Thoughts on automatic memory management

I have had a serious headache thinking about automatic memory management lately. My goal for YAFL is to hide all aspects of memory management, keep the overhead low, support embedded devices, have a small runtime and have consistent performance at all times. Most of that statement rules out a mark sweep garbage collector, and all of its derivatives no matter how short the pause. Sorry Java, but no. I have concentrated my efforts on reference counting, accepting that the cost of atomic locking on SMP systems will be reduced with aggressive optimisation and reference borrowing.

… my head hurts … i haven’t slept in ages …

Slowly, recently, I have started to wonder if there is a way to have my cake and eat it, as they say.

Alright folks, let’s start with a simple premise. When using reference counting, there is no atomic operation on allocation, because you are guaranteed to hold the only reference. On release, if the reference count is 1, you again can avoid the atomic decrement because your reference is the only one. This can be proven, but that’s another discussion. So for now, run with it. If we need to acquire an existing pointer, its count will exceed 1, that’s guaranteed. To what value we don’t know, but it will exceed 1. Short lived objects can be released easily, even if they form graphs, so long as we have good optimisation and transfer ownership when building graphs, so even those pesky complex results of functions that have compound types are simple, avoid atomics and release easily, so long as they aren’t shared.

Could we hybridise. An object is created with a counter… err… flag, let’s call it a flag now, of 1. If a pointer is acquired its flag is set to 2. That’s all. On release, if the flag is 1 we free the memory immediately, otherwise we drop it. We let it leak. Let’s give this flag a meaningful name. Let’s call it the “Generation”.

Now we need to add a mark sweep collector. We use a state machine attached to each thread to drive it, advancing a small amount each time an allocation function is called, scanning all reachable objects in generation 2 only. In a pure functional language this works, because generation 3 can’t be mutated to reference generation 2. If an object is retained after 5 or so scans, we move it into generation 3. I need to think more to get some meat on the bones here, but this feels right so far.

Traditional GCs periodically scan older generations. I propose that we never scan an older generation. Instead, the runtime action already stated above of setting the generation to 2 on acquire will automatically, and regularly in some cases, keep pulling active objects back into generation 2. In order for something in generation 3 to be unreachable, something referencing it in generation 2 must be collected. When it is collected we’ll move all outgoing gen 3 references back to gen 2.

This all seems a bit airy fairy, and frankly it is a bit. It’s still a thought in progress. Somehow this also links in with the fibre library I have added. There is a serious lack of detail around the state machine in my mental model. I don’t know if it’ll keep up with high churn.

What I do know is that this is the first time in this project that I have felt that I might even be close to solving the problem of having memory management that scales from simple single core embedded devices all the way up to massive multicore beasts without degrading performance.

Even if it is slightly less efficient that a state of the art solution, I value consistency much more.

More to come.

Advertisement

Equivalence principle

This isn’t rocket science, it’s YAFL. In YAFL these two statements are considered to be equivalent:

  • let square = (x) => x * x
  • fun square(x) => x * x

If you look deep inside the AST you’ll see that there is a separate mechanism for representing let and fun objects, but that is really there to make debugging the compiler easier. Semantically, in the YAFL language, they are the same.

YAFL also supports overloading.

Think about it…. The above two statements are equivalent, and overloading is supported. Sooooo

  • let square = (x: Float32) => x * x
  • let square = (x: Float64) => x * x

it is completely valid to have two lets of the same name, and the compiler will deal with it gracefully. We do this by distinguishing loads by the receiver type, i.e. what the recipient of an expression expects can be a factor in finding which ‘square’ to load.

This goes one step further, as it’s not just to do with lambdas and functions. The following is also valid:

  • let anumber = 100i32
  • let anumber = 100i64
  • fun square(x: Int32) => x * x
  • fun main() => square(anumber)

Because of the context in which ‘anumber’ is used, it is clear that we are referring to the first instance. Just like a function call context with parameters can make it clear which lambda type we are looking for when loading ‘square’.

There is symmetry here that works, it just works. But it does start to mess with my mind when generics comes into the picture, and that comes next.

YAFL generics, again

I have been stuck in a thought loop. In Rust traits provide generic constraints and they provide the equivalent of class/interface in other languages. The distinction is in how you use them. Traits are very similar to the Haskell concept of class, which has nothing to do with classes as you or I know them. And class/interfaces as we know them from C++/C#/Java etc have nothing to do with traits.

For my first attempt at formalising the grammar around generics I went for the Rust model of traits and trait implementations, which I kept separate from the OO class/interface model I borrowed from C#/Java. This still left me feeling uneasy. Rust managed to cross over these concepts. Should I try harder, or is it ok that I have two generic domains.

Then I went back to the origins of what a class means, where the original concept in Scheme was as a kind of module. I thought further that really it’s got a lot in common with modules, but with additional parameterised state. But it is still a set of functions in a scope. And I thought about Kotlin’s extension methods, and how they are used for passing in domain APIs to code blocks. Scope, state, modules. Generics. All coming together.

At this point my head was starting to hurt, until something suddenly struck me. The fundamental concept I was seeing here is that the C#/Java concept of class/interface is equivalent to the Rust concept if trait, IFF the class has a zero parameter constructor. In that case, it might as well be a singleton, and it might as well be implicitly in scope.

Suddenly the pieces are falling into place. Let’s play.

# Interface defines some functions that can operate on T
interface Addable<T> {
    fun `+`(a: T, b: T): T
}

# Zero param constructable class implements the functions
class Int32Addable() : Addable<Int32> {
    fun `+`(a: T, b: T): T = plusFunction(a, b)
}

# This function can add anything if there is an implementation
# of Addable available at the call site
fun addSomething<T>(a: T, b: T): T where Addable<T> = a + b

# Or we could construct an instance of Int32Addable and pass
# it in as an object.
fun addSomething<T>(a: T, b: T, impl: Addable<T>): T = impl.`+`(a, b)

fun main() {
    # These two calls are functionally equivalent, but
    # semantically and in terms of runtime code generation
    # are passing functionality in different ways.
    print(addSomething(1, 2))
    print(addSomething(1, 2, Int32Addable()))
}

This isn’t everything, not by a long shot. It looks nice, but I have just re-introduce the diamond inheritance problem that Rust so elegantly avoided. So it’s still a work in progress.

VPN to home

I have gigabit internet. It has long been a dream and so when it arrived I pounced at the opportunity. There was unfortunately a drawback, in that the provider uses CGNAT. Carrier Grade NAT is NAT as we all know it, but on an industrial scale. It’s used mostly by the mobile operators to reduce the number of IP addresses they need to own, but fixed line operators are getting in on the act. The big drawback if you run a home VPN server is that it wrecks dynamic DNS solutions.

I pay extra for a static IP address. There are technically solutions that shouldn’t require this, but they require effort and hacking around with software that I just don’t want to do. Anyway, it doesn’t cost much to have a static IP. After the ISP asked me a few questions to make sure I wasn’t just chasing something I didn’t understand, it was set up within hours.

My router is a Netgear Nighthawk R7000 with builtin VPN server. Unfortunately it’s a bit old, and not quote as secure as the people that maintain Linux packages would like, and so by default they don’t let me connect my Linux laptop to it. Most advice you find will say “use a larger key”. Well, I can try petitioning Netgear, but I double that they’ll update the firmware, so I am stuck trying to convince OpenVPN to let me connect.

Well, here we go again. 100s of forums and comments later, I finally have the magic incantation. Edit the file ‘client2.ovpn’ that you obtained from the router and add the following (replacing the redacted IP of course):

tls-cipher "DEFAULT:@SECLEVEL=0"
tls-version-min 1.0
route-gateway 192.168.xx.1
data-ciphers AES-128-CBC

I’m not giving much away with that IP, after all it’s the same one most of us have. I just randomise the penultimate digits to avoid clash with whatever public network I happen to be going over.

Simply put, this is telling OpenSSL (the underlying library used by OpenVPN for encryption) to allow lower security levels and telling it where to route packets. The last part is a rename of the existing ‘ciphers’ section, because for some reason it needs to be renamed.. Ugh, I don’t know. Just do it, it works.

I don’t need to do any of this on any other platform. I guess Linux is that good.

Linux encourages more reading. Reading is good.

Sharing a folder with VMs

I am sure this used to be easier. Oh yes, I used to use a Mac with Parallels for virtualisation, it was easier.

I have my Fedora 38 host using virt-manager to run a Windows 10 guest in a QEmu container. It’s pretty slick, but I can’t yet transfer files between the guest and host.

Looking in the virt-manager devices I see that I can add ‘virtiofs’ for sharing a host folder. Perfect, let’s try that. It needs shared memory, OK I can switch that on. Still not working, oh, it doesn’t work with Windows at all.

Next I read about Spice WebDAV. So, I add the new spice channel ‘org.spice-space.webdav.0’ and install the guest tools for spice webdav. I have read that it shares the folder ‘Public’ by default, which is fine by me. But still nothing in Windows.

With frustration starting to creep in I go old school and decide to setup a Samba share on the internal virtual network. Does it work? Does it heck!… I have lost a lot of hair over this, so let’s just accept that there is a magic incantation that works. I read so many posts, and so many rants, and each one had a piece of the puzzle. Frankly this post will be a piece of somebody else’s puzzle because every deployment is different.

Before I do the grand reveal, the key bits of information that finally unblocked it for me where:

  1. You cannot use ‘guest ok = yes’ with ‘valid users’ (note ‘guest ok’ and ‘public’ mean the same)
  2. For SELinux to work, you need to give extra permission to the folder like so ‘chcon -t samba_share_t Shared’.
  3. Restrict Samba to use the internal VM network, which in my case is virbr0.
  4. Punch a hole through the firewall so that Samba can be connected to via the internal VM network like so:
    firewall-cmd –get-active-zones
    sudo firewall-cmd –permanent –zone=libvirt –add-service=samba
    sudo firewall-cmd –reload

And now with no further ado, here is the magic incantation I finally used in ‘/etc/samba/smb.conf’.

[global]
        workgroup = SAMBA
        map to guest = bad user
        bind interfaces only = yes
        interfaces = lo virbr0
        guest account = myuser

[Shared]
        comment = Shared
        path = /home/myuser/Shared
        security = share
        read only = no
        browseable = yes
        writable = yes
        map to guest = bad user

        public = yes
        guest ok = yes
        guest only = yes
        force user = myuser
        force group = myuser
        force create mode = 0644
        force directory mode = 0755

Linux is an education.

Thoughts on generics in YAFL

I have had to take a break from YAFL development for the last month due to work commitments. For my own sanity, I have a clear priority ladder of family, work, hobbies, in that order. But now things are starting to lighten up, the clouds are parting, and I need to start collecting my thoughts again.

Before I go into the generics discussion I had one small thought about constructors. Currently I have gone for an approach similar to Kotlin, where the class is effectively also a function of the same name, which is the constructor. I want to move that to be a static member function named ‘new’ instead. For me it just provides more clarity, but so far I have avoided supporting static member functions. I think I still will avoid them, but with caveat that’ll become clear as we dive into generics.

The model I want to follow has its roots in Haskell and Rust, which have similar concepts referenced by the ‘class’ and ‘trait’ keywords respectively. The basic idea is that you can declare a set of well defined functions in a named group, and a type is said to be generically compatible with that named function group if a specialised set of implementations is available.

I’ll take that concept, but not exactly as defined in these languages. In Rust for example the ‘trait’ concept is conflated with the ‘interface’ concept that can be found in Java/Kotlin/C#, which makes sense for the Rust memory model, but is harder to fit into a languages with that hide memory management from the programmer like Java/Python etc. That aside though, I do have issue with the idea that the same named thing can be thought of as a type in the strict sense but also as a generic constraint. These, to me, are substantially different ideas.

YAFL already has ‘interface’ and ‘class’ keywords that behave in a similar way to Java/Kotlin/C#, and I am happy with that. I don’t want to force anything in there, so the generics idea will be kept separate, at least for now whilst we get this first implementation sorted.

Now it might be best to show some ideas, and explain what I am aiming for. The syntax will borrow from Rust and Kotlin:

trait Numeric<T> {
    let T::MAX: T
    let T::MIN: T
    fun `+`(l: T, r: T): T
    fun `-`(l: T, r: T): T
}

impl Numeric<Int32> {
    let Int32::MAX = 2147483647
    let Int32::MIN = -2147483648
    fun `+`(l: Int32, r: Int32): Int32 => someAddThingy(l, r)
    fun `-`(l: Int32, r: Int32): Int32 => someSubThingy(l, r)
}

fun processNumber<T>(a: T, b: T): T where Numeric<T> => a + T::MAX

‘::’ means scoped by. It’s a syntax used in Rust and C++ that I like, and used for module scoping. Here I propose to use it for type scoping as well, so we have MAX scoped by the type T. In the function we require that the type T complies with the Numeric trait, which it does. The function can be proven to be correct for any T that has an implementation of Numeric.

Anyone familiar with Rust will be looking at that syntax with furrowed eyebrows, wondering why I put the T in angle brackets next to the trait name. It’s because this isn’t about one generic type. Try this on for size:

trait Convertible<X, Y> {
    fun X::new(r: Y): X
}

impl Convertible<Int32, String> {
    fun Int32::new(r: String): Int32 => parseToInt(r)
}

fun addToString<X>(a: X, b: String): X where Convertible<X, String>, Numeric<X> => a + X::new(b)

That’s nice. We’ve effectively added a new constructor to ‘Int32’ that takes string. We’ve required that X is convertible from ‘String’ and that it’s a numeric type so that we can add. It’s all built on the idea of a trait that defines the relationship between multiple types.

Back to my opening statement, I don’t want to support static members on classes. This trait syntax clearly has done something semantically similar, so we’ll utilise that. The primary constructor of a class will be defined as ‘new’ in an anonymous impl block, and all other constructors can be explicitly declared by the programmer in the same way like so:

impl {
    fun MyClass::new(a: Int32, b: Float32): MyClass => MyClass::new(1, 2, 3, a, b)
}

Looks like rust, and as I said before it borrows a lot, un-ashamedly so, as Rust is brilliant. But I also borrow much from Kotlin, C# and even a little from Python. It’s a chimera.

There is one more thing though. YAFL tries to infer a lot in order to reduce code bloat, so where possible I’d like for it to avoid the need to declare trait constraints on generic parameters. Ideally, it should note the operations used in a function body, and locate trait definitions in the imported namespaces that uniquely match. If found, we can imply the constraints. So:

fun addToString<X>(a: X, b: String) => a + X::new(b)

Implicitly, we can see that X is expected to have a function ‘new’ that takes a parameter of type String. In our imports there is a trait defined for this, uniquely. The result X is added to ‘a’, another X. There is another trait defined for maths operations, so we can imply that as the next constraint. Ultimately, I believe we can remove much of the boiler plate, whilst retaining good code clarity.

I’ll stop here, and mull over this a bit. So far though, this feels good, and keeps the concepts simple but powerful.

Virt-Manager running Windows 10

My employer has provided me with a lovely new laptop, a completely blank slate, ready for me to install an OS and setup. I have gone with Fedora 38 Beta because of the maturity of Fedora in general, and a general hedonistic desire for pain.

Some of my work is on Windows 10, for which I have a virtual machine that I copied over from the old laptop. Here’s where my pain begins.

Permissions – Selinux as setup on Fedora won’t let KVM touch the hard disk image I have copied over. Try “chcon -t virt_content_t -R PATH/TO/VMs”.

You want multiple VMs to talk to each other without the IP address changing all the time, and they must be able to access the internet no matter what network you are actually connected to – Use the bridge. It’s probably already setup as “virbr0”. That’s all, just use bridged networking. Each machine will be allocated an IP, on next startup, that will remain with it for a very long time.

Display and fractional scaling issues – Set “Scale Display” to “Never” and enable “Auto resize VM with window”. Set video to QXL and then edit the XML to double each of the RAM values. Make sure you have added a spice channel, target name “com.redhat.spice.0”. Install spice tools on the guest machine. Set DPI to 200% in the guest. Whatever fractional scaling you use on your Linux desktop, the guest will be appropriately scaled.

Sleep mode doesn’t work or isn’t present – This is a pain, because suspend to RAM basically doesn’t work, it crashes on resume. Suspend to disk on the other hand does work, with coaxing, and there’s no button in the UI to initiate it. Edit the XML and ensure “suspend-to-disk” is enabled. Install the virtio client tools in the guest, they can be downloaded from the Proxmox web site. Add a new spice channel for the QEMU guest agent. Restart the guest. You can now hibernate the VM from the host with this command line “virsh -c qemu:///system dompmsuspend win10dev disk”.

Naturally much of this will work differently, or not at all for you, but here it is as future notes for me, and hopefully the start of finding your solutions.

We don’t use Linux because it’s easy.

YAFL progress

Parallel execution done easily and cheaply is utmost on the list of goals for YAFL. The model I am aiming for uses tuple construction as the implicit place to insert fork/joins to have parallel execution of sub-expressions, where the compiler has determined them to be expensive enough to benefit. The fork/join logic itself is designed to be very cheap. Initially I was aiming for a continuation passing style for this, until I realised just how expensive and pervasive it would be. Instead I have gone down the fibers route, with a custom fiber (and yes, I am using the American spelling, deal with it) library designed to have very low cost and low memory footprint. This is all inspired by the approach taken by the Go programming language, with its segmented stacks.

A fiber library will in general launch a small number of OS threads on startup equivalent to the number of CPU threads available. Fibers are like old school multitasking where each thread of execution has to yield control. The stacks tend to be very small, and it is usually tightly coupled with an IO library that uses OS async APIs in order to suspend fibers and resume them when IO is complete. On such a system the cost of context switching between fibers is extremely low, because the fiber library need only save a very limited set of registers that the calling ABI does not assume to be clobbered by callees, whereas the OS on a thread context switch must save ALL CPU STATE. It’s quite a lot. There are definite benefits to fibers but you have to buy into them 100%, or maybe build it in to the language at a low level such that you don’t realise it’s even happening.

In YAFL a fiber has a small stack that is not pre-allocated. After a stack has been re-used some number of times, e.g. 10,000, it is released back to the OS. If that means that we don’t have enough stacks new ones will be allocated. The point here is that the memory for a stack is allocated by the OS on demand as individual pages are accessed. Most work on a fiber will be short lived and have a small stack, but sometimes some job will cause more pages to be used, and once allocated they don’t get released back to the OS. By ageing them out of the system we allow this unused memory to go back to the OS over time, whilst retaining the benefit of re-using stacks without making OS calls all the time. Stacks are held in a LRU queue so that each time a stack is requested for a new fiber it is likely to be hot, and the queue is local to a thread so it is likely to be hot in the L1 cache of a single CPU thread.

A fiber like an OS thread has a priority. That priority is set automatically when a new fiber is started to the next ordinal number available, or to put it another way, it is taken from an ever incrementing counter. Once assigned the fiber is scheduled according to its priority, so when it suspends whilst waiting for IO, and is later re-scheduled it gets to jump the queue against all the new work. It’s not totally fair, but should help to ensure low latency on IO without starving out other fibers.

At the moment the compiler doesn’t do any automatic detection of parallel workloads. It’s early days, and I need to prove the technology first. For now the compiler supports a keyword ‘__parallel__’ which constructs a tuple of each sub-expression in the parameter list, executing each sub-expression in parallel. In terms of generated LLVM IR it ends up being relatively simple, but I’ll save that for the next post, maybe with some worked examples of the necessary transforms. For now it’s good to know that the YAFL compiler emits linear LLVM IR, with hints to guide the parallel transform phase. If the parallel phase never happens, the code will still compile and generate correct code, just without parallel execution. It’s cool.

Sleep matters

You know how it feels, late at night, you have been mulling over that problem that has bugged you all day. You don’t want to put the thought down, and wonder if spending just one more hour in front of the screen will solve it.

Stop. You’re not just delaying sleep, you’re ruining your quality of sleep. Write down your thoughts, make notes on the options you are thinking of, just enough that you won’t worry about loosing your train of thought. And then walk away.

Don’t underestimate the power of sleep in solving problems. Tomorrow you are more likely to find the solution if you have slept well, but it’s hard to sleep well if you’re worried about losing important ideas.

Good night, sleep well.

Could you live with fixed size arrays only

If the trade off for a memory manager that works efficiently at high and low scale (think embedded) devices was that each array in your language must be of a size that is fixed at compile time, and cannot exceed some very strict maximum sizes, would that be a price worth paying.

If it was a fully functional language, there’s a good chance that all collection types are persistent, meaning that they are very unlikely to use large arrays ever anyway. But even persistent collections implemented in a language like Java use dynamic arrays, arrays that have a size chosen at runtime. Using a set of pre-defined fixed size arrays, say of size 4, 8, 12, 16, 24 etc up to 396 (yes an arbitrary figure, but it is meaningful) would be a workable and quite efficient alternative.

If you’re wondering what an array is, sorry, my posts are technical. If you’re wondering what a persistent collection is then your instinct might be misleading you into thinking that it has something to do with databases. Sorry, that is far from the truth. If you cannot mutate collections the methods for growing a collection are very different, as you can’t simple assign a new element at the end of an array. Instead functional languages (and non-functional sometimes, but that’s another tale) encode each modification as a tree of changes, and then each access enters that tree at a specific node to see what it looks like at that version. So, replace the word ‘persistent’ with ‘versioned’ and your instinct about ‘versioned collections’ should be better.

The key here is that persistent collections rarely if ever need to allocate a large array.

So what’s the problem, why do we want to impose such a strict limit on things. We could allow runtime sized arrays, and we could allow large arrays. If we allow large arrays, code that runs nicely on a system with lots of memory will struggle on a smaller system even if it could fit within available memory, mostly due to fragmentation. Working on smaller systems is a design goal of YAFL. Ok, so we limit maximum array size but allow them to be sized at runtime. Then we have a secondary problem of knowing when you will hit the limit. It’ll complicate code more than the effects of having fixed size arrays with nice compile time errors and likely introduce more errors.

YAFL is an experimental language, and it is a vanity project, so as always I will do as I please. I’m part way down the path if implementing array support, and will see what strings and collections look like when implemented on top of this. I’ll report back with my findings in a few weeks.

PS. It’s not just about array size. It’s about having a maximum size for any heap allocation, to reduce the possibility of fragmentation. On embedded systems this is a big risk without dipping into the murky world of compacting memory management, and that is definitely not simple.