Parallel execution done easily and cheaply is utmost on the list of goals for YAFL. The model I am aiming for uses tuple construction as the implicit place to insert fork/joins to have parallel execution of sub-expressions, where the compiler has determined them to be expensive enough to benefit. The fork/join logic itself is designed to be very cheap. Initially I was aiming for a continuation passing style for this, until I realised just how expensive and pervasive it would be. Instead I have gone down the fibers route, with a custom fiber (and yes, I am using the American spelling, deal with it) library designed to have very low cost and low memory footprint. This is all inspired by the approach taken by the Go programming language, with its segmented stacks.
A fiber library will in general launch a small number of OS threads on startup equivalent to the number of CPU threads available. Fibers are like old school multitasking where each thread of execution has to yield control. The stacks tend to be very small, and it is usually tightly coupled with an IO library that uses OS async APIs in order to suspend fibers and resume them when IO is complete. On such a system the cost of context switching between fibers is extremely low, because the fiber library need only save a very limited set of registers that the calling ABI does not assume to be clobbered by callees, whereas the OS on a thread context switch must save ALL CPU STATE. It’s quite a lot. There are definite benefits to fibers but you have to buy into them 100%, or maybe build it in to the language at a low level such that you don’t realise it’s even happening.
In YAFL a fiber has a small stack that is not pre-allocated. After a stack has been re-used some number of times, e.g. 10,000, it is released back to the OS. If that means that we don’t have enough stacks new ones will be allocated. The point here is that the memory for a stack is allocated by the OS on demand as individual pages are accessed. Most work on a fiber will be short lived and have a small stack, but sometimes some job will cause more pages to be used, and once allocated they don’t get released back to the OS. By ageing them out of the system we allow this unused memory to go back to the OS over time, whilst retaining the benefit of re-using stacks without making OS calls all the time. Stacks are held in a LRU queue so that each time a stack is requested for a new fiber it is likely to be hot, and the queue is local to a thread so it is likely to be hot in the L1 cache of a single CPU thread.
A fiber like an OS thread has a priority. That priority is set automatically when a new fiber is started to the next ordinal number available, or to put it another way, it is taken from an ever incrementing counter. Once assigned the fiber is scheduled according to its priority, so when it suspends whilst waiting for IO, and is later re-scheduled it gets to jump the queue against all the new work. It’s not totally fair, but should help to ensure low latency on IO without starving out other fibers.
At the moment the compiler doesn’t do any automatic detection of parallel workloads. It’s early days, and I need to prove the technology first. For now the compiler supports a keyword ‘__parallel__’ which constructs a tuple of each sub-expression in the parameter list, executing each sub-expression in parallel. In terms of generated LLVM IR it ends up being relatively simple, but I’ll save that for the next post, maybe with some worked examples of the necessary transforms. For now it’s good to know that the YAFL compiler emits linear LLVM IR, with hints to guide the parallel transform phase. If the parallel phase never happens, the code will still compile and generate correct code, just without parallel execution. It’s cool.