I dropped away from posting here for a bit. After that last post I had a cardio-ablation, and it went well. All is good, and touch wood, will stay that way for a long time to come.
I unfortunately, during recovery, got out of a few good habits. Programming my pet project though wasn’t one of them, and so here we are, with something to say.
CPS (Continuation Passing Style) has been my want and desire for the internal way of handling async function calling for a while, and I am soooooooOOOOOO happy today, because I have finally got code generation not just working, but the result is compiling and running, without blowing up!
Ok, so now for some knowledge download.
A normal function call puts a few parameters into registers, and some others are passed on the stack. The called function moves the stack pointer to make a bit of space for its locals and does its work. This can happen recursively, so the stack grows. As functions return the stack pointer is unwound, so the stack shrinks back down. This happens a lot. A HUGE amount.
The thing to note about stack based function calling is that it is mostly single threaded. If you want multi-threaded you have to create more stacks, and then you need to wait for them to complete. That is, if you want to stick with the stack based model.
CPS is slightly different, in that every function call instead takes an additional parameter, the so called continuation, that tells it what to do next. It’s usually a callback function of some kind. And the thing is, a language that reads like it is stack based can generate CPS output.
I’ll get on to the why later. First, let’s play a little. Here’s an example bit of C code that adds 10 to a number and prints the result. Not much going on here.
int addTen(int number) {
return number + 10;
}
void main() {
printf("The answer is %d\n", addTen(20));
}
Let’s convert that to CPS. It’s easy enough.
void addTen(int number, void(*continuation)(int)) {
continuation(number + 10);
}
void main_2(int result) {
printf("The answer is %d\n", result);
}
void main() {
addTen(20, main_2);
}
What happens here is that we are calling the function ‘addTen’ with the additional continuation function. We’ve split ‘main’ into two functions, the setup and the continuation.
If only this was it, but actually functions need to carry some state around with them. Let’s try another example.
int addTen(int number) {
return number + 10;
}
void main(char* message) {
printf("The %s is %d\n", message, addTen(20));
}
Now we can’t just point to another C function, we need some state.
void addTen(int number, void(*continuation)(void*,int), void* context) {
continuation(context, number + 10);
}
struct context {
char* message;
};
void main_2(struct context* context, int result) {
printf("The %s is %d\n", context->message, result);
free(context);
}
void main(char* message) {
struct context* context = malloc(sizeof(struct context));
context->message = message;
addTen(20, main_2, context);
}
You might be looking at this wondering what it’s all about, wondering why there are still function calls. See the call to ‘addTen’, to ‘continuation’, they are at the end of the each block. The C compiler will not emit a function call, but instead a tail call, a jump, not a call. The stack doesn’t grow.
Now we’re getting closer. As it stands this isn’t efficient, but replace malloc/free with a bump pointer allocator and a garbage collector and things speed up a bunch. However, it’ll never be quite as fast as the stack calling convention, so why do we do it?
It is often said that async/await as implemented in C# is like an infection, in that once you start using it, it spreads through your code until nearly everything is async. The downside is that any function declared as async has slightly higher overhead, but the upside is that you can do many things at once without the overhead of too many stacks. Stack space is a big overhead in high throughput servers.
My thought was, what if we start at the other end. Everything is async, unless we can prove otherwise. Have an environment that only has the minimum number of threads required to match the physical hardware, and all work is broken into these async + continuation chunks that can be spread over many cores.
Could we make a language and environment where we get parallelism without any coder effort? It’s not a new idea, just one that I wanted to run with. It’s the core idea behind the YAFL language I am writing. Write linear code, generate parallel code, use CPS as the main mechanism to enable this.
I guess my next write up needs to be how this is actually working, with some real world examples. As it stands the output is 1000’s of lines, but it compiles down quite small. I rely heavily on the C compiler’s dead code removal.
And then there’s the garbage collection. It’s critical that it’s ultra fast and that it has no delay. I have thought a lot about that, and CPS might just enable this as well. Again, for another day.