
Compiler Series: Starting a Programming Language in C
First post in a series on building a programming language from scratch in C.
I always found building parsers and doing something with gathered information from the AST fun. In high school as my final thesis I built a simple scripting language lightscript. The code still lives somewhere on my GitHub. As you can see if you open the project, there is not a lot of documentation and I am scared to even look at this code 💀.
What always bothered me is that I did not have the time to make this an actual typed language with a proper compiler and a virtual machine backing it. It was just an interpreter with some builtin functions and working variables, arrays and functions. Granted, at the time I probably would not even know how to do it without the knowledge I have now.
I needed to take a break from all my other projects and I figured it would be fun to write a VM based programming language to reset my brain and have fun with coding again.
So, as the title suggests, I will be building a programming language called Zin. The name comes from my own nickname - short, simple and easy to remember.
This is the first post in a series where I'll be documenting each step of building the language from tokenizer to virtual machine and compiler/codegen.
Why C
I initially considered Zig. It has some genuinely great ideas - tagged unions, comptime, built-in testing, arena allocators in the standard library. I spent some time prototyping in it and there's a lot to like about the language on paper.
But after spending more time with it, C won out for a few reasons:
- Stability - Zig is great but it is not stable. The language is still pre-1.0, and things break between versions. For a project I want to build over time and come back to, I need a language that won't shift under my feet.
- Tooling - ZLS has gotten better since the last time I tried it, but it's still not where I need it to be. I spent more time fighting the tooling than writing code. C's tooling story is ancient and boring, which is exactly what I want.
- Full control - with C I implement everything myself. There's no magic, no hidden allocator behavior, no compiler doing things behind my back. Every allocation, every data structure, every abstraction is mine. For a compiler project, that's a feature, not a burden.
- It's what compilers are written in - most real compilers are written in C or C++. There's decades of literature, reference implementations, and battle-tested patterns all in C. When I get stuck, the answers are already out there.
Arena allocators, tagged unions via enum + union, defer-like cleanup with gotos or macros - all of this is doable in C. It's more manual, but for a project like this I actually want that level of control. Every line of code is intentional.
What to expect
Each post in this series will cover a concrete step:
- Tokenizer - scanning source bytes into tokens
- Parser - My own pattern based parsing logic for statements, expressions etc. No flex/bison or any recursion.
- AST - representing the program as a tree
- Type checking - starting simple, evolving from there
- Code generation - targeting our own bytecode VM
I'll share the actual code, the decisions behind it, and the mistakes along the way.
Next up: actual language design.