arrow_back
Compiler Series: Starting a Programming Language in C

Compiler Series: Starting a Programming Language in C

First post in a series on building a programming language from scratch in C.

ccompilerslanguages

I always found building parsers and doing something with gathered information from the AST fun. In high school as my final thesis I built a simple scripting language lightscript. The code still lives somewhere on my GitHub. As you can see if you open the project, there is not a lot of documentation and I am scared to even look at this code 💀.

What always bothered me is that I did not have the time to make this an actual typed language with a proper compiler and a virtual machine backing it. It was just an interpreter with some builtin functions and working variables, arrays and functions. Granted, at the time I probably would not even know how to do it without the knowledge I have now.

I needed to take a break from all my other projects and I figured it would be fun to write a VM based programming language to reset my brain and have fun with coding again.

So, as the title suggests, I will be building a programming language called Zin. The name comes from my own nickname - short, simple and easy to remember.

This is the first post in a series where I'll be documenting each step of building the language from tokenizer to virtual machine and compiler/codegen.

Why C

I initially considered Zig. It has some genuinely great ideas - tagged unions, comptime, built-in testing, arena allocators in the standard library. I spent some time prototyping in it and there's a lot to like about the language on paper.

But after spending more time with it, C won out for a few reasons:

  • Stability - Zig is great but it is not stable. The language is still pre-1.0, and things break between versions. For a project I want to build over time and come back to, I need a language that won't shift under my feet.
  • Tooling - ZLS has gotten better since the last time I tried it, but it's still not where I need it to be. I spent more time fighting the tooling than writing code. C's tooling story is ancient and boring, which is exactly what I want.
  • Full control - with C I implement everything myself. There's no magic, no hidden allocator behavior, no compiler doing things behind my back. Every allocation, every data structure, every abstraction is mine. For a compiler project, that's a feature, not a burden.
  • It's what compilers are written in - most real compilers are written in C or C++. There's decades of literature, reference implementations, and battle-tested patterns all in C. When I get stuck, the answers are already out there.

Arena allocators, tagged unions via enum + union, defer-like cleanup with gotos or macros - all of this is doable in C. It's more manual, but for a project like this I actually want that level of control. Every line of code is intentional.

What to expect

Each post in this series will cover a concrete step:

  1. Tokenizer - scanning source bytes into tokens
  2. Parser - My own pattern based parsing logic for statements, expressions etc. No flex/bison or any recursion.
  3. AST - representing the program as a tree
  4. Type checking - starting simple, evolving from there
  5. Code generation - targeting our own bytecode VM

I'll share the actual code, the decisions behind it, and the mistakes along the way.

Next up: actual language design.