151
Computer Projects and Ideas / Writing a compiler [tutorial]
« on: February 15, 2013, 07:27:00 pm »
Compiler tutorials often combine a overly-simple source language with a useless target. This one will combine a "sufficiently C-like language" with x64 (AMD64) as a target, so you can actually compile some interesting code with it and actually run it on your computer (no emulator needed).
How this tutorial works: first, general information on how to write a compiler like one of my compilers. Then, more about what I actually did in one of my compilers.
Some knowledge of x64 assembly may come in handy.
The Architecture
To keep things simple, the code-gen will be almost directly after the parser, with no significant optimizations or even type checking. Maybe I'll get into those later.
The picture of the architecture then becomes this:
The Parser
Compiler courses at universities tend to focus on parsing a lot, because it has a lot of "interesting" theory to talk about. In real life though, you don't build your own parse-tables, you use a parser generator. Parsing is, essentially, a solved problem. I use Coco/R a lot, because I like to work in C# and it's one of the few good parser generators that come with a plug-in for Visual Studio. In any case, it doesn't matter how you decide to parse really, the end result will probably an Abstract Syntax Tree.
Abstract Syntax Trees
The Abstract Syntax Tree you will get out of the parser is tree structure of the input source code. It contains all the relevant information from the source in a format more suitable for a computer.
Semantic Analysis
In serious compilers, you'd do a lot more here; such as type checking and checking for other errors (Definite Assignment, for example). But to get a functioning compiler, all you need is to bind name-uses to their definitions (strictly speaking you don't even need this, but having it allows your language to make sense). This is where things finally get non-trivial enough to warrant an explanation.
Unless you used "The Parser Hack" (mixing the Symbol Table up with the parser), names in the AST, such as variable names and function names, are textual. But in the rest of the compiler, you'll need to know what the actual thing they refer to is, not just the name. That's easy enough to fix, though.
Walk through the AST, and for every "scope", you make a new symbol table and push it on a stack.
For everything that declares anything, put that thing in the top-most symbol table.
For every use of a name, search for that name in the stack of symbol tables. Use the definition closest to the top of the stack.
Practical considerations:
- the stack can be The Stack, by doing it recursively.
- though slightly "dirty", using a writable field in the AST nodes that represent variable uses and function calls to store a reference to the AST node that declared them, works great.
- using a map<string, Node> (or equivalent) for the symbol tables makes sense, and I see no reason not to do it.
- function arguments should have their declaration be active inside the scope of their functions, not in the global scope, unless you really want your language to not make sense.
Code-gen
This is where the trouble starts. Code-gen is pretty hard to get through, and this is, I think, where most compiler projects get stuck. Number 1 reason: trying to generate fast code. Everyone likes fast code, but you'll get into lots of trouble - Register Allocation is just about doable (though nontrivial), but you'd need to match tree-patterns with a huge database of patterns to get some nice code. I like fast code too, though, so this is exactly what I'll be doing in my next compiler, and maybe I'll write a short tutorial for that when I get it working.
But for now, there will be slow code. Stack machine code, in fact. But AMD64 isn't a stack machine! And that's really OK, that just means there would be explicit pushes and pops. But on 64bit Windows, you shouldn't do that. It's not like you can't, but you shouldn't, and I'll get into why when I talk about the Linker. But fortunately, you'll only ever use a statically determinable amount of stack, at most the maximum depth of an expression. At most, but it can be less - using the stack for everything would be so wasteful that it would make me cry, so I keep the top of the stack in RAX. The offset on the stack where a temporary value should go is easily determined by keeping track of what the stack-depth would have been if you had used pushes and pops.
Generating code for a binary (2-operand) expression is easy then: generate the code for the Right Hand Side, save the value in the stack at the right offset, generate code fore the Left Hand Side, then use the appropriate "op RAX, [RSP + offset]" instruction.
That last part should make clear why the Right Hand Side comes first: the memory operand has to be the Left Hand Side (otherwise the result would go back into memory, instead of into RAX).
An optimization is immediately obvious: generate "op RAX, constant" if the Left Hand Side is a constant (of the RHS and the operation is commutative). Feel free do it - I did.
Generating code for a variable may require some trickery, if you have array-types that should "decay" into pointers for example, using a variable of that type should give its address instead of its value.
For your usual local variables, you'd generate a "mov RAX, [RBP + offset]". But what should the offset be? Well, there an extra walk over the AST comes in, one that sums all the sizes of the local variables and remembers what the total size was when a variable was declared, that size will be its offset (you can use smarter layouts there, but this works). In this walk over the AST, you should also keep track of the maximum number of arguments to functions that you call, because you'll need it.
Generating code for a constant is trivial: "mov RAX, constant"
Generating code for assignments. Suddenly the Left Hand Side is not a value, but a location. It's possible to use a special LValue code-gen for that, but I just special-cased everything right there. In the language I compile there isn't much to choose from anyway: local scalar variables (including function arguments), local array dereference or pointer dereference. (no globals, though they wouldn't be hard to add)
Generating code for "return" is interesting. Code should be generated for its argument (if it has one), but then what? How do you return? Not "ret", or you'd skip the epilogue of the function. A jump would need to know the offset to the end.. but that isn't known yet. So a "sort of like half-a-stage" at the very end of the code-gen takes care of that.
Practical: you need to know the label of the end of this function. For this reason and others, you need to know the current function, so keep track of it somehow.
Generating code for "if" and "while". Again labels are needed, and the Linker will deal with them. The new and interesting bit here is conditions. It's possible to use the normal expression code-gen and then do a "test rax, rax", and always use the Z flag. But the code generated that way sucks too much for me, so I wrote a special code-gen for conditions. Tip: only the root node of the AST of the condition is really special, and you can have a rule that it must be a comparison. That means the special condition code-gen only has one method in it.
Generating code for functions. The body is trivial when the have the rest done, it's the prologue and epilogue that are the problem. You should have the "maximum stack usage by expressions" somewhere, and the "space used by locals", and the "maximum number of arguments to functions that you call" (multiply that by 8). Don't forget to align the stack by 16, otherwise Windows functions that you call will likely crash.
Also, to simplify the codegen for using variables and calling functions, the arguments to this function should be saved to their home locations on the stack. Optionally a frame pointer can be used, you don't really need it but I chose to use it because the generated code is a bit easier to understand and debug (bugs in the code-gen will happen, so that's not unimportant).
"return" needed a label to just before the epilogue, so actually put that label there.
Generating code for strings: "lea RAX, [rip+somelabel]". Tell the linker about the string and this usage, though. It has to put it in your data segment and patch up the label reference.
Generating code for imports: you don't. They just have a label. You don't even have to tell the Linker about this, if it is ever called then the call can do that.
Generating code for function calls: mostly trivial, except when calling imports. They have to be called as "call [rip+label]", and you have to tell the Linker about it. Names were bound, so at this point you can distinguish internal calls for calls to imports, and you know the details of the import, such as which dll it was from (because you have its declaration node), so you can do that actual importing in the call node. This means unused imports won't be imported.
You can make up the rest (unary operators, the special handling of division and modulo, for loops, etc), it's just more of the same.
Blocks of Code and the "sort of like half-a-stage"
At this point, you really don't want to be stuck with textual assembly code, you'd only have to parse it again. I just made a giant list of functions like "add_q_r_r(R r, Rrm)" that output the right bytes into a buffer, and then collect a bunch of buffers. Every "block" has a buffer, an optional label, and an optional "target label".
The "half stage" then walks over the blocks, and does a couple of things:
- it tries to fix up jumps. But what if they're not within range? Well if you used the 32bit-offset versions, then you are now in trouble because the offset is just too big to make sense. If you used 8bit-offsets, then you may now have to go back and change the instruction and the length of the block. While resizes keep happening, keep iterating this procedure (a resize may cause a jump target to fall out of range)
- when all the sizes are set in stone, fix up the actual offsets of jumps and calls to internal functions.
Linker
The separate compilation model is for 99% a relic of the past. So the "Linker", or what's left of it, lost the task of gathering stuff from objects files. In fact, there are no object files.
What it does, then, is:
- create import and exception directories
- patch up import references (and data references, if you have, say, string literals)
- lay out the program code and data etc in virtual memory
- write it all into a valid .exe file
Generating the import directory is easy when you know the structure of it. The exception directory is trickier though. It has to know, for every function, where it begun and where it ended, and how much stack it allocated, and which (if any) register it used as frame pointer, and what offset it had relative to RSP, and most annoyingly of all, the exact sequence of events in the prologue.
You need a lot of information that only the function that handles the AST node for functions has, so that's where I put it into an object and give it to the Linker. It just puts it in a list until its stage runs.
Patching up import references. To do this, you need the big list of places where imports were called, and the import directory you just made. Simply write the offsets into the code.
Laying stuff out in virtual memory is simple, just put everything in their segments, and align them. By the segment alignment that is, not by the file alignment.
Writing it out into a valid .exe seems hard but isn't really, just study the structure here. In theory there's a huge number of fields that you could fill in, but most of them aren't necessary to make the program run. Remember to align sections that you're writing by the file alignment, though, which is typically lower than the segment alignment (in other words, in virtual space you have more padding than there will be in the file).
More
I will probably expand this part quite a few times. There's a lot I could write, but most of it seems irrelevant to me right now. Be sure to ask things - I've found that I explain things better when asked about them, compared to when I'm just monologuing.
One thing I did not mention above is that I have types. Kind of. The types are relevant, for example, in arrays and pointer-dereference. Strings pretend to be pointers to char. Pointers use 64bit math.
Also, I have an "op RAX, [RBP + offset]" optimization, and something similar for array accesses, and I reverse commutative ops when it helps, to save quite a lot of code that I'd have to wade through when debugging.
Constant folding and using trivial identities (eg x*0=0). The way I do this, is by giving every expression-node a virtual method Optimize that by default returns "this", but many sub-types override it with a method that return a constant if the result of that node is (locally - ie it does not track whether a variable is a known constant) known to be a constant.
How this tutorial works: first, general information on how to write a compiler like one of my compilers. Then, more about what I actually did in one of my compilers.
Some knowledge of x64 assembly may come in handy.
The Architecture
To keep things simple, the code-gen will be almost directly after the parser, with no significant optimizations or even type checking. Maybe I'll get into those later.
The picture of the architecture then becomes this:
Code: [Select]
input source code -> [parser] -> AST -> [semantic analysis] -> annotated AST -> [code-gen] -> blocks of code -> [linker] -> .exe file
AST means Abstract Syntax Tree.The Parser
Compiler courses at universities tend to focus on parsing a lot, because it has a lot of "interesting" theory to talk about. In real life though, you don't build your own parse-tables, you use a parser generator. Parsing is, essentially, a solved problem. I use Coco/R a lot, because I like to work in C# and it's one of the few good parser generators that come with a plug-in for Visual Studio. In any case, it doesn't matter how you decide to parse really, the end result will probably an Abstract Syntax Tree.
Abstract Syntax Trees
The Abstract Syntax Tree you will get out of the parser is tree structure of the input source code. It contains all the relevant information from the source in a format more suitable for a computer.
Semantic Analysis
In serious compilers, you'd do a lot more here; such as type checking and checking for other errors (Definite Assignment, for example). But to get a functioning compiler, all you need is to bind name-uses to their definitions (strictly speaking you don't even need this, but having it allows your language to make sense). This is where things finally get non-trivial enough to warrant an explanation.
Unless you used "The Parser Hack" (mixing the Symbol Table up with the parser), names in the AST, such as variable names and function names, are textual. But in the rest of the compiler, you'll need to know what the actual thing they refer to is, not just the name. That's easy enough to fix, though.
Walk through the AST, and for every "scope", you make a new symbol table and push it on a stack.
For everything that declares anything, put that thing in the top-most symbol table.
For every use of a name, search for that name in the stack of symbol tables. Use the definition closest to the top of the stack.
Practical considerations:
- the stack can be The Stack, by doing it recursively.
- though slightly "dirty", using a writable field in the AST nodes that represent variable uses and function calls to store a reference to the AST node that declared them, works great.
- using a map<string, Node> (or equivalent) for the symbol tables makes sense, and I see no reason not to do it.
- function arguments should have their declaration be active inside the scope of their functions, not in the global scope, unless you really want your language to not make sense.
Code-gen
This is where the trouble starts. Code-gen is pretty hard to get through, and this is, I think, where most compiler projects get stuck. Number 1 reason: trying to generate fast code. Everyone likes fast code, but you'll get into lots of trouble - Register Allocation is just about doable (though nontrivial), but you'd need to match tree-patterns with a huge database of patterns to get some nice code. I like fast code too, though, so this is exactly what I'll be doing in my next compiler, and maybe I'll write a short tutorial for that when I get it working.
But for now, there will be slow code. Stack machine code, in fact. But AMD64 isn't a stack machine! And that's really OK, that just means there would be explicit pushes and pops. But on 64bit Windows, you shouldn't do that. It's not like you can't, but you shouldn't, and I'll get into why when I talk about the Linker. But fortunately, you'll only ever use a statically determinable amount of stack, at most the maximum depth of an expression. At most, but it can be less - using the stack for everything would be so wasteful that it would make me cry, so I keep the top of the stack in RAX. The offset on the stack where a temporary value should go is easily determined by keeping track of what the stack-depth would have been if you had used pushes and pops.
Generating code for a binary (2-operand) expression is easy then: generate the code for the Right Hand Side, save the value in the stack at the right offset, generate code fore the Left Hand Side, then use the appropriate "op RAX, [RSP + offset]" instruction.
That last part should make clear why the Right Hand Side comes first: the memory operand has to be the Left Hand Side (otherwise the result would go back into memory, instead of into RAX).
An optimization is immediately obvious: generate "op RAX, constant" if the Left Hand Side is a constant (of the RHS and the operation is commutative). Feel free do it - I did.
Generating code for a variable may require some trickery, if you have array-types that should "decay" into pointers for example, using a variable of that type should give its address instead of its value.
For your usual local variables, you'd generate a "mov RAX, [RBP + offset]". But what should the offset be? Well, there an extra walk over the AST comes in, one that sums all the sizes of the local variables and remembers what the total size was when a variable was declared, that size will be its offset (you can use smarter layouts there, but this works). In this walk over the AST, you should also keep track of the maximum number of arguments to functions that you call, because you'll need it.
Generating code for a constant is trivial: "mov RAX, constant"
Generating code for assignments. Suddenly the Left Hand Side is not a value, but a location. It's possible to use a special LValue code-gen for that, but I just special-cased everything right there. In the language I compile there isn't much to choose from anyway: local scalar variables (including function arguments), local array dereference or pointer dereference. (no globals, though they wouldn't be hard to add)
Generating code for "return" is interesting. Code should be generated for its argument (if it has one), but then what? How do you return? Not "ret", or you'd skip the epilogue of the function. A jump would need to know the offset to the end.. but that isn't known yet. So a "sort of like half-a-stage" at the very end of the code-gen takes care of that.
Practical: you need to know the label of the end of this function. For this reason and others, you need to know the current function, so keep track of it somehow.
Generating code for "if" and "while". Again labels are needed, and the Linker will deal with them. The new and interesting bit here is conditions. It's possible to use the normal expression code-gen and then do a "test rax, rax", and always use the Z flag. But the code generated that way sucks too much for me, so I wrote a special code-gen for conditions. Tip: only the root node of the AST of the condition is really special, and you can have a rule that it must be a comparison. That means the special condition code-gen only has one method in it.
Generating code for functions. The body is trivial when the have the rest done, it's the prologue and epilogue that are the problem. You should have the "maximum stack usage by expressions" somewhere, and the "space used by locals", and the "maximum number of arguments to functions that you call" (multiply that by 8). Don't forget to align the stack by 16, otherwise Windows functions that you call will likely crash.
Also, to simplify the codegen for using variables and calling functions, the arguments to this function should be saved to their home locations on the stack. Optionally a frame pointer can be used, you don't really need it but I chose to use it because the generated code is a bit easier to understand and debug (bugs in the code-gen will happen, so that's not unimportant).
"return" needed a label to just before the epilogue, so actually put that label there.
Generating code for strings: "lea RAX, [rip+somelabel]". Tell the linker about the string and this usage, though. It has to put it in your data segment and patch up the label reference.
Generating code for imports: you don't. They just have a label. You don't even have to tell the Linker about this, if it is ever called then the call can do that.
Generating code for function calls: mostly trivial, except when calling imports. They have to be called as "call [rip+label]", and you have to tell the Linker about it. Names were bound, so at this point you can distinguish internal calls for calls to imports, and you know the details of the import, such as which dll it was from (because you have its declaration node), so you can do that actual importing in the call node. This means unused imports won't be imported.
You can make up the rest (unary operators, the special handling of division and modulo, for loops, etc), it's just more of the same.
Blocks of Code and the "sort of like half-a-stage"
At this point, you really don't want to be stuck with textual assembly code, you'd only have to parse it again. I just made a giant list of functions like "add_q_r_r(R r, Rrm)" that output the right bytes into a buffer, and then collect a bunch of buffers. Every "block" has a buffer, an optional label, and an optional "target label".
The "half stage" then walks over the blocks, and does a couple of things:
- it tries to fix up jumps. But what if they're not within range? Well if you used the 32bit-offset versions, then you are now in trouble because the offset is just too big to make sense. If you used 8bit-offsets, then you may now have to go back and change the instruction and the length of the block. While resizes keep happening, keep iterating this procedure (a resize may cause a jump target to fall out of range)
- when all the sizes are set in stone, fix up the actual offsets of jumps and calls to internal functions.
Linker
The separate compilation model is for 99% a relic of the past. So the "Linker", or what's left of it, lost the task of gathering stuff from objects files. In fact, there are no object files.
What it does, then, is:
- create import and exception directories
- patch up import references (and data references, if you have, say, string literals)
- lay out the program code and data etc in virtual memory
- write it all into a valid .exe file
Generating the import directory is easy when you know the structure of it. The exception directory is trickier though. It has to know, for every function, where it begun and where it ended, and how much stack it allocated, and which (if any) register it used as frame pointer, and what offset it had relative to RSP, and most annoyingly of all, the exact sequence of events in the prologue.
You need a lot of information that only the function that handles the AST node for functions has, so that's where I put it into an object and give it to the Linker. It just puts it in a list until its stage runs.
Patching up import references. To do this, you need the big list of places where imports were called, and the import directory you just made. Simply write the offsets into the code.
Laying stuff out in virtual memory is simple, just put everything in their segments, and align them. By the segment alignment that is, not by the file alignment.
Writing it out into a valid .exe seems hard but isn't really, just study the structure here. In theory there's a huge number of fields that you could fill in, but most of them aren't necessary to make the program run. Remember to align sections that you're writing by the file alignment, though, which is typically lower than the segment alignment (in other words, in virtual space you have more padding than there will be in the file).
More
I will probably expand this part quite a few times. There's a lot I could write, but most of it seems irrelevant to me right now. Be sure to ask things - I've found that I explain things better when asked about them, compared to when I'm just monologuing.
One thing I did not mention above is that I have types. Kind of. The types are relevant, for example, in arrays and pointer-dereference. Strings pretend to be pointers to char. Pointers use 64bit math.
Also, I have an "op RAX, [RBP + offset]" optimization, and something similar for array accesses, and I reverse commutative ops when it helps, to save quite a lot of code that I'd have to wade through when debugging.
Constant folding and using trivial identities (eg x*0=0). The way I do this, is by giving every expression-node a virtual method Optimize that by default returns "this", but many sub-types override it with a method that return a constant if the result of that node is (locally - ie it does not track whether a variable is a known constant) known to be a constant.