A compiler is a program that translates source code written in one programming language (the source language) into another language (the target language), which is often machine code that a computer's CPU can execute directly.
This translation process is complex and is broken down into a series of distinct stages or phases. Each phase takes the output of the previous phase as its input and produces an intermediate representation of the source program.
The main phases of a compiler are typically grouped into two major parts: the Front End and the Back End.
This separation allows for great modularity. For example, you can build a compiler that supports multiple languages and multiple target machines by creating a front end for each language and a back end for each machine, and then mixing and matching them (e.g., the LLVM compiler infrastructure).
Here are the main phases in their logical order:
This is the first phase of the compiler. It reads the source code as a stream of characters and groups them into meaningful sequences called lexemes. For each lexeme, the lexical analyzer produces a token.
int result = 10;).<KEYWORD, int>, <IDENTIFIER, result>, <OPERATOR, =>, <CONSTANT, 10>, <PUNCTUATION, ;>).The parser takes the stream of tokens from the lexical analyzer and verifies that it can be generated by the grammar of the source language. It builds a tree-like representation of the code that shows its grammatical structure.
This phase checks the AST for semantic consistency with the language definition. It goes beyond syntax to check if the code "makes sense."
After the front-end analysis is complete, many compilers generate an explicit, machine-independent intermediate representation (IR). This IR is easy to produce and easy to translate into the target code.
a = b + c * 10; might be translated into:t1 = 10
t2 = c * t1
t3 = b + t2
a = t3This phase takes the intermediate code and tries to improve it to make the final program faster, smaller, or more power-efficient. This is often the most complex part of a modern compiler.
2 * 5 to 10).The final phase of the compiler. It takes the optimized intermediate code and translates it into the target language, which is typically the machine code or assembly language for a specific processor.
Two other important components are active throughout the compilation process:
| Phase | Input | Output | Main Task |
| ------------------------- | --------------------------------------- | ------------------------------------- | ----------------------------------------------------------------------- |
| Front End | | | Analysis (Language-Dependent) |
| 1. Lexical Analysis | Source Code | Stream of Tokens | Group characters into "words" (tokens). |
| 2. Syntax Analysis | Stream of Tokens | Abstract Syntax Tree (AST) | Verify grammatical structure. |
| 3. Semantic Analysis | Abstract Syntax Tree (AST) | Annotated AST | Check for meaning, types, and scope. |
| Middle | | | Bridging the Gap |
| 4. Intermediate Code Gen | Annotated AST | Intermediate Representation (IR) | Create a machine-independent representation. |
| Back End | | | Synthesis (Machine-Dependent) |
| 5. Code Optimization | Intermediate Representation (IR) | Optimized IR | Improve the code's efficiency (speed, size). |
| 6. Code Generation | Optimized IR | Target Code (e.g., Assembly) | Translate IR to machine-specific instructions and allocate registers. |