This project implements a custom programming language grad
and its compiler, written in Rust. The compiler follows a multi-stage process to transform source code into executable bytecode, which is then interpreted by a custom Stack Based Virtual Machine (VM).
Try the language in the playground.
cargo install grad
echo "let a = 10; print(a);" > example.grad
grad run example.grad
- Compiler Overview
- Lexical Analysis
- Parsing
- Abstract Syntax Tree (AST)
- Code Generation
- Virtual Machine
- String Interning
- Example: Program Compilation and Execution
- Future Improvements
The compiler follows these main stages:
- Lexical Analysis - Tokenizes the input source code.
- Parsing - Builds an Abstract Syntax Tree (AST).
- Code Generation - Transforms the AST into bytecode.
- Virtual Machine - Executes the generated bytecode.
The lexical analysis is performed by the Lexer
struct. It tokenizes/splits the input source code into a series of Token
s.
pub struct Lexer {
pub tokens: Vec<Token>,
}
#[derive(Debug, PartialEq, Clone)]
pub struct Token {
pub token_type: TokenType,
pub lexeme: String,
pub span: std::ops::Range<usize>,
}
The Lexer
uses logos to identify different token types such as keywords, identifiers, literals, and operators. It also return the span (start and end positions) of each token in the source code.
The parsing stage is implemented using a recursive descent parser with Pratt parsing for expressions.
pub struct Parser<'a> {
lexer: &'a mut Lexer,
}
The parser uses methods like parse_statement()
, parse_expression()
, and various other parsing functions to build the Abstract Syntax Tree (AST).
The expression parsing uses the Pratt parsing technique for handling operator precedence:
fn expr_bp(lexer: &mut Lexer, min_bp: u8) -> ParseResult<ASTNode> {
// ... (Pratt parsing implementation by matklad)
}
This allows for efficient and correct parsing of complex expressions with different operator precedences.
The AST is represented using the ASTNode
enum:
pub enum ASTNode {
IntNumber(i64),
FloatNumber(f64),
Identifier(String),
Boolean(bool),
String(String),
Op(Ops, Vec<ASTNode>),
Callee(String, Vec<ASTNode>),
Let(String, Vec<ASTNode>),
Assign(String, Vec<ASTNode>),
If(Vec<ASTNode>, Vec<ASTNode>, Option<Vec<ASTNode>>),
While(Vec<ASTNode>, Vec<ASTNode>),
Print(Vec<ASTNode>),
Function(String, Vec<String>, Vec<ASTNode>),
Block(Vec<ASTNode>),
}
This structure allows for representing various language constructs, including literals, variables, function calls, control flow statements, and more.
The code generation phase transforms the AST into bytecode that can be executed by the Virtual Machine. This process is handled by the Compiler
struct:
pub struct Compiler {
chunk: Chunk,
interner: Interner,
locals: Vec<Local>,
local_count: usize,
scope_depth: u8,
functions: Vec<Function>,
function_count: usize,
}
The compiler emits bytecode instructions represented by the OpCode
enum:
#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
#[repr(u8)]
pub enum OpCode {
OpConstant,
OpNil,
OpTrue,
OpFalse,
// ... (other opcodes)
}
These instructions, along with their operands, are stored in a Chunk
:
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
pub code: Vec<VectorType>,
pub constants: Vec<ValueType>,
}
The VectorType
enum allows for storage of both opcodes and constant indices in the same vector.
The Virtual Machine (VM) is executes the generated bytecode. It's implemented in the VM
struct:
pub struct VM {
pub chunk: Chunk,
ip: usize,
stack: [ValueType; STACK_MAX],
stack_top: usize,
pub interner: Interner,
globals: HashMap<StringObjIdx, ValueType>,
call_frames: Vec<CallFrame>,
frame_index: usize,
}
The VM uses a stack-based architecture for executing instructions. It maintains a stack for operands and local variables, a global variable table, and call frames for function calls.
The main execution loop of the VM interprets each opcode and performs the corresponding operation:
pub fn run(&mut self) -> Result {
// ... (main execution loop)
}
To optimize string handling, the compiler uses string interning via the Interner
struct:
pub struct Interner {
pub map: HashMap<String, usize>,
vec: Vec<String>,
}
It allows for efficient storage and comparison of strings by assigning unique indices to each unique string.
Here's a simple program to demonstrate how it progresses through each stage of the compiler.
print("Hello, world!");
let a = 4.0;
let b = 2**2;
print(a + b);
The lexer breaks down the program into tokens:
[
Token { token_type: PRINT, lexeme: "print", span: 0..5 }
Token { token_type: LeftParen, lexeme: "(", span: 5..6 }
Token { token_type: String, lexeme: "\"Hello, world!\"", span: 6..21 }
Token { token_type: RightParen, lexeme: ")", span: 21..22 }
Token { token_type: SEMICOLON, lexeme: ";", span: 22..23 }
Token { token_type: LET, lexeme: "let", span: 25..28 }
Token { token_type: Identifier, lexeme: "a", span: 29..30 }
...
]
The parser creates an Abstract Syntax Tree:
Print
String(""Hello, world!"")
Let(a)
FloatNumber(4)
Let(b)
Op(PostfixOp(StarStar))
IntNumber(2)
IntNumber(2)
Print
Op(BinaryOp(Add))
Identifier(a)
Identifier(b)
The compiler generates bytecode:
0000 OP_CONSTANT 0 | intr->"Hello, world!"
0002 OP_PRINT
0003 OP_CONSTANT 2 | 4
0005 OP_DEFINE_GLOBAL 1 | intr->a
0007 OP_CONSTANT 4 | 2
0009 OP_CONSTANT 5 | 2
0011 OP_POWER
0012 OP_DEFINE_GLOBAL 3 | intr->b
0014 OP_GET_GLOBAL 6 | intr->a
0016 OP_GET_GLOBAL 7 | intr->b
0018 OP_ADD
0019 OP_PRINT
0020 OP_RETURN
The VM executes the bytecode, resulting in the output:
Hello, world!
8
While this compiler implements core functionality, there are several areas for potential improvements:
- Type System: Implement a static type system with type inference for improved safety and performance.
- Optimization: Add optimization passes to improve the generated bytecode's efficiency.
- Error Handling: Enhance error reporting with more detailed messages and source code locations.
- Garbage Collection: Implement a garbage collector for automatic memory management.
- REPL: Implement a Read-Eval-Print Loop for interactive programming.