Caution!
July 18, 2010: Oops, dates on my computer were screwed up for one day, hence the commits timestamps were all wrong. I fixed it, and force-pushed the changes to github. Unfortunately, if you cloned / fetched the repository recently, you'll have to re-clone it again now -- merges / rebases will not work otherwise.
This project is inspired by this series of blog posts by Anatoly Vorobey (the link is in Russian).
Scheme implementations are a dime a dozen these days. In fact, the best ones are even free. (My personal favorite is Chicken Scheme ). Nobody needs another scheme interpreter, not even me. So why write it? For fun, obviously. Besides, sometimes I think every programmer should implement some kind of Lisp at least once -- and no, suffering the effects of Greenspun's 10th Rule doesn't give you a free pass.
So. The plan is to implement a relatively straightforward Scheme interpreter in C. I plan to keep C codebase as small as possible -- maybe under 2000 lines tops, maybe under 1500.
Minimum viable feature set: a Lisp-1, with lexical scoping, Fixnum and floating point arithmetics, unhygienic macros, tail-call elimination, reentrant continuations, and automatic memory management (garbage collection).
Note
July 18, 2010: The minimum viable feature set implemented. It's rough around the edges, but all the big things are in place and seem to work. Took me 5 days to get here, by git log.
- Latest achievements:
- Mark-and-sweep stop-the-world Garbage Collection * gray set implementation is more efficient now
- C source lines: 1134, LoC: 942.
- What works:
- Some builtin arithmetics (fixnum and double), list functions.
lambda
, non-builtin function calls.- Lexical scope.
- Tail-call elimination.
quasiquote
and user defined macros viadefmacro
.call-with-current-continuation
works.read
/print
implemented, Read-Eval-Print Loop reimplemented in scheme.- Automatic memory management / Garbage Collection.
- What doesn't:
- Error handling is just not there.
- vectors
- syntax-rules
As of now, it's a more or less vanilla SECD [1] machine, modified for varargs, special forms and tail calls elimination. Modifications are as follows:
- Stack is not a simple list, but a list of lists.
apply
removes the top list of the stack (so we can support varargs). - Some
PROCEDURE
s are markedFL_SYNTAX
. When SECD machine detects a syntax call (takes some look-ahead), arguments are not evaluated. Also, ifFL_EVAL
flag is set for syntax, the return value of the procedure is queued up for re-evaluation. - When a tail call is detected, new dump frames allocation is skipped
in
apply
.
Tagged values. scm_val
is C long
. Lower 2 bits define the semantics
of the upper 30 / 62 bits as follows. We rely on the cell allocator to
always align cells on 4-byte boundary. Since we have our own allocator,
it's easy to enforce.
Machine word bit values | scm_val type |
---|---|
<30 or 62 bits of pointer>00 | Cell ptr, type info in cell |
<31 or 63 bits of fixnum>1 | Fixnum |
<24 or 56 bits of data><6 bits of tag>10 | Extended tag (next 6 bits) |
Primitive types are: CHAR
, BOOL
, FIXNUM
, SYMBOL
,
SPECIAL
.
PROCEDURE is a toplevel type. flags used are SYNTAX, BUILTIN.
- PROCEDURE
- is a cons(DEFINITION, env)
- DEFINITION for non-BUILTIN
- is a cons(formals, body)
- DEFINITION for BUILTIN
- is a cons(CFUNC, hint)
- CFUNC
- is an
scm_val (*cfunc)(scm_val params, <SECD machine ptr>, scm_val hint)
In SECD machine [1], continuation is the content of dump register. So, basically, we capture the state of SECD machine, and we can restore it later.
I admittedly don't understand macros well. For now, quasiquote
is
implemented, and hooked up as the mechanism for user-defined macros. It
cons()
-es like there's no tomorrow, of course, but hey, it gets the job
done.
Pretty naive tri-color mark-and-sweep Garbage Collection [2]. We do our
own C stack walking to collect pointers referencing something inside of our
memory pool. GC provides gc_register()
call to notify GC about memory
locations which may be of interest for GC. SECD machine uses it to register
S
, E
, C
and D
.
- Memory management: we can try to force every non-cell blob object (like string data) to be always pointed at by exactly one cell. Then we get a pool for cells and another pool for blobs. Objects in cell pool can be garbage-collected trivially (walking C-stack may be necessary, though), blobs are freed when the referencing cell is GC-ed. I don't think I care enough to do real compaction -- freelist should be enough.
- Garbage Collection improvements:
- unroll the unnecessary "scm-aware
cons()
" code changes gc_unregister()
- memory management for blobs (like strings, file descriptors, etc) and vectors
- unroll the unnecessary "scm-aware
- Error handling (probably via error continuation?)
- More builtin primitives
- Bootstrap prelude.scm further
- 64-bit support and other portability issues
No idea yet, some code cleanup is due, I guess. After that, memory management improvements, error handling and scheme bootstrapping.
[1] | (1, 2) A Rational Deconstruction of Landin's SECD Machine |
[2] | Wikipedia: Garbage collection (computer science) # Tri-color marking |