Posted on April 8, 2015
Notes from UW CSE P 505 – Writing an interpreter
One of the canonical exercises given in Computer Science graduate level Programming Language courses is writing an interpreter for a Scheme like language, due to Scheme’s relatively simple syntax and minimalist design. My UW CSE P 505 class was no exception. Using Haskell as our implementation language, we started off implementing a simple version of the Scheme language and gradually added features to illustrate concepts such as store passing style, continuation passing style, and type systems. How we approached the problem of writing an interpreter was particularly insightful, and the following is a high level overview.
The key takeaway from the course is to think of the interpreter as a series of mappers between successive layers. Much like a web application ultimately maps from the data access layer all the way up to the presentation layer that presents it to the end user, an interpreter maps between an input string to a core expression tree. By adding multiple intermediate steps along the way, each layer is only responsible for a few specific tasks, making it easier to reason about any given step.
The first task of the interpreter is to tokenize the input. Let’s call this method parseToken. It will take in a string and return a list of tokens. It will strip out all the whitespace and comments until we are left with a sequence of tokens that represent numbers, parentheses, and identifiers. In the Haskell code for the interpreter, the Token type looks something like this:
data Brace = Round -- ()
| Square -- []
| Curly -- {}
deriving (Eq, Show)
data Token = Open Brace -- An open brace (of the given shape).
| Close Brace -- A closing brace (of the given shape).
| NumTok Integer -- A numeric literal.
| IdTok String -- An identifier.
deriving (Eq, Show)
Note that at this layer, an ID Token has no semantic meaning. It exists as a catch all bucket to differentiate between a brace and a number token. The ID Token could be a keyword, an operator, a variable, or something else entirely, but the tokenizing layer does not care. All it cares about is finding them. Passing in “{(5)}” to the parseToken method returns [Open Curly, Open Round, NumTok 5, Close Round, Close Curly]
The subsequent task is to convert the list of tokens into an SExpression tree. Let’s call this method parseSExp. It will take a list of tokens as input and return a single SExp and any leftover tokens not consumed in the parsing (more on this later). This can be visualized as a tree with the single SExp being the root. It may also be helpful to think of the top level SExpression as containing a nested list of SExpressions as well.
data SExp = NumS Integer -- numeric expression
| IdS String -- identifier
| ListS [SExp] -- list of S-expressions
deriving Eq
Because this layer deals with SExpressions, we create new data types that correspond to the types. Although at first glance IdS and NumS seem redundant with IdTok and NumTok, its important for each layer to have its own types so that in the mapping code it is always clear what layer we are mapping to and from. Converting between open and close brace tokens to the newly introduced ListS type is the main responsibility of this layer. An open brace token denotes the start of a new list while a close brace token denotes the end of a list. This means the SExp parser will need to make sure the parentheses are balanced correctly. An SExpressions appearing outside of a pair of open close braces can only appear once. This means that any tokens not consumed in the parsing will be leftover (and can optionally be returned in the output from parseSExp).
If we run parseExpr on the list of tokens corresponding to “(foo (4 bar))”
[Open Round, IdTok "foo", Open Round, NumTok 4, IdTok "bar", Close Round, Close Round]
We get:
ListS [IdS “foo”, ListS [NumS 4, IdS “bar”]]
Again, this layer is not concerned with syntax, just mapping brace tokens into lists of SExpressions and checking that the braces match. It is in the subsequent step that we map the nested list of SExpressions into an expression tree. Let’s call this method parseExpr. Assuming no syntax errors (which would cause the method to fail accordingly), the output from parseExpr will represent a valid expression in our language that can be interpreted. This does not necessarily mean however, that it will not result in any runtime errors. For example, the expression (+ (* 2 3) (+ 1 2)) is represented as a tree with root “+” and two subtrees with * and + as their roots, respectively. Representing the expression as a tree means that we can write our subsequent parse and interpret methods recursively, simplifying the logic.
For simplicity sake, let’s say that our language supports if statements, functions, apply statements, and with* statements (a binding expression that binds a variable to a function or some value):
data Expr = NumE Integer
| IfE Expr Expr Expr
| VarE Var
| FunE [Var] Expr
| AppE [Expr]
| WithStarE [(Var, Expr)] Expr
deriving (Eq, Show)
FunE, AppE, and WithStarE all take a list of arguments. This makes writing an interpreter for the expression more complex than necessary. To further simplify things, we want to desugar our language into a core syntax. We can get rid of the WithStarE which is just a specialized case of AppE. Likewise, by utilizing currying, we can restrict functions and apply statements to only have one argument in the parameter list. Any statements with more than one argument will desugar into partial applications. The core expression would look something like:
data CExpr = NumC Integer
| IfC CExpr CExpr CExpr
| VarC Var
| FunC Var CExpr
| AppC CExpr CExpr
deriving (Eq, Show)
Here are some examples of desguaring. The following with expression
(with* ([x (+ 1 2)]
[y (* x x)]
[x (+ y 3)])
(+ x y))
Would desugar into something that looks like this.
((fun (x)
((fun (y)
((fun (x)
(+ x y))
(+ y 3)))
(* x x)))
(+ 1 2))
In terms of actual CExpr, it’d parse into:
(AppC (FunC "x" (AppC (FunC "y" (AppC (FunC "x" (AppC (AppC (VarC "+") (VarC "x")) (VarC "y"))) (AppC (AppC (VarC "+") (VarC "y")) (NumC 3)))) (AppC (AppC (VarC "*") (VarC "x")) (VarC "x")))) (AppC (AppC (VarC "+") (NumC 1)) (NumC 2)))
Here’s a simpler example:
((fun (x) (* x x)) (+ 1 2))
Desugars into
AppC (FunC "x" (AppC (AppC (VarC "*") (VarC "x")) (VarC "x")))
(AppC (AppC (VarC "+") (NumC 1)) (NumC 2))
This concept of desugaring is a useful pattern to have in the toolbox: Whenever the logic for parsing or interpreting an expression becomes too unwieldy, mapping it to a simpler intermediate language with more rigid constraints helps to simplify things.
The final step in the interpreter of course, is to actually interpret the expression. This will map the desugared CExpr into a Value type. Again, because our CExpr is a tree, we can call our interp method starting from the top level of the tree. If there are no run time errors, this will reduce to a final value:
data Val = NumV Integer
| BoolV Bool
| FunV Var CExpr Env
| PrimV String (Val -> Result Val)
The PrimV type here is of interest, as it represents primitive operators in our language. Its first argument is a string representing the operator (eg “+” for addition), and the second argument is a function that takes a val parameter and returns a val. In order for our interp method to succeed, we will pass in an initial environment that contains all the predefined primitives. Note that our desugared CExpressions have no notion of a Boolean expression, because true and false will be defined as PrimV in this initial environment instead. This environment will grow during the interpretation as we encounter additional variables and their bindings (via variable application expressions). When interp encounters a function application, it up its argument parameter (the desugared CExpr guarantees only one variable per function) in the environment. If it does not exist, the result is an error.
To give you an idea of what the interp code looks like, here is the snippet for interp being called on an IfC statement would look something like:
case expr of
IfC cond b1 b2 -> case interp cond env of
Ok (BoolV True) -> interp b1 env
Ok (BoolV False) -> interp b2 env
_ -> Err "Conditional does not evaluate to a boolean"
… other cases here
How to represent addition, subtraction, comparison operators and the like as a PrimV in Haskell is out of the scope of this article, as is how to write the parsing and interpretation methods. As is the popular saying, these are left as an exercise for the reader (or perhaps explained in a subsequent tutorial).