Compiler From Scratch - Part 2
Lexing
Lexing is the process of producing tokens from an input to be later ingested by a parser.
Consider some textual input like:
1 + 2
We have potential for quite a few tokens here, depending on how you create your lexer. For example:
NUMBER- number tokens could represent any digit, or integer, or even floating points1,2
+- plus token would just represent the+symbol- white space for anything separating the other tokens
You may even need tokens to represent new lines if your language has semantic meaning behind the
new lines. This is only one possible way of tokenizing this stream. You might actually say there is
no + token, instead opting for an OP token and suggesting that any of +, -, *, etc would
produce and OP token. If you were doing that though, you would need to pass back either a literal
association of that OP token to an associated operator, or a string literal to identify the operator.
Otherwise, you’d be losing semantic meaning of the token when you come to parsing. i.e, “What the fuck
do I do with OP?” as opposed to “Ohh, OP(+) means a plus operation”.
Semantic meaningis effectively the answer to the question “What do I do with ‘blah’?” when it comes to parsing, etc.
The lexer you design should be able to pass back to the caller as much semantic value as possible per
token. In some designs, you would attach a line and column value to the token, so if errors happen,
you can trace back which token the error was emitted in relation to.
When tokens have been produced by a lexer, for example NUMBER OP(+) NUMBER, then the parser will be
able to go, “number, operation, number” and realise it’s going to need to perform an operation, double
check which operation, “gotta do a plus here”, then create a representation of this fact. A parser
will do nothing except bundle this knowledge in a way that can be interpretted efficiently in a
later stage. This will be the creation of an Abstract Syntax Tree (AST). We will go more into detail
about this another time.
There are quite a few options for lexing. Simplest is to have a cursor move through, one character at a time, determining what’s at the cursor, working out whether a full token has been seen, what token is it, and then emitting it. Another way is to just use regular expressions and rules for each token, and attempt to find tokens using those regex rules as you move through the stream. We need a way to represent the tokens we find, in any case.
In Golang, I would opt for this kind of structure initially to represent tokens. It uses iota and typed
the token kinds as uint so it’s easiy to initialize a bunch of them as constants.
type TokenKind uint
const (
TKEof TokenKind = iota
TKUnknown
TKInteger
TKAdd
)
type Token struct {
Kind TokenKind
Literal string
}
The Token struct is super simple, just a Kind to represent a token kind, and the actual string
value from lexing in Literal. For now, we only need 3 tokens to tokenize a language that only
consists of addition operations on integers. We also include an eof token and unknown token to
handle the cases of when we read to the end of an input and for when we don’t know what the fuck
was in the input stream.
For lexing our language so far, we don’t care about whitespace, so there are no tokens for it and we will be ignoring all whitespace. So first, a small helper function:
func isWhitespace(b byte) bool {
return b == ' ' || b == '\t' || b == '\n' || b == '\r'
}
We also need a way to determine if a character we are reading is a digit or not:
func isDigit(b byte) bool {
return b >= '0' && b <= '9'
}
With this in place, we can actually write a lexer. There are multiple ways to write lexers, and this is just an iterative lexer using a system of byte reading and unreading. Realistically, to continue this process, especially in Go, you would want to use runes, and actually peek the input rather. But I don’t intend to extend this method because it becomes tedious. This is a quick example before diving into something more extensible and maintainable.
type Lexer struct {
input *bufio.Reader
finished bool
}
func NewLexer(input io.Reader) *Lexer {
return &Lexer{
input: bufio.NewReader(input),
finished: false,
}
}
func (l *Lexer) Next() Token {
// If lexer has finished, always pass back an EOF
if l.finished {
return NewToken(TKEof, "")
}
// Read a byte off the input. On any error, we assume EOF for simplicity
b, err := l.input.ReadByte()
if err != nil {
return NewToken(TKEof, "")
}
// Skip all whitespace characters
if isWhitespace(b) {
return l.Next()
}
// Check if it's a '+' operator
if b == '+' {
return NewToken(TKAdd, string(b))
}
// Consume a number, and set `err` if EOF'd at the end
if isDigit(b) {
number, err := l.consumeNumber(b)
if err != nil {
l.finished = true
}
return NewToken(TKInteger, number)
}
return NewToken(TKUnknown, string(b))
}
The lexer in this case is rather simple, utilizing Golang’s bufio.Reader for byte reading. If the
lexer has finished, it must return EOF, otherwise, skip all whitespace and check whether the token
is a known operator, a number, or if nothing else, an unknown character.
To consume a number, we must loop through the next set of bytes though until we see something that
is not a digit:
func (l *Lexer) consumeNumber(b byte) (string, error) {
result := strings.Builder{}
result.WriteByte(b)
for {
b, err := l.input.ReadByte()
if err != nil {
return result.String(), err
}
if !isDigit(b) {
break
}
result.WriteByte(b)
}
l.input.UnreadByte()
return result.String(), nil
}
We can test this simply:
func main() {
input := "1 + 2 + 1337 bob"
lexer := NewLexer(strings.NewReader(input))
for {
next := lexer.Next()
fmt.Println(next)
if next.Kind == TKEof {
break
}
}
}
And we end up with this result:
{2 1}
{3 +}
{2 2}
{3 +}
{2 1337}
{1 b}
{1 o}
{1 b}
{0 }
In the output, we can see the 1 + 2 + 1337 easily enough with associated numerical value
of their TokenKind. We also see 0 for EOF and 1 (meaning Unknown) for the textual characters from
bob at the end. So we are lexing everything as hoped.
Although this is a doable strategy for all lexing, you can see that it would become quite tedious
and potentially quite nested as you add more types and operators to check for. For more operators,
which are presumable 2 bytes or even 3 bytes long, you need at least that depth of nesting to
handle them. Then, for more complex types such as strings, and floats, you would need more methods
to consume those types. Then need a map for all the keywords and check whether the string you have
is now a keyword, then switch to produce the correct token for each of those. The lexer suddenly
starts growing a bit tedious and a bit awkward to edit. Or at least, I think so and I dislike it.
The above code is something I would never actually use for lexing; I prefer the next approach which
is to use regex and match the front of the input stream to a token defined by a regular expression.
For a simple demonstration, we’ll use the same input and this code which we’ll just fmt.Println to
the screen to examine:
func isNumber(input string) (int64, bool) {
reggie := regexp.MustCompile("^[0-9]+")
number := reggie.FindString(input)
if number == "" {
return 0, false
}
actualNumber, err := strconv.ParseInt(number, 10, 64)
if err != nil {
return 0, false
}
return actualNumber, true
}
And this prints:
1 true
to the screen indicating we used regex to pull the first 1 or number from the input. So the goal
for a regex based lexer is to create rules matching regex to tokens and for the whole input, keep
taking the matched tokens off the front, and tokenizing the whole string based on those rules. Note
that the regular expression used contained ^ at the front, because it’s important we grab tokens
only at the front of the stream; lexing out of order would be an annoying fucky wucky that would be
impossible to parse later.
Something I like to do for this kind of scrolling window view into a stream is to create a scrolling
buffer. It allows us to interogate what’s in it’s window, as well as scroll through it N bytes. This
isn’t super important. However, I don’t necessarily want to read entire files into memory, or keep
reading byte buffers and parsing tokens from them, working out which bytes weren’t part of the token
and then putting them back, etc. A scroll buffer does this in a way, but abstracts it from the
lexer and is a neat tool in general. The caveat is that the scroll buffer needs to have a window
size that it looks at to build tokens and this can have problems if your tokens are capturing very
large strings. But we will solve that issue when we get to it. For now, the ScrollBuffer:
type ScrollBuffer struct {
Buffer []byte
size int
r io.Reader
}
func NewScrollBuffer(size int, r io.Reader) *ScrollBuffer {
scroller := &ScrollBuffer{
size: size,
r: r,
Buffer: make([]byte, size),
}
scroller.Scroll(size)
return scroller
}
func (sb *ScrollBuffer) Scroll(n int) {
remainder := sb.Buffer[n:]
incoming := make([]byte, n)
io.ReadFull(sb.r, incoming)
sb.Buffer = append(remainder, incoming...)
}
func (sb *ScrollBuffer) Zeroed() bool {
for _, b := range sb.Buffer {
if b != 0 {
return false
}
}
return true
}
It’s pretty simple but it does have artificial constraints it confers onto the lexer, but we will redesign this in the future as issues pop up. One thing to keep in mind though, is that the scroll buffer completely disregards internal errors of the reader and will scroll indefinitely, always scrolling through with a window filled with zeroes. It is the caller’s responsibility to check this and stop using the scroller at that point.
Now we need to improve the lexer to handle regex matching using a scroll buffer. Firstly, we need a way to represent rules:
type LexerRule struct {
reggie *regexp.Regexp
kind TokenKind
disregard bool
}
func NewLexerRule(expr string, kind TokenKind, disregard bool) LexerRule {
return LexerRule{
reggie: regexp.MustCompile(expr),
kind: kind,
disregard: disregard,
}
}
var ShitLangLexerRules = []LexerRule{
NewLexerRule("^[0-9]+", TKInteger, false),
NewLexerRule("^\\+", TKAdd, false),
NewLexerRule("^[ \t\r\n]+", TKWhitespace, true),
}
I’ve gone and added some simple rules to the list, and a new Token for whitespace, TKWhitespace.
I use MustCompile since the app should shit itself if the regular expression provided is worthless.
The rules simply match the regular expression to the token kind it should produce, and for added
benefit, we have a disregard flag. This is used for whitespace since we know what it is, and we
need to lex through it, but we don’t want whitespace tokens for parsing (but YOU might if you want
a indentation based language, or other whitespace is important in your compiler). Whitespace is
neither unknown or an illegal token to have, so we just ignore them when they come up.
Now, our lexer is a lot simpler and won’t grow much more:
package main
import (
"io"
"regexp"
)
type Lexer struct {
scroller *ScrollBuffer
rules []LexerRule
}
const BufSize = 1024
func NewLexer(input io.Reader, rules []LexerRule) *Lexer {
return &Lexer{
scroller: NewScrollBuffer(BufSize, input),
rules: rules,
}
}
func (l *Lexer) Next() Token {
if l.scroller.Zeroed() {
return NewToken(TKEof, "")
}
for _, rule := range l.rules {
match := rule.reggie.Find(l.scroller.Buffer)
if match != nil {
l.scroller.Scroll(len(match))
if rule.disregard {
return l.Next()
}
return NewToken(rule.kind, string(match))
}
}
unknownToken := NewToken(TKUnknown, string(l.scroller.Buffer[0]))
l.scroller.Scroll(1)
return unknownToken
}
We create a new lexer with a ScrollBuffer internal, and it has a window size of 1024 which is
almost completely arbitrary. Each time we call Next() on the lexer, it checks if the scroll
window is zero, in which case, we should produce an EOF, otherwise, loop through all the lexer
rules and attempt to match and create tokens. If we see one that we don’t care about, i.e. disregard
we just recursively call Next() and move along. If we fail to create any tokens from any rules,
we must produce the unknown token.
To me, this seems a lot simpler and easier to extend long term. For the simple case, new tokens
just means adding a NewLexerRule to the rule set we parse to the lexer. Now when we run this, we get:
{3 1}
{4 +}
{3 2}
{4 +}
{3 1337}
{2 b}
{2 o}
{2 b}
{0 }
This is great for now. Lexing numbers, +, and throwing TKUnknown for textual chars, and finally
producing an EOF at the end. Best yet, no whitespace.
This is all we need for a simple lexer capable of producing tokens for a calculator type compiler that can only do addition operations and shit itself at anything else. And that’s just what we’re going to build right now, and then expand on it later.
Keep in mind that we’re losing information for error handling in the compiler by not including line numbers or columns, and that’s a conscious decision right now. It is trivial to add this tracking in, but I stray from complexity initially, and would rather tackle error handling in a unified way later.