Compiler From Scratch - Part 2

Lexing

Lexing is the process of producing tokens from an input to be later ingested by a parser. Consider some textual input like:

1 + 2

We have potential for quite a few tokens here, depending on how you create your lexer. For example:

NUMBER - number tokens could represent any digit, or integer, or even floating points
- 1, 2
+ - plus token would just represent the + symbol
- white space for anything separating the other tokens

You may even need tokens to represent new lines if your language has semantic meaning behind the new lines. This is only one possible way of tokenizing this stream. You might actually say there is no + token, instead opting for an OP token and suggesting that any of +, -, *, etc would produce and OP token. If you were doing that though, you would need to pass back either a literal association of that OP token to an associated operator, or a string literal to identify the operator. Otherwise, you’d be losing semantic meaning of the token when you come to parsing. i.e, “What the fuck do I do with OP?” as opposed to “Ohh, OP(+) means a plus operation”.

Semantic meaning is effectively the answer to the question “What do I do with ‘blah’?” when it comes to parsing, etc.

The lexer you design should be able to pass back to the caller as much semantic value as possible per token. In some designs, you would attach a line and column value to the token, so if errors happen, you can trace back which token the error was emitted in relation to.

When tokens have been produced by a lexer, for example NUMBER OP(+) NUMBER, then the parser will be able to go, “number, operation, number” and realise it’s going to need to perform an operation, double check which operation, “gotta do a plus here”, then create a representation of this fact. A parser will do nothing except bundle this knowledge in a way that can be interpretted efficiently in a later stage. This will be the creation of an Abstract Syntax Tree (AST). We will go more into detail about this another time.

There are quite a few options for lexing. Simplest is to have a cursor move through, one character at a time, determining what’s at the cursor, working out whether a full token has been seen, what token is it, and then emitting it. Another way is to just use regular expressions and rules for each token, and attempt to find tokens using those regex rules as you move through the stream. We need a way to represent the tokens we find, in any case.

In Golang, I would opt for this kind of structure initially to represent tokens. It uses iota and typed the token kinds as uint so it’s easiy to initialize a bunch of them as constants.

type TokenKind uint

const (
	TKEof TokenKind = iota
	TKUnknown
	TKInteger
	TKAdd
)

type Token struct {
	Kind    TokenKind
	Literal string
}

The Token struct is super simple, just a Kind to represent a token kind, and the actual string value from lexing in Literal. For now, we only need 3 tokens to tokenize a language that only consists of addition operations on integers. We also include an eof token and unknown token to handle the cases of when we read to the end of an input and for when we don’t know what the fuck was in the input stream.

For lexing our language so far, we don’t care about whitespace, so there are no tokens for it and we will be ignoring all whitespace. So first, a small helper function:

func isWhitespace(b byte) bool {
	return b == ' ' || b == '\t' || b == '\n' || b == '\r'
}

We also need a way to determine if a character we are reading is a digit or not:

func isDigit(b byte) bool {
	return b >= '0' && b <= '9'
}

With this in place, we can actually write a lexer. There are multiple ways to write lexers, and this is just an iterative lexer using a system of byte reading and unreading. Realistically, to continue this process, especially in Go, you would want to use runes, and actually peek the input rather. But I don’t intend to extend this method because it becomes tedious. This is a quick example before diving into something more extensible and maintainable.

type Lexer struct {
	input    *bufio.Reader
	finished bool
}

func NewLexer(input io.Reader) *Lexer {
	return &Lexer{
		input:    bufio.NewReader(input),
		finished: false,
	}
}

func (l *Lexer) Next() Token {
  // If lexer has finished, always pass back an EOF
	if l.finished {
		return NewToken(TKEof, "")
	}

  // Read a byte off the input. On any error, we assume EOF for simplicity
	b, err := l.input.ReadByte()
	if err != nil {
		return NewToken(TKEof, "")
	}

	// Skip all whitespace characters
	if isWhitespace(b) {
		return l.Next()
	}

	// Check if it's a '+' operator
	if b == '+' {
		return NewToken(TKAdd, string(b))
	}

  // Consume a number, and set `err` if EOF'd at the end
  if isDigit(b) {
	  number, err := l.consumeNumber(b)
	  if err != nil {
		  l.finished = true
	  }

	  return NewToken(TKInteger, number)
	}

	return NewToken(TKUnknown, string(b))
}

The lexer in this case is rather simple, utilizing Golang’s bufio.Reader for byte reading. If the lexer has finished, it must return EOF, otherwise, skip all whitespace and check whether the token is a known operator, a number, or if nothing else, an unknown character. To consume a number, we must loop through the next set of bytes though until we see something that is not a digit:

func (l *Lexer) consumeNumber(b byte) (string, error) {
	result := strings.Builder{}
	result.WriteByte(b)

	for {
		b, err := l.input.ReadByte()
		if err != nil {
			return result.String(), err
		}

		if !isDigit(b) {
			break
		}

		result.WriteByte(b)
	}

	l.input.UnreadByte()
	return result.String(), nil
}

We can test this simply:

func main() {
	input := "1 + 2 + 1337 bob"
	lexer := NewLexer(strings.NewReader(input))
	for {
		next := lexer.Next()
		fmt.Println(next)
		if next.Kind == TKEof {
			break
		}
	}
}

And we end up with this result:

{2 1}
{3 +}
{2 2}
{3 +}
{2 1337}
{1 b}
{1 o}
{1 b}
{0 }

In the output, we can see the 1 + 2 + 1337 easily enough with associated numerical value of their TokenKind. We also see 0 for EOF and 1 (meaning Unknown) for the textual characters from bob at the end. So we are lexing everything as hoped. Although this is a doable strategy for all lexing, you can see that it would become quite tedious and potentially quite nested as you add more types and operators to check for. For more operators, which are presumable 2 bytes or even 3 bytes long, you need at least that depth of nesting to handle them. Then, for more complex types such as strings, and floats, you would need more methods to consume those types. Then need a map for all the keywords and check whether the string you have is now a keyword, then switch to produce the correct token for each of those. The lexer suddenly starts growing a bit tedious and a bit awkward to edit. Or at least, I think so and I dislike it.

The above code is something I would never actually use for lexing; I prefer the next approach which is to use regex and match the front of the input stream to a token defined by a regular expression. For a simple demonstration, we’ll use the same input and this code which we’ll just fmt.Println to the screen to examine:

func isNumber(input string) (int64, bool) {
	reggie := regexp.MustCompile("^[0-9]+")

	number := reggie.FindString(input)
	if number == "" {
		return 0, false
	}

	actualNumber, err := strconv.ParseInt(number, 10, 64)
	if err != nil {
		return 0, false
	}

	return actualNumber, true
}

And this prints:

1 true

to the screen indicating we used regex to pull the first 1 or number from the input. So the goal for a regex based lexer is to create rules matching regex to tokens and for the whole input, keep taking the matched tokens off the front, and tokenizing the whole string based on those rules. Note that the regular expression used contained ^ at the front, because it’s important we grab tokens only at the front of the stream; lexing out of order would be an annoying fucky wucky that would be impossible to parse later.

Something I like to do for this kind of scrolling window view into a stream is to create a scrolling buffer. It allows us to interogate what’s in it’s window, as well as scroll through it N bytes. This isn’t super important. However, I don’t necessarily want to read entire files into memory, or keep reading byte buffers and parsing tokens from them, working out which bytes weren’t part of the token and then putting them back, etc. A scroll buffer does this in a way, but abstracts it from the lexer and is a neat tool in general. The caveat is that the scroll buffer needs to have a window size that it looks at to build tokens and this can have problems if your tokens are capturing very large strings. But we will solve that issue when we get to it. For now, the ScrollBuffer:

type ScrollBuffer struct {
	Buffer []byte
	size   int
	r      io.Reader
}

func NewScrollBuffer(size int, r io.Reader) *ScrollBuffer {
	scroller := &ScrollBuffer{
		size:   size,
		r:      r,
		Buffer: make([]byte, size),
	}
	scroller.Scroll(size)
	return scroller
}

func (sb *ScrollBuffer) Scroll(n int) {
	remainder := sb.Buffer[n:]
	incoming := make([]byte, n)
	io.ReadFull(sb.r, incoming)
	sb.Buffer = append(remainder, incoming...)
}

func (sb *ScrollBuffer) Zeroed() bool {
	for _, b := range sb.Buffer {
		if b != 0 {
			return false
		}
	}
	return true
}

It’s pretty simple but it does have artificial constraints it confers onto the lexer, but we will redesign this in the future as issues pop up. One thing to keep in mind though, is that the scroll buffer completely disregards internal errors of the reader and will scroll indefinitely, always scrolling through with a window filled with zeroes. It is the caller’s responsibility to check this and stop using the scroller at that point.

Now we need to improve the lexer to handle regex matching using a scroll buffer. Firstly, we need a way to represent rules:

type LexerRule struct {
	reggie *regexp.Regexp
	kind   TokenKind
	disregard bool
}

func NewLexerRule(expr string, kind TokenKind, disregard bool) LexerRule {
	return LexerRule{
		reggie: regexp.MustCompile(expr),
		kind:   kind,
		disregard: disregard,
	}
}

var ShitLangLexerRules = []LexerRule{
	NewLexerRule("^[0-9]+", TKInteger, false),
	NewLexerRule("^\\+", TKAdd, false),
	NewLexerRule("^[ \t\r\n]+", TKWhitespace, true),
}

I’ve gone and added some simple rules to the list, and a new Token for whitespace, TKWhitespace. I use MustCompile since the app should shit itself if the regular expression provided is worthless. The rules simply match the regular expression to the token kind it should produce, and for added benefit, we have a disregard flag. This is used for whitespace since we know what it is, and we need to lex through it, but we don’t want whitespace tokens for parsing (but YOU might if you want a indentation based language, or other whitespace is important in your compiler). Whitespace is neither unknown or an illegal token to have, so we just ignore them when they come up.

Now, our lexer is a lot simpler and won’t grow much more:

package main

import (
	"io"
	"regexp"
)

type Lexer struct {
	scroller *ScrollBuffer
	rules    []LexerRule
}

const BufSize = 1024 

func NewLexer(input io.Reader, rules []LexerRule) *Lexer {
	return &Lexer{
		scroller: NewScrollBuffer(BufSize, input),
		rules:    rules,
	}
}

func (l *Lexer) Next() Token {
	if l.scroller.Zeroed() {
		return NewToken(TKEof, "")
	}

	for _, rule := range l.rules {
		match := rule.reggie.Find(l.scroller.Buffer)
		if match != nil {
			l.scroller.Scroll(len(match))
			if rule.disregard {
				return l.Next()
			}

			return NewToken(rule.kind, string(match))
		}
	}
	unknownToken := NewToken(TKUnknown, string(l.scroller.Buffer[0]))
	l.scroller.Scroll(1)
	return unknownToken
}

We create a new lexer with a ScrollBuffer internal, and it has a window size of 1024 which is almost completely arbitrary. Each time we call Next() on the lexer, it checks if the scroll window is zero, in which case, we should produce an EOF, otherwise, loop through all the lexer rules and attempt to match and create tokens. If we see one that we don’t care about, i.e. disregard we just recursively call Next() and move along. If we fail to create any tokens from any rules, we must produce the unknown token.

To me, this seems a lot simpler and easier to extend long term. For the simple case, new tokens just means adding a NewLexerRule to the rule set we parse to the lexer. Now when we run this, we get:

{3 1}
{4 +}
{3 2}
{4 +}
{3 1337}
{2 b}
{2 o}
{2 b}
{0 }

This is great for now. Lexing numbers, +, and throwing TKUnknown for textual chars, and finally producing an EOF at the end. Best yet, no whitespace.

This is all we need for a simple lexer capable of producing tokens for a calculator type compiler that can only do addition operations and shit itself at anything else. And that’s just what we’re going to build right now, and then expand on it later.

Keep in mind that we’re losing information for error handling in the compiler by not including line numbers or columns, and that’s a conscious decision right now. It is trivial to add this tracking in, but I stray from complexity initially, and would rather tackle error handling in a unified way later.