Building a Toy Programming Language in Python

Posted by
on under

I thought it would be fun to go outside of my comfort zone of web development topics and write about something completely different and new, something I have never written about before. So today, I'm going to show you how to implement a programming language!

The project will parse and execute programs written in a simple language I called my (I know it's a lame name, but hey, it is "my" language). The implementation is going to be in Python, without any external dependencies. The my language is simple enough to make this project reasonably short, but also complex enough to make it interesting. If you are only interested in the complete code, you can find it in this GitHub repository. If you want to learn, then read on!

In this first installment of this series, I'll show you how to build a very basic programming language that implements a print statement. Then in the second part I'll extend the language to support variables and mathematical expressions. The end goal is a programming language implementation that can execute programs such as this one:

a = 3
b = 4 + a * 2
print b + 1

Once you learn the techniques involved in bringing my to life, you will be able to extend the language in any way you like and make it yours too. Sounds interesting? Let's get started!

The Components of a Programming Language

It is not the point of this article to bore you with a discussion of compiler theory, but at the same time I feel it is important to begin with a basic idea of the project structure. It is certainly possible to start coding away and figure things out intuitively as you go, but given that this is a topic that has been well studied it does not make sense to ignore all the research and knowledge that is available.

So, in very general terms, a programming language implementation is composed of the following phases:

  • Lexical analysis, where the program's source code is read and divided up into individual units called tokens.
  • Syntax analysis, where the sequence of tokens provided by the lexical analyzer (or lexer) is checked against the grammar of the language. The module that performs this task is called the parser.
  • Code generation, where an intermediate code is generated as a lower-level representation of the sequence of tokens parsed from the source file. The module that performs this task is sometimes called the compiler, though this term is often used to refer to this and the previous two phases combined.
  • Execution, where the intermediate code produced by the compiler is read and executed. The module that performs this task is called the interpreter.

Some languages have a more complex structure, which for example includes one or more code optimization phases after the compiler phase, leading to more efficient code. Others have simplifications, such as having a single phase that merges code generation and execution together, which means that code is executed on the fly as it is parsed.

Regardless of the exact structure of a language implementation, the important point to remember is that using a well defined structure based on the above phases will help keep the complexity of the project under control.

The "My" Programming Language, Version 0.1

If you are familiar with my tutorials, you know that I like to build projects in small steps. For that reason, the first version of the my language is going to be extremely simple, with just a print statement that prints numbers. Here is an example program written in the first version of my:

print 1
print 52

The print keyword is a reserved word of the language. Numbers in this language are integers. When this program is executed, the expected output should be:

1
52

The documentation that describes the syntax of a language is called the language grammar. This is very important to have, as it serves as a reference when implementing the language. The grammar for this first version of the my language is:

<program> = <statement> [ <statement> ... ]
<statement> = <print_statement> "\n"
<print_statement> = "print" <expression>
<expression> = number

To make sense of this notation you can start from top to bottom. The first rule says that a program in this language is defined as a statement optionally followed by more statements.

In the second rule, a statement is defined as a print statement, followed by an end of line. The third rule states that a print statement must have the literal word "print" followed by an expression. The fourth and final rule defines an expression as a number. Remember that this is a first attempt, expressions will later be expanded.

The "\n", "print" and number elements used in this grammar are terminal symbols, or tokens. These do not require grammar rules because they are tokens directly produced by the lexer. The remaining symbols, which are shown enclosed in < and >, are defined in terms of other symbols or tokens.

The Lexer

The function of the lexical analyzer (or lexer) is to read the source code of the language and produce a sequence of tokens. In this section I'll show you how to build a lexer for the my language.

Consider once again the example my program:

print 1
print 52

With this input, the lexer should produce the following list of tokens:

  • "print"
  • "number", 1
  • "\n"
  • "print"
  • "number", 52
  • "\n"

The My class shown below accepts the source code of a my program as a string, and in accordance with the grammar I shared above. Its tokens() method returns an iterator that produces tokens from this code.

class My:
    def __init__(self, code):
        self.code = code
        self.line_nr = 0

    def raise_error(self, message):
        raise ValueError(f'{self.line_nr}: {message}')

    def tokens(self):
        for line in self.code.strip().split('\n'):
            self.line_nr += 1
            for token in line.strip().split(' '):
                if token == 'print':
                    yield (token,)
                elif token.isnumeric():
                    yield ('number', int(token))
                else:
                    self.raise_error(f'Syntax Error: Invalid token {token}')
            yield ('\n',)

Copy the above code to a file named my.py. Before I explain some the interesting details in this implementation, let's give this lexer a try. Try the following in a Python prompt:

>>> from my import My
>>> p = My('''print 1
... print 52
... ''')
>>> list(p.tokens())
[('print',), ('number', 1), ('\n',), ('print',), ('number', 52), ('\n',)]

This is pretty cool, right? Each token is returned as a tuple. For the "print" and "\n" tokens, the tuples have a single element with the token itself. For the number token, a second element is added with the actual value of the number.

The tokens() method is built as a generator, a special kind of Python function that returns an iterator. The elements returned by the iterator are produced with the yield keyword, inside the body of the function. You can see that the method splits the source code into lines and iterates over them, then for each line it splits on the spaces. Inside this double for-loop, each part is checked to see if it matches the "print" or number tokens. The "\n" token is treated as a special case because it is only valid at the end of the line, so it is added automatically after each outer loop iteration.

What happens when a token does not match the "print" or number tokens defined in the language grammar? That is an error, so for that case the raise_error() method is called. This method raises a ValueError exception with an the error message, which would be the equivalent to having a syntax error reported in a Python script. Here is another console example where an error is raised due to an invalid token:

>>> p = My('''print 1
... print "foo"
... ''')
>>> list(p.tokens())
Traceback (most recent call last):
...
ValueError: 2: Syntax Error: Invalid token "foo"

Notice the 2: in the error message. This is the line number of the error! The line_nr attribute of the My class counts the line as they are being analyzed, and this makes it possible to report a line number when an error occurs. A more sophisticated lexer could also keep track of column numbers and accurately show the position of the error within the line as well.

Because the tokens() method returns an iterable, it can be printed by transforming it to a list, as I did in the Python prompt examples shown above. Another way to use this method that is more practical for a language parser is to use it along with the next() function from Python. Here is how that works, still in the Python console:

>>> p = My('''print 1
... print 52
... ''')
>>> token_feed = p.tokens()
>>> next(token_feed)
('print',)
>>> next(token_feed)
('number', 1)
>>> next(token_feed)
('\n',)
>>> next(token_feed)
('print',)
>>> next(token_feed)
('number', 52)
>>> next(token_feed)
('\n',)
>>> next(token_feed)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
StopIteration

Here the token_feed variable is assigned the iterator returned by the tokens() method. Then each time the next() function is called with this iterator as an argument, another token is returned. When all the tokens have been returned, the StopIteration exception is raised. This is going to be the basis of how the parser will communicate with the lexer to ask for tokens. The next_token() method of the My class, shown below, will be used for this task:

class My:
    def __init__(self, code):
        self.code = code
        self.line_nr = 0
        self.token_feed = self.tokens()

    # ...

    def next_token(self):
        try:
            token = next(self.token_feed)
        except StopIteration:
            token = None
        return token

In the constructor, the token_feed attribute is initialized with the token iterator. The next_token() method then returns tokens from this iterator one by one. When the end of the token stream is reached, the method catches the StopIteration exception and returns None instead, which the parser will interpret as having reached the end of the code.

To evaluate if the token stream matches a certain rule, it is often useful for the parser to "look ahead" for the next token, before it is certain if the rule under consideration is a match or not. The next_token() method shown above returns the next token, but at the same time it consumes it, meaning that once a token has been returned it has to be used or else it is lost. What would be most useful is if the parser could change its mind with regards to a token and return it to the stream, so that then another rule in the parser can get it when it calls next_token().

The idea of being able to take the next token and optionally returning it to the stream is implemented below in the return_token() method:

class My:
    def __init__(self, code):
        self.code = code
        self.line_nr = 0
        self.token_feed = self.tokens()
        self.returned_token = None

    # ...

    def next_token(self):
        if self.returned_token:
            token = self.returned_token
            self.returned_token = None
        else:
            try:
                token = next(self.token_feed)
            except StopIteration:
                token = None
        return token

    def return_token(self, token):
        if self.returned_token is not None:
            raise RuntimeError('Cannot return more than one token at a time')
        self.returned_token = token

In this version of the My class, a returned_token attribute is initialized to None. If a token provided by next_token() cannot be used, the parser can give it back with a call to return_token(), so that the same token is returned the next time next_token() is called. This creates a complication, because there is no easy way to return an item to an iterator. My implementation puts the returned token in the new returned_token attribute. The next_token() implementation is now expanded to return this token instead of the next one from the iterator. With this little trick the parser can retrieve a token, do some work with it and then change its mind and return it. One side effect of this implementation is that it is only possible to return one token, so a RuntimeError is raised if two tokens are returned in a row.

Save the above changes to the my.py file, and then start a new Python session to play with the new token methods:

>>> from my import My
>>> p = My('''print 1
... print 52
... ''')
>>> p.next_token()
('print',)
>>> p.next_token()
('number', 1)
>>> p.return_token(('number', 1))
>>> p.next_token()
('number', 1)
>>> p.next_token()
('\n',)

Hopefully this example clarifies how the parser is going to manage the token stream.

The Parser

The task of the parser is to match the sequence of tokens to the language grammar. For each rule in the grammar, there is going to be a method that determines if the token stream matches the rule or not.

Here is the first rule in the language grammar:

<program> = <statement> [ <statement> ... ]

As a naming convention, I'm going to use the rule names in the parsing methods. The rule is named <program>, so I'm going to call the method that parses it parse_program(). Another convention is that all the methods that parse rules will return True if they were able to match the rule to the incoming tokens, or False if not. A return of False would tell the caller that this rule isn't present in the stream.

The <program> rule is a high-level rule that is defined in terms of other rules, and while this may appear to complicate things, in fact it is all the contrary, because any references to other rules are addressed by calling the methods of the other rules.

I think the best way to understand what I mean here is to look at the definition of the parse_program() method:

class My:
    # ...

    def parse_program(self):
        if not self.parse_statement():
            self.raise_error('Expected: statement')
        token = self.next_token()
        while token is not None:
            self.return_token(token)
            if not self.parse_statement():
               self.raise_error('Expected: statement')
            token = self.next_token()
        return True

The <program> rule says that the very first thing to look for is something that matches the <statement> rule. This is implemented by calling a parse_statement() method (which does not exist yet). Like all the methods that implement the parsing of a rule, the return value of the call indicates if the parsing was successful or not. When parsing the <program> rule, a failure to parse a first statement indicates that the program is incorrect, as there are no alternative rules that can be considered instead. For that reason, if the initial statement cannot be parsed, that is considered an error that halts the parsing.

After the first statement, a <program> rule allows one or more optional statements, until the end of the program is reached. This means that after parsing the first statement, the parser for <program> needs to consider one of two possible paths, both valid:

  1. There is another statement in the token stream
  2. There are no more tokens, meaning that the end of the program was reached

To decide which of the two paths is the one that matches the input stream of tokens, the next token is obtained. If this token is not None, that means that there are more statements in the program. In that case, the token is returned to the stream and another call to parse_statement() is made to parse the next statement. Returning the token to the stream is important, because this token needs to be available to the parse_statement() method.

When the end of the program is reached, the next token is going to be None, and this will cause the while-loop to exit. At this point the program has been completely parsed!

But of course, the parser is still incomplete, as there are three more rules that need to be implemented. Let's look at the second rule:

<statement> = <print_statement> "\n"

Here is the parse_statement() method that implements this rule:

class My:
    # ...

    def parse_statement(self):
        if not self.parse_print_statement():
            self.raise_error('Expected: print statement')
        token = self.next_token()
        if token[0] != '\n':
            self.raise_error('Expected: end of line')
        return True

This actually makes sense, right? A statement is composed of a print statement followed by an end-of-line token. If either of them aren't present, then errors are reported. To check for the <print_statement> rule, a call to the parse_print_statement() is made. The check for the end-of-line token does not require any method calls because it can be done with a simple comparison.

There are two rules left to implement:

<print_statement> = "print" <expression>
<expression> = number

Here are the corresponding parsing methods:

class My:
    # ...

    def parse_print_statement(self):
        token = self.next_token()
        if token[0] != 'print':
            self.return_token(token)
            return False
        if not self.parse_expression():
            self.raise_error('Expected: expression')
        return True

    def parse_expression(self):
        token = self.next_token()
        if token[0] != 'number':
            self.return_token(token)
            return False
        return True

A difference in these two parsing methods is that when they don't recognize the first token they put the token back on the stream and then return False, whereas the previous two rules raised errors. When a rule returns False, it is giving the parent rule the option to attempt to call other rules before giving up with an error. This is a sensible thing to do in lower level rules, because they do not have the context to know if there are other alternatives that the parent rule can explore.

It may sound hard to believe, but this is all there is to the parser module of the my language. To tie the collection of rules together, let's add a run() method that runs the program through the parser:

class My:
    # ...

    def run(self):
        try:
            return self.parse_program()
        except ValueError as exc:
            print(str(exc))
            return False

This method catches the ValueError exception that is raised for various error conditions when the raise_error() method is invoked. When an error occurs, the error message is printed, and then the run() method returns False, to indicate that the program failed.

You can try parsing some simple programs in the Python prompt:

>>> from my import My
>>> p = My('''print 1
... print 52
... ''')
>>> p.run()
True

This returns a simple True, but that means a lot. It means that all the tokens found in this program were found to comply with the rules of the grammar of the language. Let's try some invalid programs to see how the output changes:

>>> p = My('''print 1
... print
... ''')
>>> p.run()
2: Expected: expression
False

In this example, I have omitted the argument to print in line 2. The error reports this mistake, and the top-level run() method returns False, to indicate that the program did not parse.

Let's try another one:

>>> p = My('''print 1 2
... ''')
>>> p.run()
1: Expected: end of line
False

For this second error, I added an extra number at the end of the print statement in line 1, and the parser indicates that an end-of-line was expected there. Fantastic!

The Interpreter

The parser for the first version of the language is now complete, so now it is time to evaluate the options for the remaining work:

  • One option is to define a set of low-level coding instructions and then add code generation of these instructions to the parsing methods. Then separately from that, build an interpreter the runs these instructions.
  • A second option is to directly implement the code execution logic within the parsing methods, without generating intermediate instructions.

To keep things simple, I'm going to go with the second option, which is a good option for smaller languages.

In terms of implementing the program execution for this language, the secret technique that makes it possible is the stack. A stack is a primitive data structure that supports two main operations, push and pop, to add and remove elements respectively. The interesting aspect is that elements are removed in reverse order to how they were added, which is the same as saying that the last element in is the first element out (LIFO). The stack is at the core of pretty much all execution environments for programming languages.

Here is a very simple execution stack implementation for the my language:

class My:
    def __init__(self, code):
        self.code = code
        self.line_nr = 0
        self.token_feed = self.tokens()
        self.returned_token = None
        self.stack = []

    # ...

    def stack_push(self, arg):
        self.stack.append(arg)

    def stack_pop(self):
        return self.stack.pop()

How is the stack used? It's simple. Rules such as <expression>, which have a resulting value after they are parsed, push this result to the stack. Rules such as <print_statement>, which consume a value, pop a value from the stack. Let's implement the stack in these two rules:

class My:
    # ...

    def parse_print_statement(self):
        token = self.next_token()
        if token != ('print',):
            self.return_token(token)
            return False
        if not self.parse_expression():
            self.raise_error('Expected expression')

        value = self.stack_pop()
        print(value)
        return True

    def parse_expression(self):
        token = self.next_token()
        if token[0] != 'number':
            self.return_token(token)
            return False

        self.stack_push(token[1])
        return True

    # ...

After a successful parse of the <expression> rule, and before returning True, the value associated with the expression is pushed to the stack. For the simplistic expressions in the current version of the language, the number to push is given as the second value in the number token tuple.

The <print_statement> rule, which needs a value to print, pops a value from the stack and prints it. This is all it takes to execute the print statement!

In case you are wondering, the LIFO property of the stack prevents values from different statements from ever getting mixed up. Even with complicated mathematical expressions, the stack is able to maintain the running state of the application, with every expression being consumed by the proper owner.

Running Programs Stored on Files

The my language is somewhat useless as it can only print numbers, but it is a very good base to continue building from. Before I end this first part, it would be convenient to make the my language work with programs stored in files on disk. This can be implemented at the bottom of my.py:

import sys

class My:
    # ...

if __name__ == '__main__':
    with open(sys.argv[1], 'rt') as f:
        code = f.read()
    program = My(code)
    program.run()

Now you can create silly programs that print numbers, store them in a file and execute them. For example, write the following program in a file called test.my:

print 1
print 52

Then run the program as follows:

$ python my.py test.my
1
52

Very cool!

In the second part of this tutorial, I'll show you how to extend the my language to support variables and mathematical expressions.

Become a Patron!

Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!

2 comments
  • #1 Scott said

    Enjoyed the article. Looking forward to the reset of the series. I believe there is a small typo -
    The next_time() implementation is now expanded to...
    should be
    The next_token() implementation is now expanded to

  • #2 Miguel Grinberg said

    @Scott: Thanks. The second part of this tutorial is available here.

Leave a Comment