Building a Toy Programming Language in Python, Part 2

Posted by
on under

Welcome to the second part of my tutorial on creating a toy language I called my in Python. In the first part I showed how how to implement the lexer, parser and interpreter for a very simple version of the language that did only one thing: print numbers.

With the first phase of the project completed, and with a solid base to start building from, I'll show you how to extend the language in a few ways. The purpose of this is to demonstrate how to modify this language, so that you can then build your own extensions. If you need a reference for the code, this GitHub repository has the working code that you can try.

The "My" Programming Language, Version 0.5

Let's go back to the language grammar and make some improvements, so that the my language can do some more interesting things. In this iteration I'm going to add variables, which would make it possible to run programs such as this one:

a = 25
print a

The updated grammar is shown below:

<program> = <statement> [ <statement> ... ]
<statement> = (<print_statement> | <assignment>) "\n"
<print_statement> = "print" <expression>
<assignment> = identifier "=" <expression>
<expression> = number | identifier

The main difference is that the <statement> rule now has two possible options, either <print_statement> as in the previous version, or a new <assignment> rule. An assignment is defined as an identifier token, followed by a "=" token, and then by an expression. The <expression> rule is expanded to allow either numbers or identifiers as well.

These are the tasks that are required to upgrade our language implementation:

  • Expand the tokens() method to support the new identifier and = tokens. An identifier will be defined as a sequence of characters that starts with a letter and is not a reserved word.
  • Expand parse_statement() to include assignments
  • Write a parse_assignment() method
  • Expand parse_expression() to include identifiers

Let's do these in order. Here is the new tokens() method:

    def tokens(self):
        for line in self.code.strip().split('\n'):
            self.line_nr += 1
            for token in line.strip().split(' '):
                if token in ['print', '=']:
                    yield (token,)
                elif token.isnumeric():
                    yield ('number', int(token))
                elif token[0].isalpha():
                    yield ('identifier', token)
                else:
                    self.raise_error(f'Invalid token {token}')
            yield ('\n',)

This was a small change, I just added the = as a predefined token and then considered identifiers any unknown tokens that start with a letter.

Next, the parse_statement() method:

    def parse_statement(self):
        if not self.parse_print_statement() and not self.parse_assignment():
            self.raise_error('Expected: print statement or assignment')
        token = self.next_token()
        if token[0] != '\n':
            self.raise_error('Expected: end of line')
        return True

Here I just needed to expand the first if-statement to check for the two possible statements that are allowed. I have also updated the error message to include the new assignments.

So far so good. Now let's write the new parse_assignment() method:

    def parse_assignment(self):
        token = self.next_token()
        if token[0] != 'identifier':
            self.return_token(token)
            return False
        identifier = token[1]
        token = self.next_token()
        if token[0] != '=':
            self.raise_error('Expected: =')
        if not self.parse_expression():
            self.raise_error('Expected: expression')

        self.vars[identifier] = self.stack_pop()
        return True

This method is written in the same style as the other ones, following the rule definition in the grammar. The method first looks for an identifier, then for the "=" token, and finally an expression. As with the print statement, if the first token (the identifier) does not match, instead of raising an error False is returned, to give the caller the option to try other rules before giving up.

After the parsing is complete, this method needs to implement the execution of the assignment. For this, the My class is going to maintain a vars attribute, which is going to be a dictionary with the values of all the variables. To execute the assignment, a dictionary key with the name of the identifier is added, with the value retrieved from the top of the stack.

Of course the vars attribute has to be initialized in the class constructor:

    def __init__(self, code):
        self.code = code
        self.line_nr = 0
        self.token_feed = self.tokens()
        self.returned_token = None
        self.stack = []
        self.vars = {}

The final change is the expansion of the parse_expression() method to handle the new definition of this rule. Here is the updated method:

    def parse_expression(self):
        token = self.next_token()
        if token[0] not in ['number', 'identifier']:
            self.return_token(token)
            return False

        if token[0] == 'identifier':
            if token[1] not in self.vars:
                self.raise_error(f'Syntax Error: Unknown variable {token[1]}')
            else:
                self.stack_push(self.vars[token[1]])
        else:
            self.stack_push(token[1])
        return True

There are two changes here. In the parsing stage, both number and identifier tokens are accepted, as indicated in the grammar rule. In the execution phase, when the token is an identifier, the value of the variable is pushed to the stack instead of the variable name. Here it is also easy to check that the variable is defined by looking for its name in the vars dictionary. An error is raised when the variable is unknown. No change is necessary in the execution phase when the token is a number.

Ready to test the second version of the language? Put the following program in the test.my file:

a = 25
print a

Then run the program to see variables in action:

$ python my.py test.my
25

Now change the print to use a bad variable name. For example:

a = 25
print b

This program raises an error:

❯ python my.py test.my
2: Syntax Error: Unknown variable b

Now, thanks to the way the <expression> rule is defined, an assignment can have a variable on the right side. This isn't very useful yet, but it will be once more complex mathematical expressions are implemented. For now, you can test this with the following program:

a = 25
b = a
print b

The "My" Programming Language, Version 0.9

I hope you are as excited as I am seeing how slowly but surely the little my language is starting to take shape. The next improvement is localized to the <expression> rule. Let's make this rule accept more interesting expressions that include additions, subtractions, multiplications and divisions, between numbers, variables or any combination of them. Here is the new version of this rule:

<expression> = <value> [ <operator> <expression> ]
<operator> = "+" | "-" | "*" | "/"
<value> = number | identifier

The old <expression> rule is now renamed to <value>. The new <expression> is defined as a value, optionally followed by an operator and another expression, which in this rule is defined recursively.

Here is an example program that uses the new expression syntax:

a = 1 + 2 * 3
b = a / 2
print b + 1

Before I show you the implement of these new rules, I think it would be useful to discuss how the stack is going to be used to help parse expressions of this level of complexity. There are three different events of significance to consider:

  • When the <value> rule executes, the actual value is pushed to the stack. This is the behavior that is currently implemented for the <expression> rule.
  • When the <operator> rule executes, the stack will be collapsed, then the result of the collapse and the operator will be pushed to the stack.
  • When a rule needs to get the result of an expression, it will collapse the stack and use the result.

Here you can see that there is a new operation on the stack, the collapse. The idea of a collapse is to pop the last three elements from the stack, which should be a value, an operator, and another value, and then push the result of performing the operation on those two values. This can be repeated until the stack has a single element with the final result of the expression.

This is probably confusing, so let me show you how the stack is going to look while the 1 + 2 * 3 expression is parsed.

Token Stack Comment
1 1
+ 1,+ Before pushing the +, a collapse is done on the stack, but since the stack has only one element, there is nothing to do.
2 1,+,2
* 3,* Before pushing the *, the stack is collapsed, which means that the 1 and the 2 are combined with the + operator, leaving a result of 3 in the stack
3 3,*,3

After the expression is parsed, a rule is going to need to use its result. At that point, a final collapse will be issued, which for the example expression used above would return a result of 9.

Note that solving expressions left to right as in this example is unusual, because in general multiplications and divisions take precedence over additions and subtractions. At this point the my interpreter does not know about this, this will be added later.

To begin with this work, the tokens() method has to be expanded to understand the four operators:

    def tokens(self):
        for line in self.code.strip().split('\n'):
            self.line_nr += 1
            for token in line.strip().split(' '):
                if token in ['print', '=', '+', '-', '*', '/']:
                    yield (token,)
                elif token.isnumeric():
                    yield ('number', int(token))
                elif token[0].isalpha():
                    yield ('identifier', token)
                else:
                    self.raise_error(f'Invalid token {token}')
            yield ('\n',)

The new collapse operation on the stack is shown below:

    def stack_collapse(self):
        while len(self.stack) > 1:
            value2 = self.stack_pop()
            prev_op = self.stack_pop()
            value1 = self.stack_pop()
            if prev_op == '+':
                self.stack_push(value1 + value2)
            elif prev_op == '-':
                self.stack_push(value1 - value2)
            elif prev_op == '*':
                self.stack_push(value1 * value2)
            elif prev_op == '/':
                self.stack_push(value1 // value2)
        return self.stack.pop()

Collapsing the stack is implemented as a while-loop that runs until there is a single value left in the stack. In each iteration, three elements are removed, value2, operator and value1 in that order. Then a new element is pushed with the result of performing the corresponding operation on the two values.

When the stack is fully collapsed, there should be a single element left with the final result. The method pops this last result and returns it to the caller, leaving the stack completely empty.

Even though expressions are now much more complex, the <expression> rule itself is simpler to implement, as it uses recursion. Here is the new parse_expression() method:

    def parse_expression(self):
        if not self.parse_value():
            return False
        if self.parse_operator():
            self.parse_expression()
        return True

Easy, right? After a value is successfully parsed, the method checks if an operator can also be parsed, and in that case a recursive expression is parsed after it, as required by the grammar.

The parse_value() method is identical to the old parse_expression() method:

    def parse_value(self):
        token = self.next_token()
        if token[0] not in ['number', 'identifier']:
            self.return_token(token)
            return False

        if token[0] == 'identifier':
            if token[1] not in self.vars:
                self.raise_error(f'Syntax Error: Unknown variable {token[1]}')
            else:
                self.stack_push(self.vars[token[1]])
        else:
            self.stack_push(token[1])
        return True

Finally, here is the parse_operator() rule:

    def parse_operator(self):
        token = self.next_token()
        if token[0] not in ['+', '-', '*', '/']:
            self.return_token(token)
            return False

        self.stack_push(self.stack_collapse())
        self.stack_push(token[0])
        return True

Note how the execution portion of this method collapses the stack, pushes the final result back to it as its first and only element, and only then pushes the operator in question.

To complete this upgrade, the two rules that consume expressions, which are <print_statement> and <assignment> need to be updated to collapse the stack to get their expression results:

    def parse_print_statement(self):
        token = self.next_token()
        if token[0] != 'print':
            self.return_token(token)
            return False
        if not self.parse_expression():
            self.raise_error('Expected: expression')

        value = self.stack_collapse()
        print(value)
        return True

    def parse_assignment(self):
        token = self.next_token()
        if token[0] != 'identifier':
            self.return_token(token)
            return False
        identifier = token[1]
        token = self.next_token()
        if token[0] != '=':
            self.raise_error('Expected =')
        if not self.parse_expression():
            self.raise_error('Expected expression')

        self.vars[identifier] = self.stack_collapse()
        return True

Here is the example program I shared above once again:

a = 1 + 2 * 3
b = a / 2
print b + 1

Put this code in test.py and then execute it to see my calculating expressions:

$ python my.py test.my
5

Great! Remember that expressions are solved left to right for now, so the a variable ends up with the value 9 as a result of the addition being evaluated before the multiplication. Then b is assigned a 4 (integer division is used), and finally a 5 is printed.

You probably noticed that all the example programs I've been testing have spaces between tokens. If you are used to other programming languages you may be tempted to eliminate some of the spaces, like maybe writing print b+1 instead of print b + 1. Unfortunately the lexer implementation relies on spaces acting as token separators, so the spaces are required in this language. Most languages have more elaborate lexers that are able to separate tokens without relying on white space, but for now I'm not going to worry about this.

The "My" Programming Language, Version 1.0!

In this section I'm going to make a last improvement to bring the my language to a very humble version 1. The improvement is going to be to use a natural order of operations when solving mathematical expressions, with multiplications and divisions being solved before additions and subtractions.

To help the expression evaluator know what order to use when solving expressions, each supported operation will be assigned a precedence value. The evaluation algorithm will check those values, and will always solve operations with higher precedence first. Here are the precedence values for the four operations:

operation precedence value
+ 1
- 1
* 2
/ 2

This is a straightforward way to define that multiplications and divisions must be solved before additions and substractions. When there is a sequence of operations with the same precedence value, a left-to-right order will be used.

How does precendence values help? There are a couple of changes that need to be made when working with the stack:

  • For each operation pushed to the stack, its precedence value must be pushed along with it.
  • When pushing a new operation on the stack, any operations that have equal or higher value, starting from the top of the stack, must be collapsed first. Once the top operation in the stack has a lower precedence, the collapsing must stop.

To explain this it is best to look at an example. Consider the expression 4 + 5 * 6 - 7. With a strictly left-to-right evaluator, the result would be 47, but when using the natural evaluation order the multiplication would be done first, and then the final result would be 27.

Let's run this expression step-by-step with a stack. Note that operators are now pushed to the stack with their precedence, for example +[1] or *[2].

Token Stack Comment
4 4
+ 4,+[1] Before pushing the + the stack is collapsed. Since the stack has only one element, there is nothing to do.
5 4,+[1],5
* 4,+[1],5,*[2] Before pushing the * the stack is collapsed. Because +[1] has lower precedence than *[2], nothing is done.
6 4,+[1],5,*[2],6
- 34,-[1] Before pushing the - the stack is collapsed. Because *[2] has higher precedence than -[1], the multiplication is collapsed to 30. And because +[1] is equal to -[1], the addition is collapsed as well, leaving the stack with just a 34.
7 34,-[1],7

At this point the expression is completely parsed. The rule that needs to consume this expression will issue a final stack collapse, which will return 27 as the result.

The first change to implement this is to add a dictionary with the precedences of all operators. This can be added as a class variable:

class My:
    precedence = {'+': 1, '-': 1, '*': 2, '/': 2}

    # ...

When pushing an operator to the stack, its precedence needs to be added as well. To implement this, instead of pushing the operator token as a string, a tuple with the string and the precedence will be pushed. This is the updated parse_operator() method:

    def parse_operator(self):
        token = self.next_token()
        if token[0] not in ['+', '-', '*', '/']:
            self.return_token(token)
            return False

        self.stack_push(self.stack_collapse())
        self.stack_push((token[0], self.precedence[token[0]]))
        return True

The stack_collapse() method indiscriminately collapsed all operations until the stack was empty. This has to be done more carefully now. First of all, the incoming operator needs to be passed as an argument so that its precedence can be compared against those in the stack:

    def stack_collapse(self, next_operator=None):
        op_precedence = 0 if next_operator is None else \
            self.precedence[next_operator]
        # ...

Here you can see that next_operator is now an argument which can be a string with the operator that needs to be pushed next. When a statement needs a final result for the expression currently in the stack there isn't going to be a next operator, so in that case the default of None will set the operator precedence to 0. Since 0 is lower than any operator precedences, it will cause the full collapse that is needed to retrieve the final result.

The collapse loop now needs to run while the operation precedence in the stack is higher than the incoming operation. Here is the complete stack_collapse() method:

    def stack_collapse(self, next_operator=None):
        op_precedence = 0 if next_operator is None else \
            self.precedence[next_operator]
        while len(self.stack) > 1 and self.stack[-2][1] > op_precedence:
            value2 = self.stack_pop()
            prev_op = self.stack_pop()[0]
            value1 = self.stack_pop()
            if prev_op == '+':
                self.stack_push(value1 + value2)
            elif prev_op == '-':
                self.stack_push(value1 - value2)
            elif prev_op == '*':
                self.stack_push(value1 * value2)
            elif prev_op == '/':
                self.stack_push(value1 // value2)
        return self.stack.pop()

Going back to the parse_operator() method, the next_operator argument must be added to the collapse call:

    def parse_operator(self):
        token = self.next_token()
        if token[0] not in ['+', '-', '*', '/']:
            self.return_token(token)
            return False

        self.stack_push(self.stack_collapse(next_operator=token[0]))
        self.stack_push((token[0], self.precedence[token[0]]))
        return True

This is very tricky, so be sure to review these changes carefully to understand how everything works. Write the new test program in test.my:

a = 4 + 5 * 6 - 7
print a

And now try the new interpreter on it:

$ python my.py test.my
27

Final Thoughts

I hope my step-by-step development of the little my language gave you a good understanding of how programming languages are constructed. I counted 137 lines of Python code in the final implementation, which is quite small for a complete language implementation!

But of course, the language remains extremely simple. There are a lot of interesting extensions that can be added:

  • A smarter lexer that can find tokens even when there are no white space between them
  • Use of parenthesis to alter the operator precedence when evaluating expressions
  • A code generator that writes compiled my code to disk files (maybe to .myc files, similar to Python's .pyc), and a separate interpreter that executes them.
  • Control statements such as if, while, maybe even for
  • Callable functions
  • And many more ideas that you can copy from Python or your favorite programming language!

At this time I'm not sure if I'll continue developing this language and implementing some of the above ideas. I guess it will depend on interest, so please let me know what you thought of this tutorial so far and if you'd like to see me add more advanced features.

Become a Patron!

Hello, and thank you for visiting my blog! If you enjoyed this article, please consider supporting my work on this blog on Patreon!

6 comments
  • #1 Emiliano said

    Thanks for your effort explaining how programming languages work Miguel!

  • #2 Andy Cook said

    I've really enjoyed what you've written so far and I have learned a lot. Personally I would like to see a further part showing how parentheses are dealt with. Thanks.

  • #3 Laurent said

    Thanks for this very interesting series of articles. Would very much like to see the compiling part of the 'my' language. Thanks again !

  • #4 Cosme said

    Could something like this be used to create a "domain specific language"?

  • #5 Miguel Grinberg said

    @Cosme: yes, of course!

  • #6 Connor said

    i would love to see more of this series! it explains everything super well, and i was able to actually understand the code, something rare in tutorials these days :)

Leave a Comment