Antlr: The basics of parsing programming languages source code

In today's world, the process of turning a written program code into something that can be executed is often simply taken for granted and overlooked. The compilation process is intricate and can still last for an uncomfortable amount of time for really big enterprise applications. This blog post will attempt to describe the basics about how a piece of code can be parsed into something that could be logically understood by a processor and then some operations be executed accordingly.

Today we have a plethora of tools that can help us with this task. An older example is the pair Lex and YACC (Yet Another Compiler Compiler) which were used by the C programming language. Today a Python port also exists named PLY (Python Lex and YACC). Many languages like C# and Python have their internal helper libraries which can be used only on their own respective language codes and can be used to implement different helper libraries like .NET Roslyn. We will discuss Antlr4. Antlr4 is based on and written in Java, but has ports for many different languages (including my preferred one - Python).

Code parser can be used for creating a new programming language or parsing an existing one for different purposes, including:

1. Translating a piece of code into binary data for execution (assembly or compilation)

2. Executing the parsed code directly (Interpreted language)

3. Performing a static code analysis - security audits, code review audits

4. Code deobfuscation

5. Code generation

Example

I have implemented a pretty simple math interpreter. It accepts a file containing code, parses it, and then executes the commands. The statements I chose to implement demonstrate how to approach implementing a simple programming language parser. The example is definitely not thorough so readers can try to extend it with their own language implementations while learning Antlr.

The example's speed, optimization, and elegance have secondary focus, so I would never recommend using this approach for any king of production use other than for learning.

Implementation

Tokenizing

The first part of any language processing is turning it into logical tokens. This is done with the use of lexer. When defining a lexer, we actually define a list of token types and describe what each type looks like. For instance, we could define a token named "DEF" which looks like def. This means wherever we find a word in the code text that is equal to def, a token DEF will be created. We could also define a token named "NAME" which is defined as ('a'..'z' | 'A'..'Z')+ the token would match anything with one or more letters. Note that token NAME also includes the definition for the token DEF, so it is really important that the tokens that could include one another are defined in the correct order. In this case, we would define DEF token first and then the NAME token.

In the previous example, we saw the use of character +. This is borrowed from the regular expressions syntax. This is one of the three characters defining the multiplicity. These characters are *, +, and ?, which denote "zero or more", "one or more", and "zero or one" respectively. There are other special constructs to denote sets of characters that can belong in the token, which mostly taken from the regular expressions. For instance, a dot (.) represents any character, while we can use ranges and or statements to denote a set of character: ('a' .. 'z' | 'A' .. 'Z' | '0' .. '9' | '_') denotes a single lowercase or uppercase letter or number or underscore.

It is also important to note that tokenizer tokenizes the text without the grammar rules (which is the next step), so it won't choose between the two definitions depending on what you except to appear in the rule, but it will retrieve the first one. This is why if you define NAME token first, it's going to match any "def" words as NAME token which you might not want. This is also the reason why you can't use "def" word where a name is expected - instead of matching with a token NAME, it would find a token DEF and throw a syntax error.

In the end, it is important to note that some characters are set as ignored. These are usually spaces, tabs, newlines, and so on. These are simply ignored, they don't need to exist between the tokens.

There are a few ways to test if the correct tokens are produced. The first one in my example is to use grun.bat file with the -tokens flag. See my GitHub repository and Antlr startup how to make it run. This will automatically print out all the found tokens which you can check with your expected list. You could also do it programmatically in Python, by checking the contents of the antlr4.CommonTokenStream(lexer) output.

Grammar rules: EBNF

Figure 1: Antlr parse tree representation in the Visual Studio Code, my editor of choiceEBNF is shorthand for Extended Backus-Naur Form. With it, we can write a set of rules that define a desired programming language tree structure. There is always a root node which usually defines a whole written script, file, or program, and then branches towards the individual lines and instructions.

When the rules are written, the language parser receives a list of tokens as an input and performs rule matching on them. When a particular rule is matched to a subset of tokens, the rule's callback method is executed. So this means you can implement different actions for each rule. If a particular token set can't be matched with any rule, a syntax error is raised and you can perform error handling, like reporting a line and row where the error was found.

Abstract Syntax Tree

Parsing a program with a language parser gives us a structure called CST (Concrete Syntax Tree) as the result. This tree contains all the found tokens like line terminators and each and every rule node found. If we want to perform something simple like a one-pass validation, we might want to immediately implement that and that's it. But if we want to perform multiple passes over our program, it is much smarter to map the CST to AST (Abstract Syntax Tree). In AST we can keep all the information we need and get rid of anything we might not need. For instance, we could throw away any tokens that don't contribute to the information a node contains (like line terminators, equal signs, anything that can be retrieved from other data in the node). We could also merge nodes into one if that is logically sound.

Error handling

As previously mentioned, Antlr has some nice error handling capabilities. Firstly, every token (including error tokens) contains some useful error reporting metadata, like line number and column. This can be used to produce very informative error reporting. Furthermore, if the tokenizer encounters an unrecognized token (a piece of text that can't be matched to any token specification), an error token is produced. Lastly, when the parser encounters a valid token that can't be fit into any of the parsing rules, again the error is raised which can be handled appropriately.

Antlr also has the capabilities to recover from errors, but that is fuel for another blog post.

Using the grammar

Writing the grammar

As pretty much any other language parser Antlr supports separating the tokenizer specification from the parser one, but it also supports having them in the same file. My personal recommendation would be to keep them in the same file for any small project dedicated to learning but to separate them for anything even slightly more complex.


COMPARE_EQ: '==';
COMPARE_NE: '!=';
COMPARE_G: '>';
COMPARE_GE: '>=';
COMPARE_L: '<';
COMPARE_LE: '<=';

IF: 'if';
ELSE: 'else';
WHILE: 'while';
BREAK: 'break';
PRINT: 'print';
DEF: 'def';
RETURN: 'return';
NAME: NAME_CHAR+;
NUMBER: DIGIT+ (DOT DIGIT+)?;

WS: [ \n\t\r]+ -> skip;

fragment DIGIT: ('0'..'9');
fragment NAME_CHAR: ('0'..'9'|'a'..'z'|'A'..'Z'|'_'|'-');
Code listing 2: Fragment of a lexer definition
In the above code listing we can see a simple math calculator lexer definition fragment. The tokens are specified by writing their name and a regex matching their signature. Some tokens are defined as static words, while others can contain an arbitrary number of letters and / or numbers. Some tokens are defined as fragments, which means they won't be turned to tokens if their signature is encountered, but they are used as parts in other tokens. It is important to note that when specifying the lexer, the token names must start with an uppercase letter. This is how Antlr differentiates between rule names and token names.



script : body;
body : (command | comment)*;
command : assign SEMI
| if_command
| while_command
| break_command SEMI
| print_command SEMI
| function_command
| function_call SEMI
| return_command SEMI;
if_command : IF BRACE_L value BRACE_R CURLY_L body CURLY_R (elseBody=else_command)? # IfCommandBody
| IF BRACE_L value BRACE_R command (elseBody=else_command)? # IfCommandSingle;
else_command : ELSE CURLY_L elseBody=body CURLY_R # ElseCommandBody
| ELSE elseLine=command # ElseCommandSingle;

Code listing 4: Fragment of a parser definition
The rules are also defined with their names, which start with lower case letters. Each rule can contain a subrule, which gets "called" (or more like inserted by its name into position) when its name is encountered. Each rule can have several variants (which are separated by | char), so in our example above the command can be an assign, an if block, a while block, and so on. When specifying the rules we can also use quantifier chars *, +, and ? to define how many times a rule can appear in the position it is called. If there are no quantifier characters, we expect the rule to appear once and only once. As an example in the listing above, our body can consist of zero or more commands or comments. An important detail is that in the set specified above (the set is the shorthand for a set of rules in the parentheses) there could be a single command or a single comment. That set can appear more than once, which means any order of commands and comments is acceptable. If we would have had command* comment*, that would mean there could me first zero or more commands followed by zero or more comments, but it would be unacceptable for them to intermix.

Walking the grammar tree

Antlr4 supports two ways of walking the generated CST tree. You can reimplement a tree visitor or you can reimplement a listener. In any case, the Antlr generates base classes for your convenience, which you can then override and reimplement.

The listener contains an enter* / exit* callback pair for each rule (where * is the rule name). When the tree is walked, the callbacks are called accordingly. On the plus side, this method doesn't use method calls so you can't get into stack overflow situation. On the negative side, you can't control the flow of the tree walking and the results can't be returned to the parent context from the callbacks. The first thing means the tree is going to be walked from the root upwards and back and you can't do anything to change that. The second thing means you are going to have to implement your own solution (usually a stack-based one) to handle the buildup and return of data.

The visitor contains a set of visit* overrides, which call the appropriate methods to visit the node's children. This means there could be a number of recursive method calls which can lead to stack overflow. But this also means a child can return a value and the parent can directly handle it, without any additional implementations. Also, if you so choose, you can omit or mix up calls to the children and thus control the flow of tree walking.

Both of these methods have one big drawback. When new rules are implemented, the generated base listener / visitor implementation is regenerated, so it is important to carefully compare the differences, copy the new methods to your custom overriding class, and override them. This makes maintaining the listeners and visitors a hassle. This is also why it is important to override the generated listener / visitor instead of changing them directly - when you regenerate them with new rules methods all the custom code is gone! It is important to mention that the listener has one big advantage over the visitor, it implements a method pair named enterEveryRule / exitEveryRule, while the visitor has no such thing. This allows us to implement some sort of reflection approach to map the AST generically even if the rules change.

In my example, I chose to implement the visitor pattern because I couldn't find much information about it and wanted to learn it. There was one interesting hurdle to overcome. The visitor implements the calls to each child which work nicely, but it also implements a method called visitChildren, which works pretty unexpectedly out of the box. In the next code section, I will list the original implementation details, but the full implementation can be found here.


def visitChildren(self, node):
result = self.defaultResult()
n = node.getChildCount()
for i in range(n):
if not self.shouldVisitNextChild(node, result):
return result
c = node.getChild(i)
childResult = c.accept(self)
result = self.aggregateResult(result, childResult)
return result

def defaultResult(self):
return None

def aggregateResult(self, aggregate, nextResult):
return nextResult
Code listing 6: ParseTreeVisitor implementation details
If you follow the execution you will notice that the child results are not aggregated in the list, but only the last child result is returned. In my case, when all the children would be visited, the last one would have returned None, so the result of the visitChildren method would always be None. To overcome this, you should override the defaultResult to return an empty list, and then aggregateResult to append the nextResult to the aggregated list.

The entire code for this app is available on my GitHub page. I believe it is a nice starting point for someone wanting to learn about Antlr4. The best way to start is to make this code run. Following the next few steps should do the trick:

1. Install Java

2. Install Antlr

3. Install Antlr Python target

4. Pull or Clone the repository

5. Run it!

Go through the code, debug it, implement other code blocks like foreach and most of all, have fun! If there are any questions or thoughts, post them below and I'll be glad to read and answer them.

Project on GitHub: simple-math-calculator

Conclusion

As cited previously, this is an attempt to make a simple interpreter for educational purposes, but without any attempts of making it perform well. My only goal was for it to work and for me to learn something. The whole calculation approach is based on a method that knows how to execute anything, so this method is called in a pretty much recursive fashion. This means the more lines a math file has, the closer the execution is to a stack overflow. This can, of course, be avoided, but it is a good enough example for the purpose at hand.