Since the last MCD post, I've also written a post detailing the work I've done on the design of the language: Fundamentals of Mona. Now I'm continuing with detailing the development of the actual implementation, specifically the syntax analysis phase of the compiler, or parsing.

This post focuses just on the parsing process, and does not touch upon building intermediate representations and symbol table management, which, while related to syntax analysis, are topics that deserve a post on their own. Also that's what I'm currently working on.

Role of the syntax analysis phase

The output of the previous phase, lexical analysis, is a stream of tokens: tagged pieces of information analogous to words in a sentence -- with both content and kind/type (e.g. identifiers, string literals, etc). Syntax analysis determines the actual horizontal/sentential relationship between the tokens, analogous to analyzing parts of speech in natural language -- i.e. the function of various constructs.

In terms of implementation, the most important role of parsing is structuring the flat input which is in the form of a token stream, by matching it to grammatical rules, and in the end forming a tree-like structure called an abstract syntax tree (AST).

Here's an illustration of the process:

  • Example code parsed into an AST (source)

Such structuring amounts to translating the already-existing programmer-intended structures in the code being parsed, into a form native to the compiler (the language which the compiler is written in, for Mona that's currently Scala), enabling the structures to be processed further.

Context-free grammars

In order to parse a string of some language, it is often required to formally specify the language's grammar (or at least some part of it). I won't be going into an in-depth analysis of various categories of formal grammars, but it is enough to note that most programming languages are largely specified by context-free grammars, which are symbol rewriting systems consisting of a set of nonterminal symbols, a set of terminal symbols, and production rules of specific form which specify the actual structure of the constructs in the language.

The terminal symbols are essentially token signifiers, meaning that they are used in the production rules as they will appear in the source code, and cannot be rewritten further. An example terminal would be something like Identifier, signifying the token class for identifiers, with the specific lexeme providing the concrete identifier string (such as counter, some binding in the source code). Terminals can only appear on the right hand side of production rules, while nonterminals are used on both sides and don't appear in the material source code, since their function is describing its structure.

Example of a production rule:

SimpleArgument -> Terminal(OpenBlock), OptionalNewlines, StatementSequence, Terminal(CloseBlock)

It specifies that the SimpleArgument nonterminal, a component of the Mona grammar, can be rewritten as a sequence of terminals and nonterminals, in this case describing the structure of expression blocks such as { some_function_call () }. The expanded sequence, however, includes two additional nonterminals -- OptionalNewlines, and StatementSequence, which are further processed by the parsing algorithm. Each nonterminal can have multiple production rules, and one of the main tasks in developing a parser is determining which production rule should be applied in a particular context (this is an automated process).

The process of parsing amounts to consuming the token input stream, and recursively expanding correct nonterminals, in order to match the resulting terminals in the productions with the inputs. This process is further explained in the Recursive descent section.

Syntax directed translation

Simply describing the structure of the grammar is practically insufficient for complete syntax analysis, since the final output is an AST with special construction methods which depend on the production rules being applied, and a populated symbol table -- a data structure containing information about identifiers used in the code. Additional actions need to be included in the parsing process, aside from just analyzing the structure, in order to obtain these results. This concept is called syntax directed translation, and its implementation is twofold:

  1. Attaching functions (code fragments) to production rules, mainly for creating and manipulating AST nodes and entries in the symbol table. For example, the above production rule for SimpleArgument could have a fragment attached to it, which would produce an AST node ExpressionNode to denote an expression (which, when evaluated, would have the value of the last expression in that statement -- it's a recursive structure); and
  2. Creating a system for passing the information produced by fragments through the grammatical structure as it's being parsed. In the aforementioned rule, the two nonterminals appearing on the right side would have to be expanded and parsed before the final fragment for the entire production is called. That fragment will use the information provided by the symbols on the right side, i.e. their fragments. This part will be explained in further detail below.

Specification of the Mona grammar

Monac embeds the Mona grammar in its source code, by describing the nonterminals, terminals, production rules, and fragments as objects in the Scala language:

  • Terminals are instantiated from classTags describing the token type they signify. Example: Terminal(classTag[Newlines]);
  • Nonterminals are just objects without properties (comparable to elements of an enum);
  • Both terminals and nonterminals extend the Symbol class;
  • Code fragments are functions which accept a single argument of type Context, containing the information about the scope in which the production is being expanded (for example, the scope of a function if an expression contained in that function is being parsed), and a list of AST nodes produced by the symbols on the right-hand side of the rule. The function returns an AST node for the production, and its side-effect is the potential populating of the symbol table;
  • Production rules are Scala 3-tuples of type NonTerminal -> List[Symbol] -> ((Context) => ASTNode), meaning that each left-hand side and right-hand side combination forming a rule is also tied to a specific fragment, activated when the rule is applied.

The grammar has not been finalized, i.e. only a subset of Mona can be parsed. Current progress can be seen here: MonaGrammar.scala.

Recursive descent

The recursive descent parsing algorithm in monac is taken mostly from the Dragon Book, with the exception that not each nonterminal has its own expansion function. These functions, which in the Dragon Book use mutual recursion to expand the nonterminals as needed, are replaced by a single parse function of two arguments: the symbol of the nonterminal to be expanded, and a symbol table for the parent scope of the expansion.

This is the algorithm used by the parse function:

  1. The next token is requested from the input stream;
  2. A production is selected depending on that token. If no appropriate rule can be found, an error occurs;
  3. The symbols from the right hand side are processed in sequence. Nonterminals get processed by recursively calling the parse function on them (with a new parent scope), while terminals are wrapped in appropriate ASTNode structures;
  4. The code fragment function is called with a context constructed from the parent scope passed into the parse function and a list of symbols processed in step 3.

Error recovery currently remains unimplemented as other things are of higher priority.

The entire parsing algorithm can be seen here: Parser.scala.

Selecting productions

While the parsing algorithm described above is quite simple, it relies on a data structure which is more complicated to compute: a parse table. This data structure is a matrix indexed by a nonterminal and a token, which contains information about which production rule should be applied for that specific combination. The data structure assumes the parser is of the class LL(k=1), where the number 1 refers to the number of lookahead tokens which need to be provided (of course, the data structure could be modified to support any concrete k, but 1 is simplest to implement).

The requirement for being able to construct an LL(1) parser is that the grammar is left-factored, meaning no two different right-hand sides for the same nonterminal, or their expansions, begin with the same symbols. This can be ensured with a simple algorithm for left factoring, and I do it while writing down the grammar since automating it would only complicate the process.

Once a left-factored, unambiguous grammar is provided, two data structures need to be computed before the final parse table can be constructed: first and follow sets, for each symbol in the grammar.

  • The first set provides information about the terminals which can appear as beginning of a sequence of symbols. If the aforementioned production rule for SimpleArgument was its only one, first(SimpleArgument) would equal Set({);
  • The follow set analogously is the set of terminal appearing after a sequence of symbols. If expressions in some language can be parenthesized like ( Expression ), then follow(Expression) would include ).

The algorithms for computing these sets completely and using them to construct are unnecessary for this post, but can be seen in the constructParseTable function here: Grammar.scala.

Computing the parse table need only be done once for each iteration of the grammar, since the results are saved in a file which is loaded by the compiler proper at the start.

Further development

While I've implemented the most important parts of the parsing engine, I'm still working on the results of the syntax analysis phase, that is AST node construction and symbol table management.