History of Parsing Methods

Anatomy of the Compilation Process

The different tasks to be done by a compiler can be broken down as follows:

     Preprocessing - handling include files, macros, and manifest
          constants
     Scanning - breaking the input file into tokens and managing the
          symbol table
     Parsing - verify that the input tokens form valid syntax
     Semantics - derive meaning from the structure of the input tokens
     Optimizations - find ways to accomplish the same task, quicker
     Code Generation - generate assembly language code for a specific
          computer
     Assembly - assemble the assembly language code to an object module
     ld - load one or more object modules with libraries to form an
          executable.

The primary emphasis of this course is to study the parsing techniques and the theory that makes it work.

BNF notation

John Backus, the principle designer of FORTRAN, and Peter Naur, a journalist for a computer magazine, both attend a conference on Algol in 1960 in Paris, France, I believe. On their return trip they discussed the definition of Algol-60, as they had perceived it. They had both attending the same meetings and discussions and had come away from the conference with widely differing interpretations of what the langauge would look like and how it would work. Together they worked on refining a notation for describing the grammars of languages. This notation is called BNF, or Backus Naur Form. Some authors have dropped Peter Naur's name from their definitions and called it Backus Normal Form instead.

An example of a BNF grammar that describes the rules for the +, -, *, and / operators might look something like:

     EXP    => EXP + TERM
     EXP    => EXP - TERM
     EXP    => TERM
     TERM   => TERM * FACTOR
     TERM   => TERM / FACTOR
     TERM   => FACTOR
     FACTOR => ( EXP )
     FACTOR => identifier

The names EXP, TERM, and FACTOR are just arbitrary names which represent groups of symbols, for example, the FACTOR is defined as either an identifier (variable name) or an expression in parenthesis. Symbols that are composed of other symbols are called Non-Terminal symbols. Symbols that are composed of key words, operators, or names or any other symbol that may be found in a program are called Terminal symbols.

The Terminal symbols for this grammar are +, -, *, /, (, ), and identifier. The Non-Terminal symbols are EXP, TERM, and FACTOR.

Early Parsing Methods

Grammer Embedded in the Code

The earliest compilers were written with the definition of the langauge buried deeply within the code. With these compilers it was very difficult to verify that the compiler accepted all of the langauge syntax and only the language syntax. This became especially difficult when the definition of the language was changed for later versions. All compilers before the early 1960's were of this type because there wasn't any uniform method of describing the language grammars.

Recursive Descent

With the advent of the BNF notation for describing the languages, compiler writers designed the structure of their subroutines and functions to correspond to the structure of the BNF definition of the language. To use our example grammar above, there would be seperate functions to handle EXP's, TERM's, and FACTOR's. The EXP function would call itself and the TERM function, etc. This way, when it came time to update the language to meet changing standards, it would be easier to find where the changes should be made. It also made it much easier to verify that the language accepted all legal syntax and only the legal syntax.

The Recursive Descent does not guarantee that the program matches the grammar. It only aids in making it easier for the compiler writer to try to verify the accuracy of the parser. The search for better parsing methods continued with some that analyzed the grammars and attempted to automate the parsing methods. The first such method was called Top Down Parsing or LL Parsing.

Top Down Parsing

The top down parsing method, also called a predictive parse or LL parse, requires that we reorganize the grammar so that the first symbol of each rule defining a given Non-Terminal will indicate which rule to choose for the Non-Terminal. This transformation can be done to any grammar, but is sometimes awkward. There are also some cases that cannot be parsed correctly with Top Down Parsing methods.

Bottom UP Parsing

The bottom up parse, also called an LR parse is the most powerful parseing method. It also has the most complicated set of algorithms for building the parse tables. There are a set of algorithms for building LR parse tables. The same algorithm is used for all of the LR parse tables.

LR(0)

The first of the LR parse generation algorithms is called LR(0) and it generates tables that are somewhat large. The LR(0) parse algorithm do not parse grammars with certain types of ambiguities.

LR(1)

The second algorithm, which handles all of the grammars correctly is LR(1). The LR(1) algorithm generates correct parse tables for grammars with all of the ambiguities that are found in most useful langauges. The biggest strike against LR(1) parse tables is the fact that the tables generated are much larger then the LR(0).

SLR

The third algoritm attempts to handle some of the amiguities that LR(0) fails at and keeps the size of the parse tables the same as those generated by LR(0). It is called Simple LR.

LALR

The last algorithm, Look Ahead LR, generates parse tables that are the same size of LR(0), but handles all of the ambiguities that are handled by LR(1).