Packrat parsing is a novel and practical
method for implementing linear-time parsers
for grammars defined in Top-Down Parsing
Language (TDPL). While TDPL was originally
created as a formal model for top-down
parsers with backtracking capability, this
thesis extends TDPL into a powerful
general-purpose notation for describing
language syntax, providing a compelling
alternative to traditional context-free
grammars (CFGs). Common syntactic idioms
that cannot be represented concisely in a
CFG are easily expressed in TDPL, such as
longest-match disambiguation and syntactic
predicates, making it possible to describe
the complete lexical and grammatical syntax
of a practical programming language in a
single TDPL grammar.
Packrat parsing is an adaptation of a
30-year-old tabular parsing algorithm that
was never put into practice until now. A
packrat parser can recognize any string
defined by a TDPL grammar in linear time,
providing the power and flexibility of a
backtracking recursive descent parser
without the attendant risk of exponential
parse time. A packrat parser can recognize
any LL(k) or LR(k) language, as well as many
languages requiring unlimited lookahead that
cannot be parsed by shift/reduce
parsers. Packrat parsing also provides
better composition properties than LL/LR
parsing, making it more suitable for dynamic
or extensible languages. The primary
disadvantage of packrat parsing is its
storage cost, which is a constant multiple
of the total input size rather than being
proportional to the nesting depth of the
syntactic constructs appearing in the input.
Monadic combinators and lazy evaluation
enable elegant and direct implementations of
packrat parsers in recent functional
programming languages such as Haskell. Three
different packrat parsers for the Java
language are presented here, demonstrating
the construction of packrat parsers in
Haskell using primitive pattern matching,
using monadic combinators, and by automatic
generation from a declarative parser
specification. The prototype packrat parser
generator developed for the third case
itself uses a packrat parser to read its
parser specifications, and supports full
TDPL notation extended with semantic
predicates, allowing parsing decisions to
depend on the semantic values of other
syntactic entities. Experimental results
show that all of these packrat parsers run
reliably in linear time, efficiently support
scannerless parsing with integrated
lexical analysis, and provide the
user-friendly error-handling facilities
necessary in practical applications.