This thesis documents several new developments in the theory of parsing, and also the
practical value of their implementation in the Copper parser generator.
The most widely-used apparatus for syntactic analysis of programming languages
consists of a scanner based on deterministic finite automata, built from a set of regular
expressions called the lexical syntax, and a separate parser, operating on the output of
this scanner, built from a context-free grammar and utilizing the LALR(1) algorithm.
While the LALR(1) algorithm has the advantage of guaranteeing a non-ambiguous
parse, and the approach of keeping the scanner and parser separate make the compilation
process more clean and modular, it is also a brittle approach. The class of grammars
that can be parsed with an LALR(1) parser is not closed under grammar composition,
and small changes to an LALR(1) grammar can remove the grammar from the LALR(1)
class. Also, the separation of scanner and parser prevent the use, in any organized way of parser context to resolve ambiguities in the lexical syntax. One area in which both of these drawbacks pose a particular problem is that of
parsing embedded and extensible languages. In particular, it forms one of the last
major impediments to the development of an extensible compiler in which language
extensions are imported and composed by the end user (programmer) in an analogous
manner to the way libraries are presently imported. This is due not only to the problem
of the LALR(1) grammar class not being closed under composition, but to the very
real possibility that the lexical syntax of two different extensions will clash, making it
impossible to construct a scanner without extensive manual resolution of ambiguities,
if at all. This thesis presents three innovations that are a large step towards eliminating parsing
as an Achilles’ heel in this application. Firstly, it describes a new algorithm of scanning
called context-aware scanning, in which the scanner at each scan is made aware of
what sorts of tokens are recognized as valid by the parser at that point. By allowing the
use of parser context in the scanner to disambiguate, context-aware scanning makes the
specification of embedded languages much simpler — instead of specifying a scanner
that must explicitly switch “modes” to scan on the different embedded languages, one
simply compiles a context-aware scanner from a single lexical specification, which has
implicit “modes” to scan properly on each embedded language. Similarly, in an extensible
language, this puts a degree of separation between the lexical syntax of different
language extensions, so that any clashes of this sort will not be an issue.
Secondly, the thesis describes an analysis that operates on grammar extensions of a certain form, and can recognize their membership in a class of extensions that can be
composed with each other and still produce a deterministic parser—enabling end-users
to compose extensions written by different language developers with this guarantee of
determinism. The analysis is made practical by context-aware scanning, which ensures
a lack of lexical issues to go along with the lack of syntactic nondeterminism. It is this
analysis — the first of its kind — that is the largest step toward realizing the sort of
extensible compiler described above, as the writer of each extension can test it independently
using the analysis and thus resolve any lexical or syntactic issues with the
extension before the end user ever sees it.
Thirdly, the thesis describes a corollary of this analysis, which allows extensions
that have passed the analysis to be distributed in parse table form and then composed
on-the-fly by the end users, with the same guarantee of determinism. Besides expediting
the operation of composition, this also enables the use of the analysis in situations
where the writer of a language or language extension does not want to release its grammar
to the public.
Finally, the thesis discusses how these approaches have been implemented and
made practical in Copper, including runtime tests, implementations and analyses of
real-world grammars and extensions.
University of Minnesota Ph.D. dissertation. July 2010. Major: Computer Science. Advisor: Eric Van Wyk. 1 computer file (PDF); xx, 296 pages, appendix A.
Context-aware scanning and determinism-preserving grammar composition, in theory and practice.
Retrieved from the University of Minnesota Digital Conservancy,
Content distributed via the University of Minnesota's Digital Conservancy may be subject to additional license and use restrictions applied by the depositor.