GENERAL CONTEXTFREE RECOGNITION AND PARSING
BASED ON VIABLE PREFIXES
By
D. CLAY WILSON
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
1990
Dedicated to my parents, in humble recognition of their lifelong support and encouragement and of the countless sacrifices they have made on my behalf.
ACKNOWLEDGMENTS
George Logothetis is an educator in the true sense of the word. His commitment to his students is exemplary. It is an honor to be his first Ph.D. advisee. George's contribution to my development as a person and a computer scientist is indelible. His imprint on this dissertation is no less so.
My associations with Manuel Bermudez and Joe Wilson have been most rewarding, stimulating, and enjoyable. Perhaps unbeknownst to them, they kept me going when the going got tough.
My sincere appreciation is extended to Randy Chow and David Wilson for agreeing to serve on my supervisory committee. They took on a task which my alter ego would have refused.
Jean's unfailing devotion is as perplexing as it is sustaining. Her sense of humor in the face of adversity is remarkable. She gives purpose to life.
TABLE OF CONTENTS
Page
ACK NOW LEDGM ENTS ..................................................................................................... iii
LIST OF FIGURES ............................................................................................................... vi
AB STRACT .......................................................................................................................... vii
CHAPTER
I INTRODUCTION ................................................................................................... 1
Overview ............................................................................................................ 1
Literature Review .............................................................................................. 4
O utline in Brief ................................................................................................. 7
II NOTATION AND TERM INOLOGY ..................................................................... 8
Elem ents of Form al Language Theory ............................................................. 8
ContextFree Gram m ars and Languages ......................................................... 9
StateTransition G raphs and FiniteState Autom ata ...................................... 11
I GENERAL TOPDOWN RECOGNITION: A FORMAL FRAMEWORK ........ 13 Recognition Based on Derivations .................................................................. 13
TopDown RighttoLeft Recognition .............................................................. 15
TopDown LefttoRight Recognition .............................................................. 27
Discussion ........................................................................................................ 35
IV GENERAL BOTTOMUP RECOGNITION: A FORMAL FRAMEWORK ......... 37 BottomUp LefttoRight Recognition ............................................................ 37
Discussion ........................................................................................................ 46
V ON EARLEY'S ALGORITHM ............................................................................ 48
Earley's General Recognizer ............................................................................ 48
A M odified Earley Recognizer ......................................................................... 51
Earley's Algorithm and Viable Prefixes ......................................................... 53
Earley's Algorithm and Viable Suffi xes ............................................................ 56
Discussion ........................................................................................................ 59
VI A GENERAL BOTTOMUP RECOGNIZER ..................................................... 60
Control A utom ata and Recognition G raphs ................................................... 60
T he G eneraLLRO Recognizer .......................................................................... 62
Earley's A lgorithm Revisited .......................................................................... 71
Im plem entation Considerations ........................................................................ 75
T he Com plexity of Recognition ....................................................................... 81
O n G arbage Collection and Lookahead ......................................................... 84
D iscussion ........................................................................................................ 87
VII A G ENERAL BOTT OM UP PAR SER ................................................................ 91
From R ecognition to Parsing .......................................................................... 91
T he G eneraLLRO Parser ................................................................................. 97
T he Com plexity of Parsing ................................................................................. 102
G arbage Collection R evisited .............................................................................. 103
D iscussion ............................................................................................................ 104
VIII CO N CLUSIO N ......................................................................................................... 107
Sum m ary of M ain Results ................................................................................... 107
D irections for Future Research ........................................................................... 109
REFEREN CES ...................................................................................................................... 111
BIO GRAPHICAL SK ET CH ................................................................................................. 114
LIST OF FIGURES
Page
FIGURE
3.1 A General TopDown CorrectSuffix Recognizer ........................................... 24
3.2 A General TopDown CorrectPrefix Recognizer ............................................ 33
4.1 A General BottomUp CorrectPrefix Recognizer ........................................... 43
5.1 Earley's General Recognizer ............................................................................ 49
5.2 A Modified Earley Recognizer ......................................................................... 52
5.3 The Definition of the State Derivative of a Path ........................................... 56
6.1 The GeneralLRO Recognizer .......................................................................... 63
6.2 The GeneraL.NLRO Recognizer ....................................................................... 73
6.3 A Modified Reduce Function ........................................................................... 80
7.1 The GeneralLRO Parser ................................................................................. 97
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the Requirements for the Degree of Doctor of Philosophy GENERAL CONTEXTFREE RECOGNITION AND PARSING BASED ON VIABLE PREFIXES
By
D. Clay Wilson
May 1990
Chairman: Dr. Manuel E. Bermudez
Major Department: Computer and Information Sciences
Viable prefixes play an important role in LR parsing theory. In the work presented here, viable prefixes have a commensurately central role in a theory of general contextfree recognition and parsing.
A settheoretic framework for describing general contextfree recognition is presented. The operators and operands in the framework are regularitypreserving relations and regular sets of viable prefixes, respectively. A basic operation consists of computing the image of a regular set of viable prefixes under one of the relations. By extension, general recognition is characterized in terms of computing a sequence of regular sets.
For implementation purposes, finitestate automata are used to represent the regular sets. A general bottomup recognizer that constructs an appropriate sequence of automata is described in detail. The regular languages accepted by these automata correspond to the sets of viable prefixes computed by the recognizer's settheoretic counterpart. The automata are constructed under the guidance of a control automaton which accepts the viable prefixes of the subject grammar. Ultimately, the automatabased recognizer is extended to a truly general bottomup parser.
Earley's algorithm is analyzed in the context of our viable prefixbased framework as it provides a convenient vehicle for illustrating some of our ideas. We describe how Earley's
algorithm implicitly tracks the sets of viable prefixes that arise in our model. Moreover, by modifying Earley's recognizer to construct a certain directed graph, the representation of these sets is made explicit.
Our settheoretic framework yields elegant and succinct characterizations of general contextfree recognition that appear to capture the essence of the task. On the practical front, a general bottomup parser is described in sufficient detail to be readily implemented. Although its practical potential is not evaluated here, the parser is intended for use in problem areas that require more flexible parsers than are provided within the efficient but restricted LR framework. Regardless, our viable prefixbased treatment of recognition and parsing provides a particularly appropriate framework within which the continuum between LR parsers and our general parsers may be further investigated.
CHAPTER I
INTRODUCTION
Contextfree recognition is the algorithmic process by which the membership of a string x within a contextfree language L is decided. This involves determining whether x is derived by some contextfree grammar G where L = L(G). Parsing is the process of ascertaining the syntactic structure imparted to x by G.
From a theoretical standpoint, contextfree recognition and parsing hold considerable interest in their own right. Yet contextfree grammars and their recognizers and parsers have substantial practical value as well. Most notably, results from parsing theory have proven indispensable to the implementation of programming languages. Other areas of application include natural language processing [34], syntactic pattern recognition [18], and code generation in compilers [10].
Given an arbitrary grammar G and an arbitrary string x over the terminal alphabet of G, a general recognizer (resp. parser) recognizes (resp. parses) x with respect to G. The work presented here contributes to the area of general contextfree recognition and parsing. The following section provides some motivation and a brief overview of this dissertation.
Overview
The LR parsers, namely those parsers that effect a Lefttoright scan of the input while producing a Right parse, define the most powerful class of deterministic parsers. Earley's algorithm, on the other hand, is arguably the most efficient general parser. Despite the fact that LR parsers are restricted to LR(k) grammars whereas Earley's algorithm can parse strings against any contextfree grammar, there are close parallels between the two.
Both LR parsers and Earley's algorithm are based on items. Each state of an LR parser corresponds to a set of LR items. Earley's algorithm constructs a sequence of state sets during recognition. The states manipulated by Earley's algorithm  call them Earley states  are slightly elaborated LR items.
Earley's algorithm and LR parsers scan the input string from left to right recognizing an incrementallylonger prefix of it in the process. That is, they are correctprefix recognizers.
Both LR parsers and Earley's algorithm work in a bottomup fashion. An LR parser determines the reversed rightmost derivation of an input string. In contrast, Earley's algorithm has the capability of producing all of the reversed rightmost derivations of an input string.
The relationship between Earley's algorithm and LR parsers can be described on a more fundamental level in terms of viable prefixes. Viable prefixes are certain prefixes of right sentential forms. At each point during a parse, the contents of an LR parser's stack implicitly represents a viable prefix which derives the portion of the input string parsed to that point. We let VP(G) denote the set of viable prefixes of a grammar G. In addition, let VP(G, x) denote the set of those viable prefixes of G which derive x, a string over the terminal alphabet of G.
Turning now to Earley's algorithm, consider a point in a parse at which some prefix x of the input string has been processed. The sequence of Earley state sets constructed up to that point encapsulates the strings in VP(G, x). The manner in which VP(G, x) is normally represented in the state sets is rather indirect. However, this representation can be made explicit through a variant of Earley's algorithm which constructs a directed graph whose vertices are the states generated by the original algorithm. Under an appropriate interpretation, this graph yields a finitestate automaton which accepts VP(G, x). Details of this proposed graphical variant of Earley's algorithm are supplied later.
Given an arbitrary grammar G and an arbitrary string x over the terminal alphabet of G, VP(G, x) is a regular language. This fact can be established analytically. Alternatively, the graphical variant of Earley's algorithm mentioned above provides a constructive proof of this result.
In light of these observations, the primary thrust of this work is on the formal development of an approach to general contextfree recognition and parsing that is based on explicitly computing VP(G, x) for an incrementallylonger prefix x of the input string. In particular, the viable prefix is the central concept upon which useful general recognizers and parsers are founded. The development is rigorous, yet we strive for clarity and elegance by resorting to basic principles wherever possible. In short, our approach to general recognition and parsing generalizes the role played by viable prefixes in LR parsers in order to accommodate arbitrary grammars.
This work consists of three logical divisions. In the first (Chapters III and IV), the mathematical foundation for our viable prefixbased approach to recognition and parsing is developed. The basic tools are a handful of binary relations on strings. General recognition is described using these relations and simple settheoretic concepts. A key property of the relations is that they preserve regularity. Consequently, general topdown and bottomup recognition schemes are defined in terms of computing the images of regular sets of viable prefixes under these relations. In short, general recognition is reduced to computing a sequence of regular sets.
In the second major division (Chapter V), Earley's algorithm is used as a vehicle for demonstrating the efficacy of our settheoretic approach to general recognition. In particular, the graphbased variant of Earley's algorithm is presented there. This modified algorithm illustrates one way in which VP(G, x) can be explicitly computed where x is a prefix of the input string. In the process of analyzing our Earley derivative, some subtle properties of Earley's original algorithm are also revealed and its relationship with LR parsers is clarified.
The last part of this work (Chapters VI and VII), casts our approach to general recognition and parsing into an automatatheoretic framework. First, a general recognizer is described in considerable detail. The recognizer uses an automaton which accepts VP(G) to guide the construction of an automaton that accepts VP(G, x), where x is some prefix of the input string. For convenience, the description of the algorithm employs the LR(O) automaton of G as the guiding automaton. However, the algorithm allows for a rather broad range of VP(G)accepting automata to be used instead. For example, employing the nondeterministic LR(O) automaton of G as a controlling automaton yields a general recognizer which works quite similarly to our graphbased Earley algorithm. Finally, this automatabased recognizer is extended to a general parser. Means for representing parse forests and handling ambiguity are described. The recognizer and parser are presented in enough detail to be readily implemented. In anticipation of this, many practical issues are discussed.
Literature Review
A comprehensive introduction to formal languages and automata is presented by Hopcroft and Ullman [24]. These two related disciplines are prerequisites to a study of contextfree recognition and parsing. An uptodate monograph on parsing theory has been written by Sippu and SoisalonSoininen [39]. Two volumes by Aho and Ullman [6,7] contain a wealth of information; numerous parsing algorithms are presented, both general and restricted, along with much of the theory underlying them.
Some early general parsing algorithms are compared by Griffiths and Petrick [22]. All of the algorithms surveyed rely on backtracking, so they run in O(c') time in the worstcase (n is the length of the input string).
Although it is restricted to Chomsky Normal Form grammars, the CockeYoungerKasami algorithm [6,19,46] is regarded as the first general parser to run in polynomial time (O(n3)). The nxn parse matrix that the algorithm constructs accounts for an O(n2) space complexity. Recall that the matrix entries are filled with sets of nonterminal symbols.
A version of the CockeYoungerKasami algorithm that is restricted to unambiguous grammars is presented by Kasami and Torii [25]. The time and space bounds of this algorithm are both O(n2 log n). Another version which employs linked lists in place of the parse matrix is described by Manacher [32]. This alternate storage discipline allows unambiguous grammars to be recognized in quadratic time, a marked improvement over the corresponding cubic bound of the original algorithm.
The CockeYoungerKasami algorithm was reduced to matrix multiplication by Valiant [44]. Using this result, Strassen's technique for multiplying matrices [11 is applied to obtain an asymptotic worstcase time complexity of 0(n2.11) for general recognition.1 Due to the overhead associated with this method, it is primarily of theoretical interest only.
In contrast to the CockeYoungerKasami algorithm, Earley's algorithm [6,13,14] can process any grammar. Like LR parsers, Earley's algorithm is based on sets of items. Although its worstcase time and space bounds are also 0(n3) and 0(n2), respectively, it performs significantly better on large classes of grammars. Specifically, unambiguous grammars are parsed in 0(n2) time, and only 0(n) time is needed to parse LR(k) grammars provided that ksymbol lookahead is used in the latter case. Earley's algorithm is examined further in later chapters.
Efficiency improvements that may be gained by employing LL and LRlike lookahead2 in Earley's algorithm are reported by Bouckaert et al. [9]. They concluded that FIRST sets are more useful than FOLLOW sets for reducing the number of superfluous items generated during recognition. In short, FIRST (resp. FOLLOW) information reduces the number of items generated by Earley's Predictor (resp. Completer) operation. See Christopher et al. [10] for an example of an application of Earley's algorithm; specifically, it is used to generate optimized code in a GrahamGlanville style code generator [17]. If desired, Earley's algorithm may be extended to include error recovery [3,31].
1 Even faster techniques for matrix multiplication have been developed since.
2 That is, FIRST and FOLLOW sets, respectively.
An algorithm that is a hybrid of the CockeYoungerKasami and Earley algorithms is described by Graham et al. [19,20]. This algorithm also accommodates arbitrary grammars. Like the CockeYoungerKasami algorithm, an n Xn parse matrix is constructed. However, the matrix positions are filled with sets of LR items instead of sets of nonterminals. Practical issues are discussed in detail and claims are made that more efficient implementations are attainable than are allowed by Earley's algorithm. Subcubic versions based on matrix multiplication techniques are also described.
The class of LR(k) grammars was introduced by Knuth in the seminal paper on LR parsing theory [27]. Knuth described a method for constructing a deterministic parser for an LR(k) grammar, observed that the set of viable prefixes of an arbitrary grammar is a regular language, and proved that it is undecidable whether an arbitrary grammar is LR(k) for free k >0. The discovery of LR(k) grammars was quite significant in light of their relationship to deterministic contextfree languages [16].
Knuth's technique for parser construction is generally deemed impractical due to the enormous number of parse states that can result. The SLR(k) [12] and LALR(k) [11,29] grammars define two important subclasses of the LR(k) grammars which allow this problem to be addressed satisfactorily. Relatively compact LR parsers for grammars in these subclasses can be constructed efficiently.
Tomita's algorithm [42,43] extends the conventional LR parsing algorithm to use parse tables that contain multiplydefined entries. Conflicting parse actions are handled by employing a graphstructured stack to keep track of the different parse histories. However, some grammars cause the stack to grow without bound in instances where no input is consumed, so the algorithm is not general. Tomita's algorithm is discussed in greater detail later.
The application of Tomita's algorithm to a system which supports the incremental generation of parsers is reported by Heering et al. [23]. Specifically, Tomita's algorithm is adapted to work with an incrementally generated LR(O) automaton. The states of the auto
maton are created based on need. Moreover, the system accommodates extensible grammars whereby changes in the grammar during parsing produce corresponding changes in the relevant portions of the automaton.
Work which is similar in spirit to ours is that of Mayer [331; deterministic canonical bottomup parsing is examined in terms of reduction classes where a reduction class is a pair of strings, the first and second components of which represent the left and rightcontexts, respectively, of parsing actions. Conditions are imposed on these reduction classes which ensure determinism, termination, and correctness. In short, the cited paper presents a framework for describing deterministic canonical bottomup parsers, whereas our aim is a framework for characterizing general recognition and parsing.
Outline in Brief
This introductory chapter ends with a very short synopsis of the remaining chapters. The next chapter reviews some basic definitions and terminology. Chapters III through VII comprise the main body of this dissertation. Concluding remarks are made in Chapter VIII.
Chapters III and IV develop the mathematical foundation for this work. Settheoretic characterizations of general topdown recognition and general bottomup recognition are presented in those two chapters.
Earley's algorithm is the subject of the fifth chapter. In particular, our graphical variant of Earley's algorithm is presented there.
A general automatabased bottomup recognizer is described in detail in Chapter VI. Chapter VII extends this recognizer into a general parser.
The major results of this dissertation are summarized in Chapter VIII. In addition, directions for future research, of which there are several, are delineated in that final chapter.
CHAPTER II
NOTATION AND TERMINOLOGY
This chapter summarizes some of the elementary formal aspects of this work, viz., assorted mathematical notation and definitions. In particular, some basic concepts of formal languages, directed graphs, and finitestate automata are reviewed. A more comprehensive presentation of the relevant theory can be found in the monograph by Sippu and SoisalonSoininen [39].
Elements of Formal Language Theory
An alphabet, denoted in this section by E, is a finite set of symbols. A string over E is a finite sequence of elements from E; the null string corresponds to the empty sequence and is denoted by E. A (formaO language over E is a set of strings over E; the set of all strings over E is denoted by Z* and L'* = Z*\ (E.
The length of a string is the number of symbols that it contains. The length of a string x EP is denoted by len(x) where len is defined recursively as follows: len(E) = 0; Va EE, len(a) = 1; Vx, y EZ*, len(xy) = len(x) + len(y).
The previous definition used the notion of string concatenation, viz., xy. Concatenation is generalized to apply to languages as follows. Given two languages L and L' and a string x, LLI {yz I y EL, z EL'}, xL = {x}L, and Lx = L{x}. The identity and zero of concatenation are E and 0 (the empty set), respectively. Thus, with x denoting either a string or a language, xE = Ex = x and xO = Ox = 0.
Let L be a language and i a natural number. The ith power of L, L', is defined recursively by LO = {E} and Li+  LL. The positive closure of L and the Kleene closure of L are defined by L+ = U Li and L* = U L' = L+U{E}, respectively.
i>o i>O
Let x, y, and z be arbitrary strings over E and let w =xyz. Then x is a prefix of w, y is a substring of w, and z is a suffix of w. If 0< len(x) < len(w) holds, then x is a proper prefix of w; similarly, if 0 < len(z) < len(w) holds, then z is a proper suffix of w. We define PREFD(x) = {yEZ*Ix=yz for some zEZ*} and SUFFIX(x) = {zEZ*Ix=yz for some y EL*}. If k is a natural number, then k:x (resp. x:k) denotes the unique prefix (resp. suffix) of x of length min{len(x), k}. This notation is extended to languages as follows. For L CL.*, PREFDC(L) = U PREFIX(x), SUFFIX(L) = U SUFFIX(x), k:L = {k:x I x EL}, and L:k
zEL zEL
={x:k I xEL}.
The reversal of a string x EZ*, denoted by xR, is defined recursively as follows: ER E; Va E E, aR = a; Vx, y EL *, (xy)R = yRXR. Similarly, the reversal of a language L is defined by LR  {XR Ix EL}.
ContextFree Grammars and Languages
A (contextfree) grammar is denoted by G = (V, T, P, S) where V is an alphabet known as the vocabulary of G, T C V and N= V\T are the terminal and nonterminal alphabets, respectively, P C N x V* is the finite set of productions, and S EN is the start symbol. The following conventions are generally adhered to: a, b, c, t E T; w, x, y, z E T*; A, B, C, S EN; X, Y, Z E V. In addition, lowercase Greek letters denote strings in V*. An arbitrary grammar G is assumed throughout the rest of this section.
A production (A, w) EP is written A , A and w are the lefthand side and righthand side of the production, respectively. A group of productions that share the same lefthand side, viz., A w, A+w2, . . ., A+wn, n > 1, may be abbreviated as AI 1 ... wn. A production with a righthand side of E is called a null production or production.
It is common to specify a grammar by listing only its productions. In this case, the lefthand side of the first production or production group in the list is taken to be the start symbol. The nonterminal and terminal alphabets can be inferred from the productions.
If A+ w is a production in P, then Aa.fP is an item of G for each a and f such that w=a . The size of G is defined as IGI Zllen(Aw) IAw EP}. Note that the size of G is equivalent to I{A+ a*PIAi* is an item of G)I. The reversal of G is the grammar GR  (V, T, pR,S) where pR = {A+ .0 [A+ wEP}.
The derives relation (==), a binary relation induced on V* by P, is defined formally by
= (aA/3,aw I ,IPEV*, A*.wEP}. A string yE V* such that S=**'y holds1 in G is called a sentential form of G; the set of the sentential forms of G is denoted by SF(G). The (contextfree) language that is generated by G is defined by L(G) = SF(G) n T*. Each member of L(G) is called a sentence of G. We use PREFD((G) and SUFFIX(G) as abbreviations for PREFIX(L(G)) and SUFFIX(L(G)), respectively.
For A EN and XE V, if A =*+aX holds in G for some a,fiE V*, then X is reachable from A. A symbol XEV is nullable if X=**E holds in G. A string YEV* is nullable if every symbol in 1 is nullable. In particular, e is trivially nullable.
A symbol XEV is useful if either X=S or S=**ceX=*w holds in G for some , /3E V* and w E T*; otherwise, X is useless. A grammar is reduced if every symbol in its vocabulary is useful. An arbitrary grammar G can be transformed into an equivalent reduced grammar2 in 0(1 G) time [39]. In light of this result and for the convenience that it provides, all grammars are assumed to be reduced throughout this work.
A grammar G is $augmented if, for distinguished symbols S' and $, P contains a production of the form S+S$ where S'E V is the (new) start symbol and $E T is a sentence endmarker. Moreover, S'+S$ is the only production in which S' and $ occur. Whenever we are working with a $augmented grammar, all input strings are assumed to end with $.
1 The transitive (resp. reflexivetransitive) closure of a binary relation is denoted by + (resp.
2 For our purposes, two grammars are equivalent if they generate the same language.
StateTransition Graphs and FiniteState Automata
A statetransition graph (STG) is denoted by G = (Q, E, 6) where Q is a finite set of states, E is an alphabet, and 6C(Q XEUE}) X Q is the transition relation.3 Thus, an STG differs from a finitestate automaton only in that it does not have a start state or a set of final states designated for it. A member ((p, a), q) Eb is read as a transition from p to q on a; p is the source of the transition and q is the target. A member ((p, a), q)E 6 is also written as (p,a,q)Eb or qEC(p,a); the latter may be written as q=4(p,a) if (p, a, q),(p, a,r)E3 implies that q =r. A transition on c is known as an Etransition. An STG is Efree if it has no transitions. For the remainder of this section we assume an arbitrary STG G = (Q, , 6).
The following property holds for all STGs that arise in this work. If (p, a, q),(p, b, r)E6 and a 0b, then q 5r; in words, distinct transitions which share the same source state access distinct target states. Thus, for any pair of states p, q E Q, there is at most one transition from p to q.
A path in G and the string over E" that it spells are defined inductively as follows. For each state q EQ, (q) denotes a path in G from q to q spelling E; for m >1 and q EQ, 0_
The succ function, succ:Q XE*2Q, is defined by succ(p,x) = Iq EQ 13 a path in G from p to q spelling x}. Extending this function to R C Q, succ(R, x) = U succ(q, x).
qER
The pred function, pred:QXE*+2Q, is defined in terms of succ by pred(q,x) = {p CQ I q Esucc(p, x)} and is similarly extended to subsets of Q.
3 A subscript is given to G later to differentiate it from a grammar.
The inverse of G is denoted by G = (Q,L,61) where (p,a,q)E ' if and only if (q, a,p)E6, i.e., the transitions of G are reversed in G1.
A finitestate automaton (FSA) is denoted by M = (G, q0, F) = (Q, E, 6, q0, F) where G
(Q, E, ) is an STG, q0 E Q is the start state, and F C Q is the set of final states. Each state in Q is assumed to be reachable from q0. If G is Efree, then M is also Efree. If M is cfree and (p, a, q),(p, a, r)Eb implies that q =r, then M is deterministic. An arbitrary (resp. deterministic) FSA is called an NFA (resp. DFA). The (regular) language accepted by M is defined by L(M) = {x EL* Isucc(qo,xz)flF#0}. A state q E Q is dead if no final state
is reachable from it.
CHAPTER III
GENERAL TOPDOWN RECOGNITION: A FORMAL FRAMEWORK
A formal framework for describing general topdown recognition is developed in this chapter. Two contrasting topdown recognition schemes are presented; they are distinguished by the direction in which the input string is scanned, viz., righttoleft or lefttoright. Since the two schemes turn out to be mirror images, one is derived in terms of the other. Our approach to general recognition is based on certain regularity properties of contextfree grammars. Consequently, the framework is designed accordingly to highlight these properties.
The primary purpose of this chapter is to catalog some formal aspects of general topdown recognition. An investigation of the practical utility of the two general topdown recognition schemes is left for future work. However, the theoretical development contained herein is invaluable toward deriving a practical, truly general, bottomup parser; that is the thrust of the remaining chapters. An arbitrary reduced grammar G = (V, T, P, S) is assumed throughout this chapter.
Recognition Based on Derivations
In a topdown approach to recognition, an attempt is made to construct a parse tree for an input string, perhaps implicitly, by starting at the root and progressing toward the leaves. The downward growth of an incomplete parse tree occurs at the frontier of the tree which may be represented by the string of grammar symbols which label its nodes. A basic step in constructing the parse tree involves applying the = relation to this linearized form of the frontier. However, the derives relation is too undisciplined, in general, for describing topdown recognition in a useful fashion since there is no indication of which nonterminal symbol
to replace at each step. Instead, rightmost and leftmost derivations are preferred for the additional constraints that they place on the parse tree construction process.
Since rightmost and leftmost derivations are defined in terms of subrelations of the = relation, they also construct parse trees topdown. In addition, they impose a canonical1 order on the construction of parse trees. Specifically, rightmost derivations construct parse trees from right to left, whereas leftmost derivations construct them from left to right. Some basic notions about rightmost and leftmost derivations are briefly reviewed next.
Rightmost and leftmost derivations are based on the rderives (=#) and lderives (=*I) relations, respectively. These relations are formally defined by = = {(oaAz, oawz) I oE V*, A..wEP, zET*} and , = {(xA#,xw#)I xET*, A.wEP, #EV*}. Rightmost deriva tions (resp. leftmost derivations) are defined in terms of the reflexivetransitive closure of , (resp. =,) in the usual fashion.
For E V*, if S= * holds in G, then y is called a right sentential form of G. The set of the right sentential forms of G is denoted by SFr(G). The inclusion SFr(G)_SF(G) holds and is typically, but not always, proper. In contrast, for w E T*, S = *w holds in G if and only if S =,*w holds in G. Thus, L(G) = {w E T*I S ,*w holds in G}.
For AEN and XEV, if A= +aX holds in G for some eEV*, then X is rightreachable from A; furthermore, if X =A, then A is rightrecursive. A grammar that has a rightrecursive nonterminal is a rightrecursive grammar. A symbol XE V is nullable in G if and only if X =" E holds in G.
Any string yE V* such that S = I*y holds in G is a left sentential form of G. The set of the left sentential forms of G is denoted by SFI(G). Similar to the above, the relationship SF,(G) CSF(G) holds and is generally proper. In addition, L(G) = {w E T* S w holds in G}.
Given A EN and XEV, if A =*iX holds in G for some P EV*, then X is leftreachable from A; if it further holds that X=A, then A is leftrecursive. A grammar is
I In the literature, the term "canonical" is typically associated with rightmost derivations only.
leftrecursive if at least one of its nonterminals is leftrecursive. Finally, X E V is nullable in G if and only if X=I** holds in G.
TopDown RighttoLeft Recognition
A general topdown recognition scheme that scans the input string from right to left is formally developed next.2 This scheme is based on two binary relations on V*. Through these two fundamental relations, a settheoretic characterization of general topdown righttoleft recognition which succinctly captures the essence of the task is derived.
In concert, the two relations refine and supplant the rderives relation. Certain regularity properties of contextfree grammars that are central to our treatment of recognition are characterized directly and rather elegantly by the two relations; by comparison, a description of these properties in terms of rderives is indirect and somewhat awkward. It is in this sense that the two relations refine the rderives relation. Moreover, the two relations provide alternate definitions of the right sentential forms and sentences of a grammar. In that respect, the rderives relation is supplanted by them.
Strong Rightmost Derivations
The strong rightmost derives relation (=*R) is defined by *R = {(aA, aw) Ia E V*, A wEP}. Thus, = R is a subrelation of =, with domain V*N. For brevity, the strong rightmost derives relation is called the Rderives relation.
Strong rightmost derivations are defined in terms of the reflexivetransitive closure of
R. Thus, every strong rightmost derivation is also a rightmost derivation. The following series of lemmas compares some elementary properties of rightmost and strong rightmost derivations.
Lemma 3.1 For a,#/E V*, if a==*R# holds in G, then a=* holds in G. Proof. This follows directly from the fact that =*R is a subrelation of = ,. 0
2 For the moment, we ignore the fact that a righttoleft scan of the input is not particularly useful in practice.
Lemma 3.2 For a,# 3EV* and A EN, if a=*,* #A holds in G, then a =*R/A holds in G.
Proof. Let n represent the length of a rightmost derivation of f#A from ae. By induction on n, we show that there exists an identical nstep strong rightmost derivation of #6A from a. Basis (n =0). Assume that a=*ï¿½/3A holds in G. This implies that a=#A, since =,0 is equivalent to the identity relation on V*. Since =*Ro is also equivalent to the identity relation on V*, a =o a also holds in G.
Induction (n > 0). By assumption, a *,fA holds in G. The last step in a particular nstep derivation of 63A from a can take two distinct forms. These are analyzed in the following two cases.
Case (i): a=lyB *, yA =#A for some TE V* and BA EP. By the induction hypothesis, R 'yB holds in G. Since yB = R yA holds in G by definition, we conclude that a=*U3 #A holds in G.
Case (ii): a =*r"lfAB , 68A for some B+EEP. By the induction hypothesis, a ="#iAB holds in G. Thus, a ng #6A also holds in G since PAB =R #3A holds. In both cases, we have shown that a=*)/3A holds in G. 0
Lemma 3.3 For aEV* and aET, if a=*fla holds in G for some flEV*, then a=,ya holds in G for some yE V*.
Proof. Assume that a=*,*fla holds in G for some PE V*. If a=ya for some "Y V*, then ce =R*a =a trivially holds in G. Otherwise, suppose that a does not end with a. In this case, every rightmost derivation of fla from a is nontrivial. We analyze one such rightmost derivation and focus on the step that causes a to become the rightmost symbol in a string occurring in that derivation. The initial segment of the derivation up to and including this step can take two distinct forms.
Case (i): a=6*A =*, &,a for some EV* and A*.a EP. By Lemma 3.2, a=6A holds in G. By definition, 6A = R &a holds in G. Thus, a=ya holds in G when we let y=7.
Case (ii): a = ,*6aA =*, 6a for some 6EV* and A.cEP. Similar to Case (i), a=4*6aA and 6aA =#R ba both hold in G. Now we let "y=6 to conclude that a=Riya holds in G. We have demonstrated in both cases that c = ya holds in G for some YE V*. 0
Lemma 3.4 For A EN and XE V, X is rightreachable from A in G if and only if A = aX holds in G for some aE V*.
Proof. If X is rightreachable from A in G, then A =*'4X holds in G for some fPE V*. If XEN, then A =flX also holds in G by Lemma 3.2. If XE T, then Lemma 3.3 applies, i.e., A =*flaX holds in G for some aE V*. Conversely, suppose that A =4aX holds in G for some a E 1/*. It follows directly from Lemma 3.1 that X is rightreachable from A. 0
Corollary For A EN, A is rightrecursive in G if and only if A = faA holds in G for some EV*. 03
Lemma 3.5 For XE V, X is nullable in G if and only if X= Z E holds in G.
Proof. If XET, X is not nullable in G and X=* cdoes not hold in G. Now suppose that XEN. If X is nullable in G, then every rightmost derivation which demonstrates this must be of the form X= ,*A =, E for some AEEP. From Lemma 3.2 and the fact that A =R f holds in G, we conclude that X=4 E holds in G. Conversely, X =* E immediately implies that X is nullable in G since = R is a subrelation of =*,. 03
Corollary For yE V*, y is nullable in G if and only if y=4 E holds in G. 0
One final lemma is presented before introducing the companion relation to = R. The lemma is useful for motivating this second relation.
Lemma 3.6 For zE V*, at least one of the following two statements is true: (1) a= * #a holds in G for someP 1E V* and a E T; (2) a= E holds in G. Proof. If a=E, then statement (2) holds trivially. Now suppose that a3$ï¿½. Since G is reduced, a=*x holds in G for some x E T*. If x =E, then statement (2) again holds from the corollary to Lemma 3.5. Otherwise, x =ya for some y E T* and a E T. By Lemma 3.3, it now follows that a ==4 a holds in G for some P E V*. 0
Lemma 3.3, in contrast to Lemma 3.2, illustrates that a rightmost derivation departs from a strong rightmost derivation following the step where a terminal symbol first appears at the right end of a string occurring in the rightmost derivation. The role of the second relation that we introduce is to dispense with terminal symbols as they appear at the right end of strings in strong rightmost derivations. Specifically, the chop relation is defined by I = {(aa, a) I a E V*, a E T}. For every a E T, I denotes the subrelation of I with domain V*a. Thus, for a,1PE V* and a E T, a 1,,3 holds if and only if aI13 and a=3a hold.
The relation product =*Z I, a useful composition that is suggested by Lemma 3.6, is used extensively in what follows. Formally, for a, P3E V*, a I P holds in G if and only if a== f#a I 13 holds in G for some a E T; this latter expression is usually written as a(=*4 I)* 13.
For clarity, we describe inductively the notation that we will employ for exploiting the reflexivetransitive closure of (4 I); similar conventions are applied to other relation products that are introduced later. For all a(EV*, a (=4*)a holds in G; for a,13, y 1EV*, yET1 with n>1, and aET, ifa(=*I) #and fl(=*I)0'yhold in G, thenholds in G. The order of ay in the latter expression reflects the fact that the terminal symbols of a string are generated by =aR and chopped by I from right to left. Finally, if a(= I)"13 holds in G for some a, 13E V* and y E Tn with n >0, then for convenience we may instead write this expression as a(=* 1),/3, a(=l)*13, or a(=t1 ) 13 according to whether or not the string y or its length n is relevant.
Right Sentential Forms Revisited
Next we investigate how arbitrary rightmost derivations are mimicked by the R and I relations. In short, a rightmost derivation is represented as a sequence of strong rightmost derivations interspersed with chops of terminal symbols. As a result of this analysis, the precise manner in which right sentential forms and sentences are generated by the two new relations is revealed.
Lemma 3.7 For a, P EV*, if o=15 holds in G, then az==*,*flz holds in G for every zET*.
Proof. If a=,*Z! holds in G, then a =**/3 holds in G by Lemma 3.1. The consequent in its full generality can then be established by an induction on the length of an arbitrary string zET*. Q
Lemma 3.8 For a, #E V* and z E T*, if a(= I)*3 holds in G, then a ,*/3z holds in G.
Proof. The proof is by induction on n =len(z). Basis (n =0). In this case, z =E. By assumption, a(:R* I)ï¿½15 holds in G. It must then be the case that a=15, so a =*,,/ trivially holds in G. Induction (n >0). In this case, z =ay for some a E T and yE Tn1. Assume that a(=* a I)y holds in G. Then R n1(( * I)a 1 holds in G for some yE V*. By the induction hypothesis, a=**yy holds in G. Furthermore, y(= I)a 1 implies that y=4 1a I 1 holds in G. By Lemma 3.7, yy = *ay holds in G, so a=**ay =5z also holds in G. 01
Lemma 3.9 For c, PE 1* and z E T5 if a(=*I)*=*RZ holds in G, then ce= ,**z holds in G.
Proof. By assumption, a(= I)=4 15 holds in G. This implies that a(=R I)Zy==R* 1 holds in G for some yE V*. By Lemma 3.8, a=*yz holds in G. Since y=4,6 holds in G, Yz= ,**z holds in G by Lemma 3.7. Therefore, a=*flz holds in G. 0
Lemma3.10 For a,15,'E V* and x, yE T*, if a(= R* ,)=4# and #(=4 I)4= Y hold in G, then a (= R I).y = y holds in G.
Proof. The key observation relevant here is that the expression a (=: I)* =*R* (=*4 I) =*R* *Y may be rewritten as e (==t I1; (=4 0)* =I) * , to make this transformation, the occurrence of
Z preceding 15 in the first expression is "absorbed" by (=* I) if x #E and by the occurrence of =4 preceding y otherwise. It is now immediate that a(=Z y =' holds in G. 11
Lemma3.11 For a EV* and zET*, az(=4 I)*=4a holds in G. Proof. This is shown by an easy induction on n =len(z). Basis (n =0). Trivially, a(=*R I)=** a holds in G. Induction (n > 0). Let z =ay for some a E T and y E T"1. By the induction hypothesis, y R Y =4 aa holds in G. Observing that aa =,o aa 1aa=*ï¿½a a holds in G establishes that aa (=*R* I)a = a also holds. It now follows from Lemma 3.10 that aay (= a a holds in G. 0
Lemma 3.12 For a, P3E V*, let a= P hold in G. Furthermore, let Pf=Yx for some
yE V* and x E T* where yE V*N if #3E V*NT* and y=E otherwise (i.e., x is the longest suffix of 83 consisting solely of terminal symbols). Then a(= I)=4y holds in G. Proof. The proof is by induction on the length n of a rightmost derivation of P from a. Basis (n =0). Thus, a=* 0/=a. Write a as yx for some yE V* and x E T* where x is the longest suffix of a contained in T*. In this case, a =yx (=* 1) =4 y holds in G by Lemma
3.11.
Induction (n > 0). A rightmost derivation of #3 from a consisting of n steps is of the form ce=n1 Az , 6wz =# for some 6E V*, A wEP, and z E T*. By the induction hypothesis, a(= Z I).Z 6R A holds in G. Since 5A = R &v holds in G, a(= R I)=*R 6w also holds. Now write 6w as yy for some yE V* and y E T* where y is the longest suffix of &v made up entirely of terminal symbols. By Lemma 3.11, 6w=yy (=4* I)*=* 'y holds in G. It then follows from Lemma 3.10 that a(=4 I)Y=4y holds in G. Finally, we note that # ='Yyz where, by construction, yz is the longest suffix of P3 that is comprised of only terminal symbols. 0
Theorem 3.13 SFr(G) = {'yEV* S(=R I)z =4*a holds in G for some aEV* and z E T* such that 'Y=az}.
Proof. Suppose that S (= I) =* a holds in G for some a E V* and z E T*. By Lemma 3.9, S =*az also holds in G, so az ESFr(G). Conversely, suppose that S = 'y holds in G for some yE V*. Let 'y= az for some aEV* and zET* such that z is the longest suffix of Y which is a terminal string. Then S (=4) $ =4 a holds in G by Lemma 3.12. 0
Corollary L(G) = {wET* S(=*Z I). =*, , holds in G}. 3
Corollary SUFFIX(G) = {z E T*I S (4 I)a holds in G for some c E V*}. 0
Viable Prefixes
A concept that plays a central role in LR parsing theory is that of a viable prefix. Viable prefixes are also prominent in our treatment of general recognition and parsing. Viable prefixes are defined in terms of rightmost derivations and right sentential forms as follows. A string EV* is a viable prefix of G if S =*,6Az =, b &z = "7Iz holds in G for some 6E V*, A ae6EP, and z E T*. Thus, viable prefixes are certain prefixes of right sentential forms. The set of viable prefixes of G is denoted by VP(G).
In the next series of lemmas, a definition of the viable prefixes of G in terms of the Rderives and chop relations is developed. It transpires that this definition is remarkably similar to the definition of SFr(G) just given. Since viable prefixes are defined via nontrivial rightmost derivations from S, our definition is carefully tailored to include S in VP(G) only in case S ,+See holds in G for some c E V*.
Lemma 3.14 For a,,8 E V*, if ca=R # holds in G and a is a viable prefix of G, then 8 is a viable prefix of G.
Proof. Since a=*R/f holds in G by assumption, a=yA and /#=iw for some YE V* and A*wEP. Also by assumption, aEVP(G), so S=*6Bz= , arz=arz holds in G for some 6E V*, BrEP, and z E T*. Since G is reduced, r=*y holds in G for some y E T*. Thus, S =r**oyz =yAyz =, ywyz = fyz holds in G which shows that # is a viable prefix of G.
0
Lemma 3.15 For a,# E V*, if a =, # holds in G and a is a viable prefix of G, then / is a viable prefix of G.
Proof. Applying the preceding lemma, this lemma is established by an easy induction on the length of a strong rightmost derivation of P from a. 0
3 This definition is borrowed from Sippu and SoisalonSoininen [38]. Although it differs slightly from others (cf., [51), it is more appropriate to our needs.
Lemma 3.16 For a, /E V*, if a I P holds in G and a is a viable prefix of G, then # is a viable prefix of G.
Proof. From the hypothesis, a=/8a for some a E T. Conventional definitions of viable prefixes [5] prescribe that every prefix of a viable prefix of G is also a viable prefix of G. However, this property is not immediate from the definition that we have adopted. A proof that this property does hold in our definition is provided by Sippu and SoisalonSoininen [38]. The essence of their argument is based on the existence of a rightmost derivation of the form S=**6Az=, &arz=flarz for some 6EV*, A+uaTEP, and zET*. This derivation form demonstrates that both /a =a and f# are viable prefixes of G. 01
Lemma 3.17 For yE V*, if w(= * )*'y holds in G for some SwEP and z E T*, then
y is a viable prefix of G.
Proof. The proof is by induction on n =len(z). Basis (n =0). In this case, z =E. By assumption, w(=* I)ï¿½Qy holds in G for some SwEP. Then y must equal w which is a viable prefix of G since S :7 w holds in G. Induction (n > 0). In this case, z =ay for some a E T and y E Tn1. Assume that W(=:,*,nay y holds in G. Then w(=*Z I) . I) holdss in G for some 86E V*. By the induction hypothesis, /EVP(G). Now ( I) y implies that #6=Z *ya I y holds in G. It follows from Lemmas 3.15 and 3.16 that ja and y are also viable prefixes of G. 0
Lemma 3.18 For yEV*, if w(=*1 I)=*y holds in G for some S+wEP and zET*, then  is a viable prefix of G.
Proof. Assume that w(=*R I)*=**'y holds in G for some S+wEP and z E T*. This implies that w(4 I)8 fl=* y holds in G for some #6E V*. By Lemma 3.17, P is a viable prefix of G. Thus, y is also in VP(G) by Lemma 3.15. 0
Lemma 3.19 For yE V*, if y is a viable prefix of G, then w(=R I)=,y holds in G for some SwEP and z E T*.
Proof. By assumption, yEVP(G). Thus, S =6*Ay =,&fly =Yfly holds in G for some 6EV*, Aof3EP, and yET*. From the proof of Lemma 3.12, S(=*) * *y# holdsin G.
Since G is reduced, /#=*,x holds in G for some x E T*. Therefore, 'Yf=1*lx and
~6(ZI) = both hold in G. Combining these results in the manner of Lemma 3.10, S (=*R I)z =* y holds in G. Since the nontrivial rightmost derivation of 'Y/3y from S must have a first step of the form S =, w for some S+wEP, w(=4 y holds in G where Z =Xy. 0
Theorem 3.20 VP(G) = yEV*Iw(=4I)=4 y holds in G for some S+ wEP and z E T*}.
Proof. This theorem follows directly from Lemmas 3.18 and 3.19. n
Corollary VP(G) = {yE V* S(=o.R U )+'y holds in G}. 03
One final observation is that VP(G) is closed under (= R U I). Indeed, this is immediate from Lemmas 3.14 and 3.16. Due to its importance in general canonical topdown recognition, this property is formally recorded below.
Corollary For a,flEV*, if aEVP(G) and a(=='R U I)*fi holds in G, then fEVP(G).
0
General TopDown CorrectSuffix Recognition
Let w E T* be an arbitrary input string. A topdown scheme for recognizing w with respect to G is described next. In this scheme, w is scanned from right to left. As a consequence, an incrementally longer suffix of w is recognized in the process.
The general recognition scheme effectively pursues all of the possible rightmost derivations of w in parallel. This is carried out through regularitypreserving operations on regular subsets of VP(G). Adoption of this approach obviates the need for backtracking.
General contextfree recognition is an inherently nondeterministic task. Hence, it is not generally possible to pursue the rightmost derivations of w exclusively. Instead, at the point where a suffix z of w has been processed, all rightmost derivations (from S) of all strings in T*z AL(G) are followed (i.e., all sentences that have z as a suffix).
The essence of the recognition scheme, called GeneralRR, is simple. Let z E T* be a suffix of w and suppose that all proper suffixes of z are known members of SUFFIX(G). The set of strings defined by (aEVP(G) IS =az holds in G} is used to determine if z is a member of SUFFLX(G). This set is nonempty if and only if z ESUFFIX(G). Moreover, it contains c if and only if z EL(G). The GeneraLRR recognition scheme is described in greater detail in what follows. For reference, the recognizer is presented as Figure 3.1.
function GeneraLRR (G =(V, T, P, S); w E T*)
// w =ala2 * ï¿½ ï¿½ a, n 0, each ai E T
PVP(G, f) := {wI S.wEP}
for i := 0 to n1 do
VPg14G, w:i):= =4 (PVPnn(G, w:i))
PVPM4G,W:iI1) := 1.., (VPra G, w:i))
if PVPr,(G, w:i+1) = 0 then Reject(w) fi
od
VPFR(G, w):= (PVPFM(G, w))
if E EVPR(G, w) then Accept(w) else Reject(w) fi
end
Figure 3.1  A General TopDown CorrectSuffix Recognizer
For an arbitrary string z E T*, two sets of viable prefixes are identified with z. The first set consists of the primitive RRassociates of z (in G) and is defined by PVPWRG, z) = {aE V*I w(=4 I)>*o holds in G for some S.wEP}. The second set is a superset of the first; it consists of the RRassociates of z (in G) and is defined by VPRR(G, z) = {&cEV*Iw(=4I)?'*=* * holds in G for some S+wEP}. By Theorems 3.13 and 3.20, VPwG, z) = {&EVP(G) IS= *oz holds in G} which equates to the set described in the preceding paragraph. Input string w is recognized by computing PVPFR(G,w:i) and VPWr(G, w:i) in turn as i ranges from 0 to len(w).
In words, VPRR(G, z) is the reflexivetransitive closure of PVPW G, z) under the R relation. This fact is made explicit by expressing VP G,z) as {/#EV* I a= j holds in G for some ce PVPw G,z)}. Thus, if PVPpa(G,z) is known, VP (G,z) is obtained from it through appropriate application of the =*R relation.
The incremental aspect of GeneraLRR becomes apparent in the computation of a set of primitive RRassociates. Specifically, given VP(G, z) and a E T, PVPR(G, az) is obtained by an application of the Ia relation since PVPpa(G, az) = {I E V* I a 'a/3 holds in G for some aEVPi(G,z)}. It is apparent that PVPRR(G,z) and VPR(G,z) are both nonempty if and only if z ESUFFIX(G). The computation of the primitive RRassociates of E, a suffix of every w E T*, serves as the initialization step. Specifically, PVPRR(G, E) {wI S wEP}.
Lastly, the conditions for termination of GeneralRR are specified. First suppose that w EL(G). In this case, VPw(G, w) is the last set of RRassociates computed; after it is in place, w is accepted based on the fact that eEVPm(G, w) if and only if w EL(G). Conversely, suppose that w VL(G). If w VSUFFIX(G) also holds, then there is a unique string z E T* which is the shortest proper suffix of w such that z VSUFFIX(G) holds. In this case, PVPR(G, z) is the first empty set computed by the recognizer. Otherwise, if w VL(G) and w ESUFFD((G) both hold, then E VVPR(G, w) by definition. In either case, the input string is rejected.
The correctness of the GeneraLRR recognition scheme is formally established in the following two lemmas. The supporting arguments are quite straightforward given the collective results to this point.
Lemma 3.21 Let w EL(G) be arbitrary. If GeneraLRR is applied to G and w, then GeneraLRR accepts w.
Proof. By definition, PVPrR(G, w:i) and VPm(G, w:i) are nonempty for all i, O
Lemma 3.22 Let w VL(G) be arbitrary. If GeneralRR is applied to G and w, then GeneraLRR rejects w.
Proof. There are two cases to consider based on whether or not w is in SUFFIX(G).
Case (i): w ESUFFIX(G). In this case, PVPw(G, w:i) and VPPR(G, w:i) are nonempty for all i, 0
Certain regularity properties that are inherent to all contextfree grammars are exploited by GeneralRR. Specifically, for an arbitrary string z E T*, PVPIM(G, z) and VPR(G, z) are regular languages. This fact is proven in this section. Toward that end some known theoretical results, including one which is rather obscure, are cited below. Since proofs of these results are not replicated here, the proofs that follow are quite brief.
A type of formal rewriting system known as a regular canonical system is defined by C = (Z, //) where E is an alphabet and H is a finite set of (rewriting) rules[21,30,37]. Each rule in /1 takes the form of a+ # where ca, /EL* and denotes an arbitrary string over Z, i.e., a variable. The form of a rule indicates that the lefthand side may be rewritten to its corresponding righthand side only at the extreme right end of a string. Thus, much like Rderives, the Cderives relation induced on 2* by H is defined by c = {(ya, #) I E*, a/Efl}. Given two languages L1, L2 E*, define r(L1, C, L2) by {6 EE* I y c y26 holds in C for some "y1 ELI and y2 EL2}"
A key result from the literature relevant to regular canonical systems is the following.
Fact 3.1 Let C = (E, /) be a regular canonical system and let L1 and L2 be regular languages over E. Then r(L1, C, L2) is a regular language over 2. Proof. This is a restatement of Theorem 3 from Greibach [21]. E0
The proof that PVPR(G, z) and VPm(G, z) are regular languages is based indirectly on proofs that =* and I are regularitypreserving relations. First, a relationship is established between contextfree grammars and regular canonical systems. Specifically, for a grammar G = (V, T, P, S), the regular canonical system induced by G is defined by C = (V, P) where eP ={ A. ewI A .wEP}.
Lemma 3.23 Relation =,Z is regularitypreserving.
Proof. Let G = (V, T, P, S) be a grammar, C = (V, eP) the regular canonical system induced by G, and L an arbitrary regular language over V. By Fact 3.1, r(L, C, {E}) {bE V* I yEL, y=C b holds in C} is regular. Since the = R and = o relations are equivalent,
(L) = r(L, C, {}). Therefore, =* is regularitypreserving. 03
Lemma 3.24 Relation I is regularitypreserving.
Proof. Let G = (V, T, P, S) be a grammar and let L C V* be an arbitrary regular language. The quotient of a language L1 with respect to a language L2 is defined by L1/L2 Ix I xy EL1 for some y EL2}. Since the quotient of a regular language with respect to an arbitrary set is a regular language [24], Va E T, 'a(L) = L/{a} is regular. Therefore, I is regularitypreserving. 0
Theorem 3.25 Let G = (V, T, P, S) be an arbitrary grammar and let z E T* be an arbitrary string. Then PVPaR(G, z) and VPw(G, z) are regular languages. Proof. By induction on len(z), this theorem follows from Lemmas 3.23 and 3.24 and the fact that PVP(G, E) ={wISwEP} is regular. 0
TopDown LefttoRight Recognition
In this section, a general topdown recognition scheme that presumes a lefttoright scan of the input string is formally developed. Toward that end, consider the two relations on V* defined by {(A#,wfl)l AwEP, PEV*} and {(a1, P)I aET, PlEV*}. Informally, these relations represent leftbiased counterparts of =dR and I, respectively. Along the lines of GeneraLRR, a general topdown correctprefix recognizer can be based on these two rela,
tions. Specifically, leftmost derivations, left sentential forms, etc., can be defined in terms of these relations analogously to how rightmost derivations, right sentential forms, etc., are expressed in terms of =*.R and I. However, an alternate approach is suggested by the following result.
Fact 3.2 For a, # E V*, (1) a=**3 holds in G if and only if aR =iR holds in GR; (2) a=*tf3 holds in G if and only if aR = *fiR holds in GR. Proof. A slightly stronger statement is presented by Sippu and SoisalonSoininen as Fact 3.1 [381. 0
For future reference, some useful equivalences that are implied by Fact 3.2 include the following: (1) L(GR) = (L(G))R, (2) PREFIX(GR) = (SUFFIX(G))R, and (3) SUFFDC(GR) =(PREFIX(G))R.
Fact 3.2 is exploited rather extensively in what follows. In particular, leftmost derivations in G  and ultimately general topdown correctprefix recognition  are described in terms of strong rightmost derivations in GR and the chop relation. Consequently, a substantial portion of the results derived in the previous section are useful here as well. This economizes on our efforts considerably.
Strong Rightmost Derivations in Reversed Grammars
The Rderives relation induced on V* by pR is defined by =*R = {(aA, aw) a a E V*, A.wEPR}. The relationship between strong rightmost derivations in GR and leftmost derivations in G is the subject of the next series of lemmas.4
Lemma 3.26 For ce,PEV*, if a=4# holds in GR, then aR =,?*r holds in G.
Proof. By assumption, a=*f holds in GR. This implies that a=**# holds in GR by Lemma 3.1. It follows from Fact 3.2 that aR =i6R holds in G. 0
Lemma 3.27 For a,f#E V* and A EN, if a=i*?A holds in G, then aR =** fRA holds in GR.
4 The chop relations relevant to G and GR are identical.
Proof. Assume that &=*,*A/p holds in G. By Fact 3.2, o =*=P(AfR .,9RA holds in GR. Thus, )? = *fPR/A also holds in GR by Lemma 3.2. 0
Lemma 3.28 For cEV* and a ET, if a=* a/# holds in G for some 6 EV*, then &R =* Ya holds in GR for some yEV*.
Proof. If a =,*a*d holds in G for some #E V*, then oR =,*(af#)R ='Ra holds in GR by Fact
3.2. By Lemma 3.3, it follows that aR? ==Z'ya holds in GR for some y E V*. 01
Lemma 3.29 For A E N and X E V, X is leftreachable from A in G if and only if X is rightreachable from A in GR.
Proof. Assume that A =*t+Xfl holds in G for some fiE V*. By Fact 3.2, A =*+(X)R =/ RX holds in GR, so A =*mTaX holds in GR for some a E V*. This latter conclusion follows from Lemma 3.2 if XEN, and from Lemma 3.3 otherwise. Conversely, suppose that A =4coX holds in GR for some a E V*. It follows from Lemma 3.1 that A =*,,CeX holds in GR. By Fact 3.2, A =,i ( =X)R Xao holds in G. 0
Corollary For A EN, A is leftrecursive in G if and only if A is rightrecursive in GR. []
Clearly, the nullability of vocabulary symbols is invariant with respect to grammar reversal. Thus, the following statements are equivalent for XE V: (1) X is nullable in G;
(2) X = E holds in G; (3) X= * E holds in GR. This observation is easily generalized to strings in V*.
Although Lemma 3.6 obviously applies to GR, it is restated below in terms of GR because of its importance in showing how the = R and I relations cooperate.
Lemma 3.30 For a CV*, at least one of the following two statements is true: (1) a==/Z a holds in GR for some/E V* and a E T; (2) a =** E holds in GR. o
Left Sentential Forms Revisited
The left sentential forms and sentences of G are defined in terms of the Rderives and chop relations of GR. Similar to rightmost derivations, a leftmost derivation in G is ren
dered as an alternation of strong rightmost derivations in GR and rightmost chops of terminal symbols.
Lemma 3.31 For a,16EV* and xET*, if holds in ,GR then a1R ==*(3x)R =XR31R holds in G.
Proof. By assumption, a(=R I)* holds in GR. It follows from Lemma 3.9 that a= flx also holds in GR. This implies that aR? =,/ *(#X)R =xR PR holds in G by Fact 3.2. 03
Lemma 3.32 For a, /E V*, let a= i*fl hold in G. Write 16 as xy for some x E T* and
yE V* such that yENV* if P E T*NV* and 'y=E otherwise (i.e., x is the longest prefix of that is made up of only terminal symbols). Then a (* I).R * yR holds in GR. Proof. Assume that the conditions in the hypothesis of the lemma hold. From the assumption that ce=i/3 holds in G and Fact 3.2, otR ,*lR holds in GR. Since /3=x, /#R =(xy)R XR. Thus, XR is the longest suffix of P3R that is made up of terminal symbols alone. We conclude from Lemma 3.12 that a? (* I)R ' =yR holds in GR. [
Theorem 3.33 SFI(G) = {yEV* S(=*R a holds in GR for some aEV* and x E T* such that Y=(ax)R}.
Proof. First suppose that S( ,I)* = a holds in G for some a E V* and x E T*. By Lemma 3.31, this implies that S =*i" (ax)R =xRoaR holds in G, so (ax)R is a left sentential form of G. Conversely, assume that S =h holds in G for some yE V*. Let It x=otR (x)R for x E T* and acE V* such that XR is the longest prefix of y contained in T*. This implies, by Lemma 3.32, that S (=4 I) =R a holds in GR. 0
Corollary L(G) = {w E T* I S (=4 1)R =4 c holds in }
Corollary PREFIX(G) = {x E T*l S (* I)>* a holds in GR for some aE V*}. 0
Viable Suffixes
A topdown complement to the class of LR(k) grammars is the class of LL(k) grammars [28,36]. A theory of LL(k) parsing that is a dual to the theory of LR(k) parsing is developed by Sippu and SoisalonSoininen [38]. In particular, the concept of a viable suffix is introduced
as the LL dual to the viable prefix and plays a commensurately central role in the theory. Symmetrically to the definition of viable prefixes, viable suffixes are defined in terms of leftmost derivations and left sentential forms. A string yE V* is a viable suffix of G if S= *xA6=*ixa/36 = xaofR holds in G for some xET*, Aa/EP, and 6EV*. Thus, viable suffixes are reversals of certain suffixes of left sentential forms. The set of viable suffixes of G is denoted by VS(G).
The next series of lemmas develops a definition of the viable suffixes of G in terms of the = R and I relations of GR. In that regard, the following result is useful.
Fact 3.3 (1) A string yE V* is a viable prefix of G if and only if y is a viable suffix of GR; (2) a string yE V* is a viable suffix of G if and only if y is a viable prefix of GR. Proof This is presented by Sippu and SoisalonSoininen as Fact 3.2 [38]. 0
Lemma 3.34 For ca,#/E V*, if a is a viable suffix of G and a =/3 holds in GR, then /3 is a viable suffix of G.
Proof If a is a viable suffix of G, then a is a viable prefix of GR. Since & R/3 holds in GR, # is a viable prefix of GR as well. Therefore, /3 is a viable suffix of G. 01
Lemma 3.35 For a,/ 3E V*, if a is a viable suffix of G and a= /3 holds in GR, then /3 is a viable suffix of G.
Proof This is a consequence of Lemmas 3.15 and 3.34. 0
Lemma 3.36 For a,/3E V*, if a is a viable suffix of G and a I/3 holds in GR, then /3 is a viable suffix of G.
Proof Using Fact 3.3, the proof of this lemma parallels that of Lemma 3.16. 0
Lemma 3.37 For yEV*, if w(=*ZI)=*f holds in GR for some S+wEPR and x E T*, then y is a viable suffix of G.
Proof Assume that w(=* I)*=*R y holds in GR for some S*wEPR and xET*. By Lemma 3.18, this implies that y is a viable prefix of GR. Thus, yEVS(G) by Fact 3.3. 0l
Lemma 3.38 For yE V*, if 7 is a viable suffix of G, then w(=*)* =*R* Y holds in GR for some S.+wEPR and x E T*.
Proof. Assume that y is a viable suffix of G. By Fact 3.3, "Y is also a viable prefix of GR. Thus, w(== )=1)'R holds in CR for some S..wEPR and x E T* by Lemma 3.19. 01
Theorem 3.39 VS(G) = {EV*w( holds in GR for some S.+wEPR and x E T'}.
Proof. This theorem combines Lemmas 3.37 and 3.38. 0
Corollary VS(G) = {yEV*l S( *R U I)+ holds in GR}. [
Corollary For aQEV*, if aEVS(G) and a(=*R U I)*i holds in GR, then PEVS(G). 0J
General TopDown CorrectPrefix Recognition
Let w E T* be an arbitrary input string. A topdown scheme for recognizing w with respect to G that is a lefttoright analog of GeneralRR is described next. This scheme, called General.LL, scans w from left to right as it recognizes an incrementally longer prefix of the input string. GeneralLL effectively pursues all of the leftmost derivations of w in parallel through regularitypreserving operations on regular subsets of VS(G).
Again, the inherent nondeterminism of general contextfree recognition subverts any attempt to follow exclusively the leftmost derivations of w. Instead, at the point where a prefix x of w has been processed, all leftmost derivations (from S) of all strings in xT*fL(G) are followed (i.e., all sentences that have x as a prefix).
The essence of GeneralLL mirrors that of General.RR. Let x E T* be a prefix of w. Suppose that all proper prefixes of x are members of PREFIX(G). The set of strings defined by {PEVS(G) I S 1*x#R holds in G} determines if xEPREFIX(G) holds. This set is nonempty if and only if x EPREFIX(G) and it contains E if and only if x EL(G). GeneraLLL, shown in Figure 3.2, is described in greater detail in what follows.
For arbitrary x E T*, two sets of viable suffixes are identified with x. The first set, the primitive LLassociates of x (in G), is defined by PVS(G, x) = {#E V* I W(=*R )* holds in GR for some SwEPR}. The other set contains the LLassociates of x (in G) and is
function GeneraLL (GR =(V, T, pR, S); w E T*)
//wo=aid2.., a, n >O, each a ET
PVSuiG, E):= {wI S.wEPR}
for i :=0 to n1 do
VSux4G,i: w) := 4*(PVS(G, i: w))
PVSuIG, i+l:w) : a i(VSu G, i:w))
if PVSu.(G, i+l:w) = 0 then Reject(w) fi
od
VSu4G, w) =4 (PVS G, w))
if EEVSL4G, w) then Accept(w) else Reject(w) fi
end
Figure 3.2  A General TopDown CorrectPrefix Recognizer
defined by VSu.(G,x) = {#EV*I w(=4) *R=/ holds in GR for some S.wEPR}. By Theorems 3.33 and 3.39, VSLL G, x) = {/8EVS(G) I S =7 x PiR holds in G} which is precisely the set described in the previous paragraph. Input string w is recognized by computing PVSL(G, i:w) and VSLL(G, i:w) as i ranges from 0 to len(w).
The set VSuL(G, x) is equivalently expressed as {fIE V* o#Z,6 holds in GR for some aEPVSu4G,x)}; this form explicitly reflects that VSuL(G,x) is the reflexivetransitive closure of PVSui4G, x) under the =R relation. Thus, VStL4G, x) is computed by applying Z to PVSL4G, x).
Given VSuiG, x) and a E T, PVSu4G, xa) is determined from VSu4G, x) through an application of the Ia relation since PVSu4G, xa) = (flE V*l Ia f holds in GR for some aEVSu4G,x)}. Clearly, PVSu4G,x) and VSu4G,x) are both nonempty if and only if x EPREFIX(G). The initialization step entails computing the primitive LLassociates of E, i.e., PVSg(G, E) = {wI S"'.wEpR}.
The conditions under which General.LL terminates are analogous to those of GeneralRR. If w EL(G), then VSuLG, w) is the last set of LLassociates computed; after it is known, w is accepted since EEVSu.(G, w) if and only if w EL(G). Conversely, suppose that w VL(G). If w VPREFIX(G) also holds, then there is a unique string x E T* which is the shortest prefix of w such that x VPREFIX(G) holds. In this case, PVSuLG, x) is the first
empty set computed by the recognizer. Otherwise, if w L(G) and w EPREFIX(G) both hold, then VSI(G, w) is found not to contain E. In either case, w is rejected.
The correctness of the GeneraLLL recognition scheme is formally established in the following two lemmas.
Lemma 3.40 Let w EL(G) be arbitrary. If GeneralLL is applied to GR and w, then GeneralLL accepts w.
Proof. Since every prefix of w is in PREFIX(G), PVS(G, i:w) and VSL(G, i:w) are nonempty for all i, 0
Lemma 3.41 Let w VL(G) be arbitrary. If General..LL is applied to GR and w, then GeneralLL rejects w.
Proof. There are two cases to consider depending on whether or not w EPREFIX(G). Case (i): w EPREFIX(G). In this case, PVSu4G, i:w) and VSu4G, i:w) are nonempty for all i, 0
Case (ii): w VPREFIX(G). Let x E T* be the unique string which is the longest prefix of w such that x EPREFIX(G) holds. Let len(x)=m and note that 0
Regularity Properties
The regularity properties inherent to all contextfree grammars that are exploited by GeneralLL are identified in the following.
Theorem 3.42 Let G = (V, T,P,S) be an arbitrary grammar and x an arbitrary string over T. Then PVSu G, x) and VSuLG, x) are regular languages. Proof. The proof is by induction on len(x)=n. In particular, we show that PVSu4G,x) PVPIum(GR, XR) and VSuG, x) = VP ( GR, xR ). The proof is mostly an exercise in recalling definitions and putting them in the appropriate form. Basis (n =0). The following two equalities are obvious: (1) PVSu(G, c) = {wE V* S*wEPR} = PVPm(GR,E); (2) VSu4G,c) = {3EV*l a=4fl holds in GR for some a EPVSu G, E)} = {/PE V* I a=4 #6 holds in GR for some a EPVPW (GR, E)}. Induction (n > 0). Let x =ya for some y E T"1 and a E T. By the induction hypothesis, PVSuG,y) = PVP,4(GR,yR) and VSuG,y) = VPra(GR,yR). Hence, PVSu4G, ya) = {,8EVl al.# for some aEVSu.G,y)} = {/lE VJ al.# for some CaEVPru(GR,yR)} = PVPRF GR, ayR) = PVPra(GR,(ya)R). Finally, VSuG, ya) = {E V*I a= f6 holds in GR for some oaEPVSuG,ya)} = {/EV* a= fl holds in GR for some aEPVPmGR,(ya)R)} = VPRR(GR, (ya)R). From Theorem 3.25, we conclude that PVSu4G, x) and VSu4G, x) are regular languages. 03
Discussion
A simple framework for describing general canonical topdown recognition was presented. The settheoretic framework is based on two relations on strings, =*R and I. A key property of both of these relations is that they preserve regularity. The essence of general topdown recognition was captured in terms of computing the images of regular sets under these relations.
The definitions of the various objects of importance in the framework, namely sentences, suffixes and prefixes of sentences, right and left sentential forms, etc., were cast in terms of the R and I relations. Consequently, it is a small step from these definitions to the recognition schemes that are based on them. In addition, the correctness of the recognizers is particularly easy to establish.
Given the impracticality of scanning input strings from right to left, it is worth reflecting on why strong rightmost derivations were chosen over strong leftmost derivations as a point of origin. If GeneraLbL had been developed first, the evolution from GeneraLL to GeneraLRR certainly would have been no more involved than the progression in the other direction. However, strong rightmost derivations were favored from the outset because viable prefixes are considerably more ingrained in the literature than are viable suffixes.5 In addition, the bottomup lefttoright counterpart to GeneraLRR that is developed in the next chapter is derived directly from GeneraLRR. Considerable attention is devoted to this derivative of the GeneraLRR recognition scheme in the rest of this work.
6 To date, we have yet to find a reference to Sippu and SoisalonSoininen [381 in the literature.
CHAPTER IV
GENERAL BOTTOMUP RECOGNITION: A FORMAL FRAMEWORK
A formal framework for describing general bottomup recognition is developed next. In particular, a general bottomup recognition scheme that scans input strings from left to right is presented. The bottomup lefttoright character of the recognition scheme, called GeneraLLR, intimates that it is an inverse of GeneraLRR. Indeed, GeneraLLR is directly derived from GeneraLRR through inverses of the Rderives and chop relations. Consequently, GeneralLR also exploits certain regularity properties of contextfree grammars.
In keeping with Chapter III, some formal aspects of general bottomup recognition are examined in a settheoretic framework. Later chapters affect a less abstract character; specifically, GeneraLLR is cast into concrete terms, viz., statetransition graphs and finitestate automata. Ultimately, a general bottomup parser based on GeneraLLR is described. An arbitrary reduced grammar G = (V, T, P, S) is assumed throughout this chapter.
BottomUp LefttoRight Recognition
In a bottomup approach to recognition, an attempt is made to construct a parse tree for an input string, perhaps implicitly, by starting from the leaves and working toward the root. A basic step in the upward synthesis of a parse tree involves grafting together the roots of one or more subtrees into a larger subtree. Suppose that the collection of these subtrees is represented by the string of grammar symbols which label their roots. A grafting operation may be described in terms of applying the inverse of the := relation to this linearized form of the partially constructed parse tree. That is, the occurrence of a production righthand side in this string is replaced by (or reduced to) the corresponding lefthand side nonterminal symbol; this symbol labels the root of the subtree produced by the grafting
operation. By performing reductions according to the inverse of the =*, relation instead, a canonical lefttoright order is imposed on the parse tree construction process.
However, an alternative to the inverse of the = , relation is provided by inverses of the =R and I relations. The inverse of *R is used to represent reversed strong rightmost derivations. The inverse of I introduces terminal symbols at the right end of strings. These two inverse relations cooperate to mimic reversed rightmost derivations.
Reversed Rightmost Derivations
The reduce relation (1=) is the inverse of the Rderives relation, i.e., R==; it is formally defined by = = {(aw, aA)I aE V*,AwEP}. The shift relation () is the inverse of the chop relation, i.e., 11= ï¿½ thus,  = {(a, aa) I a E V*, a E T}. For each a E T, 4denotes the subrelation of  with range V*a. More specifically, for a, PE V* and a E T, e,a fP if and only if a.# and /3=aa.
For the most part, the results in this chapter are obtained through simple manipulations of relational expressions. Two equalities on relational expressions that are regularly used in these transformations are recorded in the following.
Fact 4.1 Let R and S be binary relations on V*, i.e., R, SC V*X V*. Then the following two statements hold: (1) (R*)1 = (Rl)*; (2) (R S)' =SR1. 0
Some useful applications of Fact 4.1 include the following.
(2) (=#I)1 = (=Z
(3) ( (: I)~* )1 = ( : )1 ( (= I*) = * ( (::: I))* = j* (}*)*.
Despite the appearance of (.I*) in the last construct of both (2) and (3), the relation product (=*) is more appropriate to our needs. Indeed, since relation composition is associative, the following equivalence holds: P* ( *)* =
The interpretation of the relation product (=*) is explicitly described as follows. For a,/3E U*, (=*) P holds in G if and only if a Py4* , "ya =#3 holds in G for some yE V* and
a E T. This is expressed more neatly as a(=*')a/l. The notation relevant to the reflexivetransitive closure of this product is as follows. For all aE V*, a(F* )ï¿½a holds in G; for a,#, EV*, x ET"1 with n >1, and a E T, if (]=*. )"1#/ and P =*)a y hold in G, then F_) n y holds in G. If a,=*) / holds in G for some a, /PE V* and x E Tn, n >0, any of the expressions a)=*.i*)/, a(=)*/#, or a(=*+)n P may be used to denote this if the string x or its length n is not relevant.
The following lemma compares relational expressions involving the =*R and I relations with relational expressions involving the = and  relations.
Lemma 4.1 For a,/3EV* and xET*, a( 1),=*R* # holds in G if and only if /36]= )*I a holds in G. Proof. First suppose that x =e. By definition, both (=*R I)ï¿½ and (F=. )o are equivalent to the identity relation on V*. Thus, the following statements are equivalent.
(1) a(I)** 0/;
(2) a Z/3;
(3) #(= *)loe; (4) fle*a; and
(5) /# ,*)Op,*.
Now let x =aja2 . . . an, n >1. The following statements are equivalent in this case.
(1) a (= I)' n ;
(2) R( ,n , * i e
(3) /3(( ( I ) _ï¿½()a;
(6) (3)a4 ).. . =*)a. P a*; and
(7) /36=*)fI*a. 0
The next two lemmas demonstrate how reversed rightmost derivations are represented by the = and  relations.
Lemma 4.2 For oe, flE V* and x E T*, if c(=*)* *fl holds in G, then fl= T*ox holds in G.
Proof. By Lemma 4.1, the hypothesis implies that fl(== holds in G. It follows from Lemma 3.9 that #==**ax holds in G. 0
Lemma 4.3 For a, fiE V*, let a=,*fl hold in G. Furthermore, let 3=Yx for some
yE V* and x E T* such that yE V*N if PiE V*NT* and y=E otherwise (i.e., x is the longest suffix of #consisting solely of terminal symbols). Then y(I=*4)*a* holds in G. Proof. The hypothesis and its conditions imply that a (=4* I) =4 y holds in G (see Lemma
3.12). Therefore, "7 (=*4) I*ck holds in G by Lemma 4.1. 0
Lemma 4.4 L(G) ={wET*IEF=*,)P*S holds in G}. Proof. This is a consequence of Lemmas 4.2 and 4.3. 0l
The following connection is established between PREFIX(G) and the = and 4 rela tions.
Lemma 4.5 PREFIX(G) C (g; ET*I Ea=*4). holds in G for some oEV*).
Proof. Let x EPREFIX(G) be arbitrary. The corollaries to Theorem 3.13 together with the assumption that G is reduced yields that /3(=4? I)=4?E holds in G for some OE V'. By Lemma 4.1, E(=*)=*Pfl also holds in G. Finally, this last expression implies that E=*)*=f*l holds in G for some c E V*. 0l
The set inclusion of the preceding lemma is almost invariably proper. For example, consider the grammar with production set P = {Sa }. Although this grammar generates {a}, E (=*)', a' holds for all i >0. In fact, equality holds in Lemma 4.5 only for grammars which have an empty terminal alphabet.
Viable Prefixes Revisited
Lemma 4.5 suggests that the reduce and shift relations, as defined, are inadequate as a basis for general bottomup correctprefix recognition. Indeed, the source of their deficiency is revealed when they are examined under the guise of viable prefixes.
First, recall that VP(G) is closed with respect to =R and I. Formally, a string aE V* is a viable prefix of G if and only if w (=*R U I)* a holds in G for some S*wEP. The complimentary situation that exists with respect to the I and  relations is investigated in the next series of lemmas.
Lemma 4.6 For c, P9E V*, if c4=/3 holds in G and a VVP(G), then 6VqVP(G).
Proof. The contrapositive of this implication is proven, so we assume that IEVP(G). Since a =,1 holds in G, f3=R a also holds. By Lemma 3.14, this implies that aEVP(G). 0
Corollary For a,,6E g*, if alF=I holds in G and I3EVP(G), then aEVP(G). 0
Lemma 4.7 For a,/5E V*, if a*/ holds in G and aVVP(G), then /PVVP(G).
Proof. The proof is similar to that Lemma 4.6. Lemma 3.16 is relevant in this case. 0
Corollary For a,#1E V*, if ai# holds in G and ,EVP(G), then aEVP(G). 0
Lemma 4.8 For a,/3E V*, if a (= U)* / holds in G and aeVVP(G), then /3VVP(G). Proof. Since a ([ U )* / holds in G by assumption, a (I U )" /P holds for some n >0. Applying Lemmas 4.6 and 4.7, this lemma is proven by induction on n. 0
Corollary For a, # E V*, if a (+ U )*/3 holds in G and # EVP(G), then aEVP(G).
0
By Lemma 4.8, V*\VP(G) is closed with respect to (=U). The implication to Lemma 4.1 of this complimentary closure property is addressed in the following.
Lemma 4.9 For a,#EV* and xET*, if aEVP(G) and a (=,I)=Rfi holds in G, then 9 (=)**c holds in G when H and  are restricted to VP(G). Proof. By assumption, a is a viable prefix of G and a (=4* I)=4 P holds in G. From Lemma 4.1, #3 O=*)*=*a also holds in G. That this latter expression holds when H and are restricted to VP(G) follows from Lemma 4.8 and its corollary. 01
Our immediate goal is to describe general bottomup lefttoright recognition as the inverse of general topdown righttoleft recognition with the viable prefix being the central unifying concept. From that standpoint, it is undesirable for the reduce and shift relations to stray outside of VP(G). Consequently, these two relations are redefined to explicitly restrict them to VP(G) as follows: [ = {(aw,aA)I aEV*, A+wEP, aAEVP(G)} and  = {(a, aa)I c E V*, a E T, aa EVP(G)}. From the closure result of Lemma 4.8, restricting the ranges of these two relations to VP(G) effectively restricts their domains to VP(G) as well. Henceforth, these new restricted versions of = and  are in affect at all times.
Lemma 4.10 VP(G) = {eEV*I q=*)=* holds in G for some xET*}.
Proof. Since the = and * relations are restricted to VP(G), it is clear that any string a E V* such that E=*4)=* a holds in G for some x E T* is a viable prefix of G. In order to show that every viable prefix of G is similarly produced, let a be an arbitrary member of VP(G). From Theorem 3.20, w(=4 [)=*R~ holds in G for some SwEP and z E T*. Since G is reduced, a(=4 I)= *4 E holds in G for some x E T* (implying xz EL(G)). It follows from Lemma 4.9 that E(U=* P)*a holds in G. 0
Corollary L(G) = {w E T*I )IW holds in G for some SwEP}. 0 Corollary PREFIX(G) ={xET*IEa]=*)*& holds in G for some aEV*}. 0
Finally, the following lemma motivates, ex post facto, the relation product F=* ).
Lemma 4.11 For aEVP(G), at least one of the following two statements is true: (1) a*3p/a holds in G for some 3E V* and a CT; (2) c=*w holds in G for some S+wEP. Proof. By Theorem 3.20, w(=*R I)=,*Z holds in G for some SwEP and zET*. By Lemma 4.9, a(=*)*P*w also holds in G. If z=E, then 6=e*4)ï¿½=*w holds which demonstrates that statement (2) is true. Otherwise, z=ay for some a E T and y ET*. In this case, a(I=*4)a P w(=*); *w holds in G for some I E V*. This last expression implies that W=*/P3a =Y holds for some PE V*, so statement (1) is true. 0
General BottomUp CorrectPrefix Recognition
Now that [ and 4 are defined as inverses, albeit restricted, of ==R and I, respectively, the transition from GeneraLRR to GeneraLbR is completed by also inverting the direction in which an input string w E T* is scanned. Accordingly, the essence of GeneraLbR is that all of the reversed rightmost derivations of w E T* are followed in parallel.
Once again, there are theoretical limits on the precision to which this task may be carried out; that is, it is not possible to pursue exclusively the reversed rightmost derivations of w in the general case. Instead, at the point where a prefix x of w has been processed, all reversed rightmost derivations (from c) of all strings in xT*AL(G) are followed (i.e., all sentences that have x as a prefix).
As in the topdown recognition schemes, regularitypreserving operations on regular subsets of VP(G) are the key to GeneraLbR. Correctprefix recognition is performed, i.e., the membership in PREFIX(G) of an incrementally longer prefix of w is ascertained as w is scanned from left to right. Given a prefix x of w, the inclusion of x in PREFIX(G) is determined from the set (a EVP(G) a ,*x holds in G}. This set is nonempty if and only if z EPREFIX(G), and it contains w for some SwEP if and only if x EL(G). Figure 4.1 presents a highlevel description of GeneraLLR; a more detailed discussion follows.
function GeneraLLR (G =( V, T, P, S); w E T*)
// w =ala2 a, n >0, each ai E T
PVPuZ(G, E) :={}
for i :=0 to n1 do
VPR G, i:w) := p*(PVPLR(G, i:w))
PVPL4 G, i +1:w) :=a,+, (VP,4( G, i: w))
if PVPR(G, i+l:w) = 0 then Reject(w) fi
od
VP (G, w):= *(PVP(G w))
if wEVPR(G, w) for some S+wEP then Accept(w) else Reject(w) fi
end
Figure 4.1  A General BottomUp CorrectPrefix Recognizer
Let x E T* be an arbitrary string. The primitive LRassociates of x (in G) are defined by PVPuLG,x) = {aEVP(G)I EI =* )& holds in G}. Clearly, PVPZ(G,E) = {E}. The LRassociates of x (in G) are defined by VPL(G, x) = {caEVP(G) I E(]=*?)* =* holds in G}. By Lemma 4.2, this set is equivalent to {aEVP(G) I =*x holds in G}.
An input string w E T* is recognized by GeneralLR through the computation of PVP 4G, i:w) and VPm(G, i:w) as i ranges from 0 to len(w). The process terminates when either an empty set is produced or the input string is exhausted. Analogous to the topdown recognition schemes, the relationships between VPLR(G, x) and PVPLR(G, x), and between PVPL(G, xa) and VPu(G, x) are significant. Specifically, for x E T* and a E T, VPU(G, x) = {PEVP(G)I aI*' holds in G for some oaEPVPa(G,x)} = *(PVPw(Gx)) and PVPL(G, xa) = {flEVP(G) I ceafl holds in G for some caEVPua(G,x)} = ' (VPw(G,x)).
The conditions for termination are analogous to those for GeneralRR and General.LL. Given an input string w E T*, first suppose that w EL(G). In this case, VPu(G, w) is the last set of LRassociates computed by GeneralLR; after it is completed, w is accepted based on the fact that wEVP(G, w) for some SwEP if and only if w EL(G). Alternatively, suppose that w VL(G). If w VPREFIX(G) also holds, there is a unique string x E T* which is the shortest prefix of w such that x VPREFIX(G) holds. In this case, PVPu1(G, x) is the first empty set computed by the recognizer. On the other hand, suppose that w VL(G) and w EPREFD(G) both hold. In this case, it is discovered that wVVPu4G, w) for any S+wEP. In either case, the input string is rejected by GeneraLLR.
The correctness of GeneraLLR is recorded more formally in the next two lemmas.
Lemma 4.12 Let w EL(G) be arbitrary. If GeneralLR is applied to G and w, then GeneralLR accepts w.
Proof. From earlier results, PVPp(G,i:w) and VPI(G,i:w), 0
Lemma 4.13 Let w L(G) be arbitrary. If GeneraLLR is applied to G and w, then GeneralLR rejects w.
Proof. There are two cases to consider according to whether or not w is in PREFIX(G). Case (i): w EPREFIX(G). In this case, PVPut(G, i:w) and VPLR(G, i:w) are nonempty for all i, O
Regularity Properties
The regularity properties inherent to all contextfree grammars that are exploited by GeneralLR are identified in this section. Specifically, for an arbitrary string x E T*, PVP (G, x) and VP4 G, x) are regular languages.
Lemma 4.14 Relation P* is regularitypreserving.
Proof. Let G = (V, T, P, S) be an arbitrary grammar and let L be an arbitrary regular subset of VP(G). Define the regular canonical system C = (V,]) such that 11 = {( w, A)I A wEP }. Since =.c is defined on V* and = is defined on VP(G) g V*, I is a subrelation of a. By Fact 3.1, L' = r(L, C, {E}) is a regular language. Since regular languages are closed under intersection, L' VP(G) is regular. Clearly, P (L) C L'nVP(G) holds, since j= is a subrelation of :=* that is restricted to VP(G). The converse inclusion, viz., L,' VP(G) g * (L), is obtained by applying the corollary to Lemma 4.6. Specifically, for aEL and /EL'nVP(G), if ac fl holds in C, then a *fl holds in G. Thus, * (L) = L' VP(G), so P is regularitypreserving. 01
Lemma 4.15 Relation  is regularitypreserving.
Proof. Let G = (V, T,P, S) be a grammar, a a terminal symbol in T, and L an arbitrary regular subset of VP(G). Since regular languages are closed under concatenation, La is a regular language. However, La may contain some strings which are not viable prefixes of G. This is rectified by intersecting La with VP(G). Since regular languages are also closed under intersection, La flVP(G) is regular. Clearly, aa E V* is contained in La nVP(G) if and only if a EL and aa EVP(G) (i.e., a a cea holds in G). Thus, a (L) = La NVP(G), so I is regularitypreserving. 03
Theorem 4.16 Let G = (V, T, P, S) be an arbitrary grammar and let x be an arbitrary string over T. Then PVPuR(G, x) and VPu1(G, x) are regular languages. Proof. Applying Lemmas 4.14 and 4.15 and noting that PVPuz(G, c) = {f} is regular, the theorem is proven by induction on len(x). 0
Discussion
A simple description of general lefttoright bottomup recognition was presented. The GeneraLLR recognition scheme was derived from GeneraLRR by defining the inverses of
R and I, restricting them to VP(G), reversing the direction in which the input string is scanned, and manipulating some relational expressions. The two inverse relations, = and , preserve regularity. Thus, the essence of general lefttoright bottomup recognition was captured in terms of computing the images of regular subsets of VP(G) under these relations.
Together, the results in Chapters I1 and IV provide a succinct and elegant characterization of general contextfree recognition. This was accomplished by starting from two binary relations on strings and applying basic settheoretic concepts. There was no need to resort to automata, although automata are certainly useful for implementing the abstract recognizers. In short, the formal development contained in these two chapters provides a framework, founded on a minimal number of kernel concepts, within which the intrinsic properties of general canonical contextfree recognizers may be further investigated.
The denotations "RR", "LL", and "LR" that pervade Chapters III and IV were suggested by Knuth [28] where the following deterministic contextfree grammar classes and the methods of their analysis are enumerated:
RR(k)  scan from right to left, deduce rightmost derivations;
LL(k)  scan from left to right, deduce leftmost derivations;
LR(k)  scan from left to right, deduce reversed rightmost derivations; and
RL(k)  scan from right to left, deduce reversed leftmost derivations.
Here, k >0 indicates the length of lookahead strings used. Note that the use of these denotations is meant to evince a generalization of the respective parsing methods rather than a generalization of the grammatical classes. A corresponding GeneralRL recognition scheme is not included here. To mesh with the other recognition schemes, it would utilize the = and 4 relations defined in terms of GR. Images of regular subsets of VS(G) under these relations would be tracked by GeneralRL as an input string is scanned from right to left.
The GeneraLRR recognition scheme was developed primarily as a stepping stone to General..LL and GeneraLLR. GeneralRR is given little attention in the remaining chapters. Consequently, VPLz(G,x) (resp. PVPu4G,x)) is simplified to VP(G,x) (resp. PVP(G,x)). Similarly, VS(G,x) (resp. PVS(G,x)) is used to denote VSuLG,x) (resp. PVSG G, x)).
CHAPTER V
ON EARLEY'S ALGORITHM
In this chapter, Earley's general contextfree recognizer is examined and its relationship to the GeneraLbR and GeneraLLL recognition schemes is ascertained. In particular, a modified version of Earley's recognizer is presented which builds a statetransition graph in addition to the state sets that are constructed by Earley's original algorithm. Analyses of certain properties of the resulting STG reveal parallels between Earley's algorithm and the GeneraLLR and GeneraLLL recognizers. Throughout this chapter, an arbitrary reduced $augmented grammar G = (V, T,P,S) and an arbitrary string w =ala2 ... an+1, n >0, ai E T\{$} for 1
Earley's General Recognizer
Recall that A . is an item of G whenever there is a production of the form Aa in P. The bracketed pair [Ae*.f#, j] where A,e./# is an item of G and j is a
natural number is called an Earley state of G (or state, for short). Earley's algorithm, in recognizing w with respect to G, constructs a sequence of sets of Earley states Si, 0
Each Si is initialized to a finite set of states which we denote by basis(Si). For
1
. f {[S. .S$, 0} if i =0
bai(i [ +zi# ] A i# ]ES~}if 1 < i
The lone state in basis(S0), [S' o S$, 0], is called the initial state; it will be denoted by so. For i > 0, basis(Si) is constructed by the Earley Scanner function.
A stateset closure function, informally called SClosure, completes the construction of a set of Earley states. That is, for 0
S SClosure(basis(S)) if 0< i
5= basis(Si) if i=n+l
For 0< i
(1) Every state in basis(Si) is in Si.
(2) If [A cB3, j] is in Si, then for all B+wEP, [B.. w, i] is in Si.
(3) If [B.w.,j] is in Si, then for all [A..o.B ,k] in Sj, [A+aB.f, k] is in Si.
The states added to Si by rules (2) and (3) above correspond to the states that are spawned by the Earley Predictor and Completer functions, respectively. Thus, SClosure embodies both of these functions. The number of states added to Si during its closure is finite; after all possible states are added, we say that Si is closed.
Figure 5.1 presents Earley's general contextfree recognizer in terms of the notation defined above. A Scanner function is assumed which computes basis(Si+i) from Si and ai+l, 0
function Earley (G =( V, T, P, S); w E T*)
//w=ala2 ... an+,, n>O, aiET<{$}, li
basis(S0) :={[S'+ S$,Of
for i :0 to n do
Si := SClosure(basis(Si))
basis(S,+l) := Scanner(S,, ai,+)
if basis(Si+,) = 0 then Reject(w) fi
od
S., := basis(S.+i)
Accept(w)
end
Figure 5.1  Earley's General Recognizer
For O
For 0
The correctness of Earley's algorithm is based on the criteria which places a state in a particular state set [6]. In that regard, the following statements are made.
Fact 5.1 For O<_j
Facts 5.2 and 5.3 below ascribe bottomup and topdown interpretations, respectively, to Fact 5.1.
Fact 5.2 For O
Note that 6EVP(G,j:w) and &bEVP(G,i:w). We say that [A+c .,i]eSi is valid for b&EVP(G,i:w); in particular, [A+.a#, j]ESj is valid for 6EVP(G,j:w). If a3E also holds, then we say that [A)o./, j] ESi properly cuts &bEVP(G, i:w).
Fact 5.3 For O
In this case, note that (A6gREVS(G,j:w) and (36)REVS(G,i:w). We say that [A+o.,j]ESi is valid for (/36 EVS(G, i:w); in particular, [A+.a#, j] ESj is valid for (A6JR EVS(G, j:w).
A Modified Earley Recognizer
A modified version of Earley's recognizer, called Earley', is described next. Earley' differs from Earley's algorithm in that it constructs a statetransition graph. The STG constructed by Earley' is denoted by GE, = (QE', V, 6E,). The states in QE' are the Earley states that are generated by Earley's algorithm. The state transitions in 6E, are described below.
In recognizing w with respect to G, Earley' builds the same sequence of state sets as Earley's algorithm. In addition, a sequence of statetransition sets, viz., Ei for 0
A particular set of state transitions Ei is constructed analogously to Si. That is, (1) E is initialized to a finite set of transitions denoted by basis(Ei), and (2) a transitionset closure function, called EClosure, is applied to basis(El) to complete the construction of Ei. For 0< i
f 0 if i=0
basis(Ei)= {(s, ai, t)l s=[A.ai3,j]ESi,,t=[A+aai./3,j]ESi} if 1 0 is determined from Si1, Si, and aj; basis(E0) is a special case. For i > 0, the transitions in basis(Ei) may be installed by a slightly modified Earley Scanner function.
For 0< i
SEClosure(basis(Ei)) if 0
basis(Ei) if i=n+l
For 0
(1) Every transition in basis(El) is in Ej.
(2) If s=[A+.B,j] is in Si, then for all B+wEP, (s,E,t) is in Ei where
t =[B. w, i] ESj.
(3) If [B..w.,j] is in Si, then for all 8=[A..B,k] in S, (s,B,t) is in Ei where
t=[A,aB.,k]ES.
Transitions added to Ei by rules (2) and (3) above correlate closely with the states that are generated by the Predictor and Completer functions, respectively.
A highlevel description of Earley' is given in Figure 5.2. In that figure, we assume (1) a generalized Closure function which concurrently constructs Si and Ei, 0< i
function Earley'(G =(V, T,P,S); w E T*)
// w=aa2 '''* a,+,, n>0, aiET\$), 1 < i
basis(S0), basis(E0) {[S'*S$, O]}, 0
for i :=O to n do
(Si, E ) := Closure(basis(Si), basis(Ei))
(basis(Si+l), basis(Ei+i)) := Scanner(Si, ai+l)
if basis(Si+,) = 0 then Reject(w) fi
od
S.+1, E,+I := basis(S,+l), basis(E,+)
Accept(w)
end
Figure 5.2  A Modified Earley Recognizer
The STG GE, is informally called the Earley state graph. When the Earley state graph is complete, GE, =(QE,, V, 6E,) where QE' = U Si and 6E, = U Ei.
0
As every state in GE, is reachable from the initial state [S' .S$,0], s0 is also called the root of GE,. A path in GE, which begins at the root is called a rooted path in GE,.
Earley's Algorithm and Viable Prefixes
Let GE, = (QE', V, 6E,) be the Earley state graph that results from applying Earley' to G and w. In this section, the strings over V that are spelled by rooted paths in GE, are analyzed. It transpires that the string spelled by an arbitrary rooted path in GE, is a viable prefix of G. Moreover, the string spelled by a rooted path in GE, which terminates at a state in Si, O
Lemma 5.1 For O0). Since len()>0, a=oaX for some &EV* and XEV, i.e., 8 =[A+o1X./3,j]. Thus, 8 was added to Si by either the Scanner or the Completer. In either case, every transition to 8 in GE, is of the form (r,X,8) such that r=[Aao.X#,j]ESi, for some il, j
Corollary For O
Lemma 5.2 Let p =(8 ,8, . . . ,8,), m >0, be a rooted path in GE, such that p spells yCV* and sm=[A+#,j]ES for some A.a/3EP and ij, O
Basis (m =0). Thus, i =0, Sm =So=[S' .*S$,O] ESo, and y=E. The consequent trivially holds in this case.
Induction (m > 0). Two cases are analyzed based on whether or not a =E. Case (i): a =E. In this case, j =i and sm was added to Si by the Predictor. Thus, sn1= [B.Arj]ESj for some B.oaArEP and j', O
Corollary Let p =(8, . . . ,Sm), m 0, be a rooted path in GE, such that p spells yEV* and s.=[A.a3,j]Ebasis(Si) for some Aaf#EP and ij, Oji
Proof. If m =0, then =E and i=0. By definition, EEPVP(G,0:w). Otherwise, suppose that m > 0. Since sm Ebasis(Si), the last transition in p is on ai E T, i.e., i > 0 and 'Y="1ai for some IE V*. Therefore, "yEPVP(G,i:w). 0
The next lemma provides the converse to Lemma 5.2.
Lemma 5.3 Let y be a string in VP(G, i:w) and let [A+./3,j] ESi be a state which is valid for y for some A.afEP and ij, O
Proof. This lemma appears to be rather more difficult than Lemma 5.2 to prove rigorously. In lieu of a formal proof, an intuitive argument is given. First the following observations are made.
(1) Every state which is valid for y is in Si. Otherwise, a contradiction of Fact 5.2
would result.
(2) If Y#E, then there is some state s ESi such that s is valid for Y and s properly
cuts y. In particular, Earley states that are added by the Scanner or Completer
properly cut the viable prefixes that they are valid for.
(3) If y4, then for each state s ESi which is valid for y there is a state r ES such
that (i) r is also valid for y, (ii) r properly cuts y, and (iii) there exists a path in
GE, from r to 8 which spells E.
Given these observations, an informal inductive argument proceeds as follows where the induction is on len(y).
Basis (len(y)=0). For each state s ES0 which is valid for EEVP(G,O:w), there exists a rooted path in GE, to 8 which spells c.
Induction (len(y) >0). Let y=IX for some I EV* and XE V. By points (2) and (3) above, we may assume that [A&'o#,j]ESi properly cuts y, i.e., a=olX for some odEV*. Let s=[A+o1X.I3,j]ESi. For every i, j:i
Theorem 5.4 For O
(GE,,i, so, Si) denote an NFA. Then L(AIE,,i) = VP(G, i:w). Proof. This theorem follows from Lemmas 5.2 and 5.3. 01
Corollary For O
U E Ubasis(Ei)) and let ME,,i,b = (GE,,i,b,so, basis(Si)) denote an NFA. Then
0 j< <
L(ME, i,b) =PVP(G,i:w). 0
Theorem 5.4 and its Corollary establish a direct relationship between Earley' and the GeneraLLR recognition scheme. Indeed, Earley' prescribes one possible approach to realizing an implementation of GeneralLR. Note that the foregoing analysis of GE, provides a constructive proof that for arbitrary x E T*, VP(G, x) and PVP(G, x) are regular languages.
Earley's Algorithm and Viable Suffixes
The last section considered strings in V* that are spelled by rooted paths in GE,. The string spelled by a path in GE, is determined directly from the grammar symbols that label the transitions in that path. In this section, another string over V is associated with a path in GE,, viz., a string that is derived from the states in that path. Specifically, the state derivative of a path in GE, is defined recursively by the statederivative function given in Figure 5.3.
function statederivative ((s, sl, . ))
// (so, S,. .. ,s), m >0, is a path in GE,.
if m =0 then // Let so = [A+c ,j].
return(19)
else if so = [A+ a.X#, j] and s1 = [AaX.Ij] then //(so, X, s1)E6E,
return(statederivative ((s, s2, . .. , sJ)
else //Lets =[Aa.B#,j] and s1=[B+.w,i].
return(/V (statederivative ((sl, 2, . ,si))))
fi
end
Figure 5.3  The Definition of the State Derivative of a Path
Again, let GE, = (QE', V, 6E') be the Earley state graph that results from applying Earley' to G and w. It transpires that the state derivative of an arbitrary rooted path in GE, is a viable suffix of G. Moreover, the state derivative of a rooted path in GE, which terminates
at a state in Si, O
Lemma 5.5 Let p =(So, si,  .  ,sim), m >0, be a rooted path in GE, such that "7E V* is the state derivative of p and s=[A+o.a,j]ES for some Ac43EP and ij, O
Basis (m=0). Thus, i=O, Sm=So=[S+.S$,O]ESo, and "y=$S. By definition, $S EVS(G, 0:w) and so is clearly valid for $S.
Induction (m > 0). Two cases are analyzed, based on whether or not e =E. Case (i): c = . In this case, j =i and sm =[A .1,, i] was added to Si by the Predictor. Let SM1 [BaArjqESi for some B+aArEP and j', O
Corollary Let p =(So, sl, . . . ,8), m >0, be a rooted path in GE, such that yE V* is the state derivative of p and sm[A+oeo#,j]Ebasis(Sj) for some A+/3EP and ij, O
Proof. If m =0, then i =0, and sm = So = [S"  .s$,0] ebasis(so). derivative $s="y" which pvs(g,0:w) definition. m>O, then i>0, Sm1=[A+ oajflj]ESjj, and sm=[A+&ajï¿½,j] for some &lEV*, i.e., o=olai. The state derivatives of p'=(8o,sj,... ,8mi) and p are (ai/6)R and (/3)R =y, respectively, for some 6 E V*. By Lemma 5.5, (ai#6)R EVS(G, i1:w), so 1 EPVS(G, i:w). 03
The next lemma provides the converse to Lemma 5.5.
Lemma 5.6 Let y be a string in VS(G, i:w) and let [A+ae,,j] ESj be a state which is valid for y for some Abac4EP and ij, O
Proof. A rigorous proof of this lemma has so far eluded us. Consequently, a very informal intuitive argument is given instead. A more convincing proof is left for future work.
Observe that the basic result provided by Lemmas 5.2 and 5.3 is a graphical interpretation of Fact 5.2 in terms of certain properties of GE,. In turn, the goal of Lemmas 5.5 and
5.6 is a graphical interpretation of Fact 5.3 in terms of certain other properties of GE,.
Consider VP(G,i:w) and VS(G,i:w) for some i, 0
In contrast to the case with GeneraLLR, Lemmas 5.5 and 5.6 establish a more covert relationship between Earley' and GeneraLLL. This is in keeping with the relative complexity of the definitions of the spelling of a path and its state derivative.
Discussion
A graphical variant of Earley's algorithm was examined within the framework established in the previous two chapters. In the process, some properties of Earley's algorithm were identified and the efficacy of the GeneralLR and GeneraLLL approaches to general recognition was established. Earley's algorithm is an excellent vehicle for demonstrating the effectiveness of GeneralLR and GeneraLbL given that it is so wellknown and highlyregarded.
The analyses contained in the previous two sections illustrated how the sets of viable prefixes (resp. viable suffixes) tracked by GeneraLLR (resp. GeneralLL) are explicitly represented in the statetransition graph that is constructed by Earley'. As Earley' is a direct descendant of Earley, it is fair to conclude that these same sets are represented implicitly in the Earley state sets that are constructed by Earley's original algorithm. By viewing Earley's algorithm from this novel perspective, its operation and correctness has been explained at a level of abstraction that is closer to that necessary for capturing the essence of general canonical recognition.
The structure of GE, exhibits how Earley' subsumes both the GeneralLR and GeneralLL recognition schemes. Clearly, Earley' embodies GeneraLbR considerably more directly than GeneraLLL. In light of this, it is perhaps more apt to view Earley's algorithm as a general bottomup recognizer.
Practical aspects of the GeneraLbR recognition scheme are examined further in the next chapter and Chapter VII extends it into a general parser. Thus, this chapter is transitional in that it bridges the abstract treatment of general recognition presented in Chapters III and IV with the concrete treatment of GeneraLLR contained in Chapters VI and VII. Attempts at deriving a general parser from GeneralLL were unsuccessful. Thus, an investigation of the practical potential of GeneraLLL is left for future work.
CHAPTER VI
A GENERAL BOTTOMUP RECOGNIZER
In this chapter, a general bottomup recognizer that is directly based on the GeneraLbR recognition scheme is presented. In particular, the algorithm constructs a graph in such a way that the regular sets of viable prefixes manipulated by General.LR are represented in this graph. Aside from complications that can arise due to nullable nonterminals, the recognizer is extended into a general parser rather seamlessly (parsing is the subject of the next chapter). Thus, in light of the algorithm's practical potential, several implementation issues are discussed. Throughout this chapter, an arbitrary reduced $augmented grammar G = (V, T, P, S) and an arbitrary string w =ala2 a.+,, n >0, ai E T \{$} for
1
Control Automata and Recognition Graphs
The recognizer described in this chapter constructs a statetransition graph which we call the recognition graph. The correctness of the algorithm is based on properties of this graph. The recognition graph is constructed under the guidance of an FSA called the control automaton. The control automaton is determined from the subject grammar G and is fixed throughout the recognition process. In contrast, the recognition graph evolves during recognition; its structure is derived from the control automaton and the input string w.
For simplicity, the LR(O) automaton of G is used as the control automaton for guiding the recognition of w with respect to G; alternative control automata are suggested later. The LR(0) automaton of G is a DFA which is based on the canonical collection of sets of LR(O) items of G and the associated goto function [4,111. Recall that each set is comprised of kernel and closure items. The item S'+. S$ is a kernel item as are all items of the form
A *+ a # such that a*. With the exception of S' . S$, all items of the form A* w are closure items.
We denote the LR(O) automaton of G by Mc(G) =(I, V, goto, 10, I) where I={Io I,, ... ,Im._1} is the collection of sets of LR(O) items. The "C" subscript is a reminder that Mc(G) is used as the control automaton during recognition. For convenience, we assume that S.+ .S$EIo and S.'S$. EIm._; in fact, the latter assumption implies that I_={S'+S$.}. A detailed accounting of Mc(G) is not needed to describe how it is used to recognize strings. However, the following wellknown facts about Mc(G) are useful.
(1) L(Mc(G)) =VP(G).
(2) Each /EI \{I0} has a unique entry symbol XE V, i.e., the grammar symbol that
all transitions to 4i are made on. The entry symbol for /4, j 30, is denoted by entry(I). There are no transitions directed to 10 in MC(G), so entry(I0) is not
defined.
(3) For 4 EI, (i) if Ac.Xi3IE/, then A*oX./IEIk where Ik = goto(Ij,X);
(ii) if A aX.1#E/, then A &oX# EIk for all Ik Epred(//,X); and
(iii) if A #.IjE/and A #S', then goto(Ik,A) is defined for all Ik Epred(Ij, ,). Automaton MC(G) is also denoted by Mc if G is understood.
The precise manner in which the recognition graph is constructed is the essence of the algorithm described in the next section. Some general characteristics of recognition graphs are described in the remainder of this section.
The recognition graph constructed under the guidance of MC is denoted by GR (Mc) =(Q, V, 6). At the start of recognition, GR (MC) is set to an initial configuration. Additional states and transitions are added to Q and 6, respectively, as the recognition proceeds. The denotation GR(MC) is simplified to GR whenever the intent is obvious.
Each state added to Q during recognition corresponds to a set of items 4 EI of MC, 0
state in Q that corresponds to Ij and input position i, e.g., qj:i. The function .Q.I is defined to map a state in GR to its associated set of items in Mc; thus, V(qj:i)=I. For later use, we define Q, ={qj:iE Q}, 0
Similarly, each transition added to 6 during recognition corresponds to a transition in Mc. The members of 6 are best described in terms of the mapping 6+goto induced by 0 defined as follows: for p,qEQ and XEV, (p,X, q)E6 only if goto(V(q),X) = i(p). Thus, each transition in GR corresponds to the reversal of a transition in Mc. Consequently, all of the transitions out of a state p E Q are on entry(4(p)). Valid transitions in GR are also constrained by input position; specifically, (qk:i,X, qj:h)E6 implies that h
The GeneraLLRO Recognizer
The general contextfree recognizer, informally named GeneraLLRO, is described in this section. Concurrently, intuitive arguments for its correctness are presented. Establishing the correctness of GeneraLLRO reduces to demonstrating that it is a faithful realization of the GeneraLLR recognition scheme, i.e., that the sets of viable prefix associates that GeneraLLR tracks are correctly represented in the graph constructed by GeneraLLRO as w is scanned from left to right.
General..LRO is described in terms of how it operates when it is applied to G and w. Under the guidance of Mc(G), the LR(O) automaton of G, GeneraLLRO constructs a recognition graph GR(MC). Some general notions about recognition graphs were introduced in the last section. The description of GeneraLLRO that follows provides more specific details about how GR is derived from Mc and w. For reference, Genera.LRO is rendered in pseudocode in Figure 6.1.
1. function GeneraLLRO(G=(V, T,P,S); wET*)
2. //wfala2" " a,+,, n>0_, ajET\j$j, l
3. //Let Mc(G) =(I, V, goto, Io, I) be the LR(O) automaton for G.
4. //GR(MC)=( Q, V, ) is an STG, the recognition graph.
5. Q, 6 := {qoO), 0 //Initialize GR.
6. // Let MR =(Gi, q0:0, Qo). Then L(MR) = PVP(G, E) {E}.
7. for i :=0to n do
8. //Let MR =(G~j, q0:o, Qj). Then L(MR) =PVP(G, i:w).
9. Reduce (i)
10. // Let MR =(Gil, qo:o, Qj). Then L(MR) =VP(G, i:w). 11. Shift(i)
12. // Let MR =(Gj1, qoo, Qi+l). Then L(MR) = PVP(G, i+l:w). 13. if Qi+l =0 then Reject(w) fi 14. od
15. // Let MR =(GI, q0:0, Q+J). Then L(MR) = PVP(G, w) = {S$}. 16. Accept(w)
17. end
18. function Shift (i)
19. Q..subset := {q E Q I goto(V(q), ai+1) is defined } 20. while Qsubset $ 0 do 21. q :=Remove(Q..subset) //Let goto( (q), ai+)= Ij. 22. if qi:i+, Q then 23. Q :=Q U{qj:i~l 24. fi
25. 6 := 6U{(qj:i+,, ai+, q)} //Never redundant. 26. od
27. end
Figure 6.1  The GeneraLLRO Recognizer
Throughout its evolution, the structure of GR is paramount. Certain intermediate
stages in its construction hold particular interest. At each of these points, an FSA may be
defined in terms of Gi1 which accepts one of the sets of viable prefix associates that is computed by the GeneralLR recognition scheme. The FSA derived from G1l is denoted by
MR. The inverse of GR is desired since each of its transitions is reversed from the orientation of the corresponding transition in Mc.
It is important to remember that GR evolves continuously throughout the recognition
process. Consequently, "GR" and "MR" denote a different graph and automaton,
28. function Reduce (i) 29. .subset := bi 30. Traverse(Q,, i) 31. while &subset $0 do 32. (p,X, q) :Remove(&.subset) 33. for AorX.PE V(p) such that P=**E do 34. for r Esucc(q,aYR) do //Let goto((r),A) /. 35. if qj:i Q then 36. Q := Q U{qj:i} 37. Traverse({q,:j}, i) 38. fi 39. if (qj:, A, r) 6 then 40. 6:= 6U{(qj:i,A, r)} 41. &subset := 6.subset U{(qj:i, A, r)} 42. fi 43. od 44. od
45. od
46. end
47. function Traverse(Q.subset, i) 48. while Qsubset 0 0 do 49. q := Remove(Q..subset) 50. for goto(4(q), A) = such that A c*E do //A EN 51. if qj:j Q then 52. Q := Q U{qj:i} 53. Q.subset := Qsubset U{qj:i} 54. fi 55. 6 U{(qj:i,A, q)} //Never redundant. 56. od
57. od
58. end
Figure 6.1  continued
respectively, at distinct stages of recognition. The makeup of GR at any given time determines which regular set is recognized by MR. The GeneralLRO recognizer is best understood through an appreciation of how it transforms GR.
The GeneraLLRO recognizer is comprised of a main function (lines 117 in Figure 6.1)
and three auxiliary functions, Shift, Reduce, and Traverse. The Shift function (lines 1827)
computes the  relation whereas Reduce (lines 2846) computes the P* relation closure. The
Traverse function (lines 4758) is called from within Reduce. It handles certain transitions on
nullable nonterminal symbols. A linebyline description of the GeneraLLRO recognizer follows.
(Line 1) GeneraLLRO is supplied with two arguments, a reduced $augmented grammar G and a string w over the terminal alphabet of G.
(Lines 24) By assumption, w is terminated with $. For simplicity, we also assume that the LR(0) automaton of G, Mc(G), is provided by some external agent.1 Each of w, Mc, and GR are visible to the functions that require access to them.
(56) Graph GR is initialized to contain the single state q0:0. The comment in line 6 indicates that GT1 can be trivially embedded into an FSA that accepts PVP(G, E) = {E} at this point. Henceforth, the following statement holds for GR throughout the duration of recognition. For qj:i EQ where O
(7) This for loop iterates once for each terminal symbol in w. Having i range from 0 to n rather than from 1 to n+1 yielded a cleaner expression of the algorithm. The rest of the discussion primarily elaborates on an th iteration of this for loop for some i, 0< i
(810) The comment in line 8 is both a loop invariant and a precondition of the Reduce function. It clearly holds upon entry to the loop; the Reduce and Shift functions ensure that it also holds at the start of each iteration. This condition can be alternately stated as follows. A string yE V* is a member of PVP(G, i:w) if and only if there is a path in GR from some state q E Q to q0:0 which spells jR. The comment in line 10 is a postcondition of the Reduce function and may be restated similarly; that is, a string '/E V* is a member of VP(G, i:w) if and only if there is a path in GR from some state q E Q to q0:0 which spells "7R. Assuming that the precondition holds when Reduce is called, the Reduce function transforms GR so that the postcondition holds.
1 An alternative is for GeneraLLRO to construct MC as an initial task.
(1012) The postcondition of Reduce in line 10 is also a precondition of the Shift function. A postcondition of the Shift function is given in line 12 and is similar to the loop invariant. However, in this case the following situation holds for GR. A string Yai+1 E V* is a member of PVP(G, i+1:w) if and only if there is a path in GR from some state q EQi+l to q0:0 which spells ai+lIR. Assuming that the precondition holds when Shift is called, the Shift function transforms GR so that this postcondition holds.
(1213) If Qi+l =0 at this point, then MR has no final states. Thus, PVP(G, i+1:w) =
0 and i+l:w PREFIX(G). Consequently, w L(G), so GeneraLLRO rejects w.
(1516) Line 15 expresses a postcondition of the for loop. It holds upon completion of the nth iteration (i.e., when i =n) provided that the postcondition of Shift and Qi+l #0 both hold at the end of that iteration. In this case, w EL(G), so GeneraLLR0 accepts w.
Before continuing with the description of GeneraLLRO, the following important properties of LR(O) automata are reiterated. Let A..w. EI hold for some A*wCP with A #S' and 4j EI. In addition, let 6w be the spelling of an arbitrary path in Mc from I0 to / for some 6EV*. Then &;[1=A holds in G. Now let A.otoa3EI hold for some A+ota#EP and / EI, and let ba be the spelling of an arbitrary path in Mc from I0 to j. In this case S&+_a a holds in G. Based on the manner in which GR is derived from MC, these two equivalence properties (i.e., the equivalence of paths from I0 to j with respect to reduce and shift actions) are preserved in GR (i.e., all paths in G1l from q0:0 to qij:i are equivalent with respect to shift and reduce actions). These equivalence properties are exploited by the Shift and Reduce functions.
(11,18) The Shift function is called with i as an argument. This makes the relationship between the values of i in GeneraLLRO and Shift explicit. The operation of the Shift function during its ith invocation from GeneraLLRO is described for some i, 0 < i < n.
(19) At this point, we know that Qi cannot be empty. Otherwise, the input string would have been rejected in an earlier iteration of the main for loop. The ith call to Shift computes the 4 relation.2 Thus, we want to determine all states q E Qi for which there is
2 It is important to remember that i ranges from 0 to n.
a transition on aj+I from V(q) in Mc. The set variable called Q..subset is initialized to contain these states.
(20) Each state in Q..subset is considered in turn. No additional states are added to Q_subset within the while loop.
(2125) A state q is removed from Q..subset. Since (V(q), ai+,, Ij) is a transition in MC, we need to add qj:i+l to Q and (qj:i+l, ai+,, q) to 6. It is possible that there is more than one transition on aj+1 toj in MC, so qj:i+l may have been added to Q in an earlier iteration of the while loop. This condition is checked in line 22 and qj:i+l is added to Q only if it is necessary. However, the transition (qj:i+i, ai+l, q) cannot already be in 6 since there is only one transition on aj+I from (q) in Mc. This transition is added to 6 in line 25.
(27) By assumption, the precondition in line 10 holds when Shift is called. Based on the manner in which certain paths in GR are extended by the Shift function under the guidance of MC, the postcondition of Shift holds at this point.
The transformations of GR made by Reduce are considerably more elaborate. This is not unexpected since Reduce computes the reflexivetransitive closure of a relation.
The operation of the Reduce function during its ith invocation from GeneraLLRO is described for some i, 0< i < n. During this invocation, Reduce adds states to Qj and installs transitions from states in Q, to states in Qj where 0
(9,28) Like Shift, the Reduce function is supplied with i as an argument so that the relationship between the values of i in General.LRO and Reduce is explicit.
(29) At this point, each transition in 6i may come from a state that calls for one or more reductions. If i =0, then there are no applicable transitions. If i > 0, the relevant transitions were installed in GR by Shift during the previous iteration of the main for loop of
GeneraLLRO. In any case, a set variable called .subset is initialized to contain bi; it is crucial that this assignment occur before Traverse is called.
(30) In short, Traverse creates certain paths to states in Qi that spell strings of nullable nonterminal symbols. Further discussion of the Traverse function is deferred until later. The Reduce function can be understood independently of it.
(31) Each transition in _subset is considered in turn. All reductions relevant to the source states of those transitions are performed. Additional transitions may be added to &subset within this loop.
(32) A transition (p, X, q) is removed from _subset.
(33) The set of items i(p) determines what reductions, if any, are applicable to p. Any kernel item of the form A..Xo/E(p) such that /=*E holds in G is relevant; that is, we see through certain nullable suffixes of production righthand sides. In effect, a reduction from p on A oX is performed. As described below, a path to p spelling PR will have been installed in GR by an earlier call to the Traverse function. In this way, any cycles created in Qi by nullable nonterminals is left for Traverse to handle.
(34) At this point we are considering one particular reduction applicable to p, say A..eXo#E (p) where P is nullable. This reduction is performed by traversing certain paths in GR from p that spell (Xa)R to locate the states in Q to which transitions on A must be made. In particular, we want to traverse only those paths that start with the transition (p, X, q). Any other transition from p will have either already been reduced through or else is in 6..subset waiting to be handled in a later iteration of the while loop. The states of interest are given by succ(q,aR). It is precisely this application of succ that motivates reversing the transitions in GR with respect to those in MC.
(3542) At this point we are dealing with one particular state r Esucc(q, aR) and we assume that goto(V(r), A) = I for some Ii El. Thus, we need a state qj:i in Qi and a transition (qj:i, A, r) in 6i. Both of these objects may already exist in GR, so they are conditionally created as indicated by the if statements. Incidentally, a transition is generated redun
dantly here as the result of an ambiguity. If the transition is indeed new, it is added to ⊂ any relevant reductions from qj:i are performed through this transition when it is removed from 6Lsubset in a later iteration of the while loop.
(46) The postcondition of Reduce holds at this point. To help establish this fact, a subset of VP(G,i:w), denoted by VP'(G,i:w), is defined as follows: (1)for i=O, VP'(G,O:w)= PVP(G,O:w); (2) for 0
(30,37,47) Traverse deals solely with nullable nonterminals and productions with nullable righthand sides. In lines 30 and 37, Traverse is called with a nonempty subset of Q as an argument which becomes associated with the set variable called Q.subset. Traverse has the effect of transforming GR as if all sequences of reductions by productions that have nullable righthand sides are carried out from the states in Qsubset. However, a transformation of GR that produces the same result can be derived from a simple traversal of Mc. By adopting this alternative approach, complications that can arise due to cycles in GR are avoided. Consider the states Ik CI such that (q) =Ik for some q EQ.subset and traverse MC beginning from these states along all transitions that are made on nullable nonterminals. The states and transitions encountered in this traversal are exactly those which would arise from performing the reduction sequences described above. Consequently, counterparts for all of the states and transitions encountered in this traversal are created in GR. Thus, a particular subgraph of Mc is effectively embedded in Q, by this process. The specific subgraph is determined by the composition of Qsubset when Traverse is called.
(48) Each state in Q..subset is considered in turn. Additional states may be added to Q.subset within the loop.
(49) A state q is removed from Q..subset.
(50) All transitions from V(q) in Mc that are made on some nullable nonterminal A are relevant. Let goto(V(q), A) = I be one such transition.
(5155) We need a state qj:i in Q, and a transition (q3:i, A, q) in 6i. This state may already exist in Qj, so it is conditionally created. If qj:i is indeed new, it is added to Q.subset; the traversal will resume from qj:i when it is removed from Q.subset in a later iteration of the while loop. However, the transition (qj:i, A, q) is never generated redundantly; the discipline imposed by the graph traversal ensures that the transitions from each state encountered are considered at most once.
If the two calls to Traverse are removed from the Reduce function and the line "ga. Traverse(Qi, i)" is added to GeneraLLRO following line 9, an equivalent transformation of GR results, i.e., one that satisfies the condition stated in line 10. In this way, Traverse becomes a postprocessor of Reduce. However, for the purposes of parsing it is more appropriate to call Traverse from within Reduce as we have done in Figure 6.1. This will become evident in the next chapter when GeneraLLRO is extended into a general parser.
That General..LRO correctly implements the GeneraLLR recognition scheme may be established by induction on i. This induction depends, in turn, on proving that the Reduce (resp. Shift) function correctly transforms GR such that the postcondition in line 10 (resp. line 12) holds if the precondition in line 8 (resp. line 10) holds before the function is called. Although the Shift and Reduce functions are not formally proven correct, it is expected that the above detailed explanation of GeneraLLRO provides sufficient intuitive evidence toward that end.
Earley's Algorithm Revisited
A general recognizer that operates strikingly similar to Earley' is obtained by modifying GeneraLLRO to use a particular nondeterministic variant of the LR(0) automaton for G as a control automaton. The alternate control automaton, the modified algorithm, and its relationship to Earley' are briefly discussed in this section. Alternate Control Automata
The nondeterministic LR(0) (or NLR(0)) automaton of G [24, p. 250] is denoted here by MNC(G) =(I, V, goto, 1o, I) where
(1) 1={(1o, 1,,   . , I,,} ={{JA .iot*a AoE P},
(2) goto({A.a&.Xf#},X) = {A..&X. #}, and
(3) {B.w}Egoto(A,?.B#},E) for each B,wEP.
In this case, we prescribe that 10={S'. S$} and Im._={S',S$'}. Again, MNc(G) is simplified to MNC when G is understood. If the standard subset construction algorithm for converting NFAs to DFAs is applied to MNC, the (deterministic) LR(O) automaton of G is obtained, i.e., MC(G).
Some functions related to succ and pred are needed for navigating through NLR(O) automata and the recognition graphs derived from them. Toward that end, let G0=(Q, E, 6) be an STG. The succ and Lpred functions, both of type Q XL*2Q, are defined recursively as follows.
(1) For qEQ, succ(q, E) =L_pred(q, E) {q};
(2) forpEQ, aE ,andxCE *,
Esucc(p,xa) = {r E Q I q E succ(p, x), (q, a, r) E 6} and
Epred(p, ax) = {r EQ I q E pred(p, x), (r, a, q)E E}.
Thus, Esucc and Lpred effectively ignore Etransitions. Note that if Go is Efree, then _. succ (resp. fpred) is identical to succ (resp. pred). The Esucc and Epred functions, both of type Q+2Q, are defined for dealing with transitions. For p EQ, Esucc(p) =
{q E Q I(p, E, q)E6} and Epred(p) = {q E Q I(q, E, p)E}. All four of these functions extend to subsets of Q in the usual fashion.
The following facts apply to the NLR(O) automaton MNC(G).
(1) L(MNC(G)) =VP(G).
(2) Each jEI\{I} has a unique entry symbol XEVU{E}, again denoted by
entry(I,).
(3) For {A.+a./}EI such that A #S', Vpred({A+..3,a}) = {A.a3} and
goto(j, A) is defined for each j EEpred({A .a/3}).
An Alternate Recognizer
The GeneraLLRO recognizer is modified to employ the NLR(O) automaton of G as a control automaton in place of the LR(O) automaton. The resulting algorithm, called GeneraLNLRO, is displayed in Figure 6.2. Only a small number of minor changes were required to derive GeneraLNLRO from GeneralLRO. The differences between the two recognizers are discussed next.
The lines in Figure 6.2 were numbered so as to emphasize the correlation between the GeneralLRO and GeneraLNLRO recognizers. Consequently, the line numbers cited below reference code in both Figures 6.1 and 6.2.
(34) It is explicitly recorded that the NLR(O) automaton of G, MNC(G), is used as the control automaton in GeneraLNLRO. Thus, the recognition graph constructed by GeneraLNLRO, GR(MNC), is derived from MNC and the input string w.
(23) A state /j of MNC has more than one incoming transition only if entry(/) = E. Therefore, qj:i+l is unconditionally added to Q at this point, i.e., lines 22 and 24 are not needed in Figure 6.2.
(33) Each set of items in MNC is a singleton, so at most one reduction can apply to ï¿½(p). Thus, an if construct is more appropriate here in place of the for loop of Figure 6.1.
1. function GeneraLNLRO(G =(V, T,P,S); wET*)
2. // w=ala2 * " an+l, n 0, aiE T\{$), 1
3. //Let MNc(G)=(I, V, goto, 10, I) be the NLR(0) automaton for G.
4. // GR(MNC)=( Q, V, 6) is an STG, the recognition graph.
5. Q, 6 := {qOO}, 0 //Initialize GR.
6. // Let MR =(Gj', qoo, Qo). Then L(MR) = PVP(G, E)  {E}.
7. for i :=Oto n do
8. // Let MR =(Gi, q0:0, Q). Then L(MR) =PVP(G, i:w).
9. Reduce (i)
10. //Let MR =(Gj1, q0:0, Q). Then L(MR) VP(G, i:w). 11. Shift(i)
12. // Let MR =(Gk1, q0:0, Qi+). Then L(MR) = PVP(G, i+l:w). 13. if Qi+l =0 then Reject(w) fi 14. od
15. // Let MR =(GT1, q0:0, Q.+,). Then L(MR) = PVP(G, w) = IS$). 16. Accept(w)
17. end
18. function Shift (i)
19. Q.subset := {q E Q, I goto(V(q), a,+1) is defined } 20. while Qsubset #0 do 21. q := Remove(Q.subset) //Let goto(?(q), aij,) =/ . 23. Q := Q U{qj:i+l) //Never redundant. 25. 6:= bU{(qj:i+l, ai+,, q)} //Never redundant. 26. od
27. end
Figure 6.2  The General..NLRO Recognizer
(34) The appropriate successors of p along paths in GR that spell Xa& are located
using the Vsucc and succ functions (instead of the succ function). This is necessitated by
the presence of Etransitions in GR.
(50) Similar to GeneraLLRO, the Traverse function of General.NLRO effectively performs a certain traversal of MNC. However, in this case we also want to step over Etransitions. Traversing transitions in this way mirrors the Earley Predictor function.
28. function Reduce (i) 29. 6.subset := bi 30. Traverse(Qj, i) 31. while bsubset #0 do 32. (p, X, q) Remove(bsubset) 33. if {A .&+X. }=4(p) such that.8 * E then 34. for r Ecsucc(Vsucc(q, e )) do //Let goto(/(r),A) =I. 35. if qj:i V Q then 36. Q := Q U{qj:i} 37. Traverse({qj:i), i) 38. fi 39. if (qj:i, A, r) 6 then 40. 6 := 6U{(qj:i,A , r)} 41. b.subset := &.subset U{(qj:i, A, r)} 42. fi 43. od 44. fi
45. od
46. end
47. function Traverse(Qsubset, i) 48. while Qsubset 0 0 do 49. q := Remove(Q.subset) 50. for goto(i(q),X) = I such that X=*E do //XENU{E} 51. if q,:i j Q then 52. Q := Q U{qj:i} 53. Q.subset := Q.subset U{qj:i} 54. fi 55. 6:= 6U{(qj:i,X, q)} //Never redundant. 56. od 57. od
58. end
Figure 6.2  continued
Relationship to Earley's Algorithm
A connection between Earley's algorithm and GeneralNLR0 is established. The link
between these two algorithms is made indirectly through Earley'. Specifically, we describe a
correspondence between the Earley state graph constructed by Earley, and the recognition
graph constructed by GeneraL.NLRO.
Let G1=(Q, , 6) and G2=(Q2, E, 2) be statetransition graphs. Graph G1 is
homomorphic (resp. isomorphic) to graph G2 if there exists a surjection (resp. bijection)
f:Q1 Q2 which induces a surjection (resp. bijection) g:b1 62 defined by g((p,a,q))=(f(p),a,f(q)),p,qEQ1, aEEU{E}.
Let MNC(G) = (I, V, goto, I0, I) with I={Io, I, . . . ,Iral} be the NLR(O) automaton of G. Let GE,=(QE,, V, bE,) be the Earley state graph constructed by Earley' when it is applied to G and w. Lastly, let GR(MNC) =(Q, V, 6) be the recognition graph constructed by GeneraL.NLRO when it is applied to G and w. Graph GE, is homomorphic to Gi1 as follows. The function fl:QE,Q defined by f ([A.+a./,j]ESj)=q.:j where Ik ={A ca/ } is a surjection which induces the surjection g 1:6E,e 81 defined by gl((r,X, s))=(f l(s),X, f l(r)), r, s EQE,, XE VU{E}.
If an STG G1 is homomorphic to an STG G2, then an STG G1 can be derived from G1 such that G1 is homomorphic to 61 and 11 is isomorphic to G2. Our comparison of Earley' and GeneraLNLRO is concluded by defining an STG 1E, = (OE', V, SE') such that GE, is homomorphic to (1E, and GE, is isomorphic to Gff1.
For O
That GE, is isomorphic to Gi is established as follows. Define the function f2: E'+ Q by f2(sk:i) = qk:i. The function f2 is a bijection which induces the bijection g2:4,1 defined by g2((f,X,)) = (f2(V),Xf2(V)), r, sEQE,, XEVU{E}. Therefore, GE, is isomorphic to G1l.
Implementation Considerations
For the remainder of this chapter, we turn our attention back to the GeneraLLRO recognizer. In this section, some issues that are pertinent to implementing GeneraLLR0 are
addressed. Specifically, means for properly handling graph cycles and for efficiently implementing the relevant set operations and the succ function are discussed. A satisfactory resolution of these issues facilitates the complexity analyses undertaken in the next section. Graph Cycles
In any application which involves graphs that are not necessarily acyclic, graph cycles are a matter of concern. Neither LR(0) automata nor the recognition graphs constructed by GeneralLRO are guaranteed to be acyclic.
Let MC(G) denote the LR(O) automaton of G and let GR(MC) denote the recognition graph constructed by GeneralLRO when it is applied to G and w. Since all paths in GR are reflected in MC, albeit in reverse, GR is cyclic only if Mc is also cyclic. However, the converse does not hold; Mc may have cycles that are not replicated in a recognition graph regardless of the input string.
Properties of contextfree grammars that give rise to cycles of any kind in LR(O) automata are identified first. Since L(Mc) = VP(G), Mc is cyclic if and only if VP(G) contains strings of unbounded length. Thus, Mc is cyclic if and only if for some A EN, a E V* with a4E, and yET*, A=*aAy holds in G. That is, Vi >0, &iA EVP(G) for some bEV*. Note that a may contain terminal symbols.
Grammatical properties which give rise to those cycles in MC that can also be reproduced in GR are considered next. Since the above conditions characterize all possible cycles in MC, a restriction on those conditions is sought. Assume for the moment that GR is cyclic. Given an arbitrary transition in GR of the form (qk:i, X, qj:h), we know that h
is actually introduced into a recognition graph depends on the input string as well as the subject grammar.
A result by SoisalonSoininen and Tarhio [40] relating to the concept of a looping LR parser was helpful in identifying the grammatical properties that give rise to cyclic recognition graphs. Looping LR parsers are discussed in conjunction with a method for constructing deterministic LR parsers for some nonLR(k) grammars [2]; this method involves disambiguating multiplydefined parse table entries. A looping LR parser is an LR parser that has a parsing configuration such that all subsequent actions are reductions. The nonLR(k) grammars for which looping LR parsers can be produced (i.e., for some set of disambiguation choices) can be characterized as follows.
Fact 6.1 A looping LR parser can be constructed for G if and only if for some A EN and c&, PE V* the following three statements hold in G: (1) A = +aA 1P, (2) *E, and (3) if &=E, then 1*.
Proof. This is the main result presented by SoisalonSoininen and Tarhio [40]. 0
In summary, a cycle in Mc is introduced into GR only if it spells a nontrivial string of nullable nonterminal symbols. Paths spelling strings of nullable nonterminals which can cause cycles are introduced into GR by the Traverse function. This is effectively carried out through a traversal of Mc where each state in Mc is considered at most once. Once cycles are present in GR, they are traversed, if at all, in the Reduce function. Specifically, the computation of the succ function implies a traversal of certain paths in GR, including those which contain cycles. An implementation of the succ function which properly deals with cycles in GR is described in a later subsection. In either case, cyclic control automata and recognition graphs do not pose any particular difficulty to General.LRO.
Set Operations
Two sets are maintained by GeneralLRO during recognition, viz., Q and 6. Two set operations are used in the process. One operation is that of determining if a particular
object is an element of a set. The other operation is that of adding an object to a set. Efficient means for implementing these operations with respect to both Q and 6 are described below.
The operations on Q are considered first. We assume that the states in Qi are stored on a separate linked list for each value of i. Thus, whether or not qj:i exists in Q can be determined by scanning a list of at most m items. A state is added to Q by simply linking it into the appropriate list. Thus, both set operations of interest can be performed with respect to Q in constant time.
Membership in Q can be resolved faster using the following scheme. A boolean flag is associated with each state in Mc. The flags are reset to false at the beginning of each itera tion of the main for loop in GeneraLLRO. When a state q is added to Q by either Reduce or Shift in the ith iteration, 0
The overhead associated with resetting m boolean flags each time through the loop can be avoided by using integer flags instead. The flags are initialized to 1. When a state q is added to Q in the ith iteration, 0
Managing the transition set b is slightly more involved. We assume that all of the transitions out of q,:, with 0 j
Let (qj:i,A, qk:h), h
6 in constant time as well.
The succ Function
The last significant aspect of GeneralLRO that needs explication is its use of the succ function. This subsection proposes one approach to implementing succ. A revised Reduce function is presented which incorporates the method. The modified function is displayed in Figure 6.3.
Each use of succ in Reduce implies that a traversal of GR is carried out. An auxiliary stack, the SuccStack, is used by Reduce to effect this traversal. Each entry in the stack records an intermediate stage in the traversal of GR that is required to compute the succ function.
Consider the reference to the succ function in line 34 of Figure 6.1. Based on properties of control automata and recognition graphs, the following holds: succ(q, aR) = {r EQ 3 a path in GR from q to r spelling &R} = {r EQ I a path in GR from q to r of length len(&R)}. Motivated by this observation, each entry in SuccStack is a triple (r', A, d) where
(1) r' is a state in GR to which some path traversal from q has progressed, (2) A is the lefthand side of the production being reduced, and (3) d is the distance left to go before a state in succ(q,oR) is reached where d
(13) These three lines correspond to lines 2830 of Figure 6.1.
(4) The Succ..Stack is initially empty.
function Reduce (i) //Revised to implement the succ function.
subset := 6i
Traverse(Qi, i)
SuccStack :=0
while SuccStack #0 or subset #0 0 do
if SuccStack = 0 then
(p , X, q):= Remove(&.subset)
for A. aX.f3C(p) such that Pl *E do Push(SuccStack, (q, A, len(a))) od
else // Succ.Stack # 0
(r,A, d) := Pop(SuccStack)
if d > 0 then // Let entry(V(r)) = X.
for r'EQ such that (r,X, r')E6 do Push(Succ..Stack, (r', A, d1)) od
else //d =0, let goto(V(r),A) =Ij.
if qj:i V Q then Q := Q U{qj:i} Traverse({qj:i}, i) fi
if (qj:i, A, r) V6 then
6 : U{(qj.i,A, r)} 6bsubset := 6_subset U {(qj:i, A, r)}
end
Figure 6.3 A Modified Reduce Function
(5) This while loop corresponds to the while loop at line 31 in Figure 6.1. However, in this case there are two collections to exhaust before the loop terminates.
(6) The true branch of the if statement deals with items in 6_subset and the false branch deals with items in Succ..Stack. The if predicate is written so that items in SuccStack have priority over items in subset. Clearly the predicate is false in the first iteration of the while loop.
(78) These two lines are the same as lines 3233 of Figure 6.1.
(9) Instead of invoking the succ function as in line 34 of Figure 6.1, we initiate the graph traversal of GR that is implied by that use of succ. Specifically, (q, A, len(ce)) is
pushed onto SuccStack to record that we want to find the successors of q which are located at the ends of paths of length len(a) from q; moreover, when each of these states is found, a transition on A will be made to it from an appropriate state in Qj.
(11) The SuccStack is not empty, so one of its entries is processed.
(12) An item (r,A, d) is removed from SuccStack.
(1316) If d > 0, then the stage in the traversal of GR that is recorded by (r,A, d) has not progressed far enough. Let entry(V(r)) = X. Then every transition out of r is on X. For each state r'E Q such that (r, X, r') is a transition in GR, (r',A, dl) is pushed onto SuccStack. By effectively moving to r', the length of the traversal has been increased by 1. Consequently, the distance remaining is decreased by 1.
(1725) If d =0, then r Esucc(q, a)) for some q and a referred to in lines 79. Lines 1825 are identical to lines 3542 of Figure 6.1.
The Complexity of Recognition
In this section, some worstcase complexity bounds are established for the GeneralLRO recognizer. Specifically, we consider the amount of space and time required by GeneralLRO, in the worst case, when it is applied to G and w. In the following, it is convenient to assume that w EL(G). In addition, the LR(O) automaton of G, MC(G), is assumed to have m states.
Bounds on space requirements are derived first. They are useful in determining the time bounds. In both cases, bounds are established for arbitrary G and for arbitrary unambiguous G.
Space Bounds
The space complexity of GeneraLLRO is determined by placing an upper bound on the number of states and transitions in GR at the point when w is accepted. The sizes of the auxiliary data structures, i.e., Q.subset, 6subset, and Succ..Stack, are accounted for later.
First, we assume that G is arbitrary. For 0< i
Consider bi for some i, 0< i
quently, since Q, has at most m states, 6i has at most m2(i+1) transitions. In addition, b,+
n
contains one transition. Thus, there are at most 1+ Z M2(i+1) E 0(n2) transitions in GR.
iO
Summarizing, GR contains at most O(n) states and O(n2) transitions. Therefore, the space complexity of GeneralLRO for arbitrary G is O(n2). An ambiguous grammar that meets this worstcase space bound is the following: {S S S a I }.
The space complexity of GeneraLLRO remains O(n2) even if G is unambiguous. For example, the unambiguous grammar with production set {S. a S a I a E} meets this worstcase space bound.
Time Bounds
The time complexity of GeneraLLRO is determined by placing an upper bound on the time required to construct GR. It transpires that the complexity of GeneraLLRO is dominated by the complexity of the Reduce function. The following remarks are made in light of the earlier observations regarding the efficiency of the set operations used by GeneraLLRO.
The main function invokes the Shift and Reduce functions n +1 times each. Thus, the time complexity of GeneraLLRO is determined from the time spent in these two functions throughout the duration of recognition.
At most m states and m transitions are installed in GR during any one invocation of the Shift function. Thus, over n+1 calls, O(n) time is spent within Shift.
In analyzing the complexity of the Reduce function, the time spent within Traverse is accounted for separately. In any one invocation of Reduce, the Traverse function is called at most m times. That is, in the worst case it is called once for each state in Qj. Within any
one invocation of Traverse, at most m states and m2 transitions are added to the recognition graph. Thus, over n+1 calls to Reduce, O(n) time is spent within the Traverse function.
In assessing the contribution of the Reduce function to the time complexity of GeneraLLRO, we first assume that G is unambiguous. For some i, 1< i
are precisely the transitions that are cycled through _subset. Although at least one of these transitions must have been generated in the most recent invocation of Shift, for simplicity we assume that all O(i) of them are created by Reduce. Under this assumption, each transition in 6i results from traversing some path in GR that spells the reversal of some prefix of a production righthand side. This traversal is effected through the use of the SuccStack. Let p
 max({len(w)IA..wEP}). Thus, at most m2ip entries are cycled through Succ._Stack while all of the reductions relevant to the ith call to Reduce are performed. Together, at most m 2i(p+l) items are cycled through 6subset and SuccStack. Since
n
Zm2i(p+l) E O(n2), the total time spent in Reduce over n+1 calls is O(n2). Accumulating i=1
the total time consumed by Shift, Traverse, and Reduce, we conclude that GeneraLLRO runs in O(n2) time in the worst case if G is unambiguous.
Now assume that G is arbitrary. Again, we want to determine the total number of items cycled through subset and SuccStack during the ith call to Reduce for some i, 1
3 All of the work is done by Traverse when i =0 since 6b =0 when Reduce is called in that instance.
A  X E (p). Further suppose that len(aX) = p. While traversing all of the paths in GR that emanate from p, pass through (p, X, q), and spell XaR, an upper bound on the number p1
of items that are cycled through Succ..Stack is given by Eil E O(iP1). Since there are j0
O(i) transitions in bi that may be reduced back through, 0(iP) entries may be cycled
n
through SuccStack during the ith call to Reduce. Since iE 0(n""), GeneraLLRO runs i1
in 0(nP+l) time in the worst case.
The worstcase running time of GeneralLRO does not compare favorably with Earley's recognizer. However, the parsing version of GeneraLLRO also runs in O(nP+l) in the worstcase. As shown in the next chapter, this bound more properly reflects the time required to construct a convenient representation of all the possible parses of an input string. In contrast, the O(n3) bound does not take into account the time required by Earley's algorithm to analyze its more indirectly represented parse forest.
We have not yet accounted for the maximum sizes potentially attained by the auxiliary data structures Q..subset, 6subset, and SuccStack. The set variable Q.subset holds at most m states in either Shift or Traverse. In Reduce, the set variable 6.subset contains at most m2(i+l) transitions. Since access to SuccStack follows a FIFO discipline, it contains at most O(i) entries at any time. Therefore, the space required for these structures does not contradict the worstcase space bounds for GeneraLLRO that were derived above.
On Garbage Collection and Lookahead
Garbage collection and lookahead provide means for improving the efficiency of the General.LRO recognizer. Garbage collection is relevant to reclaiming the space occupied by states and transitions in GR when they become superfluous to the remainder of the recognition task. Lookahead is used for selectively generating only those states and transitions that are consistent with the current lookahead string. Some basic notions regarding the use of garbage collection and lookahead within GeneralLRO are discussed briefly.
Recalling the settheoretic foundation of GeneralLRO helps to motivate the utility of garbage collection. Since GR represents the sets of viable prefixes that are tracked by the recognizer, the notion of a dead state as it applies to MR identifies nonessential states of the recognition graph. Whether GR is considered at line 10 or line 12 of GeneraLLRO, all states that are dead with respect to MR at those points, as well as all transitions emanating from them, are no longer needed. Consequently, the space used by these states and transitions can be reclaimed for later use.
In order to determine an appropriate location within GeneraLLRO to invoke garbage collection, note that if MR contains no dead states before Reduce is called, then it has no dead states when Reduce terminates. However, the same remark does not apply to the Shift function. In particular, states can become dead during the ith call to Shift where 0< i < n if a proper subset of the states in Qi have transitions generated to them. Thus, it is convenient to perform garbage collection in conjunction with the Shift function by anticipating the states that become dead as a result of it.
An appropriate place to perform garbage collection is immediately following line 19 in the Shift function. The following simple scheme is sufficient.
(1) Mark all states that are reached in a traversal of GR that begins at the states in
Qsubset.
(2) In a second traversal that starts from the states in Qi\ Q.subset, delete from Q
the states that were not marked in step (1) and delete from 6 the transitions that
emanate from those states.
Note that a garbage collection scheme based on reference counts would be far less straightforward due to the selfreferences which arise from cycles in the recognition graph. Moreover, the simple markandsweep garbage collection procedure outlined above applies readily to GeneralNLRO as well.
Although garbage collection can improve the space efficiency of GeneraLLRO, it obviously incurs a time penalty. For 0
tions in GR prior to the ith call of Shift. Thus, the procedure outlined above may be performed in O((i+1)2) time. Observe that this is no worse than the worstcase time complexity of the Reduce function.
In practice, one would probably want to perform garbage collection less seldom than on every input symbol. Regardless, a similar procedure involving two graph traversals would still apply. The first traversal begins from certain states in the most recently completed state subset Qi and marks all states reached in the process. In the second traversal, all unmarked states and their outgoing transitions are deleted from the recognition graph.
The basic goal of garbage collection is to contract periodically the size of the recognition graph. As a consequence, space taken up by nonessential states and transitions becomes eligible for reuse. In contrast, the aim of lookahead is to anticipate the states and transitions that are necessary to recognize the input string. In short, lookahead is used within Shift, Reduce, and Traverse to selectively generate those states and transitions that are consistent with the current lookahead string.
In order to make use of lookahead, the items in the control automaton are attributed with appropriate lookahead strings. The literature on the computation and use of lookahead in the context of LR parsers is quite extensive. The type of lookahead typically used in conjunction with LR(0) automata is either SLR(k) lookahead [12] or LALR(k) lookahead [8,11,29].4 Without going into detail, the use of ksymbol lookahead in GeneralLRO5 for some k > 0 impacts the following locations in Figure 6.1.
(Line 19) Q.subset is computed to contain only those states q E Qi such that the shift on ai+1 from V(q) is consistent with the lookahead string.
(33) Only those reductions are initiated from p that are consistent with the current ksymbol lookahead. This comment also applies to line 8 in Figure 6.3.
(50) Transitions on nullable nonterminal symbols are selectively made based on their consistency with the ksymbol lookahead string.
4 Almost invariably, k = 1.
6 This is somewhat of a misnomer when lookahead is employed.
The costs of employing lookahead include the space that is needed for storing lookahead strings in the control automaton and the time associated with matching the ksymbol lookahead string from the input string with occurrences of it in the control automaton. If k =1 as is generally the case, the overhead of using lookahead is not usually an issue.
On the other hand, the benefits of using lookahead can be substantial. Space is saved by reducing the number of states and transitions that are needlessly created. In addition, time is saved that would otherwise be spent generating unnecessary pieces of the recognition graph and traversing paths that would be called for by Reduce in the absence of lookahead. Most significantly, GeneraLLRO runs in linear space and time if G is an LR(k) grammar provided that ksymbol lookahead is used.
Discussion
The Earley' and GeneraLLRO recognizers both construct statetransition graphs. In each case, the STG is used for representing the sets of viable prefixes that are tracked by the GeneraLLR recognition scheme. The graph constructed by Earley', GE,, is derived interpretively in the sense that the Earley states that are generated during recognition drive the construction of the graph. In contrast, GR is constructed under the guidance of a precomputed control automaton. This distinction is obscured somewhat by the GeneraLNLRO recognizer. GeneraLNLRO constructs a statetransition graph that is quite similar to GE,, but does so under the guidance of the NLR(O) automaton of G.
The GeneraLLRO and GeneraLNLRO recognizers illustrate extremal examples of a basic approach to general recognition that entails constructing a recognition graph under the guidance of a controlling automaton. In each case, (1) the structure of the recognition graph is mirrored in the control automaton, (2) the recognition graph is used to represent the sets of viable prefixes that are tracked by the GeneraLLR recognition scheme, and (3) the control automaton accepts the viable prefixes of G. Other possible control automata are suggested by the fact that the LR(O) automaton of G can be obtained by applying the subset construc
tion algorithm to the NLR(O) automaton of G. Any automaton intermediate between the NLR(O) and LR(O) automata that is built during subset construction provides a viable candidate for a control automaton. One main advantage of LR(O) automata is their determinism, whereas a favorable feature of NLR(O) automata is their comparatively smaller number of states. Automata that are intermediate between these two extremes can be tailored to balance both of these factors. The choice of possible control automata is broadened still further when lookahead is introduced. An investigation of alternate control automata is left for future work.
Of the known contextfree recognition algorithms, GeneraLLRO is most like Tomita's algorithm without lookahead [42,431. In this form, Tomita's algorithm interprets a parse table derived from the LR(O) automaton of G and maintains a socalled graphstructured stack that is similar in structure to our recognition graph. However, a transition of the form (p,A, q) is represented by two edges of the form (p, rA) and (rA, q) where p, q correspond to parse states and rA is a symbol vertex. In effect, the symbol vertices play the role of our transition labels. Due to the use of these symbol vertices, the correspondence between the states and edges in the graphstructured stack and the states and transitions of the underlying LR(O) automaton is not as precise as in GeneraLLRO. In addition, the symbol vertices needlessly increase the number of vertices and edges in the graphstructured stack, increase the lengths of paths that are traversed during reductions by a factor of 2, and complicate the operations which manage the stack.
Tomita's algorithm cannot handle cyclic grammars [42]. However, it also fails to handle some noncyclic grammars that contain Eproductions. In short, any grammar that may introduce a cycle into the graphstructured stack is troublesome. These grammars are exactly the grammars that can introduce cycles into our recognition graphs.
Tomita's algorithm independently keeps track of edges that may need to be reduced back through and states that have yet to be acted on (a state is acted on to determine what parse moves are relevant to it). In contrast, other than the special attention given certain
nonterminal transitions, GeneraLLRO uniformly lets the transitions stored in bsubset drive the reduction process.
The special handling required of nullable nonterminals is common to all general recognizers that allow productions. The manner in which Tomita's algorithm deals with Eproductions is the cause for its limited coverage. For i =0 to n +1, the states in Ui = U. Uj are generated by Tomita's algorithm as follows (U corresponds to our Qj). 0
(1) Let j=0.
(2) If i =0, then U0,0 contains only the start state; otherwise, Uj,0 is comprised of the
states that resulted from shift moves on ai from states in U_r
(3) If all of the states in Uj, have been considered, then all of the reductions have
been performed at stage i. The shift moves on aj+j are performed next.
(4) Perform all pending reductions by nonEproductions from states in Uj; any new
state that is created is placed in Ujoj.
(5) Perform all pending reductions by productions from states in Uj; any new
state that is created is placed in Ui,j+r
(6) Let j =j+l and return to step (3).
Thus, reductions by Eproductions are delayed until there are no other reductions to be made. As a consequence of this treatment of cproductions, *.U.+I is not necessarily onetoone where I represents the states in the underlying LR(0) automaton. This is an undesirable anomaly that further obfuscates the operation of the algorithm. By comparison, GeneralLRO ensures that .Qi.I is always onetoone.
The fact that Tomita's algorithm fails to handle some noncyclic grammars with Eproductions was also observed by NozohoorFarshi [35]; in particular, grammars for which 3A EN such that A = +aAfl and a=*+E hold in G, but 3=**E does not hold, are focused on. In order to accept grammars of this kind, a modification to Tomita's algorithm is proposed which allows cycles in the graphstructured stack. The basic approach to handling such cycles is outlined as follows: when a nonterminal transition is installed from a state q E Ui
that already existed in the graph, all states in Uj which were previously acted on are reconsidered to see if any reductions from them pass through the new transition. This is apparently sufficient, but the details of how it is accomplished are not provided.
The worstcase time complexity of Tomita's algorithm is also 0(n"+') [26]. In comparison, recall that the complexity of Earley's algorithm is not affected by the length of production righthand sides. Accompanying the complexity analysis by Kipps [26] is a modified version of Tomita's algorithm that has a worstcase running time in O(n3). In short, additional interstate links are used for decreasing the number of paths that must be traversed when performing reductions. However, the plethora of setunion and setmembership opera,tions contained in the algorithm does not make it clear that O(n3) time is obtained. In any case, this modification subverts the algorithm's ability to construct a parse forest, so it is only useful for recognition.
CHAPTER VII
A GENERAL BOTTOMUP PARSER
The GeneraLLRO recognizer is extended into a general bottomup parser in this chapter. The transformation from general recognizer to general parser is straightforward in all but one respect  some effort must be expended to parse arbitrary derivations of the empty string. Briefly, a parse of an input string is represented by appropriately annotating the transitions of the recognition graph. Ambiguity is accommodated by attaching multiple annotations to relevant transitions. As usual, an arbitrary reduced $augmented grammar G = (V, T, P, S) and an arbitrary string w =ala2 a,+,, n 0, ai E T \{$} for 1
From Recognition to Parsing
Implementations of deterministic bottomup parsers, of which LR parsers are exemplary, are not obliged to build an explicit parse tree for the input string. Whether or not a parse tree is indeed constructed is primarily dictated by the requirements of the application to which the parser is applied. Other factors which are influential include memory constraints and the interface between the parser and other processing components.
In contrast, general bottomup parsers typically cannot avoid explicit parse tree representations. When parsing against a nondeterministic grammar a forest of parse trees rather than an identifiably unique tree is typically relevant to the input string. Due to theoretical limitations on the discrimination afforded by lookahead, this behavior is even observed with unambiguous grammars. In any case, some representation of the parse forest must be built during parsing so that a unique parse can eventually be produced.
In light of these observations, the parsing version of GeneralLRO, GeneraL.LR0', overtly maintains a representation of a parse forest. The manner in which this is accomplished is a simple generalization of the following proposed scheme for explicitly constructing a parse tree within an LR parser.
Suppose that G is an LR grammar. We consider a hypothetical LR parser for G and describe one way to explicitly build a parse tree for an input string in conjunction with the parse stack. We may assume that the parser is based on some LR automaton for G, say M. At any point during a parse, the contents of the stack is a sequence of states from M. The parse tree that is synthesized during parsing is represented by associating a node in the tree with each state in the stack other than the bottommost state.
Let the contents of the stack at some point be 8os ... 8ï¿½m, m >0, where each si is a state of M; in particular, so is the start state of M. For 1 < i < m, let X be the entry symbol for state si. Thus, X1X2 ... Xm is the viable prefix of G that is implicitly represented by the supposed stack contents. If m =0, the relevant viable prefix is E. For 1
A shift action always creates a new leaf node. Suppose that the current input symbol is a and the next action of the parser is to shift a from s. As a result of this action, the contents of the stack becomes SoS1 ï¿½ ï¿½ , 8,t, where goto(s., a) = t1. As a side effect, a new parse tree node is generated, labeled with a, and attached to t1 in the stack.
A reduce action typically generates one internal node. However, when reducing by an
production, a leaf node is also created. Suppose that the next action called for by the parser is to reduce by production A+ E. This action transforms the contents of the stack to SoS1 * * * smt2 where goto(sm,A) = t2. Two new tree nodes are generated as a side effect. One tree node is a leaf that is labeled with E. The second is an internal tree node; it is labeled with A, set to point to the new leaf, and attached to t2.

42
Our immediate goal is to describe general bottomup lefttoright recognition as the
inverse of general topdown righttoleft recognition with the viable prefix being the central
unifying concept. From that standpoint, it is undesirable for the reduce and shift relations to
stray outside of VP(G). Consequently, these two relations are redefined to explicitly restrict
them to VP(G) as follows: = = {(oro;, aA)  ocEV*, A+UEP, aA GVP(G)} and  =
{(a, a a)  aEV*, a ET, aa GVP(G)}. From the closure result of Lemma 4.8, restricting the
ranges of these two relations to VP(G) effectively restricts their domains to VP(G) as well.
Henceforth, these new restricted versions of f= and  are in affect at all times.
Lemma 4.10 VP(G) = {aG V*  e(}=*)*=*ce holds in G for some x G T*}.
Proof. Since the [= and  relations are restricted to VP(G), it is clear that any string
aEV* such that e 0=* J=* holds in G for some x G T* is a viable prefix of G. In order to
show that every viable prefix of G is similarly produced, let a be an arbitrary member of
VP(G). From Theorem 3.20, cj(=>k l)*=*pa holds in G for some S^ojEP and zET*.
Since G is reduced, a(=^ !)*=* e holds in G for some xET* (implying xzEV{G)). It fol
lows from Lemma 4.9 that e ()=**)*(=*a holds in G.
Corollary L(G) = {; G T"*  e(j=*)*=*; holds in G for some S'+a;GP}.
Corollary PREFEX( G) = {z G T*  e 0=*#)j o holds in G for some a E F*}.
Finally, the following lemma motivates, ex post facto, the relation product (H**).
Lemma 4.11 For ccGVP(G), at least one of the following two statements is true: (1)
a\=*/3+Pa holds in G for some /3E V* and a E T\ (2) a\=*oj holds in G for some S*ojEP.
Proof By Theorem 3.20, w(=*j? I)*=^j?a holds in G for some SmEP and zET*. By
Lemma 4.9, also holds in G. If z=e, then a(J=*)e0f=*ii; holds which demon
strates that statement (2) is true. Otherwise, z = ay for some aET and yET*. In this
case, a:d=*)a/y(J=*)*=*a; holds in G for some qrGF*. This last expression implies that
=7 holds for some /3E F* so statement (1) is true.
54
Basis (m= 0). Thus, i =0, sm =s0 = [51'* 5$,0] ES0, and 7=e. The consequent trivially
holds in this case.
Induction (m > 0). Two cases are analyzed based on whether or not a=e.
Case (i): a=e. In this case, j i and sm was added to 5, by the Predictor. Thus, sm1 =
[Bkt'At, j"] ESÂ¡ for some BktAtEP and j1, 0
Clearly p' also spells 7. By Fact 5.2, <7=>r* aji+laji+2 a, holds in G. By the induction
hypothesis, 7=&7 for some 6V* such that S'=^*SBy and =^,*o1o2 a; hold in G for
some yET*. That is, 7EVP(G, i:u/) and [B*vAT,jr\ ES{ is valid for 7. Since
SBy =>,*SoAxy holds in G for some x E T*, [A* /?, 1] ES, is also valid for 7EVP(G, i:w).
Case (ii): a^e. Therefore, sm was added to SÂ¡ by either the Scanner or the Completer, i.e.,
a=ofX for some of EV* and XEV. Let sm_l = [A+otX/3,j]ESp for some i', j
and let p,=(so>si> By Lemma 5.1, p' spells Sot for some SEV* such that
SofXSa = 7. By Fact 5.2, of =>r*aJ+1aj+2 ap holds in G. Therefore, by the induction
hypothesis, S'=*?6Ay and <5=^*a1a2 a; holds in G for some yET*. That is,
<5o/EYP(Gi, i':w) and [A*otX/3, j] ESp is valid for dot. If XE T, then X = aÂ¡ and i'=i1.
If XEN, then X =>*ap+1ap+2 a{ holds in G. Consequently, 7EVP(G,i:w) and
[A a /?, j] E is valid for 7.
Corollary Let p =(sn, s1t . ,sm), m >0, be a rooted path in Gw such that p spells
7 EV* and sm = [A *o: */3,y] Ebasis(S) for some Akv/IEP and i,j, 0
Then 7EPVP(G, i:w).
Proof. If m =0, then 7=e and i =0. By definition, eEPVP(). Otherwise, suppose
that m > 0. Since sm Ebasis^,), the last transition in p is on E T, i.e., i > 0 and 7=7lai
for some 71E V*. Therefore, 7EPVP(G, i:w).
The next lemma provides the converse to Lemma 5.2.
Lemma 5.3 Let 7 be a string in VP(<7, i:w) and let [A/?, j] ES{ be a state which
is valid for 7 for some A+ot(lEP and i,j, 0
in Gei to [A+a>/3,j] ESÂ¡ which spells 7.
89
nonterminal transitions, GeneraLLRO uniformly lets the transitions stored in subset drive
the reduction process.
The special handling required of nullable nonterminals is common to all general recog
nizers that allow eproductions. The manner in which Tomitas algorithm deals with e
productions is the cause for its limited coverage. For =0 to n+1, the states in [/, =
U Â£/, j are generated by Tomitas algorithm as follows {Ui corresponds to our QÂ¡).
o
(1) Let j =0.
(2) If i =0, then U0 0 contains only the start state; otherwise, UÂ¡ 0 is comprised of the
states that resulted from shift moves on a, from states in U{_v
(3) If all of the states in Ui have been considered, then all of the reductions have
been performed at stage i. The shift moves on ai+1 are performed next.
(4) Perform all pending reductions by noneproductions from states in Ui y any new
state that is created is placed in j.
(5) Perform all pending reductions by eproductions from states in U{ ; any new
state that is created is placed in Ui j+l.
(6) Let j =y+l and return to step (3).
Thus, reductions by eproductions are delayed until there are no other reductions to be
made. As a consequence of this treatment of eproductions, 0:[/,>/ is not necessarily one
toone where / represents the states in the underlying LR(0) automaton. This is an undesir
able anomaly that further obfuscates the operation of the algorithm. By comparison,
GeneraLLRO ensures that is always onetoone.
The fact that Tomitas algorithm fails to handle some noncyclic grammars with e
productions was also observed by NozohoorFarshi [35]; in particular, grammars for which
3A EN such that A =>+aAf3 and hold in G, but f3=$*e does not hold, are focused on.
In order to accept grammars of this kind, a modification to Tomitas algorithm is proposed
which allows cycles in the graphstructured stack. The basic approach to handling such
cycles is outlined as follows: when a nonterminal transition is installed from a state q E U{
36
Given the impracticality of scanning input strings from right to left, it is worth
reflecting on why strong rightmost derivations were chosen over strong leftmost derivations
as a point of origin. If GeneraLJLL had been developed first, the evolution from GeneraLXL
to General_RR certainly would have been no more involved than the progression in the other
direction. However, strong rightmost derivations were favored from the outset because
viable prefixes are considerably more ingrained in the literature than are viable suffixes.5 In
addition, the bottomup lefttoright counterpart to General_RR that is developed in the
next chapter is derived directly from GeneralRR. Considerable attention is devoted to this
derivative of the General_RR recognition scheme in the rest of this work.
6 To date, we have yet to find a reference to Sippu and SoisalonSoininen [38] in the literature.
10
If Atu is a production in P, then Akx0 is an item of G for each a and 0 such
that oj=a0. The size of G is defined as <7 = ^{len^a;)  A+u)E.P}. Note that the size
of G is equivalent to  {Ara*0\A ra0 is an item of (7}. The reversal of G is the
grammar GR ={V, T,PR,S) where PR = {AkJ* \ A orG.P}.
The derives relation (=>), a binary relation induced on V* by P, is defined formally by
=* = {(aA 0, acj0) \ a,0EV*, A+0J&P}. A string 'yEV* such that S=>*7 holds1 in G is
called a sentential form of G; the set of the sentential forms of G is denoted by SF(G). The
(contextfree) language that is generated by G is defined by L((7) = SF(G)nT*. Each
member of L(G) is called a sentence of G. We use PREFIX(G') and SUFFD^G) as abbrevi
ations for PREFrX(L(<7)) and SUFFEX(L(
For A E N and X E V, if A =*+aX0 holds in G for some a, 0G V*, then X is reachable
from A. A symbol XEV is nullable if X=**e holds in G. A string 7GE* is nullable if
every symbol in 7 is nullable. In particular, e is trivially nullable.
A symbol XEV is useful if either X=S or S =**otX0=**w holds in G for some
a,0EV* and w T*; otherwise, X is useless. A grammar is reduced if every symbol in its
vocabulary is useful. An arbitrary grammar G can be transformed into an equivalent
reduced grammar2 in 0(1(71) time [39]. In light of this result and for the convenience that it
provides, all grammars are assumed to be reduced throughout this work.
A grammar G is $augmented if, for distinguished symbols S' and $, P contains a pro
duction of the form 5'5$ where S'EV is the (new) start symbol and $ET is a sentence
endmarker. Moreover, S'*S$ is the only production in which S' and $ occur. Whenever
we are working with a Saugmented grammar, all input strings are assumed to end with $.
1 The transitive (resp. reflexivetransitive) closure of a binary relation is denoted by +(resp. *).
2 For our purposes, two grammars are equivalent if they generate the same language.
ACKNOWLEDGMENTS
George Logothetis is an educator in the true sense of the word. His commitment to his
students is exemplary. It is an honor to be his first Ph.D. advisee. Georges contribution to
my development as a person and a computer scientist is indelible. His imprint on this disser
tation is no less so.
My associations with Manuel Bermudez and Joe Wilson have been most rewarding,
stimulating, and enjoyable. Perhaps unbeknownst to them, they kept me going when the
going got tough.
My sincere appreciation is extended to Randy Chow and David Wilson for agreeing to
serve on my supervisory committee. They took on a task which my alter ego would have
refused.
Jeans unfailing devotion is as perplexing as it is sustaining. Her sense of humor in the
face of adversity is remarkable. She gives purpose to life.
m
34
empty set computed by the recognizer. Otherwise, if w $L(G) and w GPREFDC(G) both
hold, then VSu{G, w) is found not to contain e. In either case, w is rejected.
The correctness of the GeneraLLL recognition scheme is formally established in the fol
lowing two lemmas.
Lemma 3.40 Let w GL(G) be arbitrary. If GeneraLLL is applied to GR and w, then
GeneraLLL accepts w.
Proof. Since every prefix of w is in PREFEX(G), PVSu{G,i:w) and YSufG,i:w) are
nonempty for all i, 0
tions. By assumption, w Â£L(G), so cGVSli{G,w). Consequently, w is accepted by
GeneraLLL in the second if statement.
Lemma 3.41 Let w $L(G) be arbitrary. If GeneraLLL is applied to GR and w, then
GeneraLLL rejects w.
Proof. There are two cases to consider depending on whether or not w GPREFIX(G).
Case (i): w GPREFIX(G). In this case, PVSll(G, i:w) and VSu^G, i:w) are nonempty for all
i, 0
by assumption, e ^VSu^G, w). Therefore, w is rejected by GeneraLLL in the if statement
that follows the for loop.
Case (ii): w ^PREFEX(G). Let xET* be the unique string which is the longest prefix of w
such that x GPREFEX(G!) holds. Let len(a;) = in and note that 0). Since
PVSu^G, i:x) and VSll((j, La;) are nonempty for all i, 0
GeneraLLL completes m iterations. During the (m+l)st iteration, PVSix(G,(m+l):u;)=0
is computed. Therefore, w is rejected by GeneraLLL in the first if statement.
Regularity Properties
The regularity properties inherent to all contextfree grammars that are exploited by
GeneraLLL are identified in the following.
Dedicated to my parents, in humble recognition of
their lifelong support and encouragement and of the
countless sacrifices they have made on my behalf.
12
The inverse of G is denoted by G1 == (Q,E,8~X) where (p, a, q)E6~x if and only if
(q,a,p)E6, i.e., the transitions of G are reversed in G1.
A finitestate automaton (FSA) is denoted by M = (G, q0, F) (Q, E, 6, q0, F) where G
= (Q,Â£,6) is an STG, q0EQ is the start state, and F C.Q is the set of final states. Each
state in Q is assumed to be reachable from q0. If G is efree, then M is also free. If M is
efree and (p, a, q),(p, a,r)E6 implies that q=r, then M is deterministic. An arbitrary
(resp. deterministic) FSA is called an NFA (resp. DFA). The (regular) language accepted by
M is defined by L(M) = {ar EZf \ succ(g0, x)HF 50}. A state q Q is dead if no final state
is reachable from it.
74
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
function Reduce (i)
Lsubset := t
Traverse^,, i)
while Lsubset ^0do
(p, X, q) := Remove(
if {AKxXft}=iÂ¡^p) such that/?=$* e then
for r Gesucc(Vsucc(g,ar)) do // Let goto= Ij.
if Qji (Â£ Q then
' Q Q U{gy.,}
Traverse({qfy.,}, i)
fi
if (q]:i,A, r)^then
:=U{(?y:i,A,r)}
subset := subset U{(qj.itA, r)}
fi
od
fi
od
end
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
function Traverse(Q_subset, t)
while Qsubset ^ 0 do
q := Remove(Qsubset)
for goto(iftq),X) = Ij such that X =^*e do // XG7VU{e}
if qji k Q then
Q Q u{?;:,}
Qsubset := Qsubset U{fy.,}
fi
:= 8U{(qj:i,X, g)} // Never redundant.
od
od
end
Figure 6.2 continued
Relationship to Earleys Algorithm
A connection between Earleys algorithm and General_NLR0 is established. The link
between these two algorithms is made indirectly through Earley'. Specifically, we describe a
correspondence between the Earley state graph constructed by Earley' and the recognition
graph constructed by General_NLR0.
Let G1=(QV E, x) and G2=(Q2, X,S2) be statetransition graphs. Graph Gx is
homomorphic (resp. isomorphic) to graph G2 if there exists a surjection (resp. bijection)
LIST OF FIGURES
Page
FIGURE
3.1 A General TopDown CorrectSuffix Recognizer 24
3.2 A General TopDown CorrectPrefix Recognizer 33
4.1 A General BottomUp CorrectPrefix Recognizer 43
5.1 Earleys General Recognizer 49
5.2 A Modified Earley Recognizer 52
5.3 The Definition of the State Derivative of a Path 56
6.1 The GeneraLLRO Recognizer 63
6.2 The GeneraLNLRO Recognizer 73
6.3 A Modified Reduce Function 80
7.1 The GeneraLLRO Parser 97
vi
25
The incremental aspect of General_RR becomes apparent in the computation of a set
of primitive RRassociates. Specifically, given VPri^G,^) and a&T, PVPrj^G, az) is
obtained by an application of the la relation since PVPrj^G, az) = {/? V*\a\a (3 holds in G
for some qtGVPri^G, z)}. It is apparent that PVPre^G,.?) and VPpi^G,^) are both
nonempty if and only if z GSUFFEX(G!). The computation of the primitive RRassociates of
e, a suffix of every G T* serves as the initialization step. Specifically, PVPro^GjC) =
{o;S>wGP}.
Lastly, the conditions for termination of GeneraLRR are specified. First suppose that
w GL(G). In this case, VPri^G, w) is the last set of RRassociates computed; after it is in
place, w is accepted based on the fact that eGVPRR(G, w) if and only if w GL(G). Con
versely, suppose that w ^L(G). If w ^SUFFEX(G) also holds, then there is a unique string
z G T"* which is the shortest proper suffix of w such that z ^SUFFIX(G) holds. In this case,
PVPrrG, z) is the first empty set computed by the recognizer. Otherwise, if w (G) and
w GSUFFIX(G) both hold, then c^VPri^G, w) by definition. In either case, the input string
is rejected.
The correctness of the General_RR recognition scheme is formally established in the
following two lemmas. The supporting arguments are quite straightforward given the collec
tive results to this point.
Lemma 3.21 Let w GL(G) be arbitrary. If General_RR is applied to G and w, then
GeneraLJRR accepts w.
Proof. By definition, PVPrj^G, w:i) and VPrr(G, w.i) are nonempty for all i, 0
Thus, the for loop completes len(u>) iterations. Since w GL(G) by assumption,
Â£VPrr(G, w). Therefore, w is accepted by General_RR in the second if statement.
Lemma 3.22 Let w ^L(G) be arbitrary. If GeneraLRR is applied to G and w, then
GeneraLRR rejects w.
Proof. There are two cases to consider based on whether or not w is in SUFFIX(G).
87
The costs of employing lookahead include the space that is needed for storing lookahead
strings in the control automaton and the time associated with matching the Arsymbol look
ahead string from the input string with occurrences of it in the control automaton. If A; =1
as is generally the case, the overhead of using lookahead is not usually an issue.
On the other hand, the benefits of using lookahead can be substantial. Space is saved
by reducing the number of states and transitions that are needlessly created. In addition,
time is saved that would otherwise be spent generating unnecessary pieces of the recognition
graph and traversing paths that would be called for by Reduce in the absence of lookahead.
Most significantly, General_LRO runs in linear space and time if G is an LR(A;) grammar
provided that A;symbol lookahead is used.
Discussion
The Earley' and General_LRO recognizers both construct statetransition graphs. In
each case, the STG is used for representing the sets of viable prefixes that are tracked by the
GeneralLR recognition scheme. The graph constructed by Earley', Gei, is derived interpre
tively in the sense that the Earley states that are generated during recognition drive the con
struction of the graph. In contrast, GR is constructed under the guidance of a precomputed
control automaton. This distinction is obscured somewhat by the GeneraL_NLRO recognizer.
General_NLRO constructs a statetransition graph that is quite similar to Gei, but does so
under the guidance of the NLR(O) automaton of G.
The GeneraLLRO and General_NLRO recognizers illustrate extremal examples of a
basic approach to general recognition that entails constructing a recognition graph under the
guidance of a controlling automaton. In each case, (1) the structure of the recognition graph
is mirrored in the control automaton, (2) the recognition graph is used to represent the sets
of viable prefixes that are tracked by the General_LR recognition scheme, and (3) the control
automaton accepts the viable prefixes of G. Other possible control automata are suggested
by the fact that the LR(0) automaton of G can be obtained by applying the subset construe
84
A+aX .il{p). Further suppose that len(a:X) = p. While traversing all of the paths in GR
that emanate from p, pass through (p,X, q), and spell XoP, an upper bound on the number
pi .
of items that are cycled through Succ_Stack is given by J]i3 EO(ip *). Since there are
j 0
0(i) transitions in 8i that may be reduced back through, 0(ip) entries may be cycled
n
through Succ_Stack during the fth call to Reduce. Since ^Jip EO(np+1), General_LRO runs
= i
in 0(np+1) time in the worst case.
The worstcase running time of GeneralLRO does not compare favorably with Earleys
recognizer. However, the parsing version of GeneraLLRO also runs in 0(np+1) in the worsts
case. As shown in the next chapter, this bound more properly reflects the time required to
construct a convenient representation of all the possible parses of an input string. In con
trast, the 0(n3) bound does not take into account the time required by Earleys algorithm to
analyze its more indirectly represented parse forest.
We have not yet accounted for the maximum sizes potentially attained by the auxiliary
data structures Qsubset, subset, and Succ_Stack. The set variable Q_subset holds at
most m states in either Shift or Traverse. In Reduce, the set variable .subset contains at
most m\i+1) transitions. Since access to Succ_Stack follows a FIFO discipline, it contains
at most O(i) entries at any time. Therefore, the space required for these structures does not
contradict the worstcase space bounds for GeneraLLRO that were derived above.
On Garbage Collection and Lookahead
Garbage collection and lookahead provide means for improving the efficiency of the
GeneraLLRO recognizer. Garbage collection is relevant to reclaiming the space occupied by
states and transitions in GR when they become superfluous to the remainder of the recogni
tion task. Lookahead is used for selectively generating only those states and transitions that
are consistent with the current lookahead string. Some basic notions regarding the use of
garbage collection and lookahead within GeneraLLRO are discussed briefly.
63
1. function General_LRO(G=(V, T,P,S); wÂ£T*)
2. // w=a1a2 an+1, n >0, a,eT\{$}, l
3. // Let MC(G)=(I, V, goto, I0,1) be the LR(0) automaton for G.
4. // GR(MC)=(Q, V, <5) is an STG, the recognition graph.
5. Q, 6 := {g0.0}, 0 // Initialize
6. // Let M* 90:0> Go) Then L(M*) = PVP(G, e) = {e}.
7. for i := 0 to n do
8. // Let Mr ={Gr\ q0:0, Qi). Then L(MR) = PVP(G, i:w).
9. Reduce (i)
10. // Let Mr =(Gr\q0:0, Qt). Then L(MR) =VP(G, i:w).
11. Shift (i)
12. // Let Mr ={Gg\ q0.0, Qi+1). Then L(MR) = PVP {G, f+l:w).
13. if Qi+j=0 then Reject(u>) fi
14. od
15. // Let Mr =(Gr\ q0:0, Qn+1). Then L(MR) = PVP(G, w) = {5$}.
16. Accept(u;)
17. end
18. function Shift (7)
19. Q_subset := {q E QÂ¡  gotoa,+i) is defined }
20. while Qsubset ^0do
21. q := Remove(Qsubset) // Let goto{^q), ai+1) = Ij.
22. if qj.i+l Â£ Q then
23 Q :=QU{qj:i+l}
24. fi
25. 6 := <5U{(fy.i+1, a,+1, 9)} // Never redundant.
26. od
27. end
Figure 6.1 The GeneraULRO Recognizer
Throughout its evolution, the structure of GR is paramount. Certain intermediate
stages in its construction hold particular interest. At each of these points, an FSA may be
defined in terms of GRl which accepts one of the sets of viable prefix associates that is com
puted by the General_LR recognition scheme. The FSA derived from GRl is denoted by
Mr. The inverse of GR is desired since each of its transitions is reversed from the orienta
tion of the corresponding transition in Mc.
It is important to remember that GR evolves continuously throughout the recognition
process. Consequently, GR and MR denote a different graph and automaton,
CHAPTER VIII
CONCLUSION
Summary of Main Results
The first part of this work presented a framework for describing general canonical
contextfree recognition. The framework has a structurally simple mathematical foundation.
The essence of general canonical recognition was captured using a small number of binary
relations and basic settheoretic concepts. Each general recognition scheme that was
presented followed the same script while exploiting inherent properties of viable prefixes.
Specifically, general recognition was reduced to computing a sequence of regular sets in each
case. Regularitypreserving relations were applied to effect the settoset mappings. Our
characterization of general recognition is novel and rather elegant. Its clarity and simplicity
confirm that viable prefixes are especially suitable bases for general recognition. Moreover,
our framework offers a conceptual breakthrough toward a better understanding of the
quintessence of general canonical recognition.
Earleys algorithm proved an especially fitting vehicle for demonstrating the efficacy of
the GeneraULR and GeneralLL recognition schemes. In particular, our graphical variant of
Earleys recognizer, Earley', illustrated one way of realizing explicit representations for the
sets of viable prefixes and viable suffixes that are tracked by these two complementary
schemes. The fact that General_LR is directly manifested by Earley' led us to conclude that
it is more appropriate to interpret Earleys algorithm as a bottomup method rather than a
topdown one. Regardless of which interpretation one favors, Earley' provided much new
insight into Earleys algorithm. Specifically, a deeper understanding of Earleys algorithm
was gained and its relationship with LR parsers was clarified.
107
43
General BottomUp CorrectPrefix Recognition
Now that = and  are defined as inverses, albeit restricted, of =>r and I, respectively,
the transition from General_RR to GeneraLXR is completed by also inverting the direction
in which an input string w E T* is scanned. Accordingly, the essence of GeneraLXR is that
all of the reversed rightmost derivations of w E T* are followed in parallel.
Once again, there are theoretical limits on the precision to which this task may be car
ried out; that is, it is not possible to pursue exclusively the reversed rightmost derivations of
w in the general case. Instead, at the point where a prefix x of w has been processed, all
reversed rightmost derivations (from e) of all strings in xT*C\L{G) are followed (i.e., all sen
tences that have x as a prefix).
As in the topdown recognition schemes, regularitypreserving operations on regular
subsets of VP(G) are the key to GeneraLXR. Correctprefix recognition is performed, i.e.,
the membership in PREFEX(G) of an incrementally longer prefix of w is ascertained as w is
scanned from left to right. Given a prefix x of w, the inclusion of x in PREFIX(G) is deter
mined from the set {oEVP(G)  a=^*x holds in G}. This set is nonempty if and only if
x EPREFEX(G), and it contains u for some SkvEP if and only if zEL(G). Figure 4.1
presents a highlevel description of GeneraLXR; a more detailed discussion follows.
function GeneraLXR (G =( V, T,P, 5); w E T*)
// w an, n >0, each o E T
PVPlr(G, e) :={e}
for i := 0 to n 1 do
VPlr(G, i:w) := ^(PVP^G, i:u>))
PVPlr(G, t+l:u>) := ^a +i(VPm(G, i:w))
if PVPlr(G, 1+1:to) =0 then Reject(io) fi
od
VPlr(G, w) := K(PVPi*(G, w))
if oEVPlr(G, w) for some S+oj.P then Accept(io) else Reject(io) fi
end
Figure 4.1 A General BottomUp CorrectPrefix Recognizer
68
General_LRO. In any case, a set variable called subset is initialized to contain <5,; it is cru
cial that this assignment occur before Traverse is called.
(30) In short, Traverse creates certain paths to states in QÂ¡ that spell strings of nullable
nonterminal symbols. Further discussion of the Traverse function is deferred until later.
The Reduce function can be understood independently of it.
(31) Each transition in subset is considered in turn. All reductions relevant to the
source states of those transitions are performed. Additional transitions may be added to
subset within this loop.
(32) A transition [p,X, q) is removed from subset.
(33) The set of items i/^p) determines what reductions, if any, are applicable to p. Any
kernel item of the form A*aX'f3Erl{p) such that /3=**e holds in G is relevant; that is, we
see through certain nullable suffixes of production righthand sides. In effect, a reduction
from p on A*aX is performed. As described below, a path to p spelling will have been
installed in GR by an earlier call to the Traverse function. In this way, any cycles created in
Qi by nullable nonterminals is left for Traverse to handle.
(34) At this point we are considering one particular reduction applicable to p, say
AkxX'P&i/^p) where /? is nullable. This reduction is performed by traversing certain
paths in GR from p that spell (Xa)fi to locate the states in Q to which transitions on A
must be made. In particular, we want to traverse only those paths that start with the tran
sition (p,X,q). Any other transition from p will have either already been reduced through
or else is in subset waiting to be handled in a later iteration of the while loop. The states
of interest are given by succ(g,o^). It is precisely this application of succ that motivates
reversing the transitions in GR with respect to those in Mc.
(3542) At this point we are dealing with one particular state r Gsucc(g,orfi) and we
assume that goto(4{r),A) Ij for some Ij El. Thus, we need a state g;;i in Q, and a transi
tion (qj.itA, r) in ,. Both of these objects may already exist in GR) so they are condition
ally created as indicated by the if statements. Incidentally, a transition is generated redun
82
First, we assume that G is arbitrary. For 0
one state. Thus, there are at most m(n+l)+l Â£ O(n) states in GR.
Consider <5, for some *, 0
to every state in U Q,. The number of states in U Q. is at most ra(z+l). Conse
o<;< 3 o
quently, since Q{ has at most m states, 6i has at most m2(i+l) transitions. In addition,
n
contains one transition. Thus, there are at most 1 + 2Jm2(i+l).0(n2) transitions in GR.
'= o
Summarizing, GR contains at most O(n) states and 0(n2) transitions. Therefore, the
space complexity of GeneraLLRO for arbitrary G is 0(n2). An ambiguous grammar that
meets this worstcase space bound is the following: {S+S S \ a  e}.
The space complexity of General_LRO remains 0(n2) even if G is unambiguous. For
example, the unambiguous grammar with production set {Sa S a \ a \ e} meets this
worstcase space bound.
Time Bounds
The time complexity of GeneraLLRO is determined by placing an upper bound on the
time required to construct GR. It transpires that the complexity of GeneraLLRO is dom
inated by the complexity of the Reduce function. The following remarks are made in light of
the earlier observations regarding the efficiency of the set operations used by GeneraLLRO.
The main function invokes the Shift and Reduce functions n+1 times each. Thus, the
time complexity of GeneraLLRO is determined from the time spent in these two functions
throughout the duration of recognition.
At most m states and m transitions are installed in Gr during any one invocation of
the Shift function. Thus, over n+1 calls, 0(n) time is spent within Shift.
In analyzing the complexity of the Reduce function, the time spent within Traverse is
accounted for separately. In any one invocation of Reduce, the Traverse function is called at
most m times. That is, in the worst case it is called once for each state in Q{. Within any
28
tions. Specifically, leftmost derivations, left sentential forms, etc., can be defined in terms of
these relations analogously to how rightmost derivations, right sentential forms, etc., are
expressed in terms of =*/? and I. However, an alternate approach is suggested by the follow
ing result.
Fact 3.2 For a, /?Â£ V*, (1) a=>*0 holds in G if and only if afi =$*/Â¡ft holds in GR; (2)
=$*/? holds in G if and only if oF =*? ft* holds in GR.
Proof. A slightly stronger statement is presented by Sippu and SoisalonSoininen as Fact 3.1
[38],
For future reference, some useful equivalences that are implied by Fact 3.2 include the
following: (1) L(G*) = (L(G))*, (2) PREFIX(G*) = (SUFFIX((?))*, and (3) SUFFB^G*)
= (PREFK(G))*.
Fact 3.2 is exploited rather extensively in what follows. In particular, leftmost derivar
tions in G and ultimately general topdown correctprefix recognition are described in
terms of strong rightmost derivations in GR and the chop relation. Consequently, a substan
tial portion of the results derived in the previous section are useful here as well. This
economizes on our efforts considerably.
Strong Rightmost Derivations in Reversed Grammars
The Rderives relation induced on V* by PR is defined by =h? = {(aA.,ara;) O'EV"*,
A+u)(EPr}. The relationship between strong rightmost derivations in GR and leftmost
derivations in G is the subject of the next series of lemmas.4
Lemma 3.26 For a,/3.V*, if a=*R /? holds in GR, then oP =$* holds in G.
Proof By assumption, a=$R/3 holds in GR. This implies that ot=$?/3 holds in GR by
Lemma 3.1. It follows from Fact 3.2 that aR =** ff1 holds in G.
Lemma 3.27 For a,/3E.V* and A EN, if a=**A/3 holds in G, then of1 =}r fP A holds
in Gr.
4 The chop relations relevant to G and GR are identical.
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
Â£
Manuel E. Bermudez, Chairman
Assistant Professor of
Computer and Information Sciences
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
George LogKrhetis, Cochairman
Assistant Professor of
Computer and Information Sciences
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
0
'Qm m O/m f/\ GGi 7ftaT"
Yu a
PrQ
iChieh Chow
essor of
Computer and Information Sciences
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
98
(25) The transitions installed by Shift are assigned appropriate parse annotations. As
described earlier, the parse annotation for ai+16r is denoted by [ai+1].
(34) This line reflects the new form taken by the transitions of the recognition graph.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
function Reduce (?)
.subset := ,
Traverse(<9,, t)
Succ_Stack := 0
while Succ_Stack ^ 0 or .subset ^ 0 do
if Succ_Stack = 0 then
(p,X,q, [t]) := Remove(_subset)
for AtaX'pE'il^p) such that /?=*e do
// Let [7Tg] be the parse annotation for (3.
Push(Succ_Stack, (q,A, len(o:), [&7T, tt^]))
od
else // Succ_Stack # 0
(r,A,d, [7TJ) := Pop(Succ_Stack)
if d > 0 then // Let X entry(t/(r)).
for r'GQ such that (r,X, r', [7T2])Gdo
Push(Succ_Stack, (r',A, d 1, [&7T2, 7Tx]))
od
else // d =0, let goto(rJj{r),A) = Ij.
if Qji & Q then
Q :=QU{?;:,}
Traverse({gy.,}, i)
fi
if (qj:Â¡,A, r, [n])<Â£6 for any [7r] then
;= U{(9y;i> A r, [TTj])}
subset := subset U{(gj:i, A, r, fo])}
else // Let (fy.,, A, r, [7T2]) G hold for some [7T2].
Disambiguate((gi:i,A, r, [ttJ), (qj:i,A,r, [ttJ))
fi
fi
fi
od
end
Figure 7.1 continued
(3538) As in the recognizer, we need to initiate all relevant reductions from p by push
ing appropriate entries onto Succ_Stack. However, the computation of the succ function
that is carried out here must also construct parse annotations for the transitions installed by
Reduce. A fourth field is added to each entry in Succ_Stack for this purpose. In short, this
11
StateTransition Graphs and FiniteState Automata
A statetransition graph (STG) is denoted by G = (Q,E,S) where Q is a finite set of
states, E is an alphabet, and SE(Q XlTU{Â£})XQ is the transition relation.3 Thus, an STG
differs from a finitestate automaton only in that it does not have a start state or a set of
final states designated for it. A member ((p, a), q)E6 is read as a transition from p to q on
a; p is the source of the transition and q is the target. A member ((p, a), q)E<5 is also writ
ten as (p,a,q)ES or qEÂ£(p,a); the latter may be written as q=6(p,a) if
(p, a, q),(p, a, r)E5 implies that q=r. A transition on e is known as an etransition. An
STG is efree if it has no etransitions. For the remainder of this section we assume an arbi
trary STG G = (Q, E, 6).
The following property holds for all STGs that arise in this work. If
(p, o, q),(p, b,r)ES and a^b, then in words, distinct transitions which share the
same source state access distinct target states. Thus, for any pair of states p,q EQ, there is
at most one transition from p to q.
A path in G and the string over E that it spells are defined inductively as follows. For
each state qEQ, (q) denotes a path in G from q to q spelling e; for m >1 and q^EQ,
0
{qm1, a, <7m)Â£<5, fchen (?o> Qv > Qm) denotes a path in G from q0 to qm spelling xa. The
length of a path is the number of transitions that it contains. A state q is reachable from a
state p if and only if there exists a path in G from p to q.
The succ function, succ:(j) X27* 2^, is defined by succ(p, a;) = {q EQ \ 3 a path in G
from p to q spelling a;}. Extending this function to RC.Q, succ(R,x) = U succ(g,a:).
q R
The pred function, pred:Q XS**2^, is defined in terms of succ by pred(?,a:) =
{p E Q  q Â£succ(p, a:)} and is similarly extended to subsets of Q.
A subscript is given to G later to differentiate it from a grammar.
BIOGRAPHICAL SKETCH
The author spent his early formative years in Georgetown, Massachusetts. An under
graduate education in Physics at Rensselaer Polytechnic Institute was followed by four years
with the General Electric Company in upstate New York. Thence the corporate chains were
shorn for the charms of the South Pacific. Not wanting too much of a good thing, the author
resumed his education as a graduate student at the University of Florida. In due time, an
M.S. in Computer Science was attained. Spurred on by the sage counseling of his mentor, a
Ph.D. in Computer Science was pursued as an afterthought.
114
71
Earleys Algorithm Revisited
A general recognizer that operates strikingly similar to Earley' is obtained by modifying
General_LRO to use a particular nondeterministic variant of the LR(0) automaton for G as a
control automaton. The alternate control automaton, the modified algorithm, and its relar
tionship to Earley' are briefly discussed in this section.
Alternate Control Automata
The nondeterministic LR(O) (or NLR(O)) automaton of G [24, p. 250] is denoted here by
MNciG) =(/> V> gto> 4, 0 where
(1) /={/0)/1, . ,/*_!> =
(2) goto({A* a X0},X)={A+aX'/?}, and
(3) {Bv*o;}Ggoto({AkxB/3}, e) for each B*u)EP.
In this case, we prescribe that I0={S' *5$} and 7m_1={5'5$}. Again, MNC(G) is
simplified to MNC when G is understood. If the standard subset construction algorithm for
converting NFAs to DFAs is applied to MNC, the (deterministic) LR(0) automaton of G is
obtained, i.e., MC[G).
Some functions related to succ and pred are needed for navigating through NLR(O)
automata and the recognition graphs derived from them. Toward that end, let G0=(Q, E, 8)
be an STG. The 2^succ and 17pred functions, both of type Q Xl?*2, are defined recur
sively as follows.
(1) For qEQ, Esucc(q,e) =Epred(q,e) ={g};
(2) for p EQ, a EE, and x EE",
Esucc(p, xa) {r EQ \ q EEsucc(p,x),(q, a, r)<5} and
Â£pred(p, ax) ={r EQ\q Â£.Â£pred(p,a;),(r, a, q)E6}.
Thus, Â£succ and Â£pred effectively ignore etransitions. Note that if G0 is efree, then E
succ (resp. i^pred) is identical to succ (resp. pred). The esucc and epred functions, both of
type Q+2q, are defined for dealing with etransitions. For pEQ, esucc(p) =
79
Let (qj.i,A,qk.h), h
created an integer flag was attached to it for each nonterminal transition out of Ik in Mc\
these flags are initialized to 1. If A, gt.A)$<5, then the flag attached to qk:h that is
associated with the transition out of Ik on A is less than i. When the transition is added to
8, this flag is set equal to i. The effectiveness of this scheme is a consequence of the order in
which transitions are added to 8. Thus, both set operations can be performed with respect to
8 in constant time as well.
The succ Function
The last significant aspect of General_LRO that needs explication is its use of the succ
function. This subsection proposes one approach to implementing succ. A revised Reduce
function is presented which incorporates the method. The modified function is displayed in
Figure 6.3.
Each use of succ in Reduce implies that a traversal of GR is carried out. An auxiliary
stack, the Succ_Stack, is used by Reduce to effect this traversal. Each entry in the stack
records an intermediate stage in the traversal of GR that is required to compute the succ
function.
Consider the reference to the succ function in line 34 of Figure 6.1. Based on properties
of control automata and recognition graphs, the following holds: succ(g,orfi) = {rEQ 3 a
path in Gr from q to r spelling } = {r E Q  3 a path in GR from q to r of length
len(cvfi)}. Motivated by this observation, each entry in Succ_Stack is a triple (r',A, d) where
(1) r' is a state in GR to which some path traversal from q has progressed, (2) A is the left
hand side of the production being reduced, and (3) d is the distance left to go before a state
in succ(g, o^) is reached where d
that follows clarifies how Succ_Stack is used to compute the succ function.
(13) These three lines correspond to lines 2830 of Figure 6.1.
(4) The Succ_Stack is initially empty.
53
Earleys Algorithm and Viable Prefixes
Let GEt = (Qei, V, 6E1) be the Earley state graph that results from applying Earley' to
G and w. In this section, the strings over V that are spelled by rooted paths in Gei are
analyzed. It transpires that the string spelled by an arbitrary rooted path in Gei is a viable
prefix of G. Moreover, the string spelled by a rooted path in Gei which terminates at a state
in S',, 0
bottomup interpretation of Earleys algorithm as exemplified by Fact 5.2.
Lemma 5.1 For 0
state in 5,. Then every path of length len(o;) to s in Gei spells a.
Proof. The proof is by induction on len(o;) = m.
Basis (m =0). In this case a=e, so j =i and s =[A /?, j]. The unique path of length 0 to
s in Gei is denoted by (s). By definition, this trivial path spells e.
Induction (m > 0). Since len(a) > 0, a=alX for some o/GF* and XGF, i.e.,
s = [A oSX f), j]. Thus, s was added to S', by either the Scanner or the Completer. In
either case, every transition to s in GEt is of the form (r,X, s) such that
r =[j4j]GS1,/ for some i', j
hypothesis, every path of length len(a/) to r in Gei spells o'. Consequently, every path of
length len(a) to s in Gei spells cdX =a.
Corollary For 0
state in 5,. Then every path of length len(a) to s in Gei begins at [Ar*a^,j] GS1,.
Lemma 5.2 Let p =(s0, . ,sm), m >0, be a rooted path in GE> such that p spells
qrGF* and sm=[A+af3,j].Si for some AtaflEP and i,j, 0
7GVP(C, i\w) and [A o; /?, y] GS, is valid for 7.
Proof. By Lemma 5.1 and its Corollary, 7=<5a for some GF*; we show by induction on m
that S'=>*6Ay and 6=>r* a1a2 a; hold in G for some y G T*.
32
Proof. Assume that 7 is a viable suffix of G. By Fact 3.3, 7 is also a viable prefix of GR.
Thus, uj(=*n I)*=^b7 holds in GR for some S+uPR and x G T* by Lemma 3.19.
Theorem 3.39 VS(G) = {7G F* u.>(=* l)*^^ holds in GR for some S+oj(zPr and
x G T*}.
Proof. This theorem combines Lemmas 3.37 and 3.38.
Corollary VS(G) = frC V* \ S (=>* U l)+7 holds in GR}.
Corollary For a,/3EV*, if aEVS^) and a(=^/? U I)*/? holds in GR, then /?GVS (G).
General TopDown CorrectPrefix Recognition
Let w G T* be an arbitrary input string. A topdown scheme for recognizing w with
respect to G that is a lefttoright analog of GeneraLRR is described next. This scheme,
called General_LL, scans w from left to right as it recognizes an incrementally longer prefix
of the input string. GeneraLLL effectively pursues all of the leftmost derivations of w in
parallel through regularitypreserving operations on regular subsets of VS(G).
Again, the inherent nondeterminism of general contextfree recognition subverts any
attempt to follow exclusively the leftmost derivations of w. Instead, at the point where a
prefix x of w has been processed, all leftmost derivations (from S) of all strings in
xT*C\L(G) are followed (i.e., all sentences that have a; as a prefix).
The essence of GeneralLL mirrors that of GeneraLRR. Let iGT* be a prefix of w.
Suppose that all proper prefixes of x are members of PREFEX(G). The set of strings defined
by {/?GVS(*2:/3R holds in G} determines if x GPREFLX(G) holds. This set is
nonempty if and only if x GPREFEX(G) and it contains e if and only if x GL(G).
GeneraLLL, shown in Figure 3.2, is described in greater detail in what follows.
For arbitrary x G T*, two sets of viable suffixes are identified with x. The first set, the
primitive LLassociates of x (in G), is defined by PVSu^G^a:) = {/?GF* w(=*j? I)*R /? holds
in Gr for some SwGP^}. The other set contains the LLassociates of x (in G) and is
85
Recalling the settheoretic foundation of GeneralLRO helps to motivate the utility of
garbage collection. Since GR represents the sets of viable prefixes that are tracked by the
recognizer, the notion of a dead state as it applies to MR identifies nonessential states of the
recognition graph. Whether GR is considered at line 10 or line 12 of GeneralLRO, all states
that are dead with respect to MR at those points, as well as all transitions emanating from
them, are no longer needed. Consequently, the space used by these states and transitions can
be reclaimed for later use.
In order to determine an appropriate location within GeneralLRO to invoke garbage
collection, note that if MR contains no dead states before Reduce is called, then it has no
dead states when Reduce terminates. However, the same remark does not apply to the Shift
function. In particular, states can become dead during the 'th call to Shift where 0<
a proper subset of the states in Q{ have transitions generated to them. Thus, it is convenient
to perform garbage collection in conjunction with the Shift function by anticipating the
states that become dead as a result of it.
An appropriate place to perform garbage collection is immediately following line 19 in
the Shift function. The following simple scheme is sufficient.
(1) Mark all states that are reached in a traversal of GR that begins at the states in
Q_subset.
(2) In a second traversal that starts from the states in QÂ¡\ Q_subset, delete from Q
the states that were not marked in step (1) and delete from 6 the transitions that
emanate from those states.
Note that a garbage collection scheme based on reference counts would be far less straight
forward due to the selfreferences which arise from cycles in the recognition graph. More
over, the simple markandsweep garbage collection procedure outlined above applies readily
to GeneraLNLRO as well.
Although garbage collection can improve the space efficiency of GeneralLRO, it obvi
ously incurs a time penalty. For 0
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS iii
LIST OF FIGURES vi
ABSTRACT vii
CHAPTER
I INTRODUCTION 1
Overview 1
Literature Review 4
Outline in Brief 7
II NOTATION AND TERMINOLOGY 8
Elements of Formal Language Theory 8
ContextFree Grammars and Languages 9
StateTransition Graphs and FiniteState Automata 11
m GENERAL TOPDOWN RECOGNITION: A FORMAL FRAMEWORK 13
Recognition Based on Derivations 13
TopDown RighttoLeft Recognition 15
TopDown LefttoRight Recognition 27
Discussion 35
IV GENERAL BOTTOMUP RECOGNITION: A FORMAL FRAMEWORK 37
BottomUp LefttoRight Recognition 37
Discussion 46
V ON EARLEYS ALGORITHM 48
Earleys General Recognizer 48
A Modified Earley Recognizer 51
Earleys Algorithm and Viable Prefixes 53
Earleys Algorithm and Viable Suffixes 56
Discussion 59
IV
93
Lastly, suppose that the next action called for by the parser is to reduce by production
A*Xm_T r >0, i.e., the length of the righthand side is strictly greater than
0. If goto(sm_r_1;A) = f3, then the contents of the stack becomes SqSj sm_r_1t3 and a
new tree node labeled with A is attached to i3. In addition, this new internal node is set to
point to each of the nodes that were associated with the states sm_r, . ,sm_l,sm before
the reduction was made.
Upon accepting the input string w, the contents of the stack is s0s's" where goto(s o> S)
= s' and goto(s', $) = s". At this point, the root of the parse tree for axa2 an is
attached to s'.
A parse forest for the input string is synthesized by GeneraLTRO7 in an analogous
fashion. Specifically, the Shift and Reduce functions are modified to annotate the recognition
graph with information sufficient for representing the parse forest. The parse annotations
are attached to the transitions of the recognition graph since the connectivity of the graph,
i.e., as exhibited through the transitions, reflects the structure of the parse forest.
Overlooking many of the details that are supplied later, GeneraLLR(y constructs a
parse forest as follows. When a transition on a T is created by Shift a leaf node labeled
with a is attached to that transition. A transition that is created by Reduce corresponds to
an internal node of the parse forest. The parse annotation attached to it includes pointers to
the parse annotations associated with the transitions that were traversed along the way
toward creating that transition (i.e., the transitions traversed in the computation of the succ
function). The transitions created by Traverse are annotated so as to avoid creating circu
larities in the parse forest that arise due to unbounded derivations of the empty string. In
short, Traverse resolves all ambiguous derivations of e.
A transition that is multiplydefined, i.e., due to ambiguity, can have a distinct parse
annotation attached to it for each path in the recognition graph that reduced to that transi
tion. In this way, the parse forest becomes a factored representation of all possible parse
trees for the input string (excluding ambiguous derivations of e). However, the presentation
61
A a/? such that a^e. With the exception of S'5$, all items of the form A*<*) are
closure items.
We denote the LR(0) automaton of G by MC(G)=(I, V, goto, I0,1) where
I={I0,IV . is the collection of sets of LR(0) items. The C subscript is a re
minder that MC(G) is used as the control automaton during recognition. For convenience,
we assume that S' SSG/q and S'*S$ G/m1; in fact, the latter assumption implies that
/m_1 = {5'5$*}. A detailed accounting of MC(G) is not needed to describe how it is used
to recognize strings. However, the following wellknown facts about MC(G) are useful.
(1) L(Mc(C))=VP(C).
(2) Each Ij Â£/\{/0} has a unique entry symbol XEV, i.e., the grammar symbol that
all transitions to Ij are made on. The entry symbol for Ij, j #0, is denoted by
entry (Ij). There are no transitions directed to I0 in MC(G), so entry(/0) is not
defined.
(3) For /yÂ£/, (i) if AKX'Xpeij, then A+aX*f3EIk where Ik =goto(/;,X);
(ii) if AKxX/3Ij, then A+OfX/3(zIk for all Ik Gpred(/y,X); and
(iii) if A+a'Peij and A ^Sthen goto(4,A) is defined for all Ik Â£pred(/y, a).
Automaton MC(G) is also denoted by Mc if G is understood.
The precise manner in which the recognition graph is constructed is the essence of the
algorithm described in the next section. Some general characteristics of recognition graphs
are described in the remainder of this section.
The recognition graph constructed under the guidance of Mc is denoted by
gr(Mc)=(Q, V,6). At the start of recognition, GR{MC) is set to an initial configuration.
Additional states and transitions are added to Q and 6, respectively, as the recognition
proceeds. The denotation GR(MC) is simplified to GR whenever the intent is obvious.
Each state added to Q during recognition corresponds to a set of items Ij Â£/ of Mc,
0
a,; the 0th position of w immediately precedes ox. A subscript of j:i is used to denote the
112
[15] Earley, J. Ambiguity and precedence in syntax description. Acta Inf. 4(2), pp. 18392,
1975.
[16] Ginsburg, S. and Greibach, S. Deterministic contextfree languages. Inf. Control 9(6),
pp. 62048, Dec. 1966.
[17] Glanville, R. S. and Graham, S. L. A new method for compiler code generation. In
Conference Record of the Fifth Annual ACM Symposium on Principles of Program
ming Languages, pp. 23140, Association for Computing Machinery, New York, N. Y.,
Jan. 1978.
[18] Gonzalez, R. C. and Thomason, M. G. Syntactic Pattern Recognition, An Introduc
tion. AddisonWesley, Reading, Mass., 1978.
[19] Graham, S. L. and Harrison, M. A. Parsing of general contextfree languages. In
Advances in Computers, ed. M. Rubinoff and M. C. Yovits, vol. 14, pp. 77185,
Academic Press, New York, N. Y., 1976.
[20] Graham, S. L., Harrison, M. A., and Ruzzo, W. L. An improved contextfree recog
nizer. ACM Trans. Program. Lang. Syst. 2(3), pp. 41562, July 1980.
[21] Greibach, S. A. A note on pushdown store automata and regular systems. Proc.
Amer. Math. Soc. 18(2), pp. 2638, April 1967.
[22] Griffiths, T. and Petrick, S. On the relative efficiencies of contextfree grammar recog
nizers. Commun. ACM 8(5), pp. 289300, May 1965.
[23] Heering, J., Klint, P., and Rekers, J. Incremental generation of parsers. ACM SIG
PLAN Notices 24(7), pp. 17991, July 1989.
[24] Hopcroft, J. E. and Ullman, J. D. Introduction to Automata Theory, Languages, and
Computation. AddisonWesley, Reading, Mass., 1979.
[25] Kasami, T. and Torii, K. A syntax analysis procedure for unambiguous contextfree
grammars. J. ACM 16(3), pp. 42331, July 1969.
[26] Kipps, J. R. Analysis of Tomitas algorithm for general contextfree parsing. In
Proceedings of the International Workshop on Parsing Technologies, pp. 193202,
CarnegieMellon U., Pittsburgh, Pa., 2831 Aug. 1989.
[27] Knuth, D. E. On the translation of languages from left to right. Inf. Control 8(6), pp.
60739, Oct. 1965.
[28] Knuth, D. E. Topdown syntax analysis. Acta Inf. 1(2), pp. 79110, 1971.
[29] Kristensen, B. B. and Madsen, O. L. Methods for computing LALR(A:) lookahead.
ACM Trans. Program. Lang. Syst. 3(1), pp. 6082, Jan. 1981.
[30] Langmaack, H. Application of regular canonical systems to grammars translatable
from left to right. Acta Inf. 1(2), pp. 11114, 1971.
[31] Lyon, G. Syntaxdirected leasterrors analysis for contextfree languages: a practical
approach. Commun. ACM 17(1), pp. 314, Jan. 1974.
100
q, namely &n. In addition, it must include the parse annotation relevant to the nullable
suffix /3, referred to here as One of the tasks of Traverse is to compute [7^] and associ
ate it with the item AcrX*/? of ^(p); in particular, Traverse will have done this by the
time this reduction is made. Thus, the parse annotation [&7T, 7T^] is the fourth field of the
entry pushed onto Succ_Stack that corresponds to this reduction.
(40) This line reflects the new form of the Succ_Stack entries. At this point
represents a nonempty sequence of pointers to parse annotations. These parse annotations
correspond to the suffix of some production righthand side that is being reduced to A.
(4244) This loop demonstrates how parse annotations are built up during the course of
computing the succ function. For every transition (r,X, r', [7r2]) that is traversed within this
loop, a pointer to [7T2] together with the parse annotation built up so far, [7^], becomes part
of the parse annotation for the transition on A that is eventually installed in the recognition
graph. Thus, [&7T2,7^] is the fourth field of the appropriate entry pushed on the Succ_Stack.
(5055) If (<7;:i, A, r, [7r])^ for any parse annotation [7r], we proceed as before. The
transition (<7j:i, A, r, [7^]) is installed in GR and added to subset to allow for subsequent
reductions back through it. Note that at this point 7rx represents a nonempty sequence of
pointers to parse annotations corresponding to the righthand side of some production that
has been reduced to A; more specifically, the sequence of pointers corresponds to a path in
GRl that spells that righthand side. On the other hand, if (qj:i,A, r, for some parse
annotation [7T2], then an ambiguity has been detected. The Disambiguate function is
invoked, the details of which are not specified here, to decide which parse annotation out of
[7^] and [ttJ to retain with the transition.
It is apparent from Figure 7.1 that the Traverse function is substantially more exten
sive than before. It now consists of three while loops. Each loop is discussed in turn.
The first while loop is very similar to the single while loop contained in the version of
Traverse used by the General_LR0 recognizer. Two new lines have been added and one line
has been modified.
56
Corollary For 0
U Ej Ubasis(i?,)) and let Mei , b
define GE',i,h = ( U Sj Ubasis(5), V,
(GE> b, Sqi basis(S)) denote an NFA. Then
UME,iiib)=PV?(G,i:w).
Theorem 5.4 and its Corollary establish a direct relationship between Earley' and the
GeneralLR recognition scheme. Indeed, Earley' prescribes one possible approach to realiz
ing an implementation of General_LR. Note that the foregoing analysis of Gei provides a
constructive proof that for arbitrary x G T*, VP(G, x) and PVP(G, a:) are regular languages.
Earleys Algorithm and Viable Suffixes
The last section considered strings in V* that are spelled by rooted paths in GEt. The
string spelled by a path in Gei is determined directly from the grammar symbols that label
the transitions in that path. In this section, another string over V is associated with a path
in Gei, viz., a string that is derived from the states in that path. Specifically, the state
derivative of a path in Gei is defined recursively by the statederivative function given in
Figure 5.3.
function state derivative ((s0, s1; . sm))
// (s0> 51> > sm)> m >> is a Path in GE'
if m =0 then // Let s0 = [Aaf3,j].
return^)
else if s0 = [Aj] and = [A*aXP, j] then // (s0,X, sÂ¡)G6ei
retur^statederivative^Sj, s2; > sm)))
else // Let = [AKXBp,j] and = [B u, i],
return(/3fi (statederivative ((s1( s2, . sm ))))
fi
end
Figure 5.3 The Definition of the State Derivative of a Path
Again, let Gei = (QE<, V,6Ei) be the Earley state graph that results from applying Ear
ley' to G and w. It transpires that the state derivative of an arbitrary rooted path in Gei is
a viable suffix of G. Moreover, the state derivative of a rooted path in GEt which terminates
52
(1) Every transition in basis^,) is in EÂ¡.
(2) If s = [A kx'B/3, j] is in S',, then for all BruiEP, (s,e,t) is in EÂ¡ where
t = [B+ ui, i] GS1,.
(3) If [Bis in S,, then for all s =[A+aB0, A:] in Sj, (s,B, t) is in EÂ¡ where
t=[A+aB'f3,k]eSi.
Transitions added to EÂ¡ by rules (2) and (3) above correlate closely with the states that are
generated by the Predictor and Completer functions, respectively.
A highlevel description of Earley' is given in Figure 5.2. In that figure, we assume (l) a
generalized Closure function which concurrently constructs S{ and E{, 0
are initialized to basis(5,) and basis^,), respectively, and (2) a modified Scanner function
which computes basis(Â£i+1) and basis^,^), 0 < i < n. The correctness of Earley' follows
from the wellestablished correctness of Earleys original algorithm.
function Earley'(<7 ={V, T, P, 5); w G T*)
11 w =a1a2 an+1, n >0, a,GT\{$}, l
basis(50), basis^o) := {[5' 5$,0]},0
for i := 0 to n do
(5,, EÂ¡) := Closure(basis(St), basis^,))
(basis(S,+1), basis(Ui+1)) := Scanner(5,, a,+1)
if basis^+j) = 0 then Reject(u;) fi
od
5n+i> En+1 basis(5n+1), basis(En+1)
Accept (w)
end
Figure 5.2 A Modified Earley Recognizer
The STG Gei is informally called the Earley state graph. When the Earley state graph
is complete, GEÂ¡ = (Qei, V, 6Ei) where Qei = U 5, and 6ei = U E{.
0<
As every state in Gei is reachable from the initial state [S'5$,0], s0 is also called
the root of Gei. A path in Gei which begins at the root is called a rooted path in Gei.
39
oGT. This is expressed more neatly as a(=*)a Â¡3. The notation relevant to the reflexive
transitive closure of this product is as follows. For all otEV*, O'(=**)e0 a holds in G; for
a, /?,qrG V*, xE Tn~l with n >1, and a E T, if a(jand /?()=* )a 7 hold in G, then
a ([=*) 1 holds in G. If Â¡3 holds in G for some a,/3EV* and x ETn, n >0, any of
the expressions (}=**)* /3, ((=*)*/?, or a (}=*) Â¡3 may be used to denote this if the string x
or its length n is not relevant.
The following lemma compares relational expressions involving the =>/? and I relations
with relational expressions involving the = and  relations.
Lemma 4.1 For a, Â¡3G V* and x G T*, a I) =*/? /? holds in G if and only if
/?((=*4);=*a holds in G.
Proof. First suppose that x =e. By definition, both (=** l)Â£ and ([=**)e are equivalent to the
identity relation on V*. Thus, the following statements are equivalent.
(1) <*(=** QÂ£ =*/?;
(2) a=$R ft,
(3)
(4) /3\=*o, and
(5)
Now let x =a1a2 an, n >1. The following statements are equivalent in this case.
(1) Qr(=^ 1)"=^#
(2) f3( (=$r I) =>* )1 a;
(3) H^Oa^T1*;
(4) /?(=>* ifln =>r ion_! la>
(5)
(6) ad
(7) d
73
1. function General_NLRO(G ={V, T,P,S); wGF)
2. // w=a1a2 an+v n> 0, a,eT\{$}, l
3. // Let Mnc(G)=(I, V, goto, 70, /) be the NLR(O) automaton for (7.
4. // GR(MNC)=(Q, V, 5) is an STG, the recognition graph.
5. Q, 6 := {g0.0}, 0 // Initialize GR.
6. // Let Mr =(Gr\ q0:0, Q0). Then L(MR) = PVP(G, e) = {}.
7. for i := 0 to n do
8. // Let M* =(Gfi1, 9o:0, Q). Then L(Mfl) = PVP(G, i:u,).
9. Reduce (i)
10. // Let Mr ={Gr1, q0:0, Q{). Then L(A4) =VP(G, i:u>).
11. Shift ()
12. // Let A4 =(G'1, q0:0, Qi+1). Then L(Mfi) = PVP(G, +l:u;).
13. if Qi+1=0then Reject(u;) fi
14. od
15. // Let A4 =(Gr\ q0:0, Qn+1). Then L(M*) = PVP(G, w) = {5$}.
16. Accept(u;)
17. end
18. function Shift (i)
19. Q_subset := {q Â£ Qi  goto(4{q), ai+1) is defined }
20. while Q_subset /0do
21. q := Remove(Qsubset) // Let goto(i/{q), ai+1) = Ij.
23. Q := Q U{fy.i+1} // Never redundant.
25. 6 := <5U{(g,.t+1, ai+l, q)} // Never redundant.
26. od
27. end
Figure 6.2 The GeneraLNLRO Recognizer
(34) The appropriate successors of p along paths in GR that spell XaP are located
using the Fsucc and esucc functions (instead of the succ function). This is necessitated by
the presence of transitions in GR.
(50) Similar to General_LR0, the Traverse function of General_NLR0 effectively per
forms a certain traversal of MNC. However, in this case we also want to step over e
transitions. Traversing transitions in this way mirrors the Earley Predictor function.
35
Theorem 3.42 Let G = (V,T,P,S) be an arbitrary grammar and z an arbitrary
string over T. Then PVSl^G, x) and VSu^C, z) are regular languages.
Proof. The proof is by induction on len(z) = n. In particular, we show that PVSu^G,z) =
PVPri^G^, xr) and VSufG, z) = VPrr( Gr xR). The proof is mostly an exercise in recalling
definitions and putting them in the appropriate form.
Basis (n =0). The following two equalities are obvious: (1) PVSu^GjC) = {wG H
S^
OfePVSi4G',)} = {/?eVr* a=*3p holds in GR for some a PVPrr{C?V)}
Induction (n >0). Let x =ya for some y G T"1 and aGT. By the induction hypothesis,
PVSuj(G,y) = PVP{GR,yR) and VSuJ[G,y) = VPp4GR,yR). Hence, PVSll(G,ya) =
{PV\ a \ap for some aCVSi^C, y)} = {/?GF*  alj for some aEVP^G11, yR)} =
PWPw{GR,ayR) = PVPm{GR ,{ya)R). Finally, VSufG, ya )={PE V* \ a=>3P holds in GR
for some aGPVSi^G, ya)} = {/? V\ a=>Â£p holds in GR for some aePVPn(GR, {ya)R)}
= VPrr(Gr,(ya)R). From Theorem 3.25, we conclude that PVSll(G,z) and VSll(<3,,z) are
regular languages.
Discussion
A simple framework for describing general canonical topdown recognition was
presented. The settheoretic framework is based on two relations on strings, =* and I. A
key property of both of these relations is that they preserve regularity. The essence of gen
eral topdown recognition was captured in terms of computing the images of regular sets
under these relations.
The definitions of the various objects of importance in the framework, namely sen
tences, suffixes and prefixes of sentences, right and left sentential forms, etc., were cast in
terms of the =*/? and I relations. Consequently, it is a small step from these definitions to
the recognition schemes that are based on them. In addition, the correctness of the recogniz
ers is particularly easy to establish.
92
In light of these observations, the parsing version of General_LRO, General_LR(y,
overtly maintains a representation of a parse forest. The manner in which this is accom
plished is a simple generalization of the following proposed scheme for explicitly constructing
a parse tree within an LR parser.
Suppose that G is an LR grammar. We consider a hypothetical LR parser for G and
describe one way to explicitly build a parse tree for an input string in conjunction with the
parse stack. We may assume that the parser is based on some LR automaton for G, say M.
At any point during a parse, the contents of the stack is a sequence of states from M. The
parse tree that is synthesized during parsing is represented by associating a node in the tree
with each state in the stack other than the bottommost state.
Let the contents of the stack at some point be s0si sm, m >0, where each s, is a
state of M; in particular, s0 is the start state of M. For 1
bol for state sf. Thus, XxX2 Xm is the viable prefix of G that is implicitly represented
by the supposed stack contents. If m =0, the relevant viable prefix is e. For l
assume that some representation of a parse tree node labeled with XÂ¡ is attached to the
entry for s in the stack. The shift and reduce actions of M generate additional tree nodes
as follows.
A shift action always creates a new leaf node. Suppose that the current input symbol is
o and the next action of the parser is to shift a from sm. As a result of this action, the con
tents of the stack becomes SqSj smtl where goto(sro,a) = tv As a side effect, a new
parse tree node is generated, labeled with a, and attached to t1 in the stack.
A reduce action typically generates one internal node. However, when reducing by an
eproduction, a leaf node is also created. Suppose that the next action called for by the
parser is to reduce by production A Â£. This action transforms the contents of the stack to
sosi smt2 where goto(sm, A) = t2. Two new tree nodes are generated as a side effect.
One tree node is a leaf that is labeled with e. The second is an internal tree node; it is
labeled with A, set to point to the new leaf, and attached to t2.
50
For 0<*
component of an Earley state is called a backpointer. Thus, note that in rule (2) of the
S_Closure function, 0
For 0<2
state that Earleys algorithm is a correctprefix recognizer. Moreover, w EL(G) if and only if
5n+1={[5'*5$*,0]}. Conversely, w (Â£L{G) if and only if 3*,0<*
Vi,0
such that i:w 6PREFIX(G) holds.
The correctness of Earleys algorithm is based on the criteria which places a state in a
particular state set [6], In that regard, the following statements are made.
Fact 5.1 For 0
a=**a;+1a;+2 ai and S'=$*6A'y hold in G for some 6,76 V"* such that axa2 o;
holds in G.
Facts 5.2 and 5.3 below ascribe bottomup and topdown interpretations, respectively,
to Fact 5.1.
Fact 5.2 For 0
ot=*? Uj+1aj+2 ai and S'=$?6Ay hold in G for some 66 F* and y 6 T* such that
6=4*a1a2 o; holds in G.
Note that 66VP(G,j:t) and 6a6VP(G, :t). We say that [A65, is valid
for 6a6VP(G, :w); in particular, [A <*/?, j] 6Sy is valid for 66VP(G, j:w). If a^e also
holds, then we say that [A or /?,./] 65, properly cuts 6a6VP(G, i:w).
Fact 5.3 For 0
0!=**aj+laJ+ 2 o, and S'=$*axa2 a;A6 hold in G for some 66 V*.
In this case, note that {A5)R EYS(G,j:w) and {p8)R 6VS(G, i:w). We say that
[AHX'/d, j] 65, is valid for (fi8)R 6VS(G, i:w); in particular, [A O/?, j] 6Sy is valid for
{A5)R eVS(G,j:w).
106
ambiguities are made convenient by the structure of the parse forest. In contrast, Earleys
parser produces a rather indirect representation of the parse forest. Little is said in the
literature of how this affects the ease with which a parse is produced or with which ambigui
ties are resolved by Earleys parser.
The hypothetical Disambiguate function referred to in Figure 7.1 allowed us to keep
the specification of transitions simple. By assumption, Disambiguate resolved ambiguities at
the point where they were first detected, so only one parse annotation was ever attached to a
given transition in GR. Of course, this assumption is unrealistic in the general case. A sub
stantive treatment of ambiguity and its resolution is well beyond the scope of this work.
However, the following very basic observations may be made.
The task confronted by Disambiguate in line 54 of Figure 7.1 is to determine which
transition out of (fy;i,A, r, and (qj.{,A,r, [7r2]) to retain in GR. In order of increasing
complexity, a selection may be made based on the following strategies.
(1) Through a direct comparison of [7^] and [7r2],
(2) A combination of (1) and an analysis of the subparse trees referred to by [7^] and
[tTj], respectively.
(3) An analysis of the surrounding context in combination with (1) and (2).
The Disambiguate function could conceivably resolve ambiguities that entailed analyses of
type (1) or (2) above. On the other hand, ambiguities requiring type (3) analysis would have
to be postponed until later in the parse if they depended on right context. Some simple
approaches to handling ambiguity are described by Aho et al. [2], Earley [15], Tarhio [41],
and Wharton [45],
2
Both LR parsers and Earleys algorithm are based on items. Each state of an LR
parser corresponds to a set of LR items. Earleys algorithm constructs a sequence of state
sets during recognition. The states manipulated by Earleys algorithm call them Earley
states are slightly elaborated LR items.
Earleys algorithm and LR parsers scan the input string from left to right recognizing
an incrementallylonger prefix of it in the process. That is, they are correctprefix recogniz
ers.
Both LR parsers and Earleys algorithm work in a bottomup fashion. An LR parser
determines the reversed rightmost derivation of an input string. In contrast, Earleys algo
rithm has the capability of producing all of the reversed rightmost derivations of an input
string.
The relationship between Earleys algorithm and LR parsers can be described on a
more fundamental level in terms of viable prefixes. Viable prefixes are certain prefixes of
right sentential forms. At each point during a parse, the contents of an LR parsers stack
implicitly represents a viable prefix which derives the portion of the input string parsed to
that point. We let VP(Gi) denote the set of viable prefixes of a grammar G. In addition, let
VP(<7, x) denote the set of those viable prefixes of G which derive x, a string over the termi
nal alphabet of G.
Turning now to Earleys algorithm, consider a point in a parse at which some prefix x
of the input string has been processed. The sequence of Earley state sets constructed up to
that point encapsulates the strings in VP(G, a:). The manner in which VP(G,:r) is normally
represented in the state sets is rather indirect. However, this representation can be made
explicit through a variant of Earleys algorithm which constructs a directed graph whose ver
tices are the states generated by the original algorithm. Under an appropriate interpretar
tion, this graph yields a finitestate automaton which accepts VP((?, x). Details of this pro
posed graphical variant of Earleys algorithm are supplied later.
108
The last two chapters were devoted to describing practical recognizers and parsers that
are derived from the GeneraLLR recognition scheme. Automatabased versions of
GeneralLR are obtained by using an automaton that accepts VP(Gi) to guide the construc
tion of a statetransition graph, the recognition graph. The recognition graph explicitly
represents the sets of viable prefixes that are computed by General_LR. In the discussion of
the algorithms, LR(0) and NLR(O) automata were used as control automata. However, other
choices are possible such as automata that are intermediate between the LR(0) and NLR(O)
automata as well as automata that are attributed with lookahead. The General_LR0 parser
can process arbitrary reduced contextfree grammars. To accommodate especially ill
designed grammars, simple means for dealing with pathological grammar properties were
presented. Finally, the parse forest representation used by the GeneraLLRO parser is easy
to understand and convenient for handling ambiguity.
We have included some discussion of how the Earley and Tomita algorithms compare
to ours. Although the 0(np+1) worstcase time complexity of the General_LR0 recognizer
does not compare favorably with the 0(n3) worstcase complexity of Earleys recognizer, it
is expected that GeneraLLRO would outperform Earleys algorithm in most practical situa
tions. Moreover, it is more convenient to work with the representation of the parse forest
that is used in our framework. The GeneraLLRO algorithm is in the same complexity class
as Tomitas algorithm. This is not a surprising result given the similarities between the two.
However, our algorithm can parse any reduced grammar. Thus, we have generalized
Tomitas algorithm; ironically, our general algorithm is also simpler than Tomitas. Lastly,
our framework provides some firm theoretical justification for Tomitarlike parsers. Tomitas
algorithm is notably lacking in that respect in that it is more of an ad hoc generalization of
the standard LR parsing algorithm.
The general parsers derived in our framework, viz., the GeneraLLRO parser and its
variants, are appropriate to areas of application which require more flexible parsers than are
provided within the confines of LR parsing theory. In a more general sense, our work pro
62
state in Q that corresponds to Ij and input position i, e.g., gj;i. The function iÂ¡r.Q+I is
defined to map a state in GR to its associated set of items in Mc; thus, For later
use, we define QÂ¡ ={g;:i GQ}, 0
correspond to input position i.
Similarly, each transition added to 8 during recognition corresponds to a transition in
Mc. The members of 6 are best described in terms of the mapping <5goto induced by iÂ¡j
defined as follows: for p,q Q and XG V, (p,X, g)G<5 only if goto(^g),X) = ^(p). Thus,
each transition in GR corresponds to the reversal of a transition in Mc. Consequently, all of
the transitions out of a state p GQ are on entry(V^p)). Valid transitions in GR are also con
strained by input position; specifically, (qk:i,X, qj.h).8 implies that h
we define <5,={(g;.,,X, p)G<5)}, i.e., 8i consists of all transitions in GR that emanate from
states in Qi.
The General_LRO Recognizer
The general contextfree recognizer, informally named GeneraULRO, is described in
this section. Concurrently, intuitive arguments for its correctness are presented. Establish
ing the correctness of General_LRO reduces to demonstrating that it is a faithful realization
of the General_LR recognition scheme, i.e., that the sets of viable prefix associates that
GeneraLLR tracks are correctly represented in the graph constructed by GeneraULRO as w
is scanned from left to right.
GeneraULRO is described in terms of how it operates when it is applied to G and w.
Under the guidance of MC(G), the LR(0) automaton of G, GeneraULRO constructs a recog
nition graph Gr(Mc). Some general notions about recognition graphs were introduced in the
last section. The description of GeneraULRO that follows provides more specific details
about how GR is derived from Mc and w. For reference, GeneraULRO is rendered in pseu
docode in Figure 6.1.
14
to replace at each step. Instead, rightmost and leftmost derivations are preferred for the
additional constraints that they place on the parse tree construction process.
Since rightmost and leftmost derivations are defined in terms of subrelations of the =*
relation, they also construct parse trees topdown. In addition, they impose a canonical1
order on the construction of parse trees. Specifically, rightmost derivations construct parse
trees from right to left, whereas leftmost derivations construct them from left to right. Some
basic notions about rightmost and leftmost derivations are briefly reviewed next.
Rightmost and leftmost derivations are based on the rderives (=*,) and lderives (=*/)
relations, respectively. These relations are formally defined by =>, = {(otAz, atujz) \ OfGT,
A*ojEP, zET*} and =w = {(xAfi, xu/f) \ xET*, AruEP, /3EV*}. Rightmost deriva
tions (resp. leftmost derivations) are defined in terms of the reflexivetransitive closure of =>r
(resp. =/) in the usual fashion.
For V*, if S =**7 holds in G, then 7 is called a right sentential form of G. The set
of the right sentential forms of G is denoted by SFr(G). The inclusion SFr(G)CSF(G)
holds and is typically, but not always, proper. In contrast, for w ET*, S=$*w holds in G if
and only if S =>,*w holds in G. Thus, L(G) = {u> E T* \ S =**w holds in G}.
For AEN and XEV, if A =*r+ <%X holds in G for some aEV*, then X is right
reachable from A; furthermore, if X=A, then A is rightrecursive. A grammar that has a
rightrecursive nonterminal is a rightrecursive grammar. A symbol X E V is nullable in G if
and only if X =>,* e holds in G.
Any string 7E V* such that S =**7 holds in G is a left sentential form of G. The set of
the left sentential forms of G is denoted by SFÂ¡(G). Similar to the above, the relationship
SF;(G)CSF(G) holds and is generally proper. In addition, L(G) = {to6r*5=>*tt) holds in
G}.
Given AEN and XEV, if A=*fX/3 holds in G for some /?V*, then X is left
reachable from A] if it further holds that X=A, then A is leftrecursive. A grammar is
1 In the literature, the term "canonical is typically associated with rightmost derivations only.
47
The denotations RR, LL, and LR that pervade Chapters III and IV were sug
gested by Knuth [28] where the following deterministic contextfree grammar classes and the
methods of their analysis are enumerated:
RR(Ar) scan from right to left, deduce rightmost derivations;
LL(&) scan from left to right, deduce leftmost derivations;
LR(&) scan from left to right, deduce reversed rightmost derivations; and
RL(A;) scan from right to left, deduce reversed leftmost derivations.
Here, k >0 indicates the length of lookahead strings used. Note that the use of these denota
tions is meant to evince a generalization of the respective parsing methods rather than a gen
eralization of the grammatical classes. A corresponding GeneraLRL recognition scheme is
not included here. To mesh with the other recognition schemes, it would utilize the f= and
* relations defined in terms of GR. Images of regular subsets of VS(G) under these relations
would be tracked by General_RL as an input string is scanned from right to left.
The GeneraLRR recognition scheme was developed primarily as a stepping stone to
GeneraLLL and General_LR. GeneraLRR is given little attention in the remaining
chapters. Consequently, VPlr(
PVP(G,a:)). Similarly, VS(G,a:) (resp. PVS(G,a;)) is used to denote VSll(G,:e) (resp.
PVSiiiG,x)).
CHAPTER I
INTRODUCTION
Contextfree recognition is the algorithmic process by which the membership of a string
x within a contextfree language L is decided. This involves determining whether x is
derived by some contextfree grammar G where L = L(G). Parsing is the process of ascer
taining the syntactic structure imparted to a; by G.
From a theoretical standpoint, contextfree recognition and parsing hold considerable
interest in their own right. Yet contextfree grammars and their recognizers and parsers
have substantial practical value as well. Most notably, results from parsing theory have
proven indispensable to the implementation of programming languages. Other areas of appli
cation include natural language processing [34], syntactic pattern recognition [18], and code
generation in compilers [10].
Given an arbitrary grammar G and an arbitrary string x over the terminal alphabet of
G, a general recognizer (resp. parser) recognizes (resp. parses) x with respect to G. The
work presented here contributes to the area of general contextfree recognition and parsing.
The following section provides some motivation and a brief overview of this dissertation.
Overview
The LR parsers, namely those parsers that effect a Lefttoright scan of the input while
producing a .Right parse, define the most powerful class of deterministic parsers. Earleys
algorithm, on the other hand, is arguably the most efficient general parser. Despite the fact
that LR parsers are restricted to LR() grammars whereas Earleys algorithm can parse
strings against any contextfree grammar, there are close parallels between the two.
1
105
Only minor modifications were required of the Shift and Reduce functions in order to
accommodate parsing. The Traverse function, on the other hand, was changed substantially.
It is important to note that Traverse can handle the most illformed grammars. For exam
ple, consider the grammar with the production set P = {50 , SQ+a, S0 S'f,
, Sk_1Sq} fr sme k >1. This grammar was submitted by Graham et al. [20, p.
429] as an example of a particularly bad worstcase. Although this is a contrived example,
the ability to effectively deal with pathological conditions if and when they arise is valuable
from both a theoretical and practical standpoint. Toward that end, the Traverse function
handles the worst situations in a fairly straightforward manner. Nevertheless, Traverse can
be tailored to meet the specific requirements of the subject grammar if the generality it pro
vides is not needed.
A parse annotation for a nonterminal transition is manufactured as a sequence of
pointers to the parse annotations that are encountered while a path is traversed during a
reduction. Tomitas algorithm performs similar operations to construct a parse forest. In his
parsing algorithm, the symbol vertices of the recognizer are used for storing pointers to the
nodes of the parse forest. Of course, the complexity introduced into the recognizer by the
symbol vertices and the ad hoc manner in which productions are handled carry over to the
parser.
In Tomitas algorithm, the parse forest is built separately from the graphstructured
stack. General_LR(y constructs the parse forest more or less on top of the recognition graph,
but could just as easily build the parse forest separately as well. The choice that is made for
an actual implementation primarily has implications on garbage collection.
The worstcase time complexity of GeneraLLRCy matches that of General_LR0. With
respect to GeneralLRC/, the expression np+1 reflects the time required, in the worstcase, to
construct a direct representation of the parse forest. Thus, the relative inefficiency of
General_LR0 as compared to Earleys recognizer is offset by the benefits accrued by
General_LR(y. Specifically, the traversals that are required to produce a parse and to resolve
67
a transition on oi+1 from tÂ¡^q) in Mc. The set variable called Q_subset is initialized to con
tain these states.
(20) Each state in Q_subset is considered in turn. No additional states are added to
Q_subset within the while loop.
(2125) A state q is removed from Qsubset. Since (x/^q), ai+1, L) is a transition in Mc,
we need to add to Q and (<Â¡r;:+1, ai+1, q) to <5. It is possible that there is more than one
transition on ai+1 to Ij in Mc, so <Â¡r;:i+1 may have been added to Q in an earlier iteration of
the while loop. This condition is checked in line 22 and g,.,+1 is added to Q only if it is
necessary. However, the transition (fy:,+1, ai+1, q) cannot already be in 6 since there is only
one transition on at+1 from rj^q) in Mc. This transition is added to <5 in line 25.
(27) By assumption, the precondition in line 10 holds when Shift is called. Based on the
manner in which certain paths in GR are extended by the Shift function under the guidance
of Mc, the postcondition of Shift holds at this point.
The transformations of GR made by Reduce are considerably more elaborate. This is
not unexpected since Reduce computes the reflexivetransitive closure of a relation.
The operation of the Reduce function during its ith invocation from General_LR0 is
described for some i, 0 < i < n. During this invocation, Reduce adds states to Qi and installs
transitions from states in Q to states in Qj where 0
Qi to states in Qj where 0
transitions among states in Q{ warrant special treatment. They are problematic in the gen
eral case as they can introduce cycles into the recognition graph. These transitions, always
made on nullable nonterminals, are handled separately by the Traverse function.
(9,28) Like Shift, the Reduce function is supplied with i as an argument so that the
relationship between the values of i in GeneraLLRO and Reduce is explicit.
(29) At this point, each transition in <5, may come from a state that calls for one or
more reductions. If i =0, then there are no applicable transitions. If i > 0, the relevant
transitions were installed in GR by Shift during the previous iteration of the main for loop of
CHAPTER Vn
A GENERAL BOTTOMUP PARSER
The General_LRO recognizer is extended into a general bottomup parser in this
chapter. The transformation from general recognizer to general parser is straightforward in
all but one respect some effort must be expended to parse arbitrary derivations of the
empty string. Briefly, a parse of an input string is represented by appropriately annotating
the transitions of the recognition graph. Ambiguity is accommodated by attaching multiple
annotations to relevant transitions. As usual, an arbitrary reduced Saugmented grammar G
= (V, T,P,S) and an arbitrary string w=axa2 a+1, n >0, atEjr\{$} for
an+1=$, are assumed throughout.
From Recognition to Parsing
Implementations of deterministic bottomup parsers, of which LR parsers are exem
plary, are not obliged to build an explicit parse tree for the input string. Whether or not a
parse tree is indeed constructed is primarily dictated by the requirements of the application
to which the parser is applied. Other factors which are influential include memory con
straints and the interface between the parser and other processing components.
In contrast, general bottomup parsers typically cannot avoid explicit parse tree
representations. When parsing against a nondeterministic grammar a forest of parse trees
rather than an identifiably unique tree is typically relevant to the input string. Due to
theoretical limitations on the discrimination afforded by lookahead, this behavior is even
observed with unambiguous grammars. In any case, some representation of the parse forest
must be built during parsing so that a unique parse can eventually be produced.
91
18
Lemma 3.3, in contrast to Lemma 3.2, illustrates that a rightmost derivation departs
from a strong rightmost derivation following the step where a terminal symbol first appears
at the right end of a string occurring in the rightmost derivation. The role of the second
relation that we introduce is to dispense with terminal symbols as they appear at the right
end of strings in strong rightmost derivations. Specifically, the chop relation is defined by I
= {(era, a)  a: G F* aGT"}. For every aGT, la denotes the subrelation of I with domain
V*a. Thus, for a, Â¡3EV* and a G T, a la /? holds if and only if a \/3 and a =/3a hold.
The relation product I, a useful composition that is suggested by Lemma 3.6, is
used extensively in what follows. Formally, for a,/3E.V*, o=*r I /? holds in G if and only if
a =*n /3a I /? holds in G for some a G T; this latter expression is usually written as a (=> l)a /?.
For clarity, we describe inductively the notation that we will employ for exploiting the
reflexivetransitive closure of (=$ I); similar conventions are applied to other relation pro
ducts that are introduced later. For all aGF*, a(=$r l)ea holds in G] for a, /?, 7G V*,
y G T"1 with n >1, and a G T, if 1)^~lP and /?(=* I)a 7 hold in G, then l)aÂ¡/ 7
holds in G. The order of ay in the latter expression reflects the fact that the terminal sym
bols of a string are generated by =>r and chopped by I from right to left. Finally, if
a(=$Â£\)y/3 holds in G for some a, (3EV* and yETn with n >0, then for convenience we
may instead write this expression as a(=>j? I)*Â¡3, a(=^l)*/?, or o;(=^rI)"/9 according to
whether or not the string y or its length n is relevant.
Right Sentential Forms Revisited
Next we investigate how arbitrary rightmost derivations are mimicked by the =*r and
I relations. In short, a rightmost derivation is represented as a sequence of strong rightmost
derivations interspersed with chops of terminal symbols. As a result of this analysis, the pre
cise manner in which right sentential forms and sentences are generated by the two new rela
tions is revealed.
81
pushed onto Succ_Stack to record that we want to find the successors of q which are located
at the ends of paths of length len(oi) from q\ moreover, when each of these states is found, a
transition on A will be made to it from an appropriate state in .
(11) The Succ_Stack is not empty, so one of its entries is processed.
(12) An item (r,A,d) is removed from Succ_Stack.
(1316) If d >0, then the stage in the traversal of GR that is recorded by (r,A,d) has
not progressed far enough. Let entry(^r)) = X. Then every transition out of r is on X.
For each state r'EQ such that (r,X,r') is a transition in GR, (r',A,d 1) is pushed onto
Succ_Stack. By effectively moving to r', the length of the traversal has been increased by 1.
Consequently, the distance remaining is decreased by 1.
(1725) If d 0, then r Gsucc(<7, afl) for some q and O' referred to in lines 79. Lines
1825 are identical to lines 3542 of Figure 6.1.
The Complexity of Recognition
In this section, some worstcase complexity bounds are established for the GeneraLLRO
recognizer. Specifically, we consider the amount of space and time required by GeneraLLRO,
in the worst case, when it is applied to G and w. In the following, it is convenient to assume
that w GL(G). In addition, the LR(0) automaton of G, MC(G), is assumed to have m
states.
Bounds on space requirements are derived first. They are useful in determining the
time bounds. In both cases, bounds are established for arbitrary G and for arbitrary unam
biguous G.
Space Bounds
The space complexity of GeneraLLRO is determined by placing an upper bound on the
number of states and transitions in GR at the point when w is accepted. The sizes of the
auxiliary data structures, i.e., Qsubset, subset, and Succ_Stack, are accounted for later.
102
in G, this loop determines an appropriate parse annotation to associate with the nullable
suffix ft. Thus, p is readied for any reductions that are initiated from it in the for loop at
line 35 of the Reduce function. Note that at this point none of the states in Qsubset' have
had reductions made from them yet.
(83) Each state in Qsubset' is considered in turn. No new states are added to
Qsubset' within the loop.
(84) A state q is removed from Q_subset'.
(85) For each AtaX^fidiil^q) such that /?=** holds in G, we want to associate a
parse annotation to the nullable suffix /?. This becomes the parse annotation [7^] that is
referred to in lines 3637 of the Reduce function.
(8687) If /?=Â£, then the appropriate parse annotation to associate with /? is [].
(8891) Otherwise, P=BXB2 Bm for some Bj G W and m >1. Due to the process
ing done in the second while loop, there is a path (qm, qm_x, . ,qv q) in GR which spells
BmBm\ Bv Let {qj,Bj, qÂ¡_x, [fly]), (qvBv q, K])G, m >j >2, for some parse annota
tions iTj be the transitions in that path. Then the appropriate parse annotation to associate
with P in this case is [&7r1( &7T2, . ,&7Tm],
The Complexity of Parsing
Worstcase complexity bounds for the GeneraLXRO parser are easily derived from the
complexity bounds of the recognizer. In the following, we assume that General_LR(y is
applied to G and w and that w E.L(G) holds. Space bounds are examined first.
The size of a parse annotation is bounded by some constant, e.g., the length p of the
longest production righthand side. If, as assumed, ambiguities are resolved when they are
first detected, only one parse annotation is ever attached to a given transition in GR. Thus,
the space complexity of the parser is the same as the space complexity of the recognizer.
That is, the space complexity of GeneralLR(y is 0(n2) if G is arbitrary, or unambiguous
but otherwise arbitrary, and it is O(n) if G is LR(Ar) and A:symbol lookahead is employed.
6
An algorithm that is a hybrid of the CockeYoungerKasami and Earley algorithms is
described by Graham et al. [19,20]. This algorithm also accommodates arbitrary grammars.
Like the CockeYoungerKasami algorithm, an nXn parse matrix is constructed. However,
the matrix positions are filled with sets of LR items instead of sets of nonterminals. Practi
cal issues are discussed in detail and claims are made that more efficient implementations are
attainable than are allowed by Earleys algorithm. Subcubic versions based on matrix mul
tiplication techniques are also described.
The class of LR() grammars was introduced by Knuth in the seminal paper on LR
parsing theory [27]. Knuth described a method for constructing a deterministic parser for an
LR(Â£) grammar, observed that the set of viable prefixes of an arbitrary grammar is a regular
language, and proved that it is undecidable whether an arbitrary grammar is LR(fc) for free
k >0. The discovery of LR(Ar) grammars was quite significant in light of their relationship to
deterministic contextfree languages [16].
Knuths technique for parser construction is generally deemed impractical due to the
enormous number of parse states that can result. The SLR(fc) [12] and LALR(Â£) [11,29]
grammars define two important subclasses of the LR(Ar) grammars which allow this problem
to be addressed satisfactorily. Relatively compact LR parsers for grammars in these sub
classes can be constructed efficiently.
Tomitas algorithm [42,43] extends the conventional LR parsing algorithm to use parse
tables that contain multiplydefined entries. Conflicting parse actions are handled by
employing a graphstructured stack to keep track of the different parse histories. However,
some grammars cause the stack to grow without bound in instances where no input is con
sumed, so the algorithm is not general. Tomitas algorithm is discussed in greater detail
later.
The application of Tomitas algorithm to a system which supports the incremental gen
eration of parsers is reported by Heering et al. [23]. Specifically, Tomitas algorithm is
adapted to work with an incrementally generated LR(0) automaton. The states of the auto
9
Let x, y, and z be arbitrary strings over E and let w xyz. Then a; is a prefix of w, y
is a substring of w, and z is a suffix of w. If 0 < len(a:) < len(w) holds, then a: is a proper
prefix of w; similarly, if 0 < len(2) < len(to) holds, then z is a proper suffix of w. We define
PREFIX(a;) = {y G E*  x = yz for some zEE*} and SUFFEX(a:) = {z EE*\x =yz for some
y Gi?*}. If k is a natural number, then k:x (resp. x:k) denotes the unique prefix (resp. suffix)
of x of length min{len(x), k}. This notation is extended to languages as follows. For LCZ*,
PREFDC(L) = U PREFEX(a:), SUFFIX(L) = U SUFFEX(a;), k:L = {k:x \ x GL}, and L:k
zL zGL
= {x:k  x GL}.
The reversal of a string x GZ*, denoted by xR, is defined recursively as follows: eR =
e; Va EE, aR = a; Va;,j/Gi7*, (xy)R = yRxR. Similarly, the reversal of a language L is
defined by = {a:^  x GL}.
ContextFree Grammars and Languages
A (contextfree) grammar is denoted by G = (V,T,P,S) where V is an alphabet
known as the vocabulary of G, T CP and N = V\T are the terminal and nonterminal
alphabets, respectively, P C.NXV* is the finite set of productions, and SEN is the start
symbol. The following conventions are generally adhered to: a,b,c,tET; w,x,y,zET*;
A,B, C, S EN; X,Y, Z EV. In addition, lowercase Greek letters denote strings in F*. An
arbitrary grammar G is assumed throughout the rest of this section.
A production (A,
hand side of the production, respectively. A group of productions that share the same left
hand side, viz., A*ojx, Aru)2, . Aru)n, n > 1, may be abbreviated as
Arojy  ui2   0Jn. A production with a righthand side of e is called a null production
or eproduction.
It is common to specify a grammar by listing only its productions. In this case, the
lefthand side of the first production or production group in the list is taken to be the start
symbol. The nonterminal and terminal alphabets can be inferred from the productions.
26
Case (i): w ESUFFD(G). In this case, PVPrr(G, w:i) and VPrr(G, w:i) are nonempty for
all i, 0 ), so the for loop completes len(u>) iterations. Since w (G) by assump
tion, c^VPri^G, w). Therefore, w is rejected by General_RR in the second if statement.
Case (ii): w ^SUFFIX(G). Since w 5e, w =xay for some x,y(zT* and oGT such that
y ESUFFD^G), but ay ^SUFFIX(G). Let len(?/) = ?n and note that 0) must
hold. PVPrr(G, y:i) and VPrr(G, y:i) are nonempty for all i, 0 <
completes m iterations. During the (mfl)st iteration, PVPrr(G, ay)=0 is computed.
Therefore, w is rejected by General_RR in the first if statement.
Regularity Properties
Certain regularity properties that are inherent to all contextfree grammars are
exploited by GeneraLRR. Specifically, for an arbitrary string zET* PVPrr(G,z) and
VPrh(G, z) are regular languages. This fact is proven in this section. Toward that end some
known theoretical results, including one which is rather obscure, are cited below. Since
proofs of these results are not replicated here, the proofs that follow are quite brief.
A type of formal rewriting system known as a regular canonical system is defined by C
= (Â£,11) where E is an alphabet and 77 is a finite set of (rewriting) rules[21,30,37], Each rule
in 77 takes the form of Â£a*l;P where ot,/3&Â£* and Â£ denotes an arbitrary string over E, i.e.,
a variable. The form of a rule indicates that the lefthand side may be rewritten to its
corresponding righthand side only at the extreme right end of a string. Thus, much like R
derives, the Cderives relation induced on E* by 77 is defined by =$c = {("/cr,7/3)  7GX*,
Â£aÂ£/3G77}. Given two languages L1,L2GX*, define r(Lx, G,L2) by 7x=>c72<5holds
in C for some 7X ELX and 72L2}.
A key result from the literature relevant to regular canonical systems is the following.
Fact 3.1 Let C = (E, 77) be a regular canonical system and let Lx and L2 be regular
languages over E. Then r(Lj, C, L2) is a regular language over E.
Proof. This is a restatement of Theorem 3 from Greibach [21].
69
dantly here as the result of an ambiguity. If the transition is indeed new, it is added to
subset; any relevant reductions from q]:i are performed through this transition when it is
removed from subset in a later iteration of the while loop.
(46) The postcondition of Reduce holds at this point. To help establish this fact, a sub
set of VP(G, i:w), denoted by VP'(G, i:w), is defined as follows: (1) for i =0, VP^G^Oiu;) =
PVP(G,0:tt>); (2) for 0<
X^?aj+1aj+,2 a, holds in G). For 0
clearly holds. The states and transitions added to GR directly by Reduce ensure that
VP\G, i:w) C L(Mr) holds. The contribution that Traverse makes to the transformation of
Gr can be assessed by noting that VP(G, i:w) = {afiEXP(G)  aGVP^G, i:w), /3=*r holds
in G}. The Traverse function creates any additional states in Qi and transitions among
those states so that VP(G, i:w) \VP'(G, i:w) C L{MR) also holds. Together, the Reduce and
Traverse function guarantee that L(MR) = VP(G, i:w).
(30,37,47) Traverse deals solely with nullable nonterminals and productions with null
able righthand sides. In lines 30 and 37, Traverse is called with a nonempty subset of Q{ as
an argument which becomes associated with the set variable called Q_subset. Traverse has
the effect of transforming GR as if all sequences of reductions by productions that have null
able righthand sides are carried out from the states in Q_subset. However, a transformation
of Gr that produces the same result can be derived from a simple traversal of Mc. By
adopting this alternative approach, complications that can arise due to cycles in GR are
avoided. Consider the states Ik G/ such that iÂ¡^q)Ik for some q GQ_subset and traverse
Mc beginning from these states along all transitions that are made on nullable nonterminals.
The states and transitions encountered in this traversal are exactly those which would arise
from performing the reduction sequences described above. Consequently, counterparts for all
of the states and transitions encountered in this traversal are created in GR. Thus, a partic
ular subgraph of Mc is effectively embedded in QÂ¡ by this process. The specific subgraph is
determined by the composition of Qsubset when Traverse is called.
VI A GENERAL BOTTOMUP RECOGNIZER 60
Control Automata and Recognition Graphs 60
The GeneraLLRO Recognizer 62
Earleys Algorithm Revisited 71
Implementation Considerations 75
The Complexity of Recognition 81
On Garbage Collection and Lookahead 84
Discussion 87
VII A GENERAL BOTTOMUP PARSER 91
From Recognition to Parsing 91
The GeneralLRO Parser 97
The Complexity of Parsing 102
Garbage Collection Revisited 103
Discussion 104
VIII CONCLUSION 107
Summary of Main Results 107
Directions for Future Research 109
REFERENCES Ill
BIOGRAPHICAL SKETCH 114
v
5
A version of the CockeYoungerKasami algorithm that is restricted to unambiguous
grammars is presented by Kasami and Torii [25]. The time and space bounds of this algo
rithm are both 0(n2logn). Another version which employs linked lists in place of the parse
matrix is described by Manacher [32]. This alternate storage discipline allows unambiguous
grammars to be recognized in quadratic time, a marked improvement over the corresponding
cubic bound of the original algorithm.
The CockeYoungerKasami algorithm was reduced to matrix multiplication by Valiant
[44], Using this result, Strassens technique for multiplying matrices [1] is applied to obtain
an asymptotic worstcase time complexity of 0(n281) for general recognition.1 Due to the
overhead associated with this method, it is primarily of theoretical interest only.
In contrast to the CockeYoungerKasami algorithm, Earleys algorithm [6,13,14] can
process any grammar. Like LR parsers, Earleys algorithm is based on sets of items.
Although its worstcase time and space bounds are also 0(n3) and 0(n2), respectively, it
performs significantly better on large classes of grammars. Specifically, unambiguous gram
mars are parsed in 0(n2) time, and only O(n) time is needed to parse LR(&) grammars pro
vided that Arsymbol lookahead is used in the latter case. Earleys algorithm is examined
further in later chapters.
Efficiency improvements that may be gained by employing LL and LRlike lookahead2
in Earleys algorithm are reported by Bouckaert et al. [9]. They concluded that FIRST sets
are more useful than FOLLOW sets for reducing the number of superfluous items generated
during recognition. In short, FIRST (resp. FOLLOW) information reduces the number of
items generated by Earleys Predictor (resp. Completer) operation. See Christopher et al.
[10] for an example of an application of Earleys algorithm; specifically, it is used to generate
optimized code in a GrahamGlanville style code generator [17]. If desired, Earleys algo
rithm may be extended to include error recovery [3,31].
1 Even faster techniques for matrix multiplication have been developed since.
2 That is, FIRST and FOLLOW sets, respectively.
4
The last part of this work (Chapters VI and VII), casts our approach to general recogni
tion and parsing into an automatatheoretic framework. First, a general recognizer is
described in considerable detail. The recognizer uses an automaton which accepts VP(<7) to
guide the construction of an automaton that accepts VP(G, x), where x is some prefix of the
input string. For convenience, the description of the algorithm employs the LR(O) automar
ton of G as the guiding automaton. However, the algorithm allows for a rather broad range
of VP((7)accepting automata to be used instead. For example, employing the nondeter
ministic LR(0) automaton of G as a controlling automaton yields a general recognizer which
works quite similarly to our graphbased Earley algorithm. Finally, this automatabased
recognizer is extended to a general parser. Means for representing parse forests and handling
ambiguity are described. The recognizer and parser are presented in enough detail to be
readily implemented. In anticipation of this, many practical issues are discussed.
Literature Review
A comprehensive introduction to formal languages and automata is presented by Hop
croft and Ullman [24]. These two related disciplines are prerequisites to a study of context
free recognition and parsing. An uptodate monograph on parsing theory has been written
by Sippu and SoisalonSoininen [39]. Two volumes by Aho and Ullman [6,7] contain a wealth
of information; numerous parsing algorithms are presented, both general and restricted,
along with much of the theory underlying them.
Some early general parsing algorithms are compared by Griffiths and Petrick [22]. All
of the algorithms surveyed rely on backtracking, so they run in 0(cn) time in the worstcase
(n is the length of the input string).
Although it is restricted to Chomsky Normal Form grammars, the CockeYounger
Kasami algorithm [6,19,46] is regarded as the first general parser to run in polynomial time
(0(n3)). The nXn parse matrix that the algorithm constructs accounts for an 0(n2) space
complexity. Recall that the matrix entries are filled with sets of nonterminal symbols.
20
Lemma 3.11 For aÂ¡6F* and z G T*, az (=> 1)*=* a holds in G.
Proof. This is shown by an easy induction on n =len(2).
Basis (n =0). Trivially, a(=$n 1)*=** a holds in G.
Induction (n >0). Let z=ay for some aGT and t/G T1. By the induction hypothesis,
aay (=*r tyy*1 =*r aa holds in G. Observing that a a =>itaa \aa=>2 a holds in G establishes
that cxa (=*j? I)a =$r a also holds. It now follows from Lemma 3.10 that a ay (=Â¡j? I)*y =$r a
holds in G.
Lemma 3.12 For a,/3&V*, let cx=$?P hold in G. Furthermore, let P^x for some
7G V* and x G T* where 7G V*N if /?G V*NT* and 7=e otherwise (i.e., x is the longest suffix
of P consisting solely of terminal symbols). Then a(=*R !)*=) 7 holds in G.
Proof. The proof is by induction on the length n of a rightmost derivation of P from a.
Basis (n =0). Thus, a=>? P=a. Write a as 72; for some 7G V* and iGT where x is the
longest suffix of a contained in T*. In this case, a =72; (=*r I)* => 7 holds in G by Lemma
3.11.
Induction (n > 0). A rightmost derivation of P from a consisting of n steps is of the form
Q'=*1 SAz =*r buz =P for some <5G V*, A+u)EP, and z G T*. By the induction hypothesis,
ch(=>r SA holds in G. Since 6A=>r6oj holds in G, o(=*r\)*=*r 6u> also holds. Now
write 8u> as 7y for some 7GF* and y G T* where y is the longest suffix of 6u> made up
entirely of terminal symbols. By Lemma 3.11, =71/(=** !)*=> 7 holds in G. It then fol
lows from Lemma 3.10 that cx(=*r l)yZ=^7 holds in G. Finally, we note that P=^yz where,
by construction, yz is the longest suffix of P that is comprised of only terminal symbols.
Theorem 3.13 SFr(GQ = {7CV~ S(=^l)z*=>a holds in G for some aGF* and
zET* such that 7=az}.
Proof. Suppose that S (=$r I)*=*r & holds in G for some aGF* and z G T*. By Lemma 3.9,
S=*?az also holds in G, so o,2GSFr(G?). Conversely, suppose that 5=^*7 holds in G for
some ^EV*. Let 7=az for some cxEV* and zET* such that z is the longest suffix of 7
which is a terminal string. Then S (=* !)*=** ex holds in G by Lemma 3.12.
96
A E W, A E if and only if i is the number of steps in a shortest derivation of from A.
Of course, only those subsets for which ^0 holds are of interest. For each A E W,
define elength(A) = i if and only if A E W{. Thus, elength(A) denotes the length of a short
est derivation of e from A.
In addition, a unique production is associated with each A GIF; this production is
denoted by nuller(A). The intent is for nuller(A) to be used in the first step of any deriva
tion of e from A or rather the laststep in the complimentary bottomup parse of e. By
making use of nuller(A), ambiguous derivations of e from A, if they are possible in G, are
disambiguated by Traverse. For each A E W, nuller(A) is defined by the first of the follow
ing two rules which applies.
(1) If A*eEP, then nuller(A) = A.
(2) Otherwise, nuller(A) = AÂ£1Â£2 ' Bm fr sme AB1B2 ' BmEP,
m >1, such that elength(A) = 1 + JJ elength(By).
l
For each A EW, there is a derivation of e from A consisting of i steps in which the first step
is an application of nuller(A). If nuller(A) is determined by rule (2) above, then more than
one production may apply. In this case, an arbitrary choice can be made. Alternatively,
some criteria may be applied toward making this choice more purposeful, e.g., that which
minimizes m or the height of the resulting subparse tree.
Before concluding this section, some motivation for disambiguating all derivations of e
is provided. Suppose that A EW derives e in more than one way. Then if some derivation
of e from A is a segment of a parse for the input string, then any derivation of e from A may
be substituted for this segment. In particular, this substitution may be made independently
of the context in which the segment occurs in the complete parse. If one derivation of e from
A is preferred in a given context, either the grammar must be modified to account for this or
else the favored derivation must be specified by some contextsensitive means. Since
contextsensitive extensions to contextfree grammars are beyond the scope of this work, we
choose to disambiguate all parses of e so as to minimize derivation lengths.
78
object is an element of a set. The other operation is that of adding an object to a set.
Efficient means for implementing these operations with respect to both Q and 6 are
described below.
The operations on Q are considered first. We assume that the states in Qi are stored
on a separate linked list for each value of i. Thus, whether or not exists in Q can be
determined by scanning a list of at most m items. A state is added to Q by simply linking it
into the appropriate list. Thus, both set operations of interest can be performed with respect
to Q in constant time.
Membership in Q can be resolved faster using the following scheme. A boolean flag is
associated with each state in Mc. The flags are reset to false at the beginning of each iterar
tion of the main for loop in GeneraLLRO. When a state q is added to Q by either Reduce
or Shift in the ith iteration, 0
way, the membership of q G Q can be determined during the th iteration by testing the flag
associated with if^q).
The overhead associated with resetting m boolean flags each time through the loop can
be avoided by using integer flags instead. The flags are initialized to 1. When a state q is
added to Q in the ith iteration, 0<
membership of q in Q is resolved during the ith iteration by comparing i with the value of
the flag for If the flags value is less than *, then q (Â£Q. Otherwise, the flags value is
equal to i and q EQ.
Managing the transition set 6 is slightly more involved. We assume that all of the tran
sitions out of qj:i, with 0
qj.Â¡. Thus, a new transition out of qj:i can simply be linked into this list. However, this list
may contain 0(i11) items, so it can be costly to scan the list in search of a transition. An
efficient method for resolving membership with respect to 8 is described as follows. Note
that we need only be concerned with transitions on nonterminals since transitions on termi
nals are never generated redundantly. Thus, we assume that entry(Ij) = A for some A GiV.
15
leftrecursive if at least one of its nonterminals is leftrecursive. Finally, X V is nullable in
G if and only if X =4*e holds in G.
TopDown RighttoLeft Recognition
A general topdown recognition scheme that scans the input string from right to left is
formally developed next.2 This scheme is based on two binary relations on V*. Through
these two fundamental relations, a settheoretic characterization of general topdown right
toleft recognition which succinctly captures the essence of the task is derived.
In concert, the two relations refine and supplant the rderives relation. Certain regular
ity properties of contextfree grammars that are central to our treatment of recognition are
characterized directly and rather elegantly by the two relations; by comparison, a description
of these properties in terms of rderives is indirect and somewhat awkward. It is in this
sense that the two relations refine the rderives relation. Moreover, the two relations provide
alternate definitions of the right sentential forms and sentences of a grammar. In that
respect, the rderives relation is supplanted by them.
Strong Rightmost Derivations
The strong rightmost derives relation ( =* ) is defined by =* = {(A, aoj) \ o: G V*,
AtuiEP}. Thus, =>r is a subrelation of =*, with domain V*N. For brevity, the strong
rightmost derives relation is called the Rderives relation.
Strong rightmost derivations are defined in terms of the reflexivetransitive closure of
=$/?. Thus, every strong rightmost derivation is also a rightmost derivation. The following
series of lemmas compares some elementary properties of rightmost and strong rightmost
derivations.
Lemma 3.1 For a, /?Â£ V*, if a=>R /? holds in G, then or=**/3 holds in G.
Proof. This follows directly from the fact that =>r is a subrelation of =4,.
2 For the moment, we ignore the fact that a rightrtoleft scan of the input is not particularly useful
in practice.
86
tions in Gr prior to the th call of Shift. Thus, the procedure outlined above may be per
formed in 0((i+l)2) time. Observe that this is no worse than the worstcase time complex
ity of the Reduce function.
In practice, one would probably want to perform garbage collection less seldom than on
every input symbol. Regardless, a similar procedure involving two graph traversals would
still apply. The first traversal begins from certain states in the most recently completed
state subset Q{ and marks all states reached in the process. In the second traversal, all
unmarked states and their outgoing transitions are deleted from the recognition graph.
The basic goal of garbage collection is to contract periodically the size of the recogni
tion graph. As a consequence, space taken up by nonessential states and transitions becomes
eligible for reuse. In contrast, the aim of lookahead is to anticipate the states and transitions
that are necessary to recognize the input string. In short, lookahead is used within Shift,
Reduce, and Traverse to selectively generate those states and transitions that are consistent
with the current lookahead string.
In order to make use of lookahead, the items in the control automaton are attributed
with appropriate lookahead strings. The literature on the computation and use of lookahead
in the context of LR parsers is quite extensive. The type of lookahead typically used in con
junction with LR(0) automata is either SLR() lookahead [12] or LALR() lookahead
[8,11,29],4 Without going into detail, the use of symbol lookahead in GeneraLLRO5 for
some k > 0 impacts the following locations in Figure 6.1.
(Line 19) Q_subset is computed to contain only those states q (zQÂ¡ such that the shift
on oi+1 from V(?) is consistent with the lookahead string.
(33) Only those reductions are initiated from p that are consistent with the current k
symbol lookahead. This comment also applies to line 8 in Figure 6.3.
(50) Transitions on nullable nonterminal symbols are selectively made based on their
consistency with the symbol lookahead string.
4 Almost invariably, =1.
6 This is somewhat of a misnomer when lookahead is employed.
17
Case (ii): a=4*6aA =*, 6a for some 6EV* and AreEP. Similar to Case (i), a=^S6aA and
6aA => 6a both hold in G. Now we let 76 to conclude that a=$R 7a holds in G.
We have demonstrated in both cases that a=^R^a holds in G for some 76 V*.
Lemma 3.4 For A EN and XEV, X is rightreachable from A in G if and only if
A =$r aX holds in G for some a?E V*.
Proof. If X is rightreachable from A in G, then A =*? fiX holds in G for some fiEV*. If
XEN, then A =*r fiX also holds in G by Lemma 3.2. If XET, then Lemma 3.3 applies,
i.e., A holds in G for some aEV*. Conversely, suppose that A =>rOiX holds in G
for some aEV*. It follows directly from Lemma 3.1 that X is rightreachable from A.
Corollary For A EN, A is rightrrecursive in G if and only if A =$Â£aA holds in G for
some aEV*.
Lemma 3.5 For XE V, X is nullable in G if and only if X =*r e holds in G.
Proof. If XE T, X is not nullable in G and X =*r e does not hold in G. Now suppose that
XEN. If X is nullable in G, then every rightmost derivation which demonstrates this must
be of the form X =*?A =*, e for some A*eEP. From Lemma 3.2 and the fact that A =* e
holds in G, we conclude that X=*r holds in G. Conversely, X=>He immediately implies
that X is nullable in G since =*r is a subrelation of =>r. 1=1
Corollary For 7E V*, 7 is nullable in G if and only if 7 =*r e holds in G.
One final lemma is presented before introducing the companion relation to =>r The
lemma is useful for motivating this second relation.
Lemma 3.6 For aEV*, at least one of the following two statements is true: (1)
a=*R fia holds in G for some fiE V* and a ET, (2) Q;=*j? holds in G.
Proof. If a=e, then statement (2) holds trivially. Now suppose that a^e. Since G is
reduced, a=Â¡f*x holds in G for some xET*. If x =e, then statement (2) again holds from
the corollary to Lemma 3.5. Otherwise, x=ya for some y ET* and a ET. By Lemma 3.3,
it now follows that a=$Rfia holds in G for some fiEV*.
22
Lemma 3.16 For a, (3E V*, if a I Â¡3 holds in G and a is a viable prefix of G, then /? is a
viable prefix of G.
Proof. From the hypothesis, a=/3a for some aET. Conventional definitions of viable
prefixes [5] prescribe that every prefix of a viable prefix of G is also a viable prefix of G.
However, this property is not immediate from the definition that we have adopted. A proof
that this property does hold in our definition is provided by Sippu and SoisalonSoininen [38],
The essence of their argument is based on the existence of a rightmost derivation of the form
S =>? 6Az =*r Saarz =/3arz for some EF*, A*crarEP, and zET*. This derivation form
demonstrates that both /3a =a and /3 are viable prefixes of G.
Lemma 3.17 For 7E V*, if w(=> l)*7 holds in G for some S+u>EP and z E T*, then
7 is a viable prefix of G.
Proof. The proof is by induction on n =len(^).
Basis (n =0). In this case, z =e. By assumption, a)(=*Â£ l)e7 holds in G for some SkjEP.
Then 7 must equal u which is a viable prefix of G since S =*, w holds in G.
Induction (n > 0). In this case, z=ay for some aET and yETn~x. Assume that
w(=>l)"i/7 holds in G. Then f)"1/?(=* I)a 7 holds in G for some /3EV*. By the
induction hypothesis, fiEVP(G). Now /?(=* I)a 7 implies that I 7 holds in G. It fol
lows from Lemmas 3.15 and 3.16 that 7a and 7 are also viable prefixes of G.
Lemma 3.18 For 7 EV*, if uj(=*r l)*=^7 holds in G for some S+u>EP and zET*,
then 7 is a viable prefix of G.
Proof. Assume that oj(=*r I)*=^b7 holds in G for some SkjEP and zET*. This implies
that u)(=>r I)*/?=*r7 holds in G for some /3EV*. By Lemma 3.17, Â¡3 is a viable prefix of G.
Thus, 7 is also in VP(C) by Lemma 3.15.
Lemma 3.19 For 7E V*, if 7 is a viable prefix of G, then cj(=*r !)*=> 7 holds in G for
some S+uEP and z E T*.
Proof. By assumption, 7EVP(
<5E V*, A*a/3EP, and y E T*. From the proof of Lemma 3.12, S(=>j? I)* =>7# holds in G.
16
Lemma 3.2 For a, /?G V* and A (zN, if a=*?PA holds in G, then cx=$rPA holds in
G.
Proof. Let n represent the length of a rightmost derivation of PA from a. By induction on
n, we show that there exists an identical nstep strong rightmost derivation of PA from a.
Basis (n =0). Assume that cx=*?PA holds in G. This implies that a=fiA, since =*, is
equivalent to the identity relation on V*. Since =*jÂ¡ is also equivalent to the identity relar
tion on V*, a=*R cx also holds in G.
Induction (n > 0). By assumption, a=^PpA holds in G. The last step in a particular nstep
derivation of PA from a can take two distinct forms. These are analyzed in the following
two cases.
Case (i): o;=^"17jB =>r 7&4 =/3A for some 7GF* and B+6AEP. By the induction
hypothesis, c*=>fl17B holds in G. Since 7B =*/Â¡7&4 holds in G by definition, we conclude
that a=*R PA holds in G.
Case (ii): a=>"1/?AB =>, PA for some B eGP. By the induction hypothesis, cx=$r~1 PAB
holds in G. Thus, a =}r PA also holds in G since PAB =*r PA holds.
In both cases, we have shown that cx=$r PA holds in G.
Lemma 3.3 For aGT and aGI, if a=^*/?a holds in G for some /?GF*, then
ot^R^a holds in G for some 7G V*.
Proof. Assume that a=*?Pa holds in G for some /?G V*. If =7a for some 7GF*, then
ot=*Ria =a trivially holds in G. Otherwise, suppose that a does not end with a. In this
case, every rightmost derivation of pa from 01 is nontrivial. We analyze one such rightmost
derivation and focus on the step that causes a to become the rightmost symbol in a string
occurring in that derivation. The initial segment of the derivation up to and including this
step can take two distinct forms.
Case (i): cx=Â¡t*8A =*, dea for some <5G V* and A+cra GP. By Lemma 3.2, cx=*r &4 holds in
G. By definition, 6A => Sera holds in G. Thus, ex=^R^a holds in G when we let 7=&7.
GENERAL CONTEXTFREE RECOGNITION AND PARSING
BASED ON VIABLE PREFIXES
By
D. CLAY WILSON
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
80
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
function Reduce () // Revised to implement the succ function,
subset := ,
Traverse(<5,, i)
Succ_Stack := 0
while Succ_Stack ^0or Lsubset ^0do
if Succ_Stack = 0 then
(p,X,q) := Remove(Lsubset)
for A*aX'/3.'iJ{p) such that ()=$* do
Push(Succ_Stack, (<7,A,len(a:)))
od
end
od
else // Succ_Stack ^ 0
(r,A,d) := Pop(Succ_Stack)
if d > 0 then // Let entry(^r)) = X.
for r'Â£Q such that (r,X, r')E6 do
Push(Succ_Stack, (r',A, d 1))
od
else // d =0, let goto(^(r), A) = Ij.
if ?; , Q then
Q QUiqj.,}
Traverse^.,}, i)
fi
if {qj.itA, r)^then
:=5U{(gy.,,A,r)}
subset := subset U{(q'y.i,J4, r)}
fi
fi
fi
Figure 6.3 A Modified Reduce Function
(5) This while loop corresponds to the while loop at line 31 in Figure 6.1. However,
in this case there are two collections to exhaust before the loop terminates.
(6) The true branch of the if statement deals with items in subset and the false
branch deals with items in Succ_Stack. The if predicate is written so that items in
Succ_Stack have priority over items in subset. Clearly the predicate is false in the first
iteration of the while loop.
(78) These two lines are the same as lines 3233 of Figure 6.1.
(9) Instead of invoking the succ function as in line 34 of Figure 6.1, we initiate the
graph traversal of GR that is implied by that use of succ. Specifically, (g,A,len(o;)) is
66
(1012) The postcondition of Reduce in line 10 is also a precondition of the Shift func
tion. A postcondition of the Shift function is given in line 12 and is similar to the loop invari
ant. However, in this case the following situation holds for GR. A string 7ai+1 Â£ F* is a
member of PVP(G, i+l:u>) if and only if there is a path in GR from some state q GQ,+1 to
q0.0 which spells oi+17^. Assuming that the precondition holds when Shift is called, the Shift
function transforms GR so that this postcondition holds.
(1213) If Qi+i=0 at this point, then MR has no final states. Thus, PVP^, *+l:u/) =
0 and i+l:u/ ^PREFIX(G). Consequently, w (Â£L(G), so GeneralJLRO rejects w.
(1516) Line 15 expresses a postcondition of the for loop. It holds upon completion of
the nth iteration (i.e., when i =n) provided that the postcondition of Shift and Qi+1^0 both
hold at the end of that iteration. In this case, w L(G), so General_LR0 accepts w.
Before continuing with the description of GeneralJLRO, the following important proper
ties of LR(0) automata are reiterated. Let A+w E7y hold for some A+u>EP with A jS'
and 7yÂ£7. In addition, let 6u> be the spelling of an arbitrary path in Mc from 70 to 7y for
some <5ET*. Then <5w(=M holds in G. Now let A*aa/SGIj hold for some A+ota/3(zP
and 7y7, and let <5a be the spelling of an arbitrary path in Mc from 70 to 7). In this case
6a*a6aa holds in G. Based on the manner in which GR is derived from Mc, these two
equivalence properties (i.e., the equivalence of paths from 70 to 7y with respect to reduce and
shift actions) are preserved in GR (i.e., all paths in GRl from q0:0 to qj:i are equivalent with
respect to shift and reduce actions). These equivalence properties are exploited by the Shift
and Reduce functions.
(11,18) The Shift function is called with i as an argument. This makes the relationship
between the values of i in GeneralJLRO and Shift explicit. The operation of the Shift func
tion during its ith invocation from GeneraLLRO is described for some i, 0
(19) At this point, we know that QÂ¡ cannot be empty. Otherwise, the input string
would have been rejected in an earlier iteration of the main for loop. The ith call to Shift
computes the a relation.2 Thus, we want to determine all states q GQ{ for which there is
2 It is important to remember that i ranges from 0 to n.
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
David C. Wilson
Professor of
Mathematics
This dissertation was submitted to the Graduate Facuity of the College of Engineering
and to the Graduate School and was accepted for partial fulfillment of the requirements of
the degree of Doctor of Philosophy.
Cl
May 1990
sib* Winfred M. Phillips
' Dean, College of Engineering
Madelyn M. Lockhart
Dean, Graduate School
40
The next two lemmas demonstrate how reversed rightmost derivations are represented
by the f= and  relations.
Lemma 4.2 For a,PEV* and x GT*, if <*((=* )*f=*/3 holds in G, then f3=*?ax holds
in G.
Proof. By Lemma 4.1, the hypothesis implies that /?(=** I)*=j? a holds in G. It follows from
Lemma 3.9 that f3=*?ax holds in G.
Lemma 4.3 For a, /?GF*, let a=>?f3 hold in G. Furthermore, let /?=7x for some
7G V* and x G T* such that 7G V*N if /?G V*NT* and 7=e otherwise (i.e., x is the longest
suffix of consisting solely of terminal symbols). Then 7(=*')*=*a' holds in G.
Proof. The hypothesis and its conditions imply that a (=* I)* =$r 7 holds in G (see Lemma
3.12). Therefore, 7(=*)*=*a; holds in G by Lemma 4.1.
Lemma 4.4 L(G) = {10 G T*  ()=*)* =*S'holds in G}.
Proof. This is a consequence of Lemmas 4.2 and 4.3.
The following connection is established between PREFIX(G) and the \= and  rela
tions.
Lemma 4.5 PREFEX(G) C (z G T*\ e(J=**)*ai holds in G for some aG F*}.
Proof. Let x GPREFEX(G) be arbitrary. The corollaries to Theorem 3.13 together with the
assumption that G is reduced yields that I)*=*j? e holds in G for some /?GF*. By
Lemma 4.1, e(J=*)also holds in G. Finally, this last expression implies that
e0=**)*ct(=*holds in G for some aGF*.
The set inclusion of the preceding lemma is almost invariably proper. For example,
consider the grammar with production set P = {5>a}. Although this grammar generates
{a}, e()=**)*, a holds for all i >0. In fact, equality holds in Lemma 4.5 only for grammars
which have an empty terminal alphabet.
7
matn are created based on need. Moreover, the system accommodates extensible grammars
whereby changes in the grammar during parsing produce corresponding changes in the
relevant portions of the automaton.
Work which is similar in spirit to ours is that of Mayer [33]; deterministic canonical
bottomup parsing is examined in terms of reduction classes where a reduction class is a pair
of strings, the first and second components of which represent the left and rightcontexts,
respectively, of parsing actions. Conditions are imposed on these reduction classes which
ensure determinism, termination, and correctness. In short, the cited paper presents a
framework for describing deterministic canonical bottomup parsers, whereas our aim is a
framework for characterizing general recognition and parsing.
Outline in Brief
This introductory chapter ends with a very short synopsis of the remaining chapters.
The next chapter reviews some basic definitions and terminology. Chapters HI through VII
comprise the main body of this dissertation. Concluding remarks are made in Chapter VIH.
Chapters III and IV develop the mathematical foundation for this work. Settheoretic
characterizations of general topdown recognition and general bottomup recognition are
presented in those two chapters.
Earleys algorithm is the subject of the fifth chapter. In particular, our graphical vari
ant of Earleys algorithm is presented there.
A general automatabased bottomup recognizer is described in detail in Chapter VI.
Chapter VII extends this recognizer into a general parser.
The major results of this dissertation are summarized in Chapter VIII. In addition,
directions for future research, of which there are several, are delineated in that final chapter.
65
nullable nonterminal symbols. A linebyline description of the GeneraLJLRO recognizer fol
lows.
(Line 1) GeneraLLRO is supplied with two arguments, a reduced Saugmented grammar
G and a string w over the terminal alphabet of G.
(Lines 24) By assumption, w is terminated with $. For simplicity, we also assume that
the LR(O) automaton of G, MC(G), is provided by some external agent.1 Each of w, Mc,
and Gr are visible to the functions that require access to them.
(56) Graph GR is initialized to contain the single state g0.0. The comment in line 6
indicates that GRl can be trivially embedded into an FSA that accepts PVP(G,e) = {e} at
this point. Henceforth, the following statement holds for Gr throughout the duration of
recognition. For g;:i EQ where 0
g0:o (1) spells the reversal of a string in VP(G, i:w), and (2) corresponds to the reversal of a
path from 70 to Ij in Mc. As seen below, even stronger statements may be made about GR
at particular points during recognition.
(7) This for loop iterates once for each terminal symbol in w. Having i range from 0
to n rather than from 1 to n+1 yielded a cleaner expression of the algorithm. The rest of
the discussion primarily elaborates on an ith iteration of this for loop for some i, 0
(810) The comment in line 8 is both a loop invariant and a precondition of the Reduce
function. It clearly holds upon entry to the loop; the Reduce and Shift functions ensure that
it also holds at the start of each iteration. This condition can be alternately stated as fol
lows. A string 76 V* is a member of PVP(
some state q EQÂ¡ to q0.0 which spells 7. The comment in line 10 is a postcondition of the
Reduce function and may be restated similarly; that is, a string 7G V* is a member of
VP(Gi, i:w) if and only if there is a path in GR from some state q EQÂ¡ to q0:0 which spells
7s. Assuming that the precondition holds when Reduce is called, the Reduce function
transforms GR so that the postcondition holds.
1 An alternative is for General_LR0 to construct Mq as an initial task.
51
A Modified Earley Recognizer
A modified version of Earleys recognizer, called Earley', is described next. Earley'
differs from Earleys algorithm in that it constructs a statetransition graph. The STG con
structed by Earley' is denoted by Gei = (Qei, V, 6ei). The states in Qei are the Earley states
that are generated by Earleys algorithm. The state transitions in dEi are described below.
In recognizing w with respect to G, Earley' builds the same sequence of state sets as
Earleys algorithm. In addition, a sequence of statetransition sets, viz., EÂ¡ for 0
is constructed. These sets are also constructed in order of increasing i. In particular, E1, is
constructed concurrently with S',. For 0
(s,X, t) where s .Sj for some j, 0
A particular set of state transitions E{ is constructed analogously to S',. That is, (1) E{
is initialized to a finite set of transitions denoted by basis(1,), and (2) a transitionset closure
function, called E_Closure, is applied to basis^,) to complete the construction of E{. For
0<
( 0 if =0
basis(Â£,) = j {(Sj flf; t) j s =[A_+a.aiPij] G5._i; t =[A+aar/3,j] G5,} if 1
Note that bas\s(Ei) where i >0 is determined from 5,_1, S,, and a,; basis(E0) is a special
case. For i >0, the transitions in basis^,) may be installed by a slightly modified Earley
Scanner function.
For 0
construction of EÂ¡. Similar to Sn+1, En+1 = basis(Â£'n+1).
E_Closure(basis(Â£,)) if 0 < i < n
j basis^,) if i ~n ~H1
For 0
which satisfies the following three rules.
24
The essence of the recognition scheme, called General_RR, is simple. Let z E T* be a
suffix of w and suppose that all proper suffixes of z are known members of SUFFIX(G). The
set of strings defined by (o;GVP((7)  S =*?az holds in <7} is used to determine if z is a
member of SUFFIX((7). This set is nonempty if and only if z GSUFFIX((7). Moreover, it
contains e if and only if zGL(<7). The GeneraURR recognition scheme is described in
greater detail in what follows. For reference, the recognizer is presented as Figure 3.1.
function GeneraLRR(G =(F, T,P,S); wET*)
// w =a1a2 an, n >0, each a G T
PVPrr(<7,e) :={u>S+wGP}
for i := 0 to n1 do
VPhr(G,w:i) := =>j?(PVPrr
PVPrr(G, w:i+1) := ^(VPrr(G,w:t))
if PVPrr((7, u/:+1) = 0 then Reject(iu) fi
od
VPrr(G, w) := =>r*(PVPrr(G, a))
if Â£GVPrr(G, w) then Accept(t/;) else Reject(w) fi
end
Figure 3.1 A General TopDown CorrectSuffix Recognizer
For an arbitrary string zET*, two sets of viable prefixes are identified with z. The
first set consists of the primitive RRassociates of z (in G) and is defined by PVPrr^,^) =
{cvG V* I uj(=^f l)*a holds in G for some S+uEP}. The second set is a superset of the first;
it consists of the RRassociates of z (in G) and is defined by VPrr((j!,2) =
{oG V*  u>(=*n l)*=$na holds in G for some SkjjEP}. By Theorems 3.13 and 3.20,
VPrr((7,2) = {cvGVP(G)  S==^*az holds in <7} which equates to the set described in the
preceding paragraph. Input string w is recognized by computing PVPrr((7, w:i) and
VPrr(<7, w:i) in turn as i ranges from 0 to len(w).
In words, VPrr((7,z) is the reflexivetransitive closure of PVPrr(<7, z) under the =>
relation. This fact is made explicit by expressing VPrr(<7,z) as {0E V*  o; => /? holds in G
for some q:GPVPrr(<7, z)}. Thus, if PVPrr(<7,z) is known, VPrr(<7,z) is obtained from it
through appropriate application of the =4* relation.
41
Viable Prefixes Revisited
Lemma 4.5 suggests that the reduce and shift relations, as defined, are inadequate as a
basis for general bottomup correctprefix recognition. Indeed, the source of their deficiency
is revealed when they are examined under the guise of viable prefixes.
First, recall that VP(Gi) is closed with respect to =>/? and I. Formally, a string aGF
is a viable prefix of G if and only if u> (=$n U I)* a holds in G for some S*ojGlP The com
plimentary situation that exists with respect to the f= and  relations is investigated in the
next series of lemmas.
Lemma 4.6 For a, /3E V*, if a\=P holds in G and a$VP(G), then /3&VP{G).
Proof. The contrapositive of this implication is proven, so we assume that /?Â£VP(G). Since
o\=P holds in G, /?=*/? a also holds. By Lemma 3.14, this implies that c*GVP(G).
Corollary For a, /?Â£ V*, if a\=/3 holds in G and /?Â£VP(G), then aÂ£VP(Gi).
Lemma 4.7 For a, /?Â£ V*, if aÂ¡3 holds in G and aÂ£VP(G), then p$VP{G).
Proof. The proof is similar to that Lemma 4.6. Lemma 3.16 is relevant in this case.
Corollary For a,/?Â£ V*, if a:*/? holds in G and /?VP(Â£?), then a;Â£VP(Gi).
Lemma 4.8 For a, PE V*, if a ((= U )* P holds in G and a,^VP(C), then P(fVP(G).
Proof. Since a ((=Â£!*)* (3 holds in G by assumption, P holds for some n >0.
Applying Lemmas 4.6 and 4.7, this lemma is proven by induction on n.
Corollary For a, /?Â£ V*, if a ()= U )* /3 holds in G and PEVP(G), then orÂ£VP(C).
By Lemma 4.8, V*\VP(
Lemma 4.1 of this complimentary closure property is addressed in the following.
Lemma 4.9 For a, PEV* and xET*, if aEVP(G) and a (=*r I) =* P holds in G,
then P Q=**)*=*q? holds in G when = and  are restricted to VP(<r).
Proof. By assumption, a; is a viable prefix of G and a (=^j? I)* =*r Â¡3 holds in G. From
Lemma 4.1, also holds in G. That this latter expression holds when = and 
are restricted to VP(G) follows from Lemma 4.8 and its corollary.
64
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
function Reduce (?)
subset := f
Traverse(Q,, t)
while subset 0 do
(p,X,q) := Remove(_subset)
for AKxX/Kzif^p) such that /?=*e do
for r Esuccfa,^) do // Let goto(^(r), A) Ij.
if then
Q:=QU{qj:i}
Traverse({gj:i}, *)
fi
if (<7j:i, A, r) ^ then
:=U{(y;i>Afr)}
subset := subset U{(g;.;, A, r)}
fi
od
od
od
end
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
function Traverse(Qsubset, *)
while Q_subset j^0do
q := Remove(Qsubset)
for goto(iftq),A) = Ij such that A =>*e do //A EN
if qji Q then
Q :=QU{qj:i}
Qsubset :=Qsubset U{g,.,}
fi
:= 6U{(qj:i,A, g)} // Never redundant.
od
od
end
Figure 6.1 continued
respectively, at distinct stages of recognition. The makeup of GR at any given time deter
mines which regular set is recognized by MR. The General_LR0 recognizer is best under
stood through an appreciation of how it transforms GR.
The GeneraLLRO recognizer is comprised of a main function (lines 117 in Figure 6.1)
and three auxiliary functions, Shift, Reduce, and Traverse. The Shift function (lines 1827)
computes the  relation whereas Reduce (lines 2846) computes the (=* relation closure. The
Traverse function (lines 4758) is called from within Reduce. It handles certain transitions on
110
(2) Consider alternate control automata for implementing the GeneraLLR recogni
tion scheme (including automata that are attributed with lookahead). We have
already suggested employing automata that are intermediate between NLR(O)
automata and LR(0) automata.
(3) Identify means for classifying ambiguity and investigate disambiguation strategies.
As described, the General_LRO parser produces a parse of the input string only after
the string is accepted, i.e., like Earleys algorithm. It would be advantageous to be able to
obtain parse fragments as soon as they are known to be part of a final parse. The parser
would then behave more like an extended LR parser. The GeneraLLRO parser should be
modified to provide for such a piecemeal delivery of a parse. Note that such a mechanism
would have implications on garbage collection.
The 0(np+1) worstcase time complexity of General_LRO compares unfavorably with
Earleys algorithm. The last topic that we suggest addresses this. A grammar is in canoni
cal twoform if its productions are of the forms A+B C, A+B, A*a, and A+e [39].
Clearly, every canonical twoform grammar can be recognized in 0(n3) time. One possible
approach to recognizing an arbitrary grammar in 0(n3) time is to transform it into an
equivalent canonical twoform grammar and recognize the input string with respect to the
new grammar. A parse in the original grammar could then be reconstructed from the parse
that is obtained in the transformed canonical twoform grammar.
xml version 1.0 encoding UTF8 standalone no
fcla fda yes
! General context free recognition and parsing based on viable prefixes ( Book ) 
METS:mets OBJID AA00029717_00001
xmlns:METS http:www.loc.govMETS
xmlns:xlink http:www.w3.org1999xlink
xmlns:xsi http:www.w3.org2001XMLSchemainstance
xmlns:daitss http:www.fcla.edudlsmddaitss
xmlns:mods http:www.loc.govmodsv3
xmlns:sobekcm http:digital.uflib.ufl.edumetadatasobekcm
xmlns:lom http:digital.uflib.ufl.edumetadatasobekcm_lom
xsi:schemaLocation
http:www.loc.govstandardsmetsmets.xsd
http:www.fcla.edudlsmddaitssdaitss.xsd
http:www.loc.govmodsv3mods34.xsd
http:digital.uflib.ufl.edumetadatasobekcmsobekcm.xsd
METS:metsHdr CREATEDATE 20150605T15:17:56Z ID LASTMODDATE 20150312T09:33:42Z RECORDSTATUS COMPLETE
METS:agent ROLE CREATOR TYPE ORGANIZATION
METS:name UF,University of Florida
OTHERTYPE SOFTWARE OTHER
Go UFDC  FDA Preparation Tool
INDIVIDUAL
UFAD\renner
METS:note Online edit by Kendra Carter Carter ( 3/12/2015 )
METS:dmdSec DMD1
METS:mdWrap MDTYPE MODS MIMETYPE textxml LABEL Metadata
METS:xmlData
mods:mods
mods:classification authority lcc LD1780 1990 .W747
mods:accessCondition The University of Florida George A. Smathers Libraries respect the intellectual property rights of others and do not claim any copyright interest in this item. This item may be protected by copyright but is made available here under a claim of fair use (17 U.S.C. Â§107) for nonprofit research and educational purposes. Users of this work have responsibility for determining copyright status prior to reusing, publishing or reproducing this item for purposes other than what is allowed by fair use or other copyright exemptions. Any reuse of this item in excess of fair use or other copyright exemptions requires permission of the copyright holder. The Smathers Libraries would like to learn more about this item and invite individuals or organizations to contact the RDS coordinator (ufdissertations@uflib.ufl.edu) with any additional information they can provide.
mods:genre marcgt bibliography
theses
nonfiction
mods:identifier type OCLC 23011836
ocm23011836
mods:language
mods:languageTerm text English
code iso6392b eng
mods:location
mods:url access object in http://ufdc.ufl.edu/AA00029717/00001
mods:name personal
mods:namePart Wilson, D. Clay
given D. Clay
family Wilson
mods:role
mods:roleTerm Main Entity
mods:note bibliography Includes bibliographical references (leaves 111113).
Typescript.
Vita.
statement of responsibility by D. Clay Wilson.
mods:originInfo
mods:place
mods:placeTerm marccountry xx
mods:dateIssued 1990
marc 1990
point start 1990
mods:recordInfo
mods:recordIdentifier source sobekcm AA00029717_00001
mods:recordCreationDate 910125
mods:recordOrigin Imported from (OCLC)23011836
mods:recordContentSource University of Florida
marcorg FUG
eng
OCLCQ
mods:languageOfCataloging
English
eng
mods:relatedItem original
mods:physicalDescription
mods:extent viii, 114 leaves : ; 29 cm
mods:titleInfo
mods:title General contextfree recognition and parsing based on viable prefixes
mods:typeOfResource text
DMD2
OTHERMDTYPE SOBEKCM SobekCM Custom
sobekcm:procParam
sobekcm:Aggregation UFIR
UFETD
IUF
sobekcm:Wordmark UFIR
sobekcm:Tickler RDS.INHOUSE.17
sobekcm:bibDesc
sobekcm:BibID AA00029717
sobekcm:VID 00001
sobekcm:EncodingLevel K
sobekcm:Source
sobekcm:statement UF University of Florida
sobekcm:SortDate 726467
METS:amdSec
METS:digiprovMD DIGIPROV1
DAITSS Archiving Information
daitss:daitss
daitss:AGREEMENT_INFO ACCOUNT PROJECT UFDC
METS:techMD TECH1
File Technical Details
sobekcm:FileInfo
METS:fileSec
METS:fileGrp USE reference
METS:file GROUPID G1 TXT1 textplain CHECKSUM c22a0a558614618e6cac5b6de273caa3 CHECKSUMTYPE MD5 SIZE 2316
METS:FLocat LOCTYPE OTHERLOCTYPE SYSTEM xlink:href 00006.txt
G2 TXT2 3dcc66a7eb54a4bf607678acb4af68f3 2174
00104.txt
G3 TXT3 fb110c1f1e114d8f8f4daa882b693658 2278
00101.txt
G4 TXT4 aea68b8328a67aa28b53acef46e5ece0 2026
00026.txt
G5 TXT5 aa061e777ebaaaf8b4f611be570c4598 1984
00047.txt
G6 TXT6 d58f31ffd91659ced096723ec7e29845 2243
00014.txt
G7 TXT7 6fb99d8868381b155aadf1e3373804e0 1799
00080.txt
G8 TXT8 9d5de94e7f38b66e0eec97f918515864 1516
00124.txt
G9 TXT9 aba6eda5cd1198c514d8d0340848ee14 1935
00116.txt
G10 TXT10 2477e464e3d535990b2d7162e849a5e1 1931
00079.txt
G11 TXT11 77dc9d1c2d06f6e6889bb151b00da355 1690
00058.txt
G12 TXT12 de258bd7be149e68e9df3e115eb850ce 425
00002.txt
G13 TXT13 60534f660af71c5e56ef234357032627 2114
00085.txt
G14 TXT14 9dbaaefb68b2200e5b80a64451fddf2b 2108
00091.txt
G15 TXT15 f916d6b45397c41349901d904e0f67c6 2220
00105.txt
G16 TXT16 f2ad09476128046d51981f93e78399e0 1890
00060.txt
G17 TXT17 b8178c5086681e54c2fbb42098591a00 1994
00054.txt
G18 TXT18 7af674d90a2a859d2580df94f73e4939 2313
00092.txt
G19 TXT19 df4430f291b3d6fe7be23ec473f41fef 2363
00109.txt
G20 TXT20 526933cef9788b9f9a7064472a423b01 1958
00064.txt
G21 TXT21 7708e77312c0257077035539118d63f8 2327
00102.txt
G22 TXT22 97326d0123f8dad6683f192e648f20f2 2198
00074.txt
G23 TXT23 b71cab6acf5a551a7f6fd13a09a54e1e 2052
00029.txt
G24 TXT24 b2b9208baf5325c37c0e1d3c216e3e03 2087
00051.txt
G25 TXT25 13c83a063f613eb7441b5065a5ba61df 1945
00034.txt
G26 TXT26 5bf3afe7424e630d201c90100aa016d0 737
00021.txt
G27 TXT27 7bc88541c2ad3f3e509b43a9f1f35594 2154
00055.txt
G28 TXT28 1154dad9b407e1fe1a5ac20c03ad3a52 1866
00061.txt
G29 TXT29 bd29d6ca61e0ef26d73cd6df995820ed 2215
00110.txt
G30 TXT30 ab2ab390455fb4af53931143fac2604d 2192
00067.txt
G31 TXT31 093c3cec972c26e70eea38cdc68c5daf 1833
00037.txt
G32 TXT32 0ab1f6e0afc69f48f3c4040f6a98d84f 1804
00010.txt
G33 TXT33 80c110c9395f9e7cf730e92159ede4e6 2133
00103.txt
G34 TXT34 6227960c34a7b8917659da6e6493d773 2032
00033.txt
G35 TXT35 0c6515b82ce89d59b52eecde2ff05318 1944
00084.txt
G36 TXT36 c5ebec9450b49261b5da4ec6496f3a96 1840
00059.txt
G37 TXT37 175129bb537e83d17efaa04fede0983e 1820
00100.txt
G38 TXT38 1b9a4e52a6ed86cdd24651622b9325b4 1893
00090.txt
G39 TXT39 9b2accdf92b564d675d295626d0a2ad3 2250
00096.txt
G40 TXT40 8c949cc95b638041712c8598baf486d2 2070
00011.txt
G41 TXT41 358c84be350b353ebf879b484ce0d709 2007
00046.txt
G42 TXT42 699bf73ad504efc7ff156f65381dad42 1742
00043.txt
G43 TXT43 9e69f2d2d66a2d16fdd2a78b4e028b37 1625
00108.txt
G44 TXT44 73919d7fd431a219a87ab8fe75a866f8 848
00125.txt
G45 TXT45 335101d51f8e4bdebbf6d9b8d3142451 1421
00056.txt
G46 TXT46 92e0ded35793cb4974a4057482a1bc99 1927
00062.txt
G47 TXT47 83fb23a09c3101da6d611556022dd46d 2185
00112.txt
G48 TXT48 19819631da0ccaa10d8220824eb9d114 2303
00076.txt
G49 TXT49 d5371669313063da163c9e02332c362c 2065
00118.txt
G50 TXT50 af3c0923d9e0e9f4e33f794afb0625a2 1808
00057.txt
G51 TXT51 6e449d08aec9b245a2d6b9109c916faf 1861
00016.txt
G52 TXT52 2233b7b12d35dc4a5033bbe3b0a6adba 1389
00073.txt
G53 TXT53 e43050af5eac4d137f22ff3545a77970 2325
00117.txt
G54 TXT54 fcee12fe57e8ac9770ea48864550bdde 2037
00036.txt
G55 TXT55 3d3518c95df3addc1b2b4a40f60345cf 641
00123.txt
G56 TXT56 0c22b2468ea0ab0bc4ec1ce502f61c20 2209
00087.txt
G57 TXT57 386ba5b64cfbc05d9227e1d3537ced60 1939
00066.txt
G58 TXT58 6cb24c6d235a0dc90d045c121d55101e 2406
00075.txt
G59 TXT59 39af73603cd6c46d5f69328e562c46f3 2113
00070.txt
G60 TXT60 6b1fe94c2c15d8afafd02929cdac9cde 1520
00082.txt
G61 TXT61 0f414c9b7390acedb65146cf099a186d
00012.txt
G62 TXT62 8cff653c0623374f0bc2ee8b8127284e 2127
00035.txt
G63 TXT63 28c8742205c3b871585ffaf27c5cc47e 1255
00007.txt
G64 TXT64 0d123451769ea44e29c60aba4469ceb0 2279
00077.txt
G65 TXT65 4633b3b2d01293aeb459fa0dcc837f04 1860
00025.txt
G66 TXT66 0475de67bc871cf70d629a6973943581 1558
00119.txt
G67 TXT67 f40997d7aff848c568e30b7bae20e5ad
00027.txt
G68 TXT68 f9f571b3e9d40b316c5d7cefe21d32c0 1975
00065.txt
G69 TXT69 8c06721da3dd7f084639bbdc6058c7c3 1980
00063.txt
G70 TXT70 8bcf3e8dbf5cd4c4176aafd50d2e350f 1650
00107.txt
G71 TXT71 6c5f9890837312c75ea83abca15b6e90 2299
00114.txt
G72 TXT72 66f6895ee1ea4c43d996e2ab52f4fd24 1972
00071.txt
G73 TXT73 12eb98e32e32bcea789f3eb012450f4b 2098
00120.txt
G74 TXT74 5d364f0edae2fcc20167ca733f7e1c65 1773
00115.txt
G75 TXT75 adb55ca0f0cc1259e4c82b77f7a99452 1947
00042.txt
G76 TXT76 e4c9a4d61f20fa1f3eab513755ff8fb8 909
00045.txt
G77 TXT77 6c3cdf9db96973a7c37688c128027bbb
00017.txt
G78 TXT78 fa074538fcc378bbf06ef5d67f4baed5 2214
00023.txt
G79 TXT79 a4ffdfcd94a38914cb1649d1253b9d9e 1926
00039.txt
G80 TXT80 68a468abb7781e5b5982ed17f94a44f4 2041
00019.txt
G81 TXT81 3565bdc9d126e6d3ce7af9f295aab2f2 2092
00111.txt
G82 TXT82 c1fc598c9d751f65eabbf4aa8652ccc8 2140
00098.txt
G83 TXT83 6cc8d7edc429e7978aacfc8b4c6ab131 1937
00122.txt
G84 TXT84 1ac6fe1a3a87bdb15b15b8e4250d8364 1706
00072.txt
G85 TXT85 ec0029abfe19d3e45c2e50a301219440 1657
00089.txt
G86 TXT86 098c05ccbdb50f1f89133513a9bb68b8 1734
00081.txt
G87 TXT87 afc85e4b8433df55c48cbcf6c5d42cae 2131
00020.txt
G88 TXT88 f0d432e60dc5853fad6882c90a962fef 1739
00028.txt
G89 TXT89 56237c487dfbec13cce0cc0e79e9f72d 2219
00015.txt
G90 TXT90 274ea11d1e9719b1994ca488a1931ac4 1891
00038.txt
G91 TXT91 1049a159ea521451a65769dc6820e234 3033
00005.txt
G92 TXT92 5c92e8deb7474eb90681e935b99e9d3d 2237
00095.txt
G93 TXT93 2e772bc89e43f44a40e5619d5f6bf7b2 868
00004.txt
G94 TXT94 28ff4758a78826b7882ff22c2c828c2f 2006
00031.txt
G95 TXT95 aed62def196e1007f69eae2df32d4faa 187
00003.txt
G96 TXT96 b2762fc2de02826871d9898ad9eedb43 1333
00083.txt
G97 TXT97 cf346d63aed25813bd56f63c8bc339bc 2089
00086.txt
G98 TXT98 2c540210865cfb150097c9f64c246a88 2029
00024.txt
G99 TXT99 0ce7222f6b55ee9e96bf9d11d5c64f46 1985
00052.txt
G100 TXT100 3fb26c49ded13a64573b898cc48a145b 2207
00093.txt
G101 TXT101 3240ff2ea0a96082cddb5e08dfcfdc24 2042
00088.txt
G102 TXT102 a30063d8182f42044231d4c45bd4d8a8 1842
00032.txt
G103 TXT103 cf38c8f56790eb0269e90080cdc728ae 1889
00022.txt
G104 TXT104 9a3e9be4ef4d90512c58ae3c56a41efa 2078
00050.txt
G105 TXT105 3347b1b844b6e91bbc7baf3c7cce10e1 1950
00044.txt
G106 TXT106 0a1c0439f2c1ce8655585a039517e28d 2010
00068.txt
G107 TXT107 47e319b7ae6ee2bb8b82b7a9282b766a
00008.txt
G108 TXT108 b4dbf5db417e2b89ba1b6e15dc7d257a 2012
00040.txt
G109 TXT109 a7e20ea09dc3c8cbf91448a9aa874224
00049.txt
G110 TXT110 3ab3846978c28b3497eec454e762ccb2 1990
00041.txt
G111 TXT111 da756297458b5d37f7c83acf7bb41491 1695
00106.txt
G112 TXT112 291f840d54ffe40ec3d4a731ff05494e 2307
00097.txt
G113 TXT113 2a26c5f0b3b5c3ccfbf0fa80ec7b0eec 2180
00094.txt
G114 TXT114 60c5478fa26050df72aab3baeee52383 1279
00048.txt
G115 TXT115 97adf41d84f132bdb5268e4f18a8682b 2606
00121.txt
G116 TXT116 1e8b99af9bc50c5558db5d4d82182a21
00018.txt
G117 TXT117 4ae436275d53fd380e049094cb8ad9cc 934
00009.txt
G118 TXT118 1c652a6c98b539dcb04623ddbc5d907b
00113.txt
G119 TXT119 f0aa9ef18c56a4d5fc3e66819ceb7ffc 1030
00099.txt
G120 TXT120 243ecd8e54b54108caa9bf01089e99c6
00069.txt
G121 TXT121 fca3cf2eb8168005767ae6ae03617a2e 2013
00030.txt
G122 TXT122 a514f1bc4628c0ddd9f7d24563e00884
00053.txt
G124 TXT124 37bda335f84dd53dd52f76e17774d61f
00078.txt
G125 TXT125 6399b27e62117d9f581ecda88693910e 2212
00013.txt
PRO1 textxpro 32b1dfc1076af55ed213514c2aa33fe9 56042
00006.pro
PRO2 75f83757655d6229a32980bc621b28c6 52358
00104.pro
PRO3 0d71077bd2e06b28d417af0dd66e76a2 54470
00101.pro
PRO4 ac7622805a885d300484a1f64f6f7db0 46995
00026.pro
PRO5 1c526f2ff5c65d899d7d93b14c1787a6 45127
00047.pro
PRO6 766af5e9b912f2e87b8ba519e8e241a0 54547
00014.pro
PRO7 6ae78b87806e3ff7b322444ebb74fc52 41465
00080.pro
PRO8 c0177ee050993c74fb79fea980a22163 31130
00124.pro
PRO9 afa2174d97e2145b6392626b927ea606 44767
00116.pro
PRO10 7fb2d2ce8b66c7a8c67158e276f550ce 45950
00079.pro
PRO11 27da0bffc8e16a3674b9f37b29fed9bd 36447
00058.pro
PRO12 8c15bdb6699224d8847149f8a7462e0b 7180
00002.pro
PRO13 89db613384a6e450af31a6058ffac791 51649
00085.pro
PRO14 cff81d74b05364a52b9d5d31af37cf47 48171
00091.pro
PRO15 76150d3a03bfb8932673cc48ca59d503 53339
00105.pro
PRO16 10c2150c8b77c4a384487804aca124a1 43946
00060.pro
PRO17 0639c903e240557ac6e88f55a992df38 48455
00054.pro
PRO18 20a12025b17c4c5e2f1063788589d441 55973
00092.pro
PRO19 d5959fd4d0f893551a59f516bdbccec1 57024
00109.pro
PRO20 97b74cc145eb4bb8c7d9f54fdda983a7 45463
00064.pro
PRO21 546625f16984ac5738d8ee086c3d7262 56255
00102.pro
PRO22 b9378ab27ed99256693c3e885cdd8540 52725
00074.pro
PRO23 cf3dc75438d3fa6fa014ab7fae3e0dcb 49596
00029.pro
PRO24 9c417be268a250550aac4077aab0b47b 48823
00051.pro
PRO25 c44f5ec3b59e360dcc62328802f4d21f 46525
00034.pro
PRO26 1ca12ad063f007f1cfee71c1176cbd53 17195
00021.pro
PRO27 dcf6e0c394cb7c960ed715365973b7dd 51392
00055.pro
PRO28 714a2e02a092d9923ac5300ce41506fd 38559
00061.pro
PRO29 09346f3fcb6402f44c145fa02ed1535c 52946
00110.pro
PRO30 660fc35cbe7fe498a76bd68c2225dab9 52337
00067.pro
PRO31 aa3cfd4e7c053ff6b87d4e9ddd3320a2 43341
00037.pro
PRO32 29c967d6047b8cc167ed09f8aeba6d21 40651
00010.pro
PRO33 f5a40f89eaf7ece7a8d038446d23f897 51582
00103.pro
PRO34 343ab4856d11d0778d6cc7c53010274e 45751
00033.pro
PRO35 bb4edcdd37ba31bd7f07f938bafea966 45774
00084.pro
PRO36 5d93048b230175a3ac5b1cbb410d83ac 44147
00059.pro
PRO37 336912bbf44af0bbe768983e85805bd4 41819
00100.pro
PRO38 f91b604634ed0d2bcb3b6ee8d3866192 43993
00090.pro
PRO39 ca122ced61c48f673ac9b3c95766fa74 53873
00096.pro
PRO40 bb02215e151e883d48e02a41801d7794 50189
00011.pro
PRO41 02cf6a3ba726a0a4459a6ac46b06106b 47115
00046.pro
PRO42 163a458ee7ccfc033d0c1aba0b1575de 41953
00043.pro
PRO43 d11398ee73ed9701a9911b9efc418d6f 38327
00108.pro
PRO44 7cb8a0d30837c99bc89a37e9a65c41b8 15183
00125.pro
PRO45 6b45e4dd9cc37845e879180d3142e5e3 33849
00056.pro
PRO46 b367f7f97a64eb3d0e90fd5d06bfb59b 45289
00062.pro
PRO47 7596a12198d466e691d91ef687d8a8f0 52029
00112.pro
PRO48 49a308a4615dae97928da3cdfebb808e 54868
00076.pro
PRO49 84a3ef110a2622294f855aa9ae5d95d6 49258
00118.pro
PRO50 010b983fb3383193479d18b64df28f27 41175
00057.pro
PRO51 a0002409797f25370cc0bcfd0f6edc1b 43464
00016.pro
PRO52 0b3af07ccb6d2f70e468c5a526b40f62 32836
00073.pro
PRO53 9a97ea9ecab5435405a72c43789262fc 57155
00117.pro
PRO54 ca307f173440e259573a3f17a7677ac8 48520
00036.pro
PRO55 1cd3f6e7552561a1071bf28b178afa22 15321
00123.pro
PRO56 26b61730ca29622dd5da66f04f91819c 53215
00087.pro
PRO57 e3edc912cd901f85b976d6cee445be35 47096
00066.pro
PRO58 02eb4f612be48aee1b5edbb7071febe2 57573
00075.pro
PRO59 f8659b7af7bcf89cea0180704684e827 49610
00070.pro
PRO60 f263a7f60fb9cc597f30d2ef0a140cc3 35182
00082.pro
PRO61 f09dc1cecd42333398f6237d8e6bbdba 53584
00012.pro
PRO62 2366672368e300ba8ebbd392a253203d 51534
00035.pro
PRO63 82176b9b7e48c28756f74729e4e787b2 28532
00007.pro
PRO64 bbc24482074e963becd924d96f7c6b94 54741
00077.pro
PRO65 3958d41a4dd18eb4468352d8d2617f01 45265
00025.pro
PRO66 5ebedb6322ad3ed47604706a60ac70fa 37229
00119.pro
PRO67 61ff131c5771cd17c197e9711ed9868f 51135
00027.pro
PRO68 5fae4f93681d0055872de312b252ef63 41770
00065.pro
PRO69 bf8d0ba231454e02e57d1a6f2908db9d 48447
00063.pro
PRO70 5faa97277a5196fbd6655a3a33fa6485 38994
00107.pro
PRO71 334d83a1546a2a525d245e2c50086935 56055
00114.pro
PRO72 461f4a444016733c1880638e05165f3c 47549
00071.pro
PRO73 ad9f8fbaaa501a1fbd694511ffbde7c4 48338
00120.pro
PRO74 06eb745440afb2af587e3b800524d90a 41738
00115.pro
PRO75 7a5f23cb4af457106ef160de65441ad2 44203
00042.pro
PRO76 7774a9bb9441e943ec56a1230d974414 22082
00045.pro
PRO77 7c3254400850d8a89c8c5f4ad2432677 40170
00017.pro
PRO78 124c7207b19e56b317dfc9b4c4706f60 53026
00023.pro
PRO79 9ef27e66601cbcd8ca87bb5b724a51dd 43904
00039.pro
PRO80 be1f94b0ad03a0d6f64f2f64845d7989 48401
00019.pro
PRO81 d66dc97cb9ea751fa98962e0f833767f 48832
00111.pro
PRO82 2d14579514f76ec94842f9f38a4b329a 50053
00098.pro
PRO83 eff58ea596feb42980b4ee5643d25086 44292
00122.pro
PRO84 c95cb8228b1804efcc33b98e1e0755a4 39494
00072.pro
PRO85 ab0dd0f808e4eb84080c6f7322da3696 36508
00089.pro
PRO86 df696388bbdb177a14ef5ffe7e20cd8b 38765
00081.pro
PRO87 62306ea81c216ac9f75c7db77a4ed2ad 47138
00020.pro
PRO88 ad0f1eb40871ef8ec480b089d64416d1 41367
00028.pro
PRO89 f69e43509072b460ae84975384eede95 54071
00015.pro
PRO90 7d44dc3a42a240fbb71d983776576520 42988
00038.pro
PRO91 71fa4b0a03062c50b9c832d6ec94f904 72379
00005.pro
PRO92 49ac7f170363a4fd33863c8bc9172e8b 53803
00095.pro
PRO93 8dec5909e3c81a26dd70b58f6647ca15 20150
00004.pro
PRO94 f6956fec64ff2f419b46fb0d868985ff 48543
00031.pro
PRO95 60f292d641ec9972c104cc5eb4d9d22b 3984
00003.pro
PRO96 3ec4db5358057ab297a701d81bd1d525 31110
00083.pro
PRO97 1a0864e7e2b03a078ac69c4f8007974f 50255
00086.pro
PRO98 adff2bd3220320d40652972c6df079af 46833
00024.pro
PRO99 86e282a4b304df53978c9aec4ba84f2f 44230
00052.pro
PRO100 9435e83b9f0232aa4cb042d0442efb0e 50060
00093.pro
PRO101 397ebdc018536fe21b4eadb3acb32576 46825
00088.pro
PRO102 2e39bf5cf2f071b2cc0c6c23ab42dcc6 43279
00032.pro
PRO103 e3f53f92db47408f38005fa6c60d0155 44187
00022.pro
PRO104 d629353ca091642a812ed6d6348e5553 48746
00050.pro
PRO105 653fc716617468737a76245bf9137a59 47293
00044.pro
PRO106 e00e81fa37b197e82122f400e76d7b69 48356
00068.pro
PRO107 f1152d39c9a513018dee9940dd941dde 43559
00008.pro
PRO108 d6c118554fafe73ff12953a2fce688b7 46931
00040.pro
PRO109 19c78a34dcd704839ce8de9c5773e771 37778
00049.pro
PRO110 3e39efeb8b088bf03b921f04934767f0 45855
00041.pro
PRO111 aa04bff38686af41d329eaab1b54d275 38395
00106.pro
PRO112 d5fd50b50cc38fc9660bf75b8806db9b 56376
00097.pro
PRO113 8395b71d97730462e9e69a155c994f5e 50243
00094.pro
PRO114 665f5178d2526a9124e521938797e8cb 28853
00048.pro
PRO115 243fe4ec6fbafe54ef58d9986e68d386 61401
00121.pro
PRO116 881bab0231cf9ef71f3379a5b29895e1 49226
00018.pro
PRO117 d5b6ba51dd66f03bdf02f659429ced27 22741
00009.pro
PRO118 2625ce4d3895cf7c80395e042705f005 46884
00113.pro
PRO119 09f73454ec46f1abf18a00bbe3fe0974 25039
00099.pro
PRO120 996b92022b4a1090df7fb9845cd30762 44882
00069.pro
PRO121 43ba1fb0544fd01b58244ac6f2c68e14 46778
00030.pro
PRO122 cbe6c90fc44bf1b06f9fed5ebf858a89 51818
00053.pro
PRO124 46687dcb0631be9db5f669ed95846882 57587
00078.pro
PRO125 c97eaf154467a1ef0fbb2e8f2fc7d347 53063
00013.pro
archive
TIF1 imagetiff caecd4b55d2b4d4bc57a8dd646d543ca 7047188
00006.tif
TIF2 ca5d184b70b3e0cf699910d57916f642 7196036
00104.tif
TIF3 883a673e6a409aec2b46846636189350 7230920
00101.tif
TIF4 30c5d7cdb7dcd2a082200d93b8658b98 7267128
00026.tif
TIF5 e9eb6624671ea2dc187b6fdd437e1dab 7239820
00047.tif
TIF6 f74e8c4a6256b217f9b2567c8476b9f0 7175816
00014.tif
TIF7 2f0568a4245f85574c313e35ab6094ef 7226492
00080.tif
TIF8
00124.tif
TIF9
00116.tif
TIF10
00079.tif
TIF11
00058.tif
TIF12
00002.tif
TIF13
00085.tif
TIF14
00091.tif
TIF15
00105.tif
TIF16
00060.tif
TIF17
00054.tif
TIF18
00092.tif
TIF19
00109.tif
TIF20
00064.tif
TIF21
00102.tif
TIF22
00074.tif
TIF23
00029.tif
TIF24
00051.tif
TIF25
00034.tif
TIF26
00021.tif
TIF27
00055.tif
TIF28
00061.tif
TIF29
00110.tif
TIF30
00067.tif
TIF31
00037.tif
TIF32
00010.tif
TIF33
00103.tif
TIF34
00033.tif
TIF35
00084.tif
TIF36
00059.tif
TIF37
00100.tif
TIF38
00090.tif
TIF39
00096.tif
TIF40
00011.tif
TIF41
00046.tif
TIF42
00043.tif
TIF43
00108.tif
TIF44
00125.tif
TIF45
00056.tif
TIF46
00062.tif
TIF47
00112.tif
TIF48
00076.tif
TIF49
00118.tif
TIF50
00057.tif
TIF51
00016.tif
TIF52
00073.tif
TIF53
00117.tif
TIF54
00036.tif
TIF55
00123.tif
TIF56
00087.tif
TIF57
00066.tif
TIF58
00075.tif
TIF59
00070.tif
TIF60
00082.tif
TIF61
00012.tif
TIF62
00035.tif
TIF63
00007.tif
TIF64
00077.tif
TIF65
00025.tif
TIF66
00119.tif
TIF67
00027.tif
TIF68
00065.tif
TIF69
00063.tif
TIF70
00107.tif
TIF71
00114.tif
TIF72
00071.tif
TIF73
00120.tif
TIF74
00115.tif
TIF75
00042.tif
TIF76
00045.tif
TIF77
00017.tif
TIF78
00023.tif
TIF79
00039.tif
TIF80
00019.tif
TIF81
00111.tif
TIF82
00098.tif
TIF83
00122.tif
TIF84
00072.tif
TIF85
00089.tif
TIF86
00081.tif
TIF87
00020.tif
TIF88
00028.tif
TIF89
00015.tif
TIF90
00038.tif
TIF91
00005.tif
TIF92
00095.tif
TIF93
00004.tif
TIF94
00031.tif
TIF95
00003.tif
TIF96
00083.tif
TIF97
00086.tif
TIF98
00024.tif
TIF99
00052.tif
TIF100
00093.tif
TIF101
00088.tif
TIF102
00032.tif
TIF103
00022.tif
TIF104
00050.tif
TIF105
00044.tif
TIF106
00068.tif
TIF107
00008.tif
TIF108
00040.tif
TIF109
00049.tif
TIF110
00041.tif
TIF111
00106.tif
TIF112
00097.tif
TIF113
00094.tif
TIF114
00048.tif
TIF115
00121.tif
TIF116
00018.tif
TIF117
00009.tif
TIF118
00113.tif
TIF119
00099.tif
TIF120
00069.tif
TIF121
00030.tif
TIF122
00053.tif
TIF124
00078.tif
TIF125
00013.tif
G123 METS123 unknownxmets
AA00029717_00001.mets
METS:structMap STRUCT2 other
METS:div DMDID ADMID contextfree ORDER 0 main
ODIV1 1 Main
FILES1 Page
METS:fptr FILEID
FILES2 2
FILES3 3
FILES4 4
FILES5 5
FILES6 6
FILES7 7
FILES8 8
FILES9 9
FILES10 10
FILES11 11
FILES12 12
FILES13 13
FILES14 14
FILES15 15
FILES16 16
FILES17 17
FILES18 18
FILES19 19
FILES20 20
FILES21 21
FILES22 22
FILES23 23
FILES24 24
FILES25 25
FILES26 26
FILES27 27
FILES28 28
FILES29 29
FILES30 30
FILES31 31
FILES32 32
FILES33 33
FILES34 34
FILES35 35
FILES36 36
FILES37 37
FILES38 38
FILES39 39
FILES40 40
FILES41 41
FILES42 42
FILES43 43
FILES44 44
FILES45 45
FILES46 46
FILES47 47
FILES48 48
FILES49 49
FILES50 50
FILES51 51
FILES52 52
FILES53 53
FILES54 54
FILES55 55
FILES56 56
FILES57 57
FILES58 58
FILES59 59
FILES60 60
FILES61 61
FILES62 62
FILES63 63
FILES64 64
FILES65 65
FILES66 66
FILES67 67
FILES68 68
FILES69 69
FILES70 70
FILES71 71
FILES72 72
FILES73 73
FILES74 74
FILES75 75
FILES76 76
FILES77 77
FILES78 78
FILES79 79
FILES80 80
FILES81 81
FILES82 82
FILES83 83
FILES84 84
FILES85 85
FILES86 86
FILES87 87
FILES88 88
FILES89 89
FILES90 90
FILES91 91
FILES92 92
FILES93 93
FILES94 94
FILES95 95
FILES96 96
FILES97 97
FILES98 98
FILES99 99
FILES100 100
FILES101 101
FILES102 102
FILES103 103
FILES104 104
FILES105 105
FILES106 106
FILES107 107
FILES108 108
FILES109 109
FILES110 110
FILES111 111
FILES112 112
FILES113 113
FILES114 114
FILES115 115
FILES116 116
FILES117 117
FILES118 118
FILES119 119
FILES120 120
FILES121 121
FILES122 122
FILES123 123
FILES124 124
FILES125 125
46
Lemma 4.15 Relation * is regularitypreserving.
Proof. Let G = (V, T,P,S) be a grammar, a a terminal symbol in T, and L an arbitrary
regular subset of VP(G). Since regular languages are closed under concatenation, La is a
regular language. However, La may contain some strings which are not viable prefixes of G.
This is rectified by intersecting La with VP(G). Since regular languages are also closed
under intersection, La nVP(G) is regular. Clearly, aa GL* is contained in La nVP(G) if
and only if o;GL and aa GVP(Gf) (i.e., a*a aa holds in G). Thus, <a (L) =La nVP(G), so
I is regularitypreserving.
Theorem 4.16 Let G (V, T,P,S) be an arbitrary grammar and let x be an arbi
trary string over T. Then PVPl^G, x) and VPlr(G, x) are regular languages.
Proof. Applying Lemmas 4.14 and 4.15 and noting that PVPu^G^e) = {e} is regular, the
theorem is proven by induction on len(x).
Discussion
A simple description of general lefttoright bottomup recognition was presented. The
General_LR recognition scheme was derived from GeneraLRR by defining the inverses of
=*r and I, restricting them to VP(G), reversing the direction in which the input string is
scanned, and manipulating some relational expressions. The two inverse relations, = and ,
preserve regularity. Thus, the essence of general lefttoright bottomup recognition was cap
tured in terms of computing the images of regular subsets of VP(G) under these relations.
Together, the results in Chapters III and IV provide a succinct and elegant characteri
zation of general contextfree recognition. This was accomplished by starting from two
binary relations on strings and applying basic settheoretic concepts. There was no need to
resort to automata, although automata are certainly useful for implementing the abstract
recognizers. In short, the formal development contained in these two chapters provides a
framework, founded on a minimal number of kernel concepts, within which the intrinsic pro
perties of general canonical contextfree recognizers may be further investigated.
101
(61,68) The set variable Qsubset' is initialized to the contents of Q_subset in line 61.
In line 68, each new state that is added to Q within the first while loop is also added to
Q_subset'. The states contained in Qsubset' after the first loop completes are processed
later in the third while loop.
(70) The transitions on nullable nonterminals are not directly added to 6 as before.
Instead, they are entered into a list called sortedlist. The elements of the form (p,A,q)
in sortedlist are sorted in order of increasing elength(A). The contents of sortedlist
are processed by the second while loop.
Within the second while loop, an appropriate parse annotation is determined for each
element in jsorted_list and the annotated transitions are installed into the recognition
graph. The parse annotation assigned to {p,A, q) is determined by nuller(.A).
(73) Each element in sortedlist is considered in turn. No additional elements are
added to _sorted_iist within this loop.
(74) The element (p,A,q) at the head of _sorted_list is removed. At this point, we
know that elength(A) > elength(A') for each element (p',Aq') removed from _sorted_list
in an earlier iteration of the loop.
(7576) Suppose that nuller(A) = A*Â£. Then [e] is the appropriate parse annotation
for (p,A, q). Thus, the transition (p, A, q, [c]) is added to 8.
(7780) Otherwise, nuller(A) = AtB^B^ Bm for some production
A?12?2 Bm where m >1. This implies that elength^,) < elength(A) holds for each
Bi. Since _sorted_list was sorted in order of increasing elength, an annotated transition on
each B{ has already been installed in GR. In particular, there must be a path
(qm, Qmv >7i> ?) *n Gr which spells BmBm_1 Bx. The transitions in this path are of
the form (qj,Bj, qj_v [tt;]), (q1,B1, q, [tTi])(E, m >j >2, for some parse annotations [7T]. In
this case, (p,A,q, [&7T1,&7r2, . ,&7Tm]) is the appropriate transition to add to 8.
The third while loop processes the states contained in Qsubset'. In particular, for
each state p in Qsubset' and each item of the form AtaX'PEiJ^p) such that /?=* holds
CHAPTER II
NOTATION AND TERMINOLOGY
This chapter summarizes some of the elementary formal aspects of this work, viz.,
assorted mathematical notation and definitions. In particular, some basic concepts of formal
languages, directed graphs, and finitestate automata are reviewed. A more comprehensive
presentation of the relevant theory can be found in the monograph by Sippu and Soisalon
Soininen [39].
Elements of Formal Language Theory
An alphabet, denoted in this section by 27, is a finite set of symbols. A string over E is
a finite sequence of elements from E; the null string corresponds to the empty sequence and
is denoted by e. A {formal) language over E is a set of strings over E; the set of all strings
over E is denoted by E* and Ef = Z*\{e}.
The length of a string is the number of symbols that it contains. The length of a string
xEE* is denoted by len(x) where len is defined recursively as follows: len(e) = 0; Va EE,
len(a) = 1; Vz, y EE1, len(jy) = len(x) + len(y).
The previous definition used the notion of string concatenation, viz., xy. Concatenar
tion is generalized to apply to languages as follows. Given two languages L and V and a
string x, \AJ = {yz  y EL, z EL'}, xL = {a;}L, and Lx = L{:r}. The identity and zero of con
catenation are e and 0 (the empty set), respectively. Thus, with x denoting either a string
or a language, xe = ex = x and x0 = 0x =0.
Let L be a language and t a natural number. The ith power of L, L*, is defined recur
sively by L = {e} and L,+1 = LL\ The positive closure of L and the Kleene closure of L are
defined by L+= U L* and L* = U L* = L+U{e}, respectively.
> o >o
8
27
The proof that PVPrr(Â£t, z) and VPri^C?, z) are regular languages is based indirectly on
proofs that =*r and I are regularitypreserving relations. First, a relationship is established
between contextfree grammars and regular canonical systems. Specifically, for a grammar
G = [V, T,P,S), the regular canonical system induced by G is defined by C (F, Â£P)
where Â£P = {Â£A A*wÂ£P}.
Lemma 3.23 Relation =*j? is regularitypreserving.
Proof. Let G (V,T,P,S) be a grammar, C = (F, Â£P) the regular canonical system
induced by G, and L an arbitrary regular language over V. By Fact 3.1, r(L, C, {e}) =
{<5Â£F*7Â£L, holds in C} is regular. Since the =*/? and =*c relations are equivalent,
=*R (L) = r(L, C, {e}). Therefore, =*r is regularitypreserving.
Lemma 3,24 Relation I is regularitypreserving.
Proof. Let G = (V,T,P, S) be a grammar and let L C V* be an arbitrary regular language.
The quotient of a language Lj with respect to a language L2 is defined by Lj/L2 =
{x  xy Â£Lj for some y L2}. Since the quotient of a regular language with respect to an
arbitrary set is a regular language [24], Va Â£ T, la(L) = L/{a} is regular. Therefore, I is
regularitypreserving.
Theorem 3.25 Let G = (V, T,P,S) be an arbitrary grammar and let zET* be an
arbitrary string. Then PVPrr(G, z) and VPrr((7, z) are regular languages.
Proof. By induction on len(ir), this theorem follows from Lemmas 3.23 and 3.24 and the fact
that PVPri^G, e) = {w So;Â£P} is regular.
TopDown LefttoRight Recognition
In this section, a general topdown recognition scheme that presumes a lefttoright
scan of the input string is formally developed. Toward that end, consider the two relations
on V* defined by {(A/3,uj/3)\ A+0JE.P, /?Â£P*} and {(a/9,/3)\ aÂ£T, /9Â£F*}. Informally,
these relations represent leftbiased counterparts of =$r and I, respectively. Along the lines
of GeneraLJRR, a general topdown correctprefix recognizer can be based on these two relar
99
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
function Traverse(Q_subset, 0
Qsubset' := Qsubset
while Qsubset ^0do
q Remove(Qsubset)
for goto(rÂ¡^q),A) = L such that A =>* do
if qj.i Q then
Q :=Q
Q_subset := Qsubset U{g;:,}
Qsubset' := Qsubset' U{gy.,}
fi
Insert(6_sorted_list, (qj.^A^))
od
od
//a eN
// Never redundant.
73. while <5_sorted_list ^ 0 do
74. (p,A,q) := Remove_head(<5_sorted_list)
75. if nuller(A) = A ethen
76. S:=6U{(p,A,q,[e])}
77. else // Let nuller(A) = A+BXB2 Bm, m >1.
78. //3 a path (qm, qm_v . qx, q) in GR spelling Bl
79. // i.e., {q:,Bj, qj_h [tt,]), {qvBv q, m >j >2, for some 7ry.
80. 6.= 6U{(p,A,q, [&icv &n2, . ,&nm])}
81. fi
82. od
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
end
while Qsubset' ^ 0 do
q := Remove(Q_subset')
for A*aX'/3.r/^q) such that /?=>*e do
if fS=e then
Let the parse annotation for /? be []
else // Let Â¡3=BlB2 Bm, m >1.
//3 a path {qm, . ,qvq) in GR spelling BmBm_x Bv i.e.,
// (Â¡j1, fo]). (i, Bi, q, [ttJ) Gfi, m >j >2, for some 7ry.
Let the parse annotation for ft be [&7TJ, &n2, . ,&7Tm],
fi
od
od
Figure 7.1 continued
field is used for storing the parse annotation corresponding to the path traversed so far in the
course of making a reduction. Consider the reduction from p on the production A*aX/3
where f3=$*e holds in G. The parse annotation of every transition on A that results from
this reduction will include a pointer to the parse annotation of the transition on X from p to
109
vides a basis from which many issues relating to contextfree recognition and parsing may be
further investigated. Most notably, our viable prefixbased model of recognition and parsing
offers a particularly appropriate framework within which a broad spectrum of related parsing
strategies LR parsers, the Earley and Tomita algorithms, and our general parsers may
be further studied and compared.
Directions for Future Research
Before concluding, we suggest some possible directions for further research. There are
several worthwhile prospects. Of course, it is assumed that the framework laid down herein
would be used as a starting point for the endeavors described below.
Several automatabased versions of the General_LR recognition scheme were con
sidered. Specifically, concrete realizations of GeneraLLR were born out by the Earley',
GeneralLRO, and GeneraLNLRO recognizers. The other lefttoright recognition scheme,
General_LL, was mimicked by Earley' in a rather obscure fashion. The automatartheoretic
aspects of GeneralLL should be investigated to determine more direct means for tracking
the sets of viable suffixes that are computed by it. Our preliminary findings along this line
indicate that an automatabased GeneraLLL recognizer that runs in 0(n3) time in the worst
case is indeed attainable. That is, the time complexity does not depend on the length of pro
duction righthand sides as is the case with GeneraLLRO. However, we were unable to
extend this general viable suffixbased recognizer into a parser, so further study of this issue
was suspended.
It is expected that a pursuit of the following three topics would benefit from experi
menting with actual implementations.
(1) Ascertain a more precise characterization of the 0(n2) time and 0(n) time gram
mar classes. It is wellknown that Earleys algorithm recognizes grammars with
bounded ambiguity in quadratic time; moreover, even some ambiguous grammars
are recognized in linear time.
CHAPTER III
GENERAL TOPDOWN RECOGNITION: A FORMAL FRAMEWORK
A formal framework for describing general topdown recognition is developed in this
chapter. Two contrasting topdown recognition schemes are presented; they are dis
tinguished by the direction in which the input string is scanned, viz., righLtoleft or leftto
right. Since the two schemes turn out to be mirror images, one is derived in terms of the
other. Our approach to general recognition is based on certain regularity properties of
contextfree grammars. Consequently, the framework is designed accordingly to highlight
these properties.
The primary purpose of this chapter is to catalog some formal aspects of general top
down recognition. An investigation of the practical utility of the two general topdown
recognition schemes is left for future work. However, the theoretical development contained
herein is invaluable toward deriving a practical, truly general, bottomup parser; that is the
thrust of the remaining chapters. An arbitrary reduced grammar G ={V,T,P,S) is
assumed throughout this chapter.
Recognition Based on Derivations
In a topdown approach to recognition, an attempt is made to construct a parse tree for
an input string, perhaps implicitly, by starting at the root and progressing toward the leaves.
The downward growth of an incomplete parse tree occurs at the frontier of the tree which
may be represented by the string of grammar symbols which label its nodes. A basic step in
constructing the parse tree involves applying the =* relation to this linearized form of the
frontier. However, the derives relation is too undisciplined, in general, for describing top
down recognition in a useful fashion since there is no indication of which nonterminal symbol
13
23
Since G is reduced, /?=*,*a; holds in G for some iGT*. Therefore, 7/?^*7a: and
7j0(=*r 0i=^7 both hold in G. Combining these results in the manner of Lemma 3.10,
5 (=*b I) *y =*k 7 holds in G. Since the nontrivial rightmost derivation of qffly from S must
have a first step of the form S =*, u> for some 5^wGF, oj(=*r l)*=*j?7 holds in G where
zxy.
Theorem 3.20 VP(G) = {76 F* ui(=*r !)*=* 7 holds in G for some S*u>EP and
2 G T*}.
Proof. This theorem follows directly from Lemmas 3.18 and 3.19.
Corollary VP(G) = {7G V* \ S (=>* U l)+7 holds in G}.
One final observation is that VP(G) is closed under (=^/? U I). Indeed, this is immediate
from Lemmas 3.14 and 3.16. Due to its importance in general canonical topdown recogni
tion, this property is formally recorded below.
Corollary For a,/3E V*, if aGVP(G) and a(=*R U I)*/3 holds in G, then /?GVP(G).
General TopDown CorrectSuffix Recognition
Let w ET* be an arbitrary input string. A topdown scheme for recognizing w with
respect to G is described next. In this scheme, w is scanned from right to left. As a conse
quence, an incrementally longer suffix of w is recognized in the process.
The general recognition scheme effectively pursues all of the possible rightmost deriva
tions of w in parallel. This is carried out through regularitypreserving operations on regular
subsets of VP(G). Adoption of this approach obviates the need for backtracking.
General contextfree recognition is an inherently nondeterministic task. Hence, it is not
generally possible to pursue the rightmost derivations of w exclusively. Instead, at the point
where a suffix z of w has been processed, all rightmost derivations (from S) of all strings in
T*z DL(G) are followed (i.e., all sentences that have z as a suffix).
31
as the LL dual to the viable prefix and plays a commensurately central role in the theory.
Symmetrically to the definition of viable prefixes, viable suffixes are defined in terms of left
most derivations and left sentential forms. A string 7GU* is a viable suffix of G if
S =**xA6=$i xa/36 = holds in G for some xET*, A*a/3(zP, and 6EV*. Thus,
viable suffixes are reversals of certain suffixes of left sentential forms. The set of viable
suffixes of G is denoted by VS((7).
The next series of lemmas develops a definition of the viable suffixes of G in terms of
the =*r and I relations of GR. In that regard, the following result is useful.
Fact 3.3 (1) A string 7G U* is a viable prefix of G if and only if 7 is a viable suffix of
Gr ; (2) a string 76 V* is a viable suffix of G if and only if 7 is a viable prefix of GR.
Proof. This is presented by Sippu and SoisalonSoininen as Fact 3.2 [38].
Lemma 3.34 For a, Â¡3E V*, if a is a viable suffix of G and a=*R f3 holds in GR, then /3
is a viable suffix of G.
Proof. If a is a viable suffix of G, then a is a viable prefix of GR. Since cx=*r /3 holds in
Gr /3 is a viable prefix of GR as well. Therefore, /? is a viable suffix of G.
Lemma 3.35 For a, Â¡3G V*, if or is a viable suffix of G and a =*r Â¡3 holds in GR, then /?
is a viable suffix of G.
Proof. This is a consequence of Lemmas 3.15 and 3.34.
Lemma 3.36 For or, /?G V*, if a is a viable suffix of G and a I/? holds in GR, then /3 is
a viable suffix of G.
Proof. Using Fact 3.3, the proof of this lemma parallels that of Lemma 3.16.
Lemma 3.37 For 7GF* if oj(=$r l)*=4j?7 holds in GR for some S+u>(zPR and
x G T*, then 7 is a viable suffix of G.
Proof. Assume that w(=^I)*=k7 holds in GR for some S+u)EPR and xdiT*. By
Lemma 3.18, this implies that 7 is a viable prefix of GR. Thus, 7GVS((7) by Fact 3.3.
Lemma 3.38 For 7G V*, if 7 is a viable suffix of G, then u>(=*r 1)j=^7 holds in GR
for some StwEP1* and x G T*.
19
Lemma 3.7 For a,PEV*, if <*=**/? holds in G, then az =*>/Sz holds in G for every
zET*.
Proof. If a=*R P holds in G, then a=>*P holds in G by Lemma 3.1. The consequent in its
full generality can then be established by an induction on the length of an arbitrary string
zET*.
Lemma 3.8 For a, PE V* and z G T*, if o(=4 I)*P holds in G, then a=^*Pz holds in
G.
Proof. The proof is by induction on n =len(z).
Basis (n =0). In this case, z =e. By assumption,
case that cx=P, so a=*r*P trivially holds in G.
Induction (n > 0). In this case, z=ay for some a T and yETn~l. Assume that
oÂ¡(=>r l)"y P holds in G. Then ch(=^r f)"~lry(=>R l)a P holds in G for some 7GF*. By the
induction hypothesis, a=>I^/y holds in G. Furthermore, 7(=*j? l)a P implies that 7=j?/?a I P
holds in G. By Lemma 3.7, 7y =*,*Pay holds in G, so a=*?Pay =Pz also holds in G.
Lemma 3.9 For a,PEV* and z E T*, if Â£v(=4r !)*=** P holds in G, then ot=^*Pz holds
in G.
Proof By assumption, a(=*R !)*=** P holds in G. This implies that a(=$R I)*7=* P holds in
G for some 7G V*. By Lemma 3.8, a=>*7z holds in G. Since 7 =$r P holds in G, ^z=^*Pz
holds in G by Lemma 3.7. Therefore, a=^*Pz holds in G.
Lemma 3.10 For a, P,^E V* and x,y ET*, if Â£*(=>* I)* =*r P and P(=*r I)* =>7 hold in
G, then a(=*R l)*v =*j? 7 holds in G.
Proof. The key observation relevant here is that the expression a (=r I)* =*r P (=*l)*=*a7
may be rewritten as a(=*i?l)* (=^ l)j =>5 7; to make this transformation, the occurrence of
=*r preceding P in the first expression is absorbed by (=^j? I)* if x and by the
occurrence of =*r preceding 7 otherwise. It is now immediate that a (=* I)*y =*r 7 holds in
G.
75
which induces a surjection (resp. bijection) g:8l+82 defined by
g({p,.9))if(p).fl./(?))p,qÂ£Qv eru{e).
Let Mnc(G) = (/, V, goto, I0,1) with I {I0,IV . ,Im_4} be the NLR(O) automaton
of G. Let Ge'{Qe'> t>E') be the Earley state graph constructed by Earley' when it is
applied to G and w. Lastly, let GR(MNC)=(Q, V,S) be the recognition graph constructed
by GeneralNLRO when it is applied to G and w. Graph Gei is homomorphic to GRl as fol
lows. The function fiQE>*Q defined by fi{[A+af3,j]ESi) = qk.i where Ik={A *a0}
is a surjection which induces the surjection gi8Ei*^1 defined by
?1((r,I,S))=(/1W,X,/1(r)),r,iGfeIGKU{f}.
If an STG Gl is homomorphic to an STG G2, then an STG Gk can be derived from Gk
such that Gk is homomorphic to Gi and Gk is isomorphic to G2. Our comparison of Earley'
and General_NLRO is concluded by defining an STG Gei {Qe> V>$e') suc^ that GEt is
homomorphic to Gei and Gei is isomorphic to GRl.
For 0<.k
4 {Aa/?},0
{s: 0<&
tions of Gei are defined as follows. For r, s G Qei and X GFU{e}, (r ,X, s) G8E, if and only
if 3r, s G Qei such that r Gr, s Gs, and (r,X,s)E5Ei. By construction, Gei is homomorphic
to GEt.
That Gei is isomorphic to GRl is established as follows. Define the function
f2'Qe'*Q by /2(sk:i) = clki The function f2 is a bijection which induces the bijection
g2.8Ei*81 defined by g2((r,X,s)) = (f2(s),X,f2(r)), r,sEQEt, IGFU{(}. Therefore,
Gei is isomorphic to GR.
Implementation Considerations
For the remainder of this chapter, we turn our attention back to the GeneraLLRO
recognizer. In this section, some issues that are pertinent to implementing GeneralLRO are
58
Corollary Let p =(s0, si> ,,)> m >0, be a rooted path in GEt such that 76 V* is
the state derivative of p and sm [A *a/3,j] Gbasis(5,) for some A^a^EP and i,j,
0
Proof. If m =0, then i =0, and sm = s0 = [S' 5$,0] Gbasis(50). Thus, the state derivar
tive of p is $5=7 which is in PVS() by definition. If m > 0, then i> 0,
smi = [A+al'ail3,j]E.Si_1, and sm =[Aa'a,/?,j] for some a'EV*, i.e., a=a'ai. The
state derivatives of p'=(s0,sv . ,sm_i) and p are {a{P5)R and (PS)R =7, respectively, for
some SE V*. By Lemma 5.5, GVS(G, i l:w), so 7GPVS(G, i:w).
The next lemma provides the converse to Lemma 5.5.
Lemma 5.6 Let 7 be a string in VS(G, i:w) and let [A*o/3,j] G5, be a state which
is valid for 7 for some A+cxf3EP and i,j, 0
in Gei to [AKXf3, j] with state derivative 7.
Proof. A rigorous proof of this lemma has so far eluded us. Consequently, a very informal
intuitive argument is given instead. A more convincing proof is left for future work.
Observe that the basic result provided by Lemmas 5.2 and 5.3 is a graphical interpreta
tion of Fact 5.2 in terms of certain properties of Gei. In turn, the goal of Lemmas 5.5 and
5.6 is a graphical interpretation of Fact 5.3 in terms of certain other properties of Gei.
Consider VP((7,:u;) and VS(G,i:w) for some i, 0<
established that 7G V* is a member of VP(G, i:w) if and only if there is a rooted path in Gei
to some state in 5t which spells 7. Lemma 5.5 showed that 7G V* is a member of
VS[G,i\w) if there is a rooted path in Gei to some state in 5, with state derivative 7. It
would be rather counterintuitive and at variance with Fact 5.3 if the converse to the previ
ous statement did not also hold. In fact, such a result would appear to subvert the generality
of Earleys algorithm.
In contrast to the case with GeneralLR, Lemmas 5.5 and 5.6 establish a more covert
relationship between Earley' and GeneraLLL. This is in keeping with the relative complex
ity of the definitions of the spelling of a path and its state derivative.
21
Corollary L(G) = {u; G T*  S (=j? I)* =>* c holds in
Corollary SUFFIX(G) = {z G T*  5 (=* l)*cv holds in G for some a: G V*}.
Viable Prefixes
A concept that plays a central role in LR parsing theory is that of a viable prefix.
Viable prefixes are also prominent in our treatment of general recognition and parsing.
Viable prefixes are defined in terms of rightmost derivations and right sentential forms as fol
lows. A string 'yG V* is a viable prefii? of G if S = 8Az =*, 6a0z = 70z holds in G for
some <5G V* A+a0EP, and zET*. Thus, viable prefixes are certain prefixes of right sen
tential forms. The set of viable prefixes of G is denoted by VP((7).
In the next series of lemmas, a definition of the viable prefixes of G in terms of the R
derives and chop relations is developed. It transpires that this definition is remarkably simi
lar to the definition of SFr((j) just given. Since viable prefixes are defined via nontrivial
rightmost derivations from S, our definition is carefully tailored to include S in VP(Â£?) only
in case S =>?Sa holds in G for some aEV*.
Lemma 3.14 For a, 0E V*, if a=*R 0 holds in G and a is a viable prefix of G, then /?
is a viable prefix of G.
Proof. Since a=Â¡m 0 holds in G by assumption, a=^A and 0=')oj for some 7GF* and
AkjjEP. Also by assumption, o;GVP(<7), so S =>*6Bz =*, 6otz=cxtz holds in G for some
GV*, BhttEP, and zET*. Since G is reduced, T=^*y holds in G for some yET*.
Thus, S =$?ayz =^Ayz =$, 'yojyz =0yz holds in G which shows that 0 is a viable prefix of G.
Lemma 3.15 For a, 0EV*, if a=$R 0 holds in G and a is a viable prefix of G, then 0
is a viable prefix of G.
Proof. Applying the preceding lemma, this lemma is established by an easy induction on the
length of a strong rightmost derivation of 0 from a.
3 This definition is borrowed from Sippu and SoisalonSoininen [38], Although it differs slightly from
others (cf., [5]), it is more appropriate to our needs.
104
Consider a point during the parse of an input string at which we would like to perform
garbage collection. If the garbage collection procedure proposed for General_LR0 is applied,
the recognition graph may be contracted more than is desired for parsing. Specifically, tran
sitions may be deleted from GR whose parse annotations are part of the parse forest relevant
to the prefix of the input string analyzed to that point. The marking phase of the garbage
collection procedure must be modified accordingly to correct for this.
Consider the recognition graph just prior to performing garbage collection. Informally,
we will refer to the states in GR that are not deleted by our original garbage collection pro
cedure as being essential to recognition. The states in GR that are essential to parsing are
defined inductively as follows.
(1) If p (EQ is essential to recognition, then p is essential to parsing.
(2) If p Q is essential to parsing and entry(p) = A for some A G/V, for every transi
tion (p,,4, q, [tt])G<5 where [zr] = &7r2, . ,&7rm], m> 1, let
(r,X, s, [7Tm])G<5 be the rightmost transition referenced in [7r], Then r and all
states reachable from r are essential to parsing.
The marking phase of the garbage collection procedure must be modified so as to mark
all states in G* that are essential to parsing. In order to accomplish this, certain branches of
the parse forest must be traversed according to the inductive definition given above. The
second step of the garbage collection procedure, that which deletes unmarked states and
their outgoing transitions, remains unchanged.
Discussion
The GeneraULRO recognizer was extended into a general contextfree parser. The
parse forest constructed by General_LR(y is represented by attaching appropriate parse
annotations to the transitions of GR. In effect, the parse forest is superimposed on the recog
nition graph.
49
A stateset closure function, informally called S_Closure, completes the construction of
a set of Earley states. That is, for 0
Since Sn+1 = basis(5+1), there is no need to apply S_Closure to basis(5n+1).
S_Closure(basis(5)) Â¡f o < i < n
basis(5) if i = n +1
For 0<
satisfies the following three rules.
(1) Every state in basis(5,) is in SÂ¡.
(2) If [A*aB/3, j] is in Sit then for all BKjjEP, [B i] is in S{.
(3) If \Bj] is in 5,, then for all [A+oi'BP, k] in Sj, [A+aBl3,k] is in SÂ¡.
The states added to 5, by rules (2) and (3) above correspond to the states that are spawned
by the Earley Predictor and Completer functions, respectively. Thus, S_Closure embodies
both of these functions. The number of states added to SÂ¡ during its closure is finite; after
all possible states are added, we say that 5, is closed.
Figure 5.1 presents Earleys general contextfree recognizer in terms of the notation
defined above. A Scanner function is assumed which computes basis(S't+1) from SÂ¡ and at+1,
0
function Earley (G =(V, T, P, 5); w 6 T*)
II w=a1a2 an+ll n >0, o, Gr\{$}, l<
basis(50) := {[Â£' Â£$,()]}
for i := 0 to n do
S{ := S_Closure(basis(5,))
basis(5i+1) := Scanner(5,, ai+1)
if basis(S,+1) = 0 then Reject(u;) fi
od
5+i := basis(5+1)
Accept(w)
end
Figure 5.1 Earleys General Recognizer
38
operation. By performing reductions according to the inverse of the =>, relation instead, a
canonical lefttoright order is imposed on the parse tree construction process.
However, an alternative to the inverse of the =*> relation is provided by inverses of the
and I relations. The inverse of =h? is used to represent reversed strong rightmost
derivations. The inverse of I introduces terminal symbols at the right end of strings. These
two inverse relations cooperate to mimic reversed rightmost derivations.
Reversed Rightmost Derivations
The reduce relation ([=) is the inverse of the Rderives relation, i.e., =*/?1 = j=; it is
formally defined by = = {(auj,aA)  a V*,A*ojEP}. The shift relation (*) is the inverse
of the chop relation, i.e., I1 = '; thus,  = {(a,era)  o;GF*, a T}. For each aET, *a
denotes the subrelation of  with range V*a. More specifically, for a,/3EV* and aET,
or*a Â¡3 if and only if a+Â¡3 and (3=aa.
For the most part, the results in this chapter are obtained through simple manipular
tions of relational expressions. Two equalities on relational expressions that are regularly
used in these transformations are recorded in the following.
Fact 4.1 Let R and S be binary relations on V*, i.e., R, S C.V*X V*. Then the fol
lowing two statements hold: (1) (R*)1 = (R1)*; (2) (R Â£)1 = S~l i?_1.
Some useful applications of Fact 4.1 include the following.
(1)
(2) (=>* i)1 = r1 (=* )_1 = (=*;
(3) (Hiir^r1=(^r1(Hor1 = K(hTr=K(K)*
Despite the appearance of in the last construct of both (2) and (3), the relation pro
duct (}=*4) is more appropriate to our needs. Indeed, since relation composition is associar
tive, the following equivalence holds: f=*(4=*)* = (!=**)*(==*.
The interpretation of the relation product (=*4) is explicitly described as follows. For
a, PE V*, a(=**)/? holds in G if and only if a\=*rj +aia =/3 holds in G for some qrE V* and
CHAPTER V
ON EARLEYS ALGORITHM
In this chapter, Earleys general contextfree recognizer is examined and its relationship
to the General_LR and GeneraLLL recognition schemes is ascertained. In particular, a
modified version of Earleys recognizer is presented which builds a statetransition graph in
addition to the state sets that are constructed by Earleys original algorithm. Analyses of
certain properties of the resulting STG reveal parallels between Earleys algorithm and the
General_LR and General_LL recognizers. Throughout this chapter, an arbitrary reduced $
augmented grammar G = (V, T,P,S) and an arbitrary string w=a1a2 a+1, n >0,
a, (E T \{$} for 1 <*
Earleys General Recognizer
Recall that Aa*/? is an item of G whenever there is a production of the form
Araft in P. The bracketed pair [A a/?,y] where A*a/3 is an item of G and j is a
natural number is called an Earley state of G (or state, for short). Earleys algorithm, in
recognizing w with respect to G, constructs a sequence of sets of Earley states Sit
0 < i < n +1. The sets are constructed in order of increasing i beginning with S0. Thus, set
Sf is constructed only after all sets Sj with 0 < i are in place.
Each 5, is initialized to a finite set of states which we denote by basis(5,). For
1 <*
{[5'>*5$,0]} if i =0
basics',)j{[A_*aa..y][A_a.y] ^5{_j} if l <
The lone state in basis(50), [5'S'S, 0], is called the initial state; it will be denoted by s0.
For i > 0, basis(5,) is constructed by the Earley Scanner function.
48
70
(48) Each state in Q_subset is considered in turn. Additional states may be added to
Q_subset within the loop.
(49) A state q is removed from Q_subset.
(50) All transitions from il{q) in Mc that are made on some nullable nonterminal A are
relevant. Let goto(tJ^q),A) = Ij be one such transition.
(5155) We need a state qj:i in Q, and a transition (qj:i,A,q) in 5,. This state may
already exist in Qit so it is conditionally created. If qj:i is indeed new, it is added to
Q_subset; the traversal will resume from qj:i when it is removed from Qsubset in a later
iteration of the while loop. However, the transition (qj:i,A,q) is never generated redun
dantly; the discipline imposed by the graph traversal ensures that the transitions from each
state encountered are considered at most once.
If the two calls to Traverse are removed from the Reduce function and the line 9a.
Traverse(Q,, *) is added to GeneraLLRO following line 9, an equivalent transformation of
Gr results, i.e., one that satisfies the condition stated in line 10. In this way, Traverse
becomes a postprocessor of Reduce. However, for the purposes of parsing it is more
appropriate to call Traverse from within Reduce as we have done in Figure 6.1. This will
become evident in the next chapter when GeneraLLRO is extended into a general parser.
That GeneraLLRO correctly implements the GeneraLLR recognition scheme may be
established by induction on i. This induction depends, in turn, on proving that the Reduce
(resp. Shift) function correctly transforms GR such that the postcondition in line 10 (resp.
line 12) holds if the precondition in line 8 (resp. line 10) holds before the function is called.
Although the Shift and Reduce functions are not formally proven correct, it is expected that
the above detailed explanation of GeneraLLRO provides sufficient intuitive evidence toward
that end.
GENERAL CONTEXTFREE RECOGNITION AND PARSING
BASED ON VIABLE PREFIXES
By
D. CLAY WILSON
A DISSERTATION PRESENTED TO THE GRADUATE SCHOOL
OF THE UNIVERSITY OF FLORIDA IN
PARTIAL FULFILLMENT OF THE REQUIREMENTS
FOR THE DEGREE OF DOCTOR OF PHILOSOPHY
UNIVERSITY OF FLORIDA
Dedicated to my parents, in humble recognition of
their lifelong support and encouragement and of the
countless sacrifices they have made on my behalf.
ACKNOWLEDGMENTS
George Logothetis is an educator in the true sense of the word. His commitment to his
students is exemplary. It is an honor to be his first Ph.D. advisee. Georges contribution to
my development as a person and a computer scientist is indelible. His imprint on this disser
tation is no less so.
My associations with Manuel Bermudez and Joe Wilson have been most rewarding,
stimulating, and enjoyable. Perhaps unbeknownst to them, they kept me going when the
going got tough.
My sincere appreciation is extended to Randy Chow and David Wilson for agreeing to
serve on my supervisory committee. They took on a task which my alter ego would have
refused.
Jeans unfailing devotion is as perplexing as it is sustaining. Her sense of humor in the
face of adversity is remarkable. She gives purpose to life.
m
TABLE OF CONTENTS
Page
ACKNOWLEDGMENTS iii
LIST OF FIGURES vi
ABSTRACT vii
CHAPTER
I INTRODUCTION 1
Overview 1
Literature Review 4
Outline in Brief 7
II NOTATION AND TERMINOLOGY 8
Elements of Formal Language Theory 8
ContextFree Grammars and Languages 9
StateTransition Graphs and FiniteState Automata 11
m GENERAL TOPDOWN RECOGNITION: A FORMAL FRAMEWORK 13
Recognition Based on Derivations 13
TopDown RighttoLeft Recognition 15
TopDown LefttoRight Recognition 27
Discussion 35
IV GENERAL BOTTOMUP RECOGNITION: A FORMAL FRAMEWORK 37
BottomUp LefttoRight Recognition 37
Discussion 46
V ON EARLEYS ALGORITHM 48
Earleys General Recognizer 48
A Modified Earley Recognizer 51
Earleys Algorithm and Viable Prefixes 53
Earleys Algorithm and Viable Suffixes 56
Discussion 59
IV
VI A GENERAL BOTTOMUP RECOGNIZER 60
Control Automata and Recognition Graphs 60
The GeneraLLRO Recognizer 62
Earleys Algorithm Revisited 71
Implementation Considerations 75
The Complexity of Recognition 81
On Garbage Collection and Lookahead 84
Discussion 87
VII A GENERAL BOTTOMUP PARSER 91
From Recognition to Parsing 91
The GeneralLRO Parser 97
The Complexity of Parsing 102
Garbage Collection Revisited 103
Discussion 104
VIII CONCLUSION 107
Summary of Main Results 107
Directions for Future Research 109
REFERENCES Ill
BIOGRAPHICAL SKETCH 114
v
LIST OF FIGURES
Page
FIGURE
3.1 A General TopDown CorrectSuffix Recognizer 24
3.2 A General TopDown CorrectPrefix Recognizer 33
4.1 A General BottomUp CorrectPrefix Recognizer 43
5.1 Earleys General Recognizer 49
5.2 A Modified Earley Recognizer 52
5.3 The Definition of the State Derivative of a Path 56
6.1 The GeneraLLRO Recognizer 63
6.2 The GeneraLNLRO Recognizer 73
6.3 A Modified Reduce Function 80
7.1 The GeneraLLRO Parser 97
vi
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
GENERAL CONTEXTFREE RECOGNITION AND PARSING
BASED ON VIABLE PREFIXES
By
D. Clay Wilson
May 1990
Chairman: Dr. Manuel E. Bermudez
Major Department: Computer and Information Sciences
Viable prefixes play an important role in LR parsing theory. In the work presented
here, viable prefixes have a commensurately central role in a theory of general contextfree
recognition and parsing.
A settheoretic framework for describing general contextfree recognition is presented.
The operators and operands in the framework are regularitypreserving relations and regular
sets of viable prefixes, respectively. A basic operation consists of computing the image of a
regular set of viable prefixes under one of the relations. By extension, general recognition is
characterized in terms of computing a sequence of regular sets.
For implementation purposes, finitestate automata are used to represent the regular
sets. A general bottomup recognizer that constructs an appropriate sequence of automata is
described in detail. The regular languages accepted by these automata correspond to the
sets of viable prefixes computed by the recognizers settheoretic counterpart. The automata
are constructed under the guidance of a control automaton which accepts the viable prefixes
of the subject grammar. Ultimately, the automatabased recognizer is extended to a truly
general bottomup parser.
Earleys algorithm is analyzed in the context of our viable prefixbased framework as it
provides a convenient vehicle for illustrating some of our ideas. We describe how Earleys
Vll
algorithm implicitly tracks the sets of viable prefixes that arise in our model. Moreover, by
modifying Earleys recognizer to construct a certain directed graph, the representation of
these sets is made explicit.
Our settheoretic framework yields elegant and succinct characterizations of general
contextfree recognition that appear to capture the essence of the task. On the practical
front, a general bottomup parser is described in sufficient detail to be readily implemented.
Although its practical potential is not evaluated here, the parser is intended for use in prob
lem areas that require more flexible parsers than are provided within the efficient but re
stricted LR framework. Regardless, our viable prefixbased treatment of recognition and
parsing provides a particularly appropriate framework within which the continuum between
LR parsers and our general parsers may be further investigated.
CHAPTER I
INTRODUCTION
Contextfree recognition is the algorithmic process by which the membership of a string
x within a contextfree language L is decided. This involves determining whether x is
derived by some contextfree grammar G where L = L(G). Parsing is the process of ascer
taining the syntactic structure imparted to a; by G.
From a theoretical standpoint, contextfree recognition and parsing hold considerable
interest in their own right. Yet contextfree grammars and their recognizers and parsers
have substantial practical value as well. Most notably, results from parsing theory have
proven indispensable to the implementation of programming languages. Other areas of appli
cation include natural language processing [34], syntactic pattern recognition [18], and code
generation in compilers [10].
Given an arbitrary grammar G and an arbitrary string x over the terminal alphabet of
G, a general recognizer (resp. parser) recognizes (resp. parses) x with respect to G. The
work presented here contributes to the area of general contextfree recognition and parsing.
The following section provides some motivation and a brief overview of this dissertation.
Overview
The LR parsers, namely those parsers that effect a Lefttoright scan of the input while
producing a .Right parse, define the most powerful class of deterministic parsers. Earleys
algorithm, on the other hand, is arguably the most efficient general parser. Despite the fact
that LR parsers are restricted to LR() grammars whereas Earleys algorithm can parse
strings against any contextfree grammar, there are close parallels between the two.
1
2
Both LR parsers and Earleys algorithm are based on items. Each state of an LR
parser corresponds to a set of LR items. Earleys algorithm constructs a sequence of state
sets during recognition. The states manipulated by Earleys algorithm call them Earley
states are slightly elaborated LR items.
Earleys algorithm and LR parsers scan the input string from left to right recognizing
an incrementallylonger prefix of it in the process. That is, they are correctprefix recogniz
ers.
Both LR parsers and Earleys algorithm work in a bottomup fashion. An LR parser
determines the reversed rightmost derivation of an input string. In contrast, Earleys algo
rithm has the capability of producing all of the reversed rightmost derivations of an input
string.
The relationship between Earleys algorithm and LR parsers can be described on a
more fundamental level in terms of viable prefixes. Viable prefixes are certain prefixes of
right sentential forms. At each point during a parse, the contents of an LR parsers stack
implicitly represents a viable prefix which derives the portion of the input string parsed to
that point. We let VP(Gi) denote the set of viable prefixes of a grammar G. In addition, let
VP(<7, x) denote the set of those viable prefixes of G which derive x, a string over the termi
nal alphabet of G.
Turning now to Earleys algorithm, consider a point in a parse at which some prefix x
of the input string has been processed. The sequence of Earley state sets constructed up to
that point encapsulates the strings in VP(G, a:). The manner in which VP(G,:r) is normally
represented in the state sets is rather indirect. However, this representation can be made
explicit through a variant of Earleys algorithm which constructs a directed graph whose ver
tices are the states generated by the original algorithm. Under an appropriate interpretar
tion, this graph yields a finitestate automaton which accepts VP((?, x). Details of this pro
posed graphical variant of Earleys algorithm are supplied later.
3
Given an arbitrary grammar G and an arbitrary string x over the terminal alphabet of
G, VP(G,:r) is a regular language. This fact can be established analytically. Alternatively,
the graphical variant of Earleys algorithm mentioned above provides a constructive proof of
this result.
In light of these observations, the primary thrust of this work is on the formal develop
ment of an approach to general contextfree recognition and parsing that is based on expli
citly computing VP(
lar, the viable prefix is the central concept upon which useful general recognizers and parsers
are founded. The development is rigorous, yet we strive for clarity and elegance by resorting
to basic principles wherever possible. In short, our approach to general recognition and pars
ing generalizes the role played by viable prefixes in LR parsers in order to accommodate
arbitrary grammars.
This work consists of three logical divisions. In the first (Chapters HI and IV), the
mathematical foundation for our viable prefixbased approach to recognition and parsing is
developed. The basic tools are a handful of binary relations on strings. General recognition
is described using these relations and simple settheoretic concepts. A key property of the
relations is that they preserve regularity. Consequently, general topdown and bottomup
recognition schemes are defined in terms of computing the images of regular sets of viable
prefixes under these relations. In short, general recognition is reduced to computing a
sequence of regular sets.
In the second major division (Chapter V), Earleys algorithm is used as a vehicle for
demonstrating the efficacy of our settheoretic approach to general recognition. In particu
lar, the graphbased variant of Earleys algorithm is presented there. This modified algo
rithm illustrates one way in which VP(
of the input string. In the process of analyzing our Earley derivative, some subtle properties
of Earleys original algorithm are also revealed and its relationship with LR parsers is
clarified.
4
The last part of this work (Chapters VI and VII), casts our approach to general recogni
tion and parsing into an automatatheoretic framework. First, a general recognizer is
described in considerable detail. The recognizer uses an automaton which accepts VP(<7) to
guide the construction of an automaton that accepts VP(G, x), where x is some prefix of the
input string. For convenience, the description of the algorithm employs the LR(O) automar
ton of G as the guiding automaton. However, the algorithm allows for a rather broad range
of VP((7)accepting automata to be used instead. For example, employing the nondeter
ministic LR(0) automaton of G as a controlling automaton yields a general recognizer which
works quite similarly to our graphbased Earley algorithm. Finally, this automatabased
recognizer is extended to a general parser. Means for representing parse forests and handling
ambiguity are described. The recognizer and parser are presented in enough detail to be
readily implemented. In anticipation of this, many practical issues are discussed.
Literature Review
A comprehensive introduction to formal languages and automata is presented by Hop
croft and Ullman [24]. These two related disciplines are prerequisites to a study of context
free recognition and parsing. An uptodate monograph on parsing theory has been written
by Sippu and SoisalonSoininen [39]. Two volumes by Aho and Ullman [6,7] contain a wealth
of information; numerous parsing algorithms are presented, both general and restricted,
along with much of the theory underlying them.
Some early general parsing algorithms are compared by Griffiths and Petrick [22]. All
of the algorithms surveyed rely on backtracking, so they run in 0(cn) time in the worstcase
(n is the length of the input string).
Although it is restricted to Chomsky Normal Form grammars, the CockeYounger
Kasami algorithm [6,19,46] is regarded as the first general parser to run in polynomial time
(0(n3)). The nXn parse matrix that the algorithm constructs accounts for an 0(n2) space
complexity. Recall that the matrix entries are filled with sets of nonterminal symbols.
5
A version of the CockeYoungerKasami algorithm that is restricted to unambiguous
grammars is presented by Kasami and Torii [25]. The time and space bounds of this algo
rithm are both 0(n2logn). Another version which employs linked lists in place of the parse
matrix is described by Manacher [32]. This alternate storage discipline allows unambiguous
grammars to be recognized in quadratic time, a marked improvement over the corresponding
cubic bound of the original algorithm.
The CockeYoungerKasami algorithm was reduced to matrix multiplication by Valiant
[44], Using this result, Strassens technique for multiplying matrices [1] is applied to obtain
an asymptotic worstcase time complexity of 0(n281) for general recognition.1 Due to the
overhead associated with this method, it is primarily of theoretical interest only.
In contrast to the CockeYoungerKasami algorithm, Earleys algorithm [6,13,14] can
process any grammar. Like LR parsers, Earleys algorithm is based on sets of items.
Although its worstcase time and space bounds are also 0(n3) and 0(n2), respectively, it
performs significantly better on large classes of grammars. Specifically, unambiguous gram
mars are parsed in 0(n2) time, and only O(n) time is needed to parse LR(&) grammars pro
vided that Arsymbol lookahead is used in the latter case. Earleys algorithm is examined
further in later chapters.
Efficiency improvements that may be gained by employing LL and LRlike lookahead2
in Earleys algorithm are reported by Bouckaert et al. [9]. They concluded that FIRST sets
are more useful than FOLLOW sets for reducing the number of superfluous items generated
during recognition. In short, FIRST (resp. FOLLOW) information reduces the number of
items generated by Earleys Predictor (resp. Completer) operation. See Christopher et al.
[10] for an example of an application of Earleys algorithm; specifically, it is used to generate
optimized code in a GrahamGlanville style code generator [17]. If desired, Earleys algo
rithm may be extended to include error recovery [3,31].
1 Even faster techniques for matrix multiplication have been developed since.
2 That is, FIRST and FOLLOW sets, respectively.
6
An algorithm that is a hybrid of the CockeYoungerKasami and Earley algorithms is
described by Graham et al. [19,20]. This algorithm also accommodates arbitrary grammars.
Like the CockeYoungerKasami algorithm, an nXn parse matrix is constructed. However,
the matrix positions are filled with sets of LR items instead of sets of nonterminals. Practi
cal issues are discussed in detail and claims are made that more efficient implementations are
attainable than are allowed by Earleys algorithm. Subcubic versions based on matrix mul
tiplication techniques are also described.
The class of LR() grammars was introduced by Knuth in the seminal paper on LR
parsing theory [27]. Knuth described a method for constructing a deterministic parser for an
LR(Â£) grammar, observed that the set of viable prefixes of an arbitrary grammar is a regular
language, and proved that it is undecidable whether an arbitrary grammar is LR(fc) for free
k >0. The discovery of LR(Ar) grammars was quite significant in light of their relationship to
deterministic contextfree languages [16].
Knuths technique for parser construction is generally deemed impractical due to the
enormous number of parse states that can result. The SLR(fc) [12] and LALR(Â£) [11,29]
grammars define two important subclasses of the LR(Ar) grammars which allow this problem
to be addressed satisfactorily. Relatively compact LR parsers for grammars in these sub
classes can be constructed efficiently.
Tomitas algorithm [42,43] extends the conventional LR parsing algorithm to use parse
tables that contain multiplydefined entries. Conflicting parse actions are handled by
employing a graphstructured stack to keep track of the different parse histories. However,
some grammars cause the stack to grow without bound in instances where no input is con
sumed, so the algorithm is not general. Tomitas algorithm is discussed in greater detail
later.
The application of Tomitas algorithm to a system which supports the incremental gen
eration of parsers is reported by Heering et al. [23]. Specifically, Tomitas algorithm is
adapted to work with an incrementally generated LR(0) automaton. The states of the auto
7
matn are created based on need. Moreover, the system accommodates extensible grammars
whereby changes in the grammar during parsing produce corresponding changes in the
relevant portions of the automaton.
Work which is similar in spirit to ours is that of Mayer [33]; deterministic canonical
bottomup parsing is examined in terms of reduction classes where a reduction class is a pair
of strings, the first and second components of which represent the left and rightcontexts,
respectively, of parsing actions. Conditions are imposed on these reduction classes which
ensure determinism, termination, and correctness. In short, the cited paper presents a
framework for describing deterministic canonical bottomup parsers, whereas our aim is a
framework for characterizing general recognition and parsing.
Outline in Brief
This introductory chapter ends with a very short synopsis of the remaining chapters.
The next chapter reviews some basic definitions and terminology. Chapters HI through VII
comprise the main body of this dissertation. Concluding remarks are made in Chapter VIH.
Chapters III and IV develop the mathematical foundation for this work. Settheoretic
characterizations of general topdown recognition and general bottomup recognition are
presented in those two chapters.
Earleys algorithm is the subject of the fifth chapter. In particular, our graphical vari
ant of Earleys algorithm is presented there.
A general automatabased bottomup recognizer is described in detail in Chapter VI.
Chapter VII extends this recognizer into a general parser.
The major results of this dissertation are summarized in Chapter VIII. In addition,
directions for future research, of which there are several, are delineated in that final chapter.
CHAPTER II
NOTATION AND TERMINOLOGY
This chapter summarizes some of the elementary formal aspects of this work, viz.,
assorted mathematical notation and definitions. In particular, some basic concepts of formal
languages, directed graphs, and finitestate automata are reviewed. A more comprehensive
presentation of the relevant theory can be found in the monograph by Sippu and Soisalon
Soininen [39].
Elements of Formal Language Theory
An alphabet, denoted in this section by 27, is a finite set of symbols. A string over E is
a finite sequence of elements from E; the null string corresponds to the empty sequence and
is denoted by e. A {formal) language over E is a set of strings over E; the set of all strings
over E is denoted by E* and Ef = Z*\{e}.
The length of a string is the number of symbols that it contains. The length of a string
xEE* is denoted by len(x) where len is defined recursively as follows: len(e) = 0; Va EE,
len(a) = 1; Vz, y EE1, len(jy) = len(x) + len(y).
The previous definition used the notion of string concatenation, viz., xy. Concatenar
tion is generalized to apply to languages as follows. Given two languages L and V and a
string x, \AJ = {yz  y EL, z EL'}, xL = {a;}L, and Lx = L{:r}. The identity and zero of con
catenation are e and 0 (the empty set), respectively. Thus, with x denoting either a string
or a language, xe = ex = x and x0 = 0x =0.
Let L be a language and t a natural number. The ith power of L, L*, is defined recur
sively by L = {e} and L,+1 = LL\ The positive closure of L and the Kleene closure of L are
defined by L+= U L* and L* = U L* = L+U{e}, respectively.
> o >o
8
9
Let x, y, and z be arbitrary strings over E and let w xyz. Then a; is a prefix of w, y
is a substring of w, and z is a suffix of w. If 0 < len(a:) < len(w) holds, then a: is a proper
prefix of w; similarly, if 0 < len(2) < len(to) holds, then z is a proper suffix of w. We define
PREFIX(a;) = {y G E*  x = yz for some zEE*} and SUFFEX(a:) = {z EE*\x =yz for some
y Gi?*}. If k is a natural number, then k:x (resp. x:k) denotes the unique prefix (resp. suffix)
of x of length min{len(x), k}. This notation is extended to languages as follows. For LCZ*,
PREFDC(L) = U PREFEX(a:), SUFFIX(L) = U SUFFEX(a;), k:L = {k:x \ x GL}, and L:k
zL zGL
= {x:k  x GL}.
The reversal of a string x GZ*, denoted by xR, is defined recursively as follows: eR =
e; Va EE, aR = a; Va;,j/Gi7*, (xy)R = yRxR. Similarly, the reversal of a language L is
defined by = {a:^  x GL}.
ContextFree Grammars and Languages
A (contextfree) grammar is denoted by G = (V,T,P,S) where V is an alphabet
known as the vocabulary of G, T CP and N = V\T are the terminal and nonterminal
alphabets, respectively, P C.NXV* is the finite set of productions, and SEN is the start
symbol. The following conventions are generally adhered to: a,b,c,tET; w,x,y,zET*;
A,B, C, S EN; X,Y, Z EV. In addition, lowercase Greek letters denote strings in F*. An
arbitrary grammar G is assumed throughout the rest of this section.
A production (A,
hand side of the production, respectively. A group of productions that share the same left
hand side, viz., A*ojx, Aru)2, . Aru)n, n > 1, may be abbreviated as
Arojy  ui2   0Jn. A production with a righthand side of e is called a null production
or eproduction.
It is common to specify a grammar by listing only its productions. In this case, the
lefthand side of the first production or production group in the list is taken to be the start
symbol. The nonterminal and terminal alphabets can be inferred from the productions.
10
If Atu is a production in P, then Akx0 is an item of G for each a and 0 such
that oj=a0. The size of G is defined as <7 = ^{len^a;)  A+u)E.P}. Note that the size
of G is equivalent to  {Ara*0\A ra0 is an item of (7}. The reversal of G is the
grammar GR ={V, T,PR,S) where PR = {AkJ* \ A orG.P}.
The derives relation (=>), a binary relation induced on V* by P, is defined formally by
=* = {(aA 0, acj0) \ a,0EV*, A+0J&P}. A string 'yEV* such that S=>*7 holds1 in G is
called a sentential form of G; the set of the sentential forms of G is denoted by SF(G). The
(contextfree) language that is generated by G is defined by L((7) = SF(G)nT*. Each
member of L(G) is called a sentence of G. We use PREFIX(G') and SUFFD^G) as abbrevi
ations for PREFrX(L(<7)) and SUFFEX(L(
For A E N and X E V, if A =*+aX0 holds in G for some a, 0G V*, then X is reachable
from A. A symbol XEV is nullable if X=**e holds in G. A string 7GE* is nullable if
every symbol in 7 is nullable. In particular, e is trivially nullable.
A symbol XEV is useful if either X=S or S =**otX0=**w holds in G for some
a,0EV* and w T*; otherwise, X is useless. A grammar is reduced if every symbol in its
vocabulary is useful. An arbitrary grammar G can be transformed into an equivalent
reduced grammar2 in 0(1(71) time [39]. In light of this result and for the convenience that it
provides, all grammars are assumed to be reduced throughout this work.
A grammar G is $augmented if, for distinguished symbols S' and $, P contains a pro
duction of the form 5'5$ where S'EV is the (new) start symbol and $ET is a sentence
endmarker. Moreover, S'*S$ is the only production in which S' and $ occur. Whenever
we are working with a Saugmented grammar, all input strings are assumed to end with $.
1 The transitive (resp. reflexivetransitive) closure of a binary relation is denoted by +(resp. *).
2 For our purposes, two grammars are equivalent if they generate the same language.
11
StateTransition Graphs and FiniteState Automata
A statetransition graph (STG) is denoted by G = (Q,E,S) where Q is a finite set of
states, E is an alphabet, and SE(Q XlTU{Â£})XQ is the transition relation.3 Thus, an STG
differs from a finitestate automaton only in that it does not have a start state or a set of
final states designated for it. A member ((p, a), q)E6 is read as a transition from p to q on
a; p is the source of the transition and q is the target. A member ((p, a), q)E<5 is also writ
ten as (p,a,q)ES or qEÂ£(p,a); the latter may be written as q=6(p,a) if
(p, a, q),(p, a, r)E5 implies that q=r. A transition on e is known as an etransition. An
STG is efree if it has no etransitions. For the remainder of this section we assume an arbi
trary STG G = (Q, E, 6).
The following property holds for all STGs that arise in this work. If
(p, o, q),(p, b,r)ES and a^b, then in words, distinct transitions which share the
same source state access distinct target states. Thus, for any pair of states p,q EQ, there is
at most one transition from p to q.
A path in G and the string over E that it spells are defined inductively as follows. For
each state qEQ, (q) denotes a path in G from q to q spelling e; for m >1 and q^EQ,
0
{qm1, a, <7m)Â£<5, fchen (?o> Qv > Qm) denotes a path in G from q0 to qm spelling xa. The
length of a path is the number of transitions that it contains. A state q is reachable from a
state p if and only if there exists a path in G from p to q.
The succ function, succ:(j) X27* 2^, is defined by succ(p, a;) = {q EQ \ 3 a path in G
from p to q spelling a;}. Extending this function to RC.Q, succ(R,x) = U succ(g,a:).
q R
The pred function, pred:Q XS**2^, is defined in terms of succ by pred(?,a:) =
{p E Q  q Â£succ(p, a:)} and is similarly extended to subsets of Q.
A subscript is given to G later to differentiate it from a grammar.
12
The inverse of G is denoted by G1 == (Q,E,8~X) where (p, a, q)E6~x if and only if
(q,a,p)E6, i.e., the transitions of G are reversed in G1.
A finitestate automaton (FSA) is denoted by M = (G, q0, F) (Q, E, 6, q0, F) where G
= (Q,Â£,6) is an STG, q0EQ is the start state, and F C.Q is the set of final states. Each
state in Q is assumed to be reachable from q0. If G is efree, then M is also free. If M is
efree and (p, a, q),(p, a,r)E6 implies that q=r, then M is deterministic. An arbitrary
(resp. deterministic) FSA is called an NFA (resp. DFA). The (regular) language accepted by
M is defined by L(M) = {ar EZf \ succ(g0, x)HF 50}. A state q Q is dead if no final state
is reachable from it.
CHAPTER III
GENERAL TOPDOWN RECOGNITION: A FORMAL FRAMEWORK
A formal framework for describing general topdown recognition is developed in this
chapter. Two contrasting topdown recognition schemes are presented; they are dis
tinguished by the direction in which the input string is scanned, viz., righLtoleft or leftto
right. Since the two schemes turn out to be mirror images, one is derived in terms of the
other. Our approach to general recognition is based on certain regularity properties of
contextfree grammars. Consequently, the framework is designed accordingly to highlight
these properties.
The primary purpose of this chapter is to catalog some formal aspects of general top
down recognition. An investigation of the practical utility of the two general topdown
recognition schemes is left for future work. However, the theoretical development contained
herein is invaluable toward deriving a practical, truly general, bottomup parser; that is the
thrust of the remaining chapters. An arbitrary reduced grammar G ={V,T,P,S) is
assumed throughout this chapter.
Recognition Based on Derivations
In a topdown approach to recognition, an attempt is made to construct a parse tree for
an input string, perhaps implicitly, by starting at the root and progressing toward the leaves.
The downward growth of an incomplete parse tree occurs at the frontier of the tree which
may be represented by the string of grammar symbols which label its nodes. A basic step in
constructing the parse tree involves applying the =* relation to this linearized form of the
frontier. However, the derives relation is too undisciplined, in general, for describing top
down recognition in a useful fashion since there is no indication of which nonterminal symbol
13
14
to replace at each step. Instead, rightmost and leftmost derivations are preferred for the
additional constraints that they place on the parse tree construction process.
Since rightmost and leftmost derivations are defined in terms of subrelations of the =*
relation, they also construct parse trees topdown. In addition, they impose a canonical1
order on the construction of parse trees. Specifically, rightmost derivations construct parse
trees from right to left, whereas leftmost derivations construct them from left to right. Some
basic notions about rightmost and leftmost derivations are briefly reviewed next.
Rightmost and leftmost derivations are based on the rderives (=*,) and lderives (=*/)
relations, respectively. These relations are formally defined by =>, = {(otAz, atujz) \ OfGT,
A*ojEP, zET*} and =w = {(xAfi, xu/f) \ xET*, AruEP, /3EV*}. Rightmost deriva
tions (resp. leftmost derivations) are defined in terms of the reflexivetransitive closure of =>r
(resp. =/) in the usual fashion.
For V*, if S =**7 holds in G, then 7 is called a right sentential form of G. The set
of the right sentential forms of G is denoted by SFr(G). The inclusion SFr(G)CSF(G)
holds and is typically, but not always, proper. In contrast, for w ET*, S=$*w holds in G if
and only if S =>,*w holds in G. Thus, L(G) = {u> E T* \ S =**w holds in G}.
For AEN and XEV, if A =*r+ <%X holds in G for some aEV*, then X is right
reachable from A; furthermore, if X=A, then A is rightrecursive. A grammar that has a
rightrecursive nonterminal is a rightrecursive grammar. A symbol X E V is nullable in G if
and only if X =>,* e holds in G.
Any string 7E V* such that S =**7 holds in G is a left sentential form of G. The set of
the left sentential forms of G is denoted by SFÂ¡(G). Similar to the above, the relationship
SF;(G)CSF(G) holds and is generally proper. In addition, L(G) = {to6r*5=>*tt) holds in
G}.
Given AEN and XEV, if A=*fX/3 holds in G for some /?V*, then X is left
reachable from A] if it further holds that X=A, then A is leftrecursive. A grammar is
1 In the literature, the term "canonical is typically associated with rightmost derivations only.
15
leftrecursive if at least one of its nonterminals is leftrecursive. Finally, X V is nullable in
G if and only if X =4*e holds in G.
TopDown RighttoLeft Recognition
A general topdown recognition scheme that scans the input string from right to left is
formally developed next.2 This scheme is based on two binary relations on V*. Through
these two fundamental relations, a settheoretic characterization of general topdown right
toleft recognition which succinctly captures the essence of the task is derived.
In concert, the two relations refine and supplant the rderives relation. Certain regular
ity properties of contextfree grammars that are central to our treatment of recognition are
characterized directly and rather elegantly by the two relations; by comparison, a description
of these properties in terms of rderives is indirect and somewhat awkward. It is in this
sense that the two relations refine the rderives relation. Moreover, the two relations provide
alternate definitions of the right sentential forms and sentences of a grammar. In that
respect, the rderives relation is supplanted by them.
Strong Rightmost Derivations
The strong rightmost derives relation ( =* ) is defined by =* = {(A, aoj) \ o: G V*,
AtuiEP}. Thus, =>r is a subrelation of =*, with domain V*N. For brevity, the strong
rightmost derives relation is called the Rderives relation.
Strong rightmost derivations are defined in terms of the reflexivetransitive closure of
=$/?. Thus, every strong rightmost derivation is also a rightmost derivation. The following
series of lemmas compares some elementary properties of rightmost and strong rightmost
derivations.
Lemma 3.1 For a, /?Â£ V*, if a=>R /? holds in G, then or=**/3 holds in G.
Proof. This follows directly from the fact that =>r is a subrelation of =4,.
2 For the moment, we ignore the fact that a rightrtoleft scan of the input is not particularly useful
in practice.
16
Lemma 3.2 For a, /?G V* and A (zN, if a=*?PA holds in G, then cx=$rPA holds in
G.
Proof. Let n represent the length of a rightmost derivation of PA from a. By induction on
n, we show that there exists an identical nstep strong rightmost derivation of PA from a.
Basis (n =0). Assume that cx=*?PA holds in G. This implies that a=fiA, since =*, is
equivalent to the identity relation on V*. Since =*jÂ¡ is also equivalent to the identity relar
tion on V*, a=*R cx also holds in G.
Induction (n > 0). By assumption, a=^PpA holds in G. The last step in a particular nstep
derivation of PA from a can take two distinct forms. These are analyzed in the following
two cases.
Case (i): o;=^"17jB =>r 7&4 =/3A for some 7GF* and B+6AEP. By the induction
hypothesis, c*=>fl17B holds in G. Since 7B =*/Â¡7&4 holds in G by definition, we conclude
that a=*R PA holds in G.
Case (ii): a=>"1/?AB =>, PA for some B eGP. By the induction hypothesis, cx=$r~1 PAB
holds in G. Thus, a =}r PA also holds in G since PAB =*r PA holds.
In both cases, we have shown that cx=$r PA holds in G.
Lemma 3.3 For aGT and aGI, if a=^*/?a holds in G for some /?GF*, then
ot^R^a holds in G for some 7G V*.
Proof. Assume that a=*?Pa holds in G for some /?G V*. If =7a for some 7GF*, then
ot=*Ria =a trivially holds in G. Otherwise, suppose that a does not end with a. In this
case, every rightmost derivation of pa from 01 is nontrivial. We analyze one such rightmost
derivation and focus on the step that causes a to become the rightmost symbol in a string
occurring in that derivation. The initial segment of the derivation up to and including this
step can take two distinct forms.
Case (i): cx=Â¡t*8A =*, dea for some <5G V* and A+cra GP. By Lemma 3.2, cx=*r &4 holds in
G. By definition, 6A => Sera holds in G. Thus, ex=^R^a holds in G when we let 7=&7.
17
Case (ii): a=4*6aA =*, 6a for some 6EV* and AreEP. Similar to Case (i), a=^S6aA and
6aA => 6a both hold in G. Now we let 76 to conclude that a=$R 7a holds in G.
We have demonstrated in both cases that a=^R^a holds in G for some 76 V*.
Lemma 3.4 For A EN and XEV, X is rightreachable from A in G if and only if
A =$r aX holds in G for some a?E V*.
Proof. If X is rightreachable from A in G, then A =*? fiX holds in G for some fiEV*. If
XEN, then A =*r fiX also holds in G by Lemma 3.2. If XET, then Lemma 3.3 applies,
i.e., A holds in G for some aEV*. Conversely, suppose that A =>rOiX holds in G
for some aEV*. It follows directly from Lemma 3.1 that X is rightreachable from A.
Corollary For A EN, A is rightrrecursive in G if and only if A =$Â£aA holds in G for
some aEV*.
Lemma 3.5 For XE V, X is nullable in G if and only if X =*r e holds in G.
Proof. If XE T, X is not nullable in G and X =*r e does not hold in G. Now suppose that
XEN. If X is nullable in G, then every rightmost derivation which demonstrates this must
be of the form X =*?A =*, e for some A*eEP. From Lemma 3.2 and the fact that A =* e
holds in G, we conclude that X=*r holds in G. Conversely, X=>He immediately implies
that X is nullable in G since =*r is a subrelation of =>r. 1=1
Corollary For 7E V*, 7 is nullable in G if and only if 7 =*r e holds in G.
One final lemma is presented before introducing the companion relation to =>r The
lemma is useful for motivating this second relation.
Lemma 3.6 For aEV*, at least one of the following two statements is true: (1)
a=*R fia holds in G for some fiE V* and a ET, (2) Q;=*j? holds in G.
Proof. If a=e, then statement (2) holds trivially. Now suppose that a^e. Since G is
reduced, a=Â¡f*x holds in G for some xET*. If x =e, then statement (2) again holds from
the corollary to Lemma 3.5. Otherwise, x=ya for some y ET* and a ET. By Lemma 3.3,
it now follows that a=$Rfia holds in G for some fiEV*.
18
Lemma 3.3, in contrast to Lemma 3.2, illustrates that a rightmost derivation departs
from a strong rightmost derivation following the step where a terminal symbol first appears
at the right end of a string occurring in the rightmost derivation. The role of the second
relation that we introduce is to dispense with terminal symbols as they appear at the right
end of strings in strong rightmost derivations. Specifically, the chop relation is defined by I
= {(era, a)  a: G F* aGT"}. For every aGT, la denotes the subrelation of I with domain
V*a. Thus, for a, Â¡3EV* and a G T, a la /? holds if and only if a \/3 and a =/3a hold.
The relation product I, a useful composition that is suggested by Lemma 3.6, is
used extensively in what follows. Formally, for a,/3E.V*, o=*r I /? holds in G if and only if
a =*n /3a I /? holds in G for some a G T; this latter expression is usually written as a (=> l)a /?.
For clarity, we describe inductively the notation that we will employ for exploiting the
reflexivetransitive closure of (=$ I); similar conventions are applied to other relation pro
ducts that are introduced later. For all aGF*, a(=$r l)ea holds in G] for a, /?, 7G V*,
y G T"1 with n >1, and a G T, if 1)^~lP and /?(=* I)a 7 hold in G, then l)aÂ¡/ 7
holds in G. The order of ay in the latter expression reflects the fact that the terminal sym
bols of a string are generated by =>r and chopped by I from right to left. Finally, if
a(=$Â£\)y/3 holds in G for some a, (3EV* and yETn with n >0, then for convenience we
may instead write this expression as a(=>j? I)*Â¡3, a(=^l)*/?, or o;(=^rI)"/9 according to
whether or not the string y or its length n is relevant.
Right Sentential Forms Revisited
Next we investigate how arbitrary rightmost derivations are mimicked by the =*r and
I relations. In short, a rightmost derivation is represented as a sequence of strong rightmost
derivations interspersed with chops of terminal symbols. As a result of this analysis, the pre
cise manner in which right sentential forms and sentences are generated by the two new rela
tions is revealed.
19
Lemma 3.7 For a,PEV*, if <*=**/? holds in G, then az =*>/Sz holds in G for every
zET*.
Proof. If a=*R P holds in G, then a=>*P holds in G by Lemma 3.1. The consequent in its
full generality can then be established by an induction on the length of an arbitrary string
zET*.
Lemma 3.8 For a, PE V* and z G T*, if o(=4 I)*P holds in G, then a=^*Pz holds in
G.
Proof. The proof is by induction on n =len(z).
Basis (n =0). In this case, z =e. By assumption,
case that cx=P, so a=*r*P trivially holds in G.
Induction (n > 0). In this case, z=ay for some a T and yETn~l. Assume that
oÂ¡(=>r l)"y P holds in G. Then ch(=^r f)"~lry(=>R l)a P holds in G for some 7GF*. By the
induction hypothesis, a=>I^/y holds in G. Furthermore, 7(=*j? l)a P implies that 7=j?/?a I P
holds in G. By Lemma 3.7, 7y =*,*Pay holds in G, so a=*?Pay =Pz also holds in G.
Lemma 3.9 For a,PEV* and z E T*, if Â£v(=4r !)*=** P holds in G, then ot=^*Pz holds
in G.
Proof By assumption, a(=*R !)*=** P holds in G. This implies that a(=$R I)*7=* P holds in
G for some 7G V*. By Lemma 3.8, a=>*7z holds in G. Since 7 =$r P holds in G, ^z=^*Pz
holds in G by Lemma 3.7. Therefore, a=^*Pz holds in G.
Lemma 3.10 For a, P,^E V* and x,y ET*, if Â£*(=>* I)* =*r P and P(=*r I)* =>7 hold in
G, then a(=*R l)*v =*j? 7 holds in G.
Proof. The key observation relevant here is that the expression a (=r I)* =*r P (=*l)*=*a7
may be rewritten as a(=*i?l)* (=^ l)j =>5 7; to make this transformation, the occurrence of
=*r preceding P in the first expression is absorbed by (=^j? I)* if x and by the
occurrence of =*r preceding 7 otherwise. It is now immediate that a (=* I)*y =*r 7 holds in
G.
20
Lemma 3.11 For aÂ¡6F* and z G T*, az (=> 1)*=* a holds in G.
Proof. This is shown by an easy induction on n =len(2).
Basis (n =0). Trivially, a(=$n 1)*=** a holds in G.
Induction (n >0). Let z=ay for some aGT and t/G T1. By the induction hypothesis,
aay (=*r tyy*1 =*r aa holds in G. Observing that a a =>itaa \aa=>2 a holds in G establishes
that cxa (=*j? I)a =$r a also holds. It now follows from Lemma 3.10 that a ay (=Â¡j? I)*y =$r a
holds in G.
Lemma 3.12 For a,/3&V*, let cx=$?P hold in G. Furthermore, let P^x for some
7G V* and x G T* where 7G V*N if /?G V*NT* and 7=e otherwise (i.e., x is the longest suffix
of P consisting solely of terminal symbols). Then a(=*R !)*=) 7 holds in G.
Proof. The proof is by induction on the length n of a rightmost derivation of P from a.
Basis (n =0). Thus, a=>? P=a. Write a as 72; for some 7G V* and iGT where x is the
longest suffix of a contained in T*. In this case, a =72; (=*r I)* => 7 holds in G by Lemma
3.11.
Induction (n > 0). A rightmost derivation of P from a consisting of n steps is of the form
Q'=*1 SAz =*r buz =P for some <5G V*, A+u)EP, and z G T*. By the induction hypothesis,
ch(=>r SA holds in G. Since 6A=>r6oj holds in G, o(=*r\)*=*r 6u> also holds. Now
write 8u> as 7y for some 7GF* and y G T* where y is the longest suffix of 6u> made up
entirely of terminal symbols. By Lemma 3.11, =71/(=** !)*=> 7 holds in G. It then fol
lows from Lemma 3.10 that cx(=*r l)yZ=^7 holds in G. Finally, we note that P=^yz where,
by construction, yz is the longest suffix of P that is comprised of only terminal symbols.
Theorem 3.13 SFr(GQ = {7CV~ S(=^l)z*=>a holds in G for some aGF* and
zET* such that 7=az}.
Proof. Suppose that S (=$r I)*=*r & holds in G for some aGF* and z G T*. By Lemma 3.9,
S=*?az also holds in G, so o,2GSFr(G?). Conversely, suppose that 5=^*7 holds in G for
some ^EV*. Let 7=az for some cxEV* and zET* such that z is the longest suffix of 7
which is a terminal string. Then S (=* !)*=** ex holds in G by Lemma 3.12.
21
Corollary L(G) = {u; G T*  S (=j? I)* =>* c holds in
Corollary SUFFIX(G) = {z G T*  5 (=* l)*cv holds in G for some a: G V*}.
Viable Prefixes
A concept that plays a central role in LR parsing theory is that of a viable prefix.
Viable prefixes are also prominent in our treatment of general recognition and parsing.
Viable prefixes are defined in terms of rightmost derivations and right sentential forms as fol
lows. A string 'yG V* is a viable prefii? of G if S = 8Az =*, 6a0z = 70z holds in G for
some <5G V* A+a0EP, and zET*. Thus, viable prefixes are certain prefixes of right sen
tential forms. The set of viable prefixes of G is denoted by VP((7).
In the next series of lemmas, a definition of the viable prefixes of G in terms of the R
derives and chop relations is developed. It transpires that this definition is remarkably simi
lar to the definition of SFr((j) just given. Since viable prefixes are defined via nontrivial
rightmost derivations from S, our definition is carefully tailored to include S in VP(Â£?) only
in case S =>?Sa holds in G for some aEV*.
Lemma 3.14 For a, 0E V*, if a=*R 0 holds in G and a is a viable prefix of G, then /?
is a viable prefix of G.
Proof. Since a=Â¡m 0 holds in G by assumption, a=^A and 0=')oj for some 7GF* and
AkjjEP. Also by assumption, o;GVP(<7), so S =>*6Bz =*, 6otz=cxtz holds in G for some
GV*, BhttEP, and zET*. Since G is reduced, T=^*y holds in G for some yET*.
Thus, S =$?ayz =^Ayz =$, 'yojyz =0yz holds in G which shows that 0 is a viable prefix of G.
Lemma 3.15 For a, 0EV*, if a=$R 0 holds in G and a is a viable prefix of G, then 0
is a viable prefix of G.
Proof. Applying the preceding lemma, this lemma is established by an easy induction on the
length of a strong rightmost derivation of 0 from a.
3 This definition is borrowed from Sippu and SoisalonSoininen [38], Although it differs slightly from
others (cf., [5]), it is more appropriate to our needs.
22
Lemma 3.16 For a, (3E V*, if a I Â¡3 holds in G and a is a viable prefix of G, then /? is a
viable prefix of G.
Proof. From the hypothesis, a=/3a for some aET. Conventional definitions of viable
prefixes [5] prescribe that every prefix of a viable prefix of G is also a viable prefix of G.
However, this property is not immediate from the definition that we have adopted. A proof
that this property does hold in our definition is provided by Sippu and SoisalonSoininen [38],
The essence of their argument is based on the existence of a rightmost derivation of the form
S =>? 6Az =*r Saarz =/3arz for some EF*, A*crarEP, and zET*. This derivation form
demonstrates that both /3a =a and /3 are viable prefixes of G.
Lemma 3.17 For 7E V*, if w(=> l)*7 holds in G for some S+u>EP and z E T*, then
7 is a viable prefix of G.
Proof. The proof is by induction on n =len(^).
Basis (n =0). In this case, z =e. By assumption, a)(=*Â£ l)e7 holds in G for some SkjEP.
Then 7 must equal u which is a viable prefix of G since S =*, w holds in G.
Induction (n > 0). In this case, z=ay for some aET and yETn~x. Assume that
w(=>l)"i/7 holds in G. Then f)"1/?(=* I)a 7 holds in G for some /3EV*. By the
induction hypothesis, fiEVP(G). Now /?(=* I)a 7 implies that I 7 holds in G. It fol
lows from Lemmas 3.15 and 3.16 that 7a and 7 are also viable prefixes of G.
Lemma 3.18 For 7 EV*, if uj(=*r l)*=^7 holds in G for some S+u>EP and zET*,
then 7 is a viable prefix of G.
Proof. Assume that oj(=*r I)*=^b7 holds in G for some SkjEP and zET*. This implies
that u)(=>r I)*/?=*r7 holds in G for some /3EV*. By Lemma 3.17, Â¡3 is a viable prefix of G.
Thus, 7 is also in VP(C) by Lemma 3.15.
Lemma 3.19 For 7E V*, if 7 is a viable prefix of G, then cj(=*r !)*=> 7 holds in G for
some S+uEP and z E T*.
Proof. By assumption, 7EVP(
<5E V*, A*a/3EP, and y E T*. From the proof of Lemma 3.12, S(=>j? I)* =>7# holds in G.
23
Since G is reduced, /?=*,*a; holds in G for some iGT*. Therefore, 7/?^*7a: and
7j0(=*r 0i=^7 both hold in G. Combining these results in the manner of Lemma 3.10,
5 (=*b I) *y =*k 7 holds in G. Since the nontrivial rightmost derivation of qffly from S must
have a first step of the form S =*, u> for some 5^wGF, oj(=*r l)*=*j?7 holds in G where
zxy.
Theorem 3.20 VP(G) = {76 F* ui(=*r !)*=* 7 holds in G for some S*u>EP and
2 G T*}.
Proof. This theorem follows directly from Lemmas 3.18 and 3.19.
Corollary VP(G) = {7G V* \ S (=>* U l)+7 holds in G}.
One final observation is that VP(G) is closed under (=^/? U I). Indeed, this is immediate
from Lemmas 3.14 and 3.16. Due to its importance in general canonical topdown recogni
tion, this property is formally recorded below.
Corollary For a,/3E V*, if aGVP(G) and a(=*R U I)*/3 holds in G, then /?GVP(G).
General TopDown CorrectSuffix Recognition
Let w ET* be an arbitrary input string. A topdown scheme for recognizing w with
respect to G is described next. In this scheme, w is scanned from right to left. As a conse
quence, an incrementally longer suffix of w is recognized in the process.
The general recognition scheme effectively pursues all of the possible rightmost deriva
tions of w in parallel. This is carried out through regularitypreserving operations on regular
subsets of VP(G). Adoption of this approach obviates the need for backtracking.
General contextfree recognition is an inherently nondeterministic task. Hence, it is not
generally possible to pursue the rightmost derivations of w exclusively. Instead, at the point
where a suffix z of w has been processed, all rightmost derivations (from S) of all strings in
T*z DL(G) are followed (i.e., all sentences that have z as a suffix).
24
The essence of the recognition scheme, called General_RR, is simple. Let z E T* be a
suffix of w and suppose that all proper suffixes of z are known members of SUFFIX(G). The
set of strings defined by (o;GVP((7)  S =*?az holds in <7} is used to determine if z is a
member of SUFFIX((7). This set is nonempty if and only if z GSUFFIX((7). Moreover, it
contains e if and only if zGL(<7). The GeneraURR recognition scheme is described in
greater detail in what follows. For reference, the recognizer is presented as Figure 3.1.
function GeneraLRR(G =(F, T,P,S); wET*)
// w =a1a2 an, n >0, each a G T
PVPrr(<7,e) :={u>S+wGP}
for i := 0 to n1 do
VPhr(G,w:i) := =>j?(PVPrr
PVPrr(G, w:i+1) := ^(VPrr(G,w:t))
if PVPrr((7, u/:+1) = 0 then Reject(iu) fi
od
VPrr(G, w) := =>r*(PVPrr(G, a))
if Â£GVPrr(G, w) then Accept(t/;) else Reject(w) fi
end
Figure 3.1 A General TopDown CorrectSuffix Recognizer
For an arbitrary string zET*, two sets of viable prefixes are identified with z. The
first set consists of the primitive RRassociates of z (in G) and is defined by PVPrr^,^) =
{cvG V* I uj(=^f l)*a holds in G for some S+uEP}. The second set is a superset of the first;
it consists of the RRassociates of z (in G) and is defined by VPrr((j!,2) =
{oG V*  u>(=*n l)*=$na holds in G for some SkjjEP}. By Theorems 3.13 and 3.20,
VPrr((7,2) = {cvGVP(G)  S==^*az holds in <7} which equates to the set described in the
preceding paragraph. Input string w is recognized by computing PVPrr((7, w:i) and
VPrr(<7, w:i) in turn as i ranges from 0 to len(w).
In words, VPrr((7,z) is the reflexivetransitive closure of PVPrr(<7, z) under the =>
relation. This fact is made explicit by expressing VPrr(<7,z) as {0E V*  o; => /? holds in G
for some q:GPVPrr(<7, z)}. Thus, if PVPrr(<7,z) is known, VPrr(<7,z) is obtained from it
through appropriate application of the =4* relation.
25
The incremental aspect of General_RR becomes apparent in the computation of a set
of primitive RRassociates. Specifically, given VPri^G,^) and a&T, PVPrj^G, az) is
obtained by an application of the la relation since PVPrj^G, az) = {/? V*\a\a (3 holds in G
for some qtGVPri^G, z)}. It is apparent that PVPre^G,.?) and VPpi^G,^) are both
nonempty if and only if z GSUFFEX(G!). The computation of the primitive RRassociates of
e, a suffix of every G T* serves as the initialization step. Specifically, PVPro^GjC) =
{o;S>wGP}.
Lastly, the conditions for termination of GeneraLRR are specified. First suppose that
w GL(G). In this case, VPri^G, w) is the last set of RRassociates computed; after it is in
place, w is accepted based on the fact that eGVPRR(G, w) if and only if w GL(G). Con
versely, suppose that w ^L(G). If w ^SUFFEX(G) also holds, then there is a unique string
z G T"* which is the shortest proper suffix of w such that z ^SUFFIX(G) holds. In this case,
PVPrrG, z) is the first empty set computed by the recognizer. Otherwise, if w (G) and
w GSUFFIX(G) both hold, then c^VPri^G, w) by definition. In either case, the input string
is rejected.
The correctness of the General_RR recognition scheme is formally established in the
following two lemmas. The supporting arguments are quite straightforward given the collec
tive results to this point.
Lemma 3.21 Let w GL(G) be arbitrary. If General_RR is applied to G and w, then
GeneraLJRR accepts w.
Proof. By definition, PVPrj^G, w:i) and VPrr(G, w.i) are nonempty for all i, 0
Thus, the for loop completes len(u>) iterations. Since w GL(G) by assumption,
Â£VPrr(G, w). Therefore, w is accepted by General_RR in the second if statement.
Lemma 3.22 Let w ^L(G) be arbitrary. If GeneraLRR is applied to G and w, then
GeneraLRR rejects w.
Proof. There are two cases to consider based on whether or not w is in SUFFIX(G).
26
Case (i): w ESUFFD(G). In this case, PVPrr(G, w:i) and VPrr(G, w:i) are nonempty for
all i, 0 ), so the for loop completes len(u>) iterations. Since w (G) by assump
tion, c^VPri^G, w). Therefore, w is rejected by General_RR in the second if statement.
Case (ii): w ^SUFFIX(G). Since w 5e, w =xay for some x,y(zT* and oGT such that
y ESUFFD^G), but ay ^SUFFIX(G). Let len(?/) = ?n and note that 0) must
hold. PVPrr(G, y:i) and VPrr(G, y:i) are nonempty for all i, 0 <
completes m iterations. During the (mfl)st iteration, PVPrr(G, ay)=0 is computed.
Therefore, w is rejected by General_RR in the first if statement.
Regularity Properties
Certain regularity properties that are inherent to all contextfree grammars are
exploited by GeneraLRR. Specifically, for an arbitrary string zET* PVPrr(G,z) and
VPrh(G, z) are regular languages. This fact is proven in this section. Toward that end some
known theoretical results, including one which is rather obscure, are cited below. Since
proofs of these results are not replicated here, the proofs that follow are quite brief.
A type of formal rewriting system known as a regular canonical system is defined by C
= (Â£,11) where E is an alphabet and 77 is a finite set of (rewriting) rules[21,30,37], Each rule
in 77 takes the form of Â£a*l;P where ot,/3&Â£* and Â£ denotes an arbitrary string over E, i.e.,
a variable. The form of a rule indicates that the lefthand side may be rewritten to its
corresponding righthand side only at the extreme right end of a string. Thus, much like R
derives, the Cderives relation induced on E* by 77 is defined by =$c = {("/cr,7/3)  7GX*,
Â£aÂ£/3G77}. Given two languages L1,L2GX*, define r(Lx, G,L2) by 7x=>c72<5holds
in C for some 7X ELX and 72L2}.
A key result from the literature relevant to regular canonical systems is the following.
Fact 3.1 Let C = (E, 77) be a regular canonical system and let Lx and L2 be regular
languages over E. Then r(Lj, C, L2) is a regular language over E.
Proof. This is a restatement of Theorem 3 from Greibach [21].
27
The proof that PVPrr(Â£t, z) and VPri^C?, z) are regular languages is based indirectly on
proofs that =*r and I are regularitypreserving relations. First, a relationship is established
between contextfree grammars and regular canonical systems. Specifically, for a grammar
G = [V, T,P,S), the regular canonical system induced by G is defined by C (F, Â£P)
where Â£P = {Â£A A*wÂ£P}.
Lemma 3.23 Relation =*j? is regularitypreserving.
Proof. Let G (V,T,P,S) be a grammar, C = (F, Â£P) the regular canonical system
induced by G, and L an arbitrary regular language over V. By Fact 3.1, r(L, C, {e}) =
{<5Â£F*7Â£L, holds in C} is regular. Since the =*/? and =*c relations are equivalent,
=*R (L) = r(L, C, {e}). Therefore, =*r is regularitypreserving.
Lemma 3,24 Relation I is regularitypreserving.
Proof. Let G = (V,T,P, S) be a grammar and let L C V* be an arbitrary regular language.
The quotient of a language Lj with respect to a language L2 is defined by Lj/L2 =
{x  xy Â£Lj for some y L2}. Since the quotient of a regular language with respect to an
arbitrary set is a regular language [24], Va Â£ T, la(L) = L/{a} is regular. Therefore, I is
regularitypreserving.
Theorem 3.25 Let G = (V, T,P,S) be an arbitrary grammar and let zET* be an
arbitrary string. Then PVPrr(G, z) and VPrr((7, z) are regular languages.
Proof. By induction on len(ir), this theorem follows from Lemmas 3.23 and 3.24 and the fact
that PVPri^G, e) = {w So;Â£P} is regular.
TopDown LefttoRight Recognition
In this section, a general topdown recognition scheme that presumes a lefttoright
scan of the input string is formally developed. Toward that end, consider the two relations
on V* defined by {(A/3,uj/3)\ A+0JE.P, /?Â£P*} and {(a/9,/3)\ aÂ£T, /9Â£F*}. Informally,
these relations represent leftbiased counterparts of =$r and I, respectively. Along the lines
of GeneraLJRR, a general topdown correctprefix recognizer can be based on these two relar
28
tions. Specifically, leftmost derivations, left sentential forms, etc., can be defined in terms of
these relations analogously to how rightmost derivations, right sentential forms, etc., are
expressed in terms of =*/? and I. However, an alternate approach is suggested by the follow
ing result.
Fact 3.2 For a, /?Â£ V*, (1) a=>*0 holds in G if and only if afi =$*/Â¡ft holds in GR; (2)
=$*/? holds in G if and only if oF =*? ft* holds in GR.
Proof. A slightly stronger statement is presented by Sippu and SoisalonSoininen as Fact 3.1
[38],
For future reference, some useful equivalences that are implied by Fact 3.2 include the
following: (1) L(G*) = (L(G))*, (2) PREFIX(G*) = (SUFFIX((?))*, and (3) SUFFB^G*)
= (PREFK(G))*.
Fact 3.2 is exploited rather extensively in what follows. In particular, leftmost derivar
tions in G and ultimately general topdown correctprefix recognition are described in
terms of strong rightmost derivations in GR and the chop relation. Consequently, a substan
tial portion of the results derived in the previous section are useful here as well. This
economizes on our efforts considerably.
Strong Rightmost Derivations in Reversed Grammars
The Rderives relation induced on V* by PR is defined by =h? = {(aA.,ara;) O'EV"*,
A+u)(EPr}. The relationship between strong rightmost derivations in GR and leftmost
derivations in G is the subject of the next series of lemmas.4
Lemma 3.26 For a,/3.V*, if a=*R /? holds in GR, then oP =$* holds in G.
Proof By assumption, a=$R/3 holds in GR. This implies that ot=$?/3 holds in GR by
Lemma 3.1. It follows from Fact 3.2 that aR =** ff1 holds in G.
Lemma 3.27 For a,/3E.V* and A EN, if a=**A/3 holds in G, then of1 =}r fP A holds
in Gr.
4 The chop relations relevant to G and GR are identical.
29
Proof. Assume that a=**Ap holds in G. By Fact 3.2, =>*(A P)R = PRA holds in GR.
Thus, a8 =*f /SR A also holds in GR by Lemma 3.2.
Lemma 3.28 For aEF* and aET, if a=$*afi holds in G for some PEV*, then
op =*j? 7a holds in GR for some 7 EV*.
Proof. If a=**aft holds in G for some PEV*, then of* =>*(aP)R =PR a holds in GR by Fact
3.2. By Lemma 3.3, it follows that of =^R^a holds in GR for some 7E V*.
Lemma 3.29 For A EN and XE V, X is leftreachable from A in G if and only if X
is rightreachable from A in GR.
Proof. Assume that A XP holds in G for some pE V*. By Fact 3.2, A =>?(XP)R =PR X
holds in GR, so A =^rqX holds in GR for some a E V*. This latter conclusion follows from
Lemma 3.2 if XEN, and from Lemma 3.3 otherwise. Conversely, suppose that A =$nOtX
holds in Gr for some aEV*. It follows from Lemma 3.1 that A =*?aX holds in GR. By
Fact 3.2, A =>f(aX)R =XaR holds in G.
Corollary For A EN, A is leftrecursive in G if and only if A is rightrecursive in
Gr.
Clearly, the nullability of vocabulary symbols is invariant with respect to grammar
reversal. Thus, the following statements are equivalent for XEV\ (l) X is nullable in G\
(2) X =*r e holds in G; (3) X =>r e holds in GR. This observation is easily generalized to
strings in V*.
Although Lemma 3.6 obviously applies to GR, it is restated below in terms of GR
because of its importance in showing how the =$r and I relations cooperate.
Lemma 3.30 For O'GV'*, at least one of the following two statements is true: (1)
a=>R Pa holds in GR for some pEV* and a E T; (2) cx=$r e holds in GR.
Left Sentential Forms Revisited
The left sentential forms and sentences of G are defined in terms of the Rderives and
chop relations of GR. Similar to rightmost derivations, a leftmost derivation in G is ren
30
dered as an alternation of strong rightmost derivations in GR and rightmost chops of termi
nal symbols.
Lemma 3.31 For a,0E.V* and xET*, if q:(=*r\)*=>r 0 holds in GR, then
cS =*r{Px)R =xR0R holds in G.
Proof. By assumption, a(=$H I)* =$% 0 holds in GR. It follows from Lemma 3.9 that or =$>*/?:r
also holds in GR. This implies that cf =$*(0x)R =xR0R holds in G by Fact 3.2.
Lemma 3.32 For a,0EV*, let a=**0 hold in G. Write 0 as 2:7 for some x Â£ T* and
7G V such that 7G7VY* if 0(zT*NV* and 7=e otherwise (i.e., x is the longest prefix of 0
that is made up of only terminal symbols). Then aP (=w I)*/? =$r 7^ holds in GR.
Proof. Assume that the conditions in the hypothesis of the lemma hold. From the assump
tion that a=**0 holds in G and Fact 3.2, of* =*0R holds in GR. Since 0=x^,
0R (x^f)R =iR xR. Thus, xR is the longest suffix of 0R that is made up of terminal symbols
alone. We conclude from Lemma 3.12 that o^ (=$* I)*r=*r 7s holds in GR.
Theorem 3.33 SF^C) = {7EV* S (=>r I) a holds in GR for some aGV* and
x Â£ T* such that 7=(o'z)fi}.
Proof. First suppose that S (=*r I) =** a holds in GR for some aEV* and xET*. By
Lemma 3.31, this implies that S =$*(atx)R =xRaR holds in G, so (aar)^ is a left sentential
form of G. Conversely, assume that S'=>*7 holds in G for some 7GF*. Let
r)=xRaR =(ax)R for x E T* and ftL* such that xR is the longest prefix of 7 contained in
T*. This implies, by Lemma 3.32, that S (=*r I)* =$r a holds in GR.
Corollary L(G) = {w G T*\ S(=$r e holds in GR}.
Corollary PREFEX( G) = {z C T* \ S (=j? I) ** cv holds in GR for some aGL*}.
Viable Suffixes
A topdown complement to the class of LR(&) grammars is the class of LL(Ar) grammars
[28,36], A theory of LL(Ar) parsing that is a dual to the theory of LR(fc) parsing is developed
by Sippu and SoisalonSoininen [38]. In particular, the concept of a viable suffix is introduced
31
as the LL dual to the viable prefix and plays a commensurately central role in the theory.
Symmetrically to the definition of viable prefixes, viable suffixes are defined in terms of left
most derivations and left sentential forms. A string 7GU* is a viable suffix of G if
S =**xA6=$i xa/36 = holds in G for some xET*, A*a/3(zP, and 6EV*. Thus,
viable suffixes are reversals of certain suffixes of left sentential forms. The set of viable
suffixes of G is denoted by VS((7).
The next series of lemmas develops a definition of the viable suffixes of G in terms of
the =*r and I relations of GR. In that regard, the following result is useful.
Fact 3.3 (1) A string 7G U* is a viable prefix of G if and only if 7 is a viable suffix of
Gr ; (2) a string 76 V* is a viable suffix of G if and only if 7 is a viable prefix of GR.
Proof. This is presented by Sippu and SoisalonSoininen as Fact 3.2 [38].
Lemma 3.34 For a, Â¡3E V*, if a is a viable suffix of G and a=*R f3 holds in GR, then /3
is a viable suffix of G.
Proof. If a is a viable suffix of G, then a is a viable prefix of GR. Since cx=*r /3 holds in
Gr /3 is a viable prefix of GR as well. Therefore, /? is a viable suffix of G.
Lemma 3.35 For a, Â¡3G V*, if or is a viable suffix of G and a =*r Â¡3 holds in GR, then /?
is a viable suffix of G.
Proof. This is a consequence of Lemmas 3.15 and 3.34.
Lemma 3.36 For or, /?G V*, if a is a viable suffix of G and a I/? holds in GR, then /3 is
a viable suffix of G.
Proof. Using Fact 3.3, the proof of this lemma parallels that of Lemma 3.16.
Lemma 3.37 For 7GF* if oj(=$r l)*=4j?7 holds in GR for some S+u>(zPR and
x G T*, then 7 is a viable suffix of G.
Proof. Assume that w(=^I)*=k7 holds in GR for some S+u)EPR and xdiT*. By
Lemma 3.18, this implies that 7 is a viable prefix of GR. Thus, 7GVS((7) by Fact 3.3.
Lemma 3.38 For 7G V*, if 7 is a viable suffix of G, then u>(=*r 1)j=^7 holds in GR
for some StwEP1* and x G T*.
32
Proof. Assume that 7 is a viable suffix of G. By Fact 3.3, 7 is also a viable prefix of GR.
Thus, uj(=*n I)*=^b7 holds in GR for some S+uPR and x G T* by Lemma 3.19.
Theorem 3.39 VS(G) = {7G F* u.>(=* l)*^^ holds in GR for some S+oj(zPr and
x G T*}.
Proof. This theorem combines Lemmas 3.37 and 3.38.
Corollary VS(G) = frC V* \ S (=>* U l)+7 holds in GR}.
Corollary For a,/3EV*, if aEVS^) and a(=^/? U I)*/? holds in GR, then /?GVS (G).
General TopDown CorrectPrefix Recognition
Let w G T* be an arbitrary input string. A topdown scheme for recognizing w with
respect to G that is a lefttoright analog of GeneraLRR is described next. This scheme,
called General_LL, scans w from left to right as it recognizes an incrementally longer prefix
of the input string. GeneraLLL effectively pursues all of the leftmost derivations of w in
parallel through regularitypreserving operations on regular subsets of VS(G).
Again, the inherent nondeterminism of general contextfree recognition subverts any
attempt to follow exclusively the leftmost derivations of w. Instead, at the point where a
prefix x of w has been processed, all leftmost derivations (from S) of all strings in
xT*C\L(G) are followed (i.e., all sentences that have a; as a prefix).
The essence of GeneralLL mirrors that of GeneraLRR. Let iGT* be a prefix of w.
Suppose that all proper prefixes of x are members of PREFEX(G). The set of strings defined
by {/?GVS(*2:/3R holds in G} determines if x GPREFLX(G) holds. This set is
nonempty if and only if x GPREFEX(G) and it contains e if and only if x GL(G).
GeneraLLL, shown in Figure 3.2, is described in greater detail in what follows.
For arbitrary x G T*, two sets of viable suffixes are identified with x. The first set, the
primitive LLassociates of x (in G), is defined by PVSu^G^a:) = {/?GF* w(=*j? I)*R /? holds
in Gr for some SwGP^}. The other set contains the LLassociates of x (in G) and is
33
function General_LL(G* ={V, T,PR,S); wÂ£T*)
II w =a,a2 a, n >0, each a{ G T
PVSiu(G,) :={u>S>cuGP*}
for i := 0 to n 1 do
VSulG,i:w) := =*{PVSu{G,i:w))
PVSu{G,i+l:w) := I#.+i(VSuj(G!>i:w))
if PVSll(G, i+l:u;) = 0 then Reject(u>) fi
od
VSi^G, w) := ^i?(PVSiu(G, u;))
if eGVSu^G, w) then Accept(w) else Reject(u;) fi
end
Figure 3.2 A General TopDown CorrectPrefix Recognizer
defined by VSu^G.a:) = {/?GV* oj{=>* l)*H =** P holds in GR for some S+uGPR}. By
Theorems 3.33 and 3.39, VSll(G, x) = {/?GVS(G)  S =$*x(3R holds in G} which is precisely
the set described in the previous paragraph. Input string w is recognized by computing
PVSll(G, i:w) and VSll(G, i:w) as i ranges from 0 to len(u;).
The set VSll(G,:e) is equivalently expressed as {/?G V* \ ot =$Â£ /? holds in Gr for some
q;GPVSll(G, a:)}; this form explicitly reflects that VSix(G,a:) is the reflexivetransitive clo
sure of PVSu^G^z) under the =*r relation. Thus, VSi^G^a;) is computed by applying =>j?
to PVSll(G, a:).
Given VSLL(G,a:) and a G T, PVSu^G, ara) is determined from VSll(G,Â£) through an
application of the l0 relation since PVSll(G,xa) = {/? G V*  a: la/? holds in GR for some
orGVSu^G', a;)}. Clearly, PVSu^Gjar) and VSm(G,a;) are both nonempty if and only if
x GPREFIX(G). The initialization step entails computing the primitive LLassociates of e,
i.e., PVSll(G, e) = {cuI 5^cjGP^}.
The conditions under which GeneraLLL terminates are analogous to those of
General_RR. If w GL(G), then VSll(G, w) is the last set of LLassociates computed; after it
is known, w is accepted since cGVSll(G,w) if and only if w GL(G). Conversely, suppose
that w ^L(G). If w ^PREFDC(G) also holds, then there is a unique string iGT* which is
the shortest prefix of w such that x ^PREFIX(G) holds. In this case, PVSu^G, a:) is the first
34
empty set computed by the recognizer. Otherwise, if w $L(G) and w GPREFDC(G) both
hold, then VSu{G, w) is found not to contain e. In either case, w is rejected.
The correctness of the GeneraLLL recognition scheme is formally established in the fol
lowing two lemmas.
Lemma 3.40 Let w GL(G) be arbitrary. If GeneraLLL is applied to GR and w, then
GeneraLLL accepts w.
Proof. Since every prefix of w is in PREFEX(G), PVSu{G,i:w) and YSufG,i:w) are
nonempty for all i, 0
tions. By assumption, w Â£L(G), so cGVSli{G,w). Consequently, w is accepted by
GeneraLLL in the second if statement.
Lemma 3.41 Let w $L(G) be arbitrary. If GeneraLLL is applied to GR and w, then
GeneraLLL rejects w.
Proof. There are two cases to consider depending on whether or not w GPREFIX(G).
Case (i): w GPREFIX(G). In this case, PVSll(G, i:w) and VSu^G, i:w) are nonempty for all
i, 0
by assumption, e ^VSu^G, w). Therefore, w is rejected by GeneraLLL in the if statement
that follows the for loop.
Case (ii): w ^PREFEX(G). Let xET* be the unique string which is the longest prefix of w
such that x GPREFEX(G!) holds. Let len(a;) = in and note that 0). Since
PVSu^G, i:x) and VSll((j, La;) are nonempty for all i, 0
GeneraLLL completes m iterations. During the (m+l)st iteration, PVSix(G,(m+l):u;)=0
is computed. Therefore, w is rejected by GeneraLLL in the first if statement.
Regularity Properties
The regularity properties inherent to all contextfree grammars that are exploited by
GeneraLLL are identified in the following.
35
Theorem 3.42 Let G = (V,T,P,S) be an arbitrary grammar and z an arbitrary
string over T. Then PVSl^G, x) and VSu^C, z) are regular languages.
Proof. The proof is by induction on len(z) = n. In particular, we show that PVSu^G,z) =
PVPri^G^, xr) and VSufG, z) = VPrr( Gr xR). The proof is mostly an exercise in recalling
definitions and putting them in the appropriate form.
Basis (n =0). The following two equalities are obvious: (1) PVSu^GjC) = {wG H
S^
OfePVSi4G',)} = {/?eVr* a=*3p holds in GR for some a PVPrr{C?V)}
Induction (n >0). Let x =ya for some y G T"1 and aGT. By the induction hypothesis,
PVSuj(G,y) = PVP{GR,yR) and VSuJ[G,y) = VPp4GR,yR). Hence, PVSll(G,ya) =
{PV\ a \ap for some aCVSi^C, y)} = {/?GF*  alj for some aEVP^G11, yR)} =
PWPw{GR,ayR) = PVPm{GR ,{ya)R). Finally, VSufG, ya )={PE V* \ a=>3P holds in GR
for some aGPVSi^G, ya)} = {/? V\ a=>Â£p holds in GR for some aePVPn(GR, {ya)R)}
= VPrr(Gr,(ya)R). From Theorem 3.25, we conclude that PVSll(G,z) and VSll(<3,,z) are
regular languages.
Discussion
A simple framework for describing general canonical topdown recognition was
presented. The settheoretic framework is based on two relations on strings, =* and I. A
key property of both of these relations is that they preserve regularity. The essence of gen
eral topdown recognition was captured in terms of computing the images of regular sets
under these relations.
The definitions of the various objects of importance in the framework, namely sen
tences, suffixes and prefixes of sentences, right and left sentential forms, etc., were cast in
terms of the =*/? and I relations. Consequently, it is a small step from these definitions to
the recognition schemes that are based on them. In addition, the correctness of the recogniz
ers is particularly easy to establish.
36
Given the impracticality of scanning input strings from right to left, it is worth
reflecting on why strong rightmost derivations were chosen over strong leftmost derivations
as a point of origin. If GeneraLJLL had been developed first, the evolution from GeneraLXL
to General_RR certainly would have been no more involved than the progression in the other
direction. However, strong rightmost derivations were favored from the outset because
viable prefixes are considerably more ingrained in the literature than are viable suffixes.5 In
addition, the bottomup lefttoright counterpart to General_RR that is developed in the
next chapter is derived directly from GeneralRR. Considerable attention is devoted to this
derivative of the General_RR recognition scheme in the rest of this work.
6 To date, we have yet to find a reference to Sippu and SoisalonSoininen [38] in the literature.
CHAPTER IV
GENERAL BOTTOMUP RECOGNITION: A FORMAL FRAMEWORK
A formal framework for describing general bottomup recognition is developed next. In
particular, a general bottomup recognition scheme that scans input strings from left to right
is presented. The bottomup lefttoright character of the recognition scheme, called
GeneraULR, intimates that it is an inverse of General_RR. Indeed, General_LR is directly
derived from General_RR through inverses of the Rderives and chop relations. Conse
quently, General_LR also exploits certain regularity properties of contextfree grammars.
In keeping with Chapter III, some formal aspects of general bottomup recognition are
examined in a settheoretic framework. Later chapters affect a less abstract character;
specifically, General_LR is cast into concrete terms, viz., statetransition graphs and finite
state automata. Ultimately, a general bottomup parser based on GeneraLLR is described.
An arbitrary reduced grammar G =(V, T,P,S) is assumed throughout this chapter.
BottomUp LefttoRight Recognition
In a bottomup approach to recognition, an attempt is made to construct a parse tree
for an input string, perhaps implicitly, by starting from the leaves and working toward the
root. A basic step in the upward synthesis of a parse tree involves grafting together the
roots of one or more subtrees into a larger subtree. Suppose that the collection of these sub
trees is represented by the string of grammar symbols which label their roots. A grafting
operation may be described in terms of applying the inverse of the => relation to this linear
ized form of the partially constructed parse tree. That is, the occurrence of a production
righthand side in this string is replaced by (or reduced to) the corresponding lefthand side
nonterminal symbol; this symbol labels the root of the subtree produced by the grafting
37
38
operation. By performing reductions according to the inverse of the =>, relation instead, a
canonical lefttoright order is imposed on the parse tree construction process.
However, an alternative to the inverse of the =*> relation is provided by inverses of the
and I relations. The inverse of =h? is used to represent reversed strong rightmost
derivations. The inverse of I introduces terminal symbols at the right end of strings. These
two inverse relations cooperate to mimic reversed rightmost derivations.
Reversed Rightmost Derivations
The reduce relation ([=) is the inverse of the Rderives relation, i.e., =*/?1 = j=; it is
formally defined by = = {(auj,aA)  a V*,A*ojEP}. The shift relation (*) is the inverse
of the chop relation, i.e., I1 = '; thus,  = {(a,era)  o;GF*, a T}. For each aET, *a
denotes the subrelation of  with range V*a. More specifically, for a,/3EV* and aET,
or*a Â¡3 if and only if a+Â¡3 and (3=aa.
For the most part, the results in this chapter are obtained through simple manipular
tions of relational expressions. Two equalities on relational expressions that are regularly
used in these transformations are recorded in the following.
Fact 4.1 Let R and S be binary relations on V*, i.e., R, S C.V*X V*. Then the fol
lowing two statements hold: (1) (R*)1 = (R1)*; (2) (R Â£)1 = S~l i?_1.
Some useful applications of Fact 4.1 include the following.
(1)
(2) (=>* i)1 = r1 (=* )_1 = (=*;
(3) (Hiir^r1=(^r1(Hor1 = K(hTr=K(K)*
Despite the appearance of in the last construct of both (2) and (3), the relation pro
duct (}=*4) is more appropriate to our needs. Indeed, since relation composition is associar
tive, the following equivalence holds: f=*(4=*)* = (!=**)*(==*.
The interpretation of the relation product (=*4) is explicitly described as follows. For
a, PE V*, a(=**)/? holds in G if and only if a\=*rj +aia =/3 holds in G for some qrE V* and
39
oGT. This is expressed more neatly as a(=*)a Â¡3. The notation relevant to the reflexive
transitive closure of this product is as follows. For all otEV*, O'(=**)e0 a holds in G; for
a, /?,qrG V*, xE Tn~l with n >1, and a E T, if a(jand /?()=* )a 7 hold in G, then
a ([=*) 1 holds in G. If Â¡3 holds in G for some a,/3EV* and x ETn, n >0, any of
the expressions (}=**)* /3, ((=*)*/?, or a (}=*) Â¡3 may be used to denote this if the string x
or its length n is not relevant.
The following lemma compares relational expressions involving the =>/? and I relations
with relational expressions involving the = and  relations.
Lemma 4.1 For a, Â¡3G V* and x G T*, a I) =*/? /? holds in G if and only if
/?((=*4);=*a holds in G.
Proof. First suppose that x =e. By definition, both (=** l)Â£ and ([=**)e are equivalent to the
identity relation on V*. Thus, the following statements are equivalent.
(1) <*(=** QÂ£ =*/?;
(2) a=$R ft,
(3)
(4) /3\=*o, and
(5)
Now let x =a1a2 an, n >1. The following statements are equivalent in this case.
(1) Qr(=^ 1)"=^#
(2) f3( (=$r I) =>* )1 a;
(3) H^Oa^T1*;
(4) /?(=>* ifln =>r ion_! la>
(5)
(6) ad
(7) d
40
The next two lemmas demonstrate how reversed rightmost derivations are represented
by the f= and  relations.
Lemma 4.2 For a,PEV* and x GT*, if <*((=* )*f=*/3 holds in G, then f3=*?ax holds
in G.
Proof. By Lemma 4.1, the hypothesis implies that /?(=** I)*=j? a holds in G. It follows from
Lemma 3.9 that f3=*?ax holds in G.
Lemma 4.3 For a, /?GF*, let a=>?f3 hold in G. Furthermore, let /?=7x for some
7G V* and x G T* such that 7G V*N if /?G V*NT* and 7=e otherwise (i.e., x is the longest
suffix of consisting solely of terminal symbols). Then 7(=*')*=*a' holds in G.
Proof. The hypothesis and its conditions imply that a (=* I)* =$r 7 holds in G (see Lemma
3.12). Therefore, 7(=*)*=*a; holds in G by Lemma 4.1.
Lemma 4.4 L(G) = {10 G T*  ()=*)* =*S'holds in G}.
Proof. This is a consequence of Lemmas 4.2 and 4.3.
The following connection is established between PREFIX(G) and the \= and  rela
tions.
Lemma 4.5 PREFEX(G) C (z G T*\ e(J=**)*ai holds in G for some aG F*}.
Proof. Let x GPREFEX(G) be arbitrary. The corollaries to Theorem 3.13 together with the
assumption that G is reduced yields that I)*=*j? e holds in G for some /?GF*. By
Lemma 4.1, e(J=*)also holds in G. Finally, this last expression implies that
e0=**)*ct(=*holds in G for some aGF*.
The set inclusion of the preceding lemma is almost invariably proper. For example,
consider the grammar with production set P = {5>a}. Although this grammar generates
{a}, e()=**)*, a holds for all i >0. In fact, equality holds in Lemma 4.5 only for grammars
which have an empty terminal alphabet.
41
Viable Prefixes Revisited
Lemma 4.5 suggests that the reduce and shift relations, as defined, are inadequate as a
basis for general bottomup correctprefix recognition. Indeed, the source of their deficiency
is revealed when they are examined under the guise of viable prefixes.
First, recall that VP(Gi) is closed with respect to =>/? and I. Formally, a string aGF
is a viable prefix of G if and only if u> (=$n U I)* a holds in G for some S*ojGlP The com
plimentary situation that exists with respect to the f= and  relations is investigated in the
next series of lemmas.
Lemma 4.6 For a, /3E V*, if a\=P holds in G and a$VP(G), then /3&VP{G).
Proof. The contrapositive of this implication is proven, so we assume that /?Â£VP(G). Since
o\=P holds in G, /?=*/? a also holds. By Lemma 3.14, this implies that c*GVP(G).
Corollary For a, /?Â£ V*, if a\=/3 holds in G and /?Â£VP(G), then aÂ£VP(Gi).
Lemma 4.7 For a, /?Â£ V*, if aÂ¡3 holds in G and aÂ£VP(G), then p$VP{G).
Proof. The proof is similar to that Lemma 4.6. Lemma 3.16 is relevant in this case.
Corollary For a,/?Â£ V*, if a:*/? holds in G and /?VP(Â£?), then a;Â£VP(Gi).
Lemma 4.8 For a, PE V*, if a ((= U )* P holds in G and a,^VP(C), then P(fVP(G).
Proof. Since a ((=Â£!*)* (3 holds in G by assumption, P holds for some n >0.
Applying Lemmas 4.6 and 4.7, this lemma is proven by induction on n.
Corollary For a, /?Â£ V*, if a ()= U )* /3 holds in G and PEVP(G), then orÂ£VP(C).
By Lemma 4.8, V*\VP(
Lemma 4.1 of this complimentary closure property is addressed in the following.
Lemma 4.9 For a, PEV* and xET*, if aEVP(G) and a (=*r I) =* P holds in G,
then P Q=**)*=*q? holds in G when = and  are restricted to VP(<r).
Proof. By assumption, a; is a viable prefix of G and a (=^j? I)* =*r Â¡3 holds in G. From
Lemma 4.1, also holds in G. That this latter expression holds when = and 
are restricted to VP(G) follows from Lemma 4.8 and its corollary.
42
Our immediate goal is to describe general bottomup lefttoright recognition as the
inverse of general topdown righttoleft recognition with the viable prefix being the central
unifying concept. From that standpoint, it is undesirable for the reduce and shift relations to
stray outside of VP(G). Consequently, these two relations are redefined to explicitly restrict
them to VP(G) as follows: = = {(oro;, aA)  ocEV*, A+UEP, aA GVP(G)} and  =
{(a, a a)  aEV*, a ET, aa GVP(G)}. From the closure result of Lemma 4.8, restricting the
ranges of these two relations to VP(G) effectively restricts their domains to VP(G) as well.
Henceforth, these new restricted versions of f= and  are in affect at all times.
Lemma 4.10 VP(G) = {aG V*  e(}=*)*=*ce holds in G for some x G T*}.
Proof. Since the [= and  relations are restricted to VP(G), it is clear that any string
aEV* such that e 0=* J=* holds in G for some x G T* is a viable prefix of G. In order to
show that every viable prefix of G is similarly produced, let a be an arbitrary member of
VP(G). From Theorem 3.20, cj(=>k l)*=*pa holds in G for some S^ojEP and zET*.
Since G is reduced, a(=^ !)*=* e holds in G for some xET* (implying xzEV{G)). It fol
lows from Lemma 4.9 that e ()=**)*(=*a holds in G.
Corollary L(G) = {; G T"*  e(j=*)*=*; holds in G for some S'+a;GP}.
Corollary PREFEX( G) = {z G T*  e 0=*#)j o holds in G for some a E F*}.
Finally, the following lemma motivates, ex post facto, the relation product (H**).
Lemma 4.11 For ccGVP(G), at least one of the following two statements is true: (1)
a\=*/3+Pa holds in G for some /3E V* and a E T\ (2) a\=*oj holds in G for some S*ojEP.
Proof By Theorem 3.20, w(=*j? I)*=^j?a holds in G for some SmEP and zET*. By
Lemma 4.9, also holds in G. If z=e, then a(J=*)e0f=*ii; holds which demon
strates that statement (2) is true. Otherwise, z = ay for some aET and yET*. In this
case, a:d=*)a/y(J=*)*=*a; holds in G for some qrGF*. This last expression implies that
=7 holds for some /3E F* so statement (1) is true.
43
General BottomUp CorrectPrefix Recognition
Now that = and  are defined as inverses, albeit restricted, of =>r and I, respectively,
the transition from General_RR to GeneraLXR is completed by also inverting the direction
in which an input string w E T* is scanned. Accordingly, the essence of GeneraLXR is that
all of the reversed rightmost derivations of w E T* are followed in parallel.
Once again, there are theoretical limits on the precision to which this task may be car
ried out; that is, it is not possible to pursue exclusively the reversed rightmost derivations of
w in the general case. Instead, at the point where a prefix x of w has been processed, all
reversed rightmost derivations (from e) of all strings in xT*C\L{G) are followed (i.e., all sen
tences that have x as a prefix).
As in the topdown recognition schemes, regularitypreserving operations on regular
subsets of VP(G) are the key to GeneraLXR. Correctprefix recognition is performed, i.e.,
the membership in PREFEX(G) of an incrementally longer prefix of w is ascertained as w is
scanned from left to right. Given a prefix x of w, the inclusion of x in PREFIX(G) is deter
mined from the set {oEVP(G)  a=^*x holds in G}. This set is nonempty if and only if
x EPREFEX(G), and it contains u for some SkvEP if and only if zEL(G). Figure 4.1
presents a highlevel description of GeneraLXR; a more detailed discussion follows.
function GeneraLXR (G =( V, T,P, 5); w E T*)
// w an, n >0, each o E T
PVPlr(G, e) :={e}
for i := 0 to n 1 do
VPlr(G, i:w) := ^(PVP^G, i:u>))
PVPlr(G, t+l:u>) := ^a +i(VPm(G, i:w))
if PVPlr(G, 1+1:to) =0 then Reject(io) fi
od
VPlr(G, w) := K(PVPi*(G, w))
if oEVPlr(G, w) for some S+oj.P then Accept(io) else Reject(io) fi
end
Figure 4.1 A General BottomUp CorrectPrefix Recognizer
44
Let x E T* be an arbitrary string. The primitive LRassociates of x (in G) are defined
by PVPix(G,x) = (aEVP(G)l eQ=*)> holds in G). Clearly, PVPu^G,*) = {e}. The
LRassociates of x (in G) are defined by VPij^G^a:) = {q,GVP(G) e ([=*)*[=*a; holds in
G}. By Lemma 4.2, this set is equivalent to (o'EVP(G)  a=*?x holds in G}.
An input string w E T* is recognized by General_LR through the computation of
PVPlr(G, i:w) and VPlr(G, i:w) as i ranges from 0 to len(tz>). The process terminates when
either an empty set is produced or the input string is exhausted. Analogous to the topdown
recognition schemes, the relationships between VPlr(G,2:) and PVPlr(G,:c), and between
PVPlr(G, xa) and VPlr(G, a;) are significant. Specifically, for x E T* and a E T, VPlr(G, x)
= {/?EVP(G) a^/3 holds in G for some q;EPVPlr(G, a;)} = K(PVP(G,a:)) and
PVPlr(G, xa) = {/3EVP(G)  aa/3 holds in G for some oEVPlr(G, a;)} = <a (VPl^G, *)).
The conditions for termination are analogous to those for GeneraUtR and General_LL.
Given an input string u;ET* first suppose that u;EL(G). In this case, VPlr(G,u;) is the
last set of LRassociates computed by General_LR; after it is completed, w is accepted based
on the fact that cuEVPlr(G, w) for some ShjEP if and only if w EL(G). Alternatively,
suppose that w ^L(G). If w ^PREFIX(G) also holds, there is a unique string x E T* which
is the shortest prefix of w such that x ^PREFEX(G) holds. In this case, PVPii^G^a;) is the
first empty set computed by the recognizer. On the other hand, suppose that w ^L(G) and
wEPREFEX(G) both hold. In this case, it is discovered that cj$VPlr(G, w) for any
S*u>(zP. In either case, the input string is rejected by General_LR.
The correctness of GeneraULR is recorded more formally in the next two lemmas.
Lemma 4.12 Let u)EL(G) be arbitrary. If General_LR is applied to G and w, then
GeneraL_LR accepts w.
Proof. From earlier results, PVPlr(G, Lw) and VPlr(G, i:w), 0
nonempty. Thus, the for loop of GeneralLR completes len(w) iterations. Since w EL(G)
by assumption, o>EVPlr(G, u>) for some S+ojEP. Therefore, w is accepted by
GeneraLLR in the second if statement.
45
Lemma 4.13 Let w $L(G) be arbitrary. If GeneraLLR is applied to G and w, then
GeneralLR rejects w.
Proof. There are two cases to consider according to whether or not w is in PREFB^G).
Case (i): w GPREFBC(G). In this case, PVPlr(G, i:w) and VPlr(G,i:w) are nonempty for
all i, 0), so the for loop of GeneraLLR completes len(i/>) iterations. Since
w ^L(G) by assumption, cu^VPlr(G, w) for any S*oj(zP. Therefore, w is rejected by
GeneraLLR in the if statement that follows the for loop.
Case (ii): w ^SUFFB^G). Let xET* be the unique string which is the longest prefix of w
such that x GPREFIX(G) holds. Let len(a:) = m and note that 0<.m < len(u/). For all i,
0
completes m iterations. During the (m+l)3t iteration, PVPm(G,(ml):u;)=0 is computed.
Therefore, w is rejected by GeneraLLR in the if statement enclosed within the for loop.
Regularity Properties
The regularity properties inherent to all contextfree grammars that are exploited by
GeneraLLR are identified in this section. Specifically, for an arbitrary string x G T*,
PVPi^G, x) and VPlr((j, x) are regular languages.
Lemma 4.14 Relation (=* is regularitypreserving.
Proof. Let G = (V, T,P,S) be an arbitrary grammar and let L be an arbitrary regular sub
set of VP((7). Define the regular canonical system C = {V,IJ) such that II =
{(Â£cu, Â£A) \AvwGF}. Since =*c is defined on V* and is defined on VP(G) C V*, f= is a
subrelation of =>c. By Fact 3.1, L' = r(L, C,{e}) is a regular language. Since regular
languages are closed under intersection, L'nVP(G) is regular. Clearly, f=* (L) C L'DVPG)
holds, since [= is a subrelation of =*c that is restricted to VP(G). The converse inclusion,
viz., L'nVP(G) C (=* (L), is obtained by applying the corollary to Lemma 4.6. Specifically,
for aGL and ^GL'nVP(G), if a=*i0 holds in C, then a\=*/3 holds in G. Thus, =*(L) =
L'nVP(G), so =* is regularitypreserving.
46
Lemma 4.15 Relation * is regularitypreserving.
Proof. Let G = (V, T,P,S) be a grammar, a a terminal symbol in T, and L an arbitrary
regular subset of VP(G). Since regular languages are closed under concatenation, La is a
regular language. However, La may contain some strings which are not viable prefixes of G.
This is rectified by intersecting La with VP(G). Since regular languages are also closed
under intersection, La nVP(G) is regular. Clearly, aa GL* is contained in La nVP(G) if
and only if o;GL and aa GVP(Gf) (i.e., a*a aa holds in G). Thus, <a (L) =La nVP(G), so
I is regularitypreserving.
Theorem 4.16 Let G (V, T,P,S) be an arbitrary grammar and let x be an arbi
trary string over T. Then PVPl^G, x) and VPlr(G, x) are regular languages.
Proof. Applying Lemmas 4.14 and 4.15 and noting that PVPu^G^e) = {e} is regular, the
theorem is proven by induction on len(x).
Discussion
A simple description of general lefttoright bottomup recognition was presented. The
General_LR recognition scheme was derived from GeneraLRR by defining the inverses of
=*r and I, restricting them to VP(G), reversing the direction in which the input string is
scanned, and manipulating some relational expressions. The two inverse relations, = and ,
preserve regularity. Thus, the essence of general lefttoright bottomup recognition was cap
tured in terms of computing the images of regular subsets of VP(G) under these relations.
Together, the results in Chapters III and IV provide a succinct and elegant characteri
zation of general contextfree recognition. This was accomplished by starting from two
binary relations on strings and applying basic settheoretic concepts. There was no need to
resort to automata, although automata are certainly useful for implementing the abstract
recognizers. In short, the formal development contained in these two chapters provides a
framework, founded on a minimal number of kernel concepts, within which the intrinsic pro
perties of general canonical contextfree recognizers may be further investigated.
47
The denotations RR, LL, and LR that pervade Chapters III and IV were sug
gested by Knuth [28] where the following deterministic contextfree grammar classes and the
methods of their analysis are enumerated:
RR(Ar) scan from right to left, deduce rightmost derivations;
LL(&) scan from left to right, deduce leftmost derivations;
LR(&) scan from left to right, deduce reversed rightmost derivations; and
RL(A;) scan from right to left, deduce reversed leftmost derivations.
Here, k >0 indicates the length of lookahead strings used. Note that the use of these denota
tions is meant to evince a generalization of the respective parsing methods rather than a gen
eralization of the grammatical classes. A corresponding GeneraLRL recognition scheme is
not included here. To mesh with the other recognition schemes, it would utilize the f= and
* relations defined in terms of GR. Images of regular subsets of VS(G) under these relations
would be tracked by General_RL as an input string is scanned from right to left.
The GeneraLRR recognition scheme was developed primarily as a stepping stone to
GeneraLLL and General_LR. GeneraLRR is given little attention in the remaining
chapters. Consequently, VPlr(
PVP(G,a:)). Similarly, VS(G,a:) (resp. PVS(G,a;)) is used to denote VSll(G,:e) (resp.
PVSiiiG,x)).
CHAPTER V
ON EARLEYS ALGORITHM
In this chapter, Earleys general contextfree recognizer is examined and its relationship
to the General_LR and GeneraLLL recognition schemes is ascertained. In particular, a
modified version of Earleys recognizer is presented which builds a statetransition graph in
addition to the state sets that are constructed by Earleys original algorithm. Analyses of
certain properties of the resulting STG reveal parallels between Earleys algorithm and the
General_LR and General_LL recognizers. Throughout this chapter, an arbitrary reduced $
augmented grammar G = (V, T,P,S) and an arbitrary string w=a1a2 a+1, n >0,
a, (E T \{$} for 1 <*
Earleys General Recognizer
Recall that Aa*/? is an item of G whenever there is a production of the form
Araft in P. The bracketed pair [A a/?,y] where A*a/3 is an item of G and j is a
natural number is called an Earley state of G (or state, for short). Earleys algorithm, in
recognizing w with respect to G, constructs a sequence of sets of Earley states Sit
0 < i < n +1. The sets are constructed in order of increasing i beginning with S0. Thus, set
Sf is constructed only after all sets Sj with 0 < i are in place.
Each 5, is initialized to a finite set of states which we denote by basis(5,). For
1 <*
{[5'>*5$,0]} if i =0
basics',)j{[A_*aa..y][A_a.y] ^5{_j} if l <
The lone state in basis(50), [5'S'S, 0], is called the initial state; it will be denoted by s0.
For i > 0, basis(5,) is constructed by the Earley Scanner function.
48
49
A stateset closure function, informally called S_Closure, completes the construction of
a set of Earley states. That is, for 0
Since Sn+1 = basis(5+1), there is no need to apply S_Closure to basis(5n+1).
S_Closure(basis(5)) Â¡f o < i < n
basis(5) if i = n +1
For 0<
satisfies the following three rules.
(1) Every state in basis(5,) is in SÂ¡.
(2) If [A*aB/3, j] is in Sit then for all BKjjEP, [B i] is in S{.
(3) If \Bj] is in 5,, then for all [A+oi'BP, k] in Sj, [A+aBl3,k] is in SÂ¡.
The states added to 5, by rules (2) and (3) above correspond to the states that are spawned
by the Earley Predictor and Completer functions, respectively. Thus, S_Closure embodies
both of these functions. The number of states added to SÂ¡ during its closure is finite; after
all possible states are added, we say that 5, is closed.
Figure 5.1 presents Earleys general contextfree recognizer in terms of the notation
defined above. A Scanner function is assumed which computes basis(S't+1) from SÂ¡ and at+1,
0
function Earley (G =(V, T, P, 5); w 6 T*)
II w=a1a2 an+ll n >0, o, Gr\{$}, l<
basis(50) := {[Â£' Â£$,()]}
for i := 0 to n do
S{ := S_Closure(basis(5,))
basis(5i+1) := Scanner(5,, ai+1)
if basis(S,+1) = 0 then Reject(u;) fi
od
5+i := basis(5+1)
Accept(w)
end
Figure 5.1 Earleys General Recognizer
50
For 0<*
component of an Earley state is called a backpointer. Thus, note that in rule (2) of the
S_Closure function, 0
For 0<2
state that Earleys algorithm is a correctprefix recognizer. Moreover, w EL(G) if and only if
5n+1={[5'*5$*,0]}. Conversely, w (Â£L{G) if and only if 3*,0<*
Vi,0
such that i:w 6PREFIX(G) holds.
The correctness of Earleys algorithm is based on the criteria which places a state in a
particular state set [6], In that regard, the following statements are made.
Fact 5.1 For 0
a=**a;+1a;+2 ai and S'=$*6A'y hold in G for some 6,76 V"* such that axa2 o;
holds in G.
Facts 5.2 and 5.3 below ascribe bottomup and topdown interpretations, respectively,
to Fact 5.1.
Fact 5.2 For 0
ot=*? Uj+1aj+2 ai and S'=$?6Ay hold in G for some 66 F* and y 6 T* such that
6=4*a1a2 o; holds in G.
Note that 66VP(G,j:t) and 6a6VP(G, :t). We say that [A65, is valid
for 6a6VP(G, :w); in particular, [A <*/?, j] 6Sy is valid for 66VP(G, j:w). If a^e also
holds, then we say that [A or /?,./] 65, properly cuts 6a6VP(G, i:w).
Fact 5.3 For 0
0!=**aj+laJ+ 2 o, and S'=$*axa2 a;A6 hold in G for some 66 V*.
In this case, note that {A5)R EYS(G,j:w) and {p8)R 6VS(G, i:w). We say that
[AHX'/d, j] 65, is valid for (fi8)R 6VS(G, i:w); in particular, [A O/?, j] 6Sy is valid for
{A5)R eVS(G,j:w).
51
A Modified Earley Recognizer
A modified version of Earleys recognizer, called Earley', is described next. Earley'
differs from Earleys algorithm in that it constructs a statetransition graph. The STG con
structed by Earley' is denoted by Gei = (Qei, V, 6ei). The states in Qei are the Earley states
that are generated by Earleys algorithm. The state transitions in dEi are described below.
In recognizing w with respect to G, Earley' builds the same sequence of state sets as
Earleys algorithm. In addition, a sequence of statetransition sets, viz., EÂ¡ for 0
is constructed. These sets are also constructed in order of increasing i. In particular, E1, is
constructed concurrently with S',. For 0
(s,X, t) where s .Sj for some j, 0
A particular set of state transitions E{ is constructed analogously to S',. That is, (1) E{
is initialized to a finite set of transitions denoted by basis(1,), and (2) a transitionset closure
function, called E_Closure, is applied to basis^,) to complete the construction of E{. For
0<
( 0 if =0
basis(Â£,) = j {(Sj flf; t) j s =[A_+a.aiPij] G5._i; t =[A+aar/3,j] G5,} if 1
Note that bas\s(Ei) where i >0 is determined from 5,_1, S,, and a,; basis(E0) is a special
case. For i >0, the transitions in basis^,) may be installed by a slightly modified Earley
Scanner function.
For 0
construction of EÂ¡. Similar to Sn+1, En+1 = basis(Â£'n+1).
E_Closure(basis(Â£,)) if 0 < i < n
j basis^,) if i ~n ~H1
For 0
which satisfies the following three rules.
52
(1) Every transition in basis^,) is in EÂ¡.
(2) If s = [A kx'B/3, j] is in S',, then for all BruiEP, (s,e,t) is in EÂ¡ where
t = [B+ ui, i] GS1,.
(3) If [Bis in S,, then for all s =[A+aB0, A:] in Sj, (s,B, t) is in EÂ¡ where
t=[A+aB'f3,k]eSi.
Transitions added to EÂ¡ by rules (2) and (3) above correlate closely with the states that are
generated by the Predictor and Completer functions, respectively.
A highlevel description of Earley' is given in Figure 5.2. In that figure, we assume (l) a
generalized Closure function which concurrently constructs S{ and E{, 0
are initialized to basis(5,) and basis^,), respectively, and (2) a modified Scanner function
which computes basis(Â£i+1) and basis^,^), 0 < i < n. The correctness of Earley' follows
from the wellestablished correctness of Earleys original algorithm.
function Earley'(<7 ={V, T, P, 5); w G T*)
11 w =a1a2 an+1, n >0, a,GT\{$}, l
basis(50), basis^o) := {[5' 5$,0]},0
for i := 0 to n do
(5,, EÂ¡) := Closure(basis(St), basis^,))
(basis(S,+1), basis(Ui+1)) := Scanner(5,, a,+1)
if basis^+j) = 0 then Reject(u;) fi
od
5n+i> En+1 basis(5n+1), basis(En+1)
Accept (w)
end
Figure 5.2 A Modified Earley Recognizer
The STG Gei is informally called the Earley state graph. When the Earley state graph
is complete, GEÂ¡ = (Qei, V, 6Ei) where Qei = U 5, and 6ei = U E{.
0<
As every state in Gei is reachable from the initial state [S'5$,0], s0 is also called
the root of Gei. A path in Gei which begins at the root is called a rooted path in Gei.
53
Earleys Algorithm and Viable Prefixes
Let GEt = (Qei, V, 6E1) be the Earley state graph that results from applying Earley' to
G and w. In this section, the strings over V that are spelled by rooted paths in Gei are
analyzed. It transpires that the string spelled by an arbitrary rooted path in Gei is a viable
prefix of G. Moreover, the string spelled by a rooted path in Gei which terminates at a state
in S',, 0
bottomup interpretation of Earleys algorithm as exemplified by Fact 5.2.
Lemma 5.1 For 0
state in 5,. Then every path of length len(o;) to s in Gei spells a.
Proof. The proof is by induction on len(o;) = m.
Basis (m =0). In this case a=e, so j =i and s =[A /?, j]. The unique path of length 0 to
s in Gei is denoted by (s). By definition, this trivial path spells e.
Induction (m > 0). Since len(a) > 0, a=alX for some o/GF* and XGF, i.e.,
s = [A oSX f), j]. Thus, s was added to S', by either the Scanner or the Completer. In
either case, every transition to s in GEt is of the form (r,X, s) such that
r =[j4j]GS1,/ for some i', j
hypothesis, every path of length len(a/) to r in Gei spells o'. Consequently, every path of
length len(a) to s in Gei spells cdX =a.
Corollary For 0
state in 5,. Then every path of length len(a) to s in Gei begins at [Ar*a^,j] GS1,.
Lemma 5.2 Let p =(s0, . ,sm), m >0, be a rooted path in GE> such that p spells
qrGF* and sm=[A+af3,j].Si for some AtaflEP and i,j, 0
7GVP(C, i\w) and [A o; /?, y] GS, is valid for 7.
Proof. By Lemma 5.1 and its Corollary, 7=<5a for some GF*; we show by induction on m
that S'=>*6Ay and 6=>r* a1a2 a; hold in G for some y G T*.
54
Basis (m= 0). Thus, i =0, sm =s0 = [51'* 5$,0] ES0, and 7=e. The consequent trivially
holds in this case.
Induction (m > 0). Two cases are analyzed based on whether or not a=e.
Case (i): a=e. In this case, j i and sm was added to 5, by the Predictor. Thus, sm1 =
[Bkt'At, j"] ESÂ¡ for some BktAtEP and j1, 0
Clearly p' also spells 7. By Fact 5.2, <7=>r* aji+laji+2 a, holds in G. By the induction
hypothesis, 7=&7 for some 6V* such that S'=^*SBy and =^,*o1o2 a; hold in G for
some yET*. That is, 7EVP(G, i:u/) and [B*vAT,jr\ ES{ is valid for 7. Since
SBy =>,*SoAxy holds in G for some x E T*, [A* /?, 1] ES, is also valid for 7EVP(G, i:w).
Case (ii): a^e. Therefore, sm was added to SÂ¡ by either the Scanner or the Completer, i.e.,
a=ofX for some of EV* and XEV. Let sm_l = [A+otX/3,j]ESp for some i', j
and let p,=(so>si> By Lemma 5.1, p' spells Sot for some SEV* such that
SofXSa = 7. By Fact 5.2, of =>r*aJ+1aj+2 ap holds in G. Therefore, by the induction
hypothesis, S'=*?6Ay and <5=^*a1a2 a; holds in G for some yET*. That is,
<5o/EYP(Gi, i':w) and [A*otX/3, j] ESp is valid for dot. If XE T, then X = aÂ¡ and i'=i1.
If XEN, then X =>*ap+1ap+2 a{ holds in G. Consequently, 7EVP(G,i:w) and
[A a /?, j] E is valid for 7.
Corollary Let p =(sn, s1t . ,sm), m >0, be a rooted path in Gw such that p spells
7 EV* and sm = [A *o: */3,y] Ebasis(S) for some Akv/IEP and i,j, 0
Then 7EPVP(G, i:w).
Proof. If m =0, then 7=e and i =0. By definition, eEPVP(). Otherwise, suppose
that m > 0. Since sm Ebasis^,), the last transition in p is on E T, i.e., i > 0 and 7=7lai
for some 71E V*. Therefore, 7EPVP(G, i:w).
The next lemma provides the converse to Lemma 5.2.
Lemma 5.3 Let 7 be a string in VP(<7, i:w) and let [A/?, j] ES{ be a state which
is valid for 7 for some A+ot(lEP and i,j, 0
in Gei to [A+a>/3,j] ESÂ¡ which spells 7.
55
Proof. This lemma appears to be rather more difficult than Lemma 5.2 to prove rigorously.
In lieu of a formal proof, an intuitive argument is given. First the following observations are
made.
(1) Every state which is valid for 7 is in 5,. Otherwise, a contradiction of Fact 5.2
would result.
(2) If 7^6, then there is some state s ES, such that s is valid for 7 and s properly
cuts 7. In particular, Earley states that are added by the Scanner or Completer
properly cut the viable prefixes that they are valid for.
(3) If 7t^c, then for each state s ES, which is valid for 7 there is a state r ES) such
that (i) r is also valid for 7, (ii) r properly cuts 7, and (iii) there exists a path in
Gei from r to s which spells e.
Given these observations, an informal inductive argument proceeds as follows where the
induction is on len(7).
Basis (len(7)=0). For each state sESq which is valid for eGYP(G,0:u;), there exists a
rooted path in Gei to s which spells e.
Induction (len(7) > 0). Let 7=VX for some and XGK. By points (2) and (3) above,
we may assume that [A ra/3, j] ES, properly cuts 7, i.e., ot=otX for some o'ET. Let
s = [A o/X*/3, j] E5,. For every j
X =}*o,i+1a,/+2 a, hold in G, [Ato/ Xft, j] is in S#. Pick one such i' (there must be at
least one) and let r =[A*oIX^,j] E5,/. By the induction hypothesis, '/iVP(Cr, i'.w), r is
valid for T7, and there exists a rooted path to r in Gei which spells T7. When s is added to
by either the Scanner or Completer, the transition (r,X,s) is installed in Gei. Therefore,
there exists a rooted path in Gei to s which spells 7.
Theorem 5.4 For 0
3 0
(GE>t,, s0,5'i) denote an NFA. Then L(Mei ,) = VP(G, t:w).
Proof. This theorem follows from Lemmas 5.2 and 5.3.
56
Corollary For 0
U Ej Ubasis(i?,)) and let Mei , b
define GE',i,h = ( U Sj Ubasis(5), V,
(GE> b, Sqi basis(S)) denote an NFA. Then
UME,iiib)=PV?(G,i:w).
Theorem 5.4 and its Corollary establish a direct relationship between Earley' and the
GeneralLR recognition scheme. Indeed, Earley' prescribes one possible approach to realiz
ing an implementation of General_LR. Note that the foregoing analysis of Gei provides a
constructive proof that for arbitrary x G T*, VP(G, x) and PVP(G, a:) are regular languages.
Earleys Algorithm and Viable Suffixes
The last section considered strings in V* that are spelled by rooted paths in GEt. The
string spelled by a path in Gei is determined directly from the grammar symbols that label
the transitions in that path. In this section, another string over V is associated with a path
in Gei, viz., a string that is derived from the states in that path. Specifically, the state
derivative of a path in Gei is defined recursively by the statederivative function given in
Figure 5.3.
function state derivative ((s0, s1; . sm))
// (s0> 51> > sm)> m >> is a Path in GE'
if m =0 then // Let s0 = [Aaf3,j].
return^)
else if s0 = [Aj] and = [A*aXP, j] then // (s0,X, sÂ¡)G6ei
retur^statederivative^Sj, s2; > sm)))
else // Let = [AKXBp,j] and = [B u, i],
return(/3fi (statederivative ((s1( s2, . sm ))))
fi
end
Figure 5.3 The Definition of the State Derivative of a Path
Again, let Gei = (QE<, V,6Ei) be the Earley state graph that results from applying Ear
ley' to G and w. It transpires that the state derivative of an arbitrary rooted path in Gei is
a viable suffix of G. Moreover, the state derivative of a rooted path in GEt which terminates
57
at a state in Sit 0
the topdown interpretation given to Earleys algorithm by Fact 5.3.
Lemma 5.5 Let p =(s0,sv . ,sm), m >0, be a rooted path in GE< such that 7GT*
is the state derivative of p and sm=[A+Of/3,j]ESi for some A+a/3EP and i,j,
0
Proof. We show that ^={PS)R for some <5GF* such that S'=**a1a2 a^Ab holds in G.
The proof is by induction on m.
Basis [m =0). Thus, =0, sm =s0 = [5' *5$,0]G50, and 7=$5. By definition,
$5 GVS(<2,0:t(;) and s0 is clearly valid for $5.
Induction (m > 0). Two cases are analyzed, based on whether or not a=e.
Case (i): ae. In this case, j =i and sm = [A+ Â¡3,] was added to SÂ¡ by the Predictor. Let
sm_1 = [BKTAT,jr]ESi for some BktAt&P and j', 0
derivatives of p'=(s0,sv . and p are (AtS)r and (/3rb)R =7, respectively, for some
GL*. By Fact 5.3, <7=w*aJi+1aJ/+2 a, holds in G. By the induction hypothesis,
S' =>*a1a2 ajiBS holds in G. That is, {AtS)r GVS((7, i:w) and [B+(XAt, j1] GS, is
valid for (At5)r. Clearly, S'=$*ala2 a^Arb also holds in G. Thus,
(/St5)r =7GVS(G!, i:w) and \A*(), i] is valid for 7.
Case (ii): a^e. Thus, sm was added to 5, by either the Scanner or the Completer, i.e.,
otolX for some o/GF* and XGF. Let sm_l \A*oI'Xf3,j\ GS1, for some i', j
and let p'=(s0,s 1, . The state derivatives of p' and p are (Xf3S)R and (/35)R =7,
respectively, for some GF*. By Fact 5.3, o/=^*aJ+1aJ+2 a, holds in G. By the induc
tion hypothesis, S' =**a1a2 ajAb holds in G, so (XÂ¡38)R GVS(G!, i':w) and
[Afo,.JT)9,j]G5,i is valid for {X/36)R. If XGT, then X = a{ and 1. If XEN, then
X=^*a,/+1a,/+2 a, holds in G. In either case, Qf=4*Oy+1a;+2 a, holds in G. There
fore, (/35)R =7GVS(G, i:w) and [A+Of/3,j] GS, is valid for 7.
58
Corollary Let p =(s0, si> ,,)> m >0, be a rooted path in GEt such that 76 V* is
the state derivative of p and sm [A *a/3,j] Gbasis(5,) for some A^a^EP and i,j,
0
Proof. If m =0, then i =0, and sm = s0 = [S' 5$,0] Gbasis(50). Thus, the state derivar
tive of p is $5=7 which is in PVS() by definition. If m > 0, then i> 0,
smi = [A+al'ail3,j]E.Si_1, and sm =[Aa'a,/?,j] for some a'EV*, i.e., a=a'ai. The
state derivatives of p'=(s0,sv . ,sm_i) and p are {a{P5)R and (PS)R =7, respectively, for
some SE V*. By Lemma 5.5, GVS(G, i l:w), so 7GPVS(G, i:w).
The next lemma provides the converse to Lemma 5.5.
Lemma 5.6 Let 7 be a string in VS(G, i:w) and let [A*o/3,j] G5, be a state which
is valid for 7 for some A+cxf3EP and i,j, 0
in Gei to [AKXf3, j] with state derivative 7.
Proof. A rigorous proof of this lemma has so far eluded us. Consequently, a very informal
intuitive argument is given instead. A more convincing proof is left for future work.
Observe that the basic result provided by Lemmas 5.2 and 5.3 is a graphical interpreta
tion of Fact 5.2 in terms of certain properties of Gei. In turn, the goal of Lemmas 5.5 and
5.6 is a graphical interpretation of Fact 5.3 in terms of certain other properties of Gei.
Consider VP((7,:u;) and VS(G,i:w) for some i, 0<
established that 7G V* is a member of VP(G, i:w) if and only if there is a rooted path in Gei
to some state in 5t which spells 7. Lemma 5.5 showed that 7G V* is a member of
VS[G,i\w) if there is a rooted path in Gei to some state in 5, with state derivative 7. It
would be rather counterintuitive and at variance with Fact 5.3 if the converse to the previ
ous statement did not also hold. In fact, such a result would appear to subvert the generality
of Earleys algorithm.
In contrast to the case with GeneralLR, Lemmas 5.5 and 5.6 establish a more covert
relationship between Earley' and GeneraLLL. This is in keeping with the relative complex
ity of the definitions of the spelling of a path and its state derivative.
59
Discussion
A graphical variant of Earleys algorithm was examined within the framework estar
blished in the previous two chapters. In the process, some properties of Earleys algorithm
were identified and the efficacy of the GeneraLLR and GeneraLLL approaches to general
recognition was established. Earleys algorithm is an excellent vehicle for demonstrating the
effectiveness of GeneralLR and GeneraLLL given that it is so wellknown and highly
regarded.
The analyses contained in the previous two sections illustrated how the sets of viable
prefixes (resp. viable suffixes) tracked by GeneralLR (resp. GeneraLLL) are explicitly
represented in the statetransition graph that is constructed by Earley'. As Earley' is a
direct descendant of Earley, it is fair to conclude that these same sets are represented impli
citly in the Earley state sets that are constructed by Earleys original algorithm. By viewing
Earleys algorithm from this novel perspective, its operation and correctness has been
explained at a level of abstraction that is closer to that necessary for capturing the essence of
general canonical recognition.
The structure of Gei exhibits how Earley' subsumes both the GeneralLR and
GeneralLL recognition schemes. Clearly, Earley' embodies GeneralLR considerably more
directly than GeneralLL. In light of this, it is perhaps more apt to view Earleys algorithm
as a general bottomup recognizer.
Practical aspects of the GeneralLR recognition scheme are examined further in the
next chapter and Chapter VII extends it into a general parser. Thus, this chapter is transi
tional in that it bridges the abstract treatment of general recognition presented in Chapters
III and IV with the concrete treatment of General_LR contained in Chapters VI and VII.
Attempts at deriving a general parser from GeneraLLL were unsuccessful. Thus, an investi
gation of the practical potential of GeneraLLL is left for future work.
CHAPTER VI
A GENERAL BOTTOMUP RECOGNIZER
In this chapter, a general bottomup recognizer that is directly based on the
General_LR recognition scheme is presented. In particular, the algorithm constructs a graph
in such a way that the regular sets of viable prefixes manipulated by GeneraLLR are
represented in this graph. Aside from complications that can arise due to nullable nontermi
nals, the recognizer is extended into a general parser rather seamlessly (parsing is the subject
of the next chapter). Thus, in light of the algorithms practical potential, several implemen
tation issues are discussed. Throughout this chapter, an arbitrary reduced $augmented
grammar G {V, T,P,S) and an arbitrary string w=a1a2 an+1, n >0, a,Â£T\{$}for
1 < i < n, an+1=$, are assumed.
Control Automata and Recognition Graphs
The recognizer described in this chapter constructs a statetransition graph which we
call the recognition graph. The correctness of the algorithm is based on properties of this
graph. The recognition graph is constructed under the guidance of an FSA called the control
automaton. The control automaton is determined from the subject grammar G and is fixed
throughout the recognition process. In contrast, the recognition graph evolves during recog
nition; its structure is derived from the control automaton and the input string w.
For simplicity, the LR(0) automaton of G is used as the control automaton for guiding
the recognition of w with respect to G; alternative control automata are suggested later.
The LR(0) automaton of G is a DFA which is based on the canonical collection of sets of
LR(0) items of G and the associated goto function [4,11]. Recall that each set is comprised
of kernel and closure items. The item S' *5$ is a kernel item as are all items of the form
60
61
A a/? such that a^e. With the exception of S'5$, all items of the form A*<*) are
closure items.
We denote the LR(0) automaton of G by MC(G)=(I, V, goto, I0,1) where
I={I0,IV . is the collection of sets of LR(0) items. The C subscript is a re
minder that MC(G) is used as the control automaton during recognition. For convenience,
we assume that S' SSG/q and S'*S$ G/m1; in fact, the latter assumption implies that
/m_1 = {5'5$*}. A detailed accounting of MC(G) is not needed to describe how it is used
to recognize strings. However, the following wellknown facts about MC(G) are useful.
(1) L(Mc(C))=VP(C).
(2) Each Ij Â£/\{/0} has a unique entry symbol XEV, i.e., the grammar symbol that
all transitions to Ij are made on. The entry symbol for Ij, j #0, is denoted by
entry (Ij). There are no transitions directed to I0 in MC(G), so entry(/0) is not
defined.
(3) For /yÂ£/, (i) if AKX'Xpeij, then A+aX*f3EIk where Ik =goto(/;,X);
(ii) if AKxX/3Ij, then A+OfX/3(zIk for all Ik Gpred(/y,X); and
(iii) if A+a'Peij and A ^Sthen goto(4,A) is defined for all Ik Â£pred(/y, a).
Automaton MC(G) is also denoted by Mc if G is understood.
The precise manner in which the recognition graph is constructed is the essence of the
algorithm described in the next section. Some general characteristics of recognition graphs
are described in the remainder of this section.
The recognition graph constructed under the guidance of Mc is denoted by
gr(Mc)=(Q, V,6). At the start of recognition, GR{MC) is set to an initial configuration.
Additional states and transitions are added to Q and 6, respectively, as the recognition
proceeds. The denotation GR(MC) is simplified to GR whenever the intent is obvious.
Each state added to Q during recognition corresponds to a set of items Ij Â£/ of Mc,
0
a,; the 0th position of w immediately precedes ox. A subscript of j:i is used to denote the
62
state in Q that corresponds to Ij and input position i, e.g., gj;i. The function iÂ¡r.Q+I is
defined to map a state in GR to its associated set of items in Mc; thus, For later
use, we define QÂ¡ ={g;:i GQ}, 0
correspond to input position i.
Similarly, each transition added to 8 during recognition corresponds to a transition in
Mc. The members of 6 are best described in terms of the mapping <5goto induced by iÂ¡j
defined as follows: for p,q Q and XG V, (p,X, g)G<5 only if goto(^g),X) = ^(p). Thus,
each transition in GR corresponds to the reversal of a transition in Mc. Consequently, all of
the transitions out of a state p GQ are on entry(V^p)). Valid transitions in GR are also con
strained by input position; specifically, (qk:i,X, qj.h).8 implies that h
we define <5,={(g;.,,X, p)G<5)}, i.e., 8i consists of all transitions in GR that emanate from
states in Qi.
The General_LRO Recognizer
The general contextfree recognizer, informally named GeneraULRO, is described in
this section. Concurrently, intuitive arguments for its correctness are presented. Establish
ing the correctness of General_LRO reduces to demonstrating that it is a faithful realization
of the General_LR recognition scheme, i.e., that the sets of viable prefix associates that
GeneraLLR tracks are correctly represented in the graph constructed by GeneraULRO as w
is scanned from left to right.
GeneraULRO is described in terms of how it operates when it is applied to G and w.
Under the guidance of MC(G), the LR(0) automaton of G, GeneraULRO constructs a recog
nition graph Gr(Mc). Some general notions about recognition graphs were introduced in the
last section. The description of GeneraULRO that follows provides more specific details
about how GR is derived from Mc and w. For reference, GeneraULRO is rendered in pseu
docode in Figure 6.1.
63
1. function General_LRO(G=(V, T,P,S); wÂ£T*)
2. // w=a1a2 an+1, n >0, a,eT\{$}, l
3. // Let MC(G)=(I, V, goto, I0,1) be the LR(0) automaton for G.
4. // GR(MC)=(Q, V, <5) is an STG, the recognition graph.
5. Q, 6 := {g0.0}, 0 // Initialize
6. // Let M* 90:0> Go) Then L(M*) = PVP(G, e) = {e}.
7. for i := 0 to n do
8. // Let Mr ={Gr\ q0:0, Qi). Then L(MR) = PVP(G, i:w).
9. Reduce (i)
10. // Let Mr =(Gr\q0:0, Qt). Then L(MR) =VP(G, i:w).
11. Shift (i)
12. // Let Mr ={Gg\ q0.0, Qi+1). Then L(MR) = PVP {G, f+l:w).
13. if Qi+j=0 then Reject(u>) fi
14. od
15. // Let Mr =(Gr\ q0:0, Qn+1). Then L(MR) = PVP(G, w) = {5$}.
16. Accept(u;)
17. end
18. function Shift (7)
19. Q_subset := {q E QÂ¡  gotoa,+i) is defined }
20. while Qsubset ^0do
21. q := Remove(Qsubset) // Let goto{^q), ai+1) = Ij.
22. if qj.i+l Â£ Q then
23 Q :=QU{qj:i+l}
24. fi
25. 6 := <5U{(fy.i+1, a,+1, 9)} // Never redundant.
26. od
27. end
Figure 6.1 The GeneraULRO Recognizer
Throughout its evolution, the structure of GR is paramount. Certain intermediate
stages in its construction hold particular interest. At each of these points, an FSA may be
defined in terms of GRl which accepts one of the sets of viable prefix associates that is com
puted by the General_LR recognition scheme. The FSA derived from GRl is denoted by
Mr. The inverse of GR is desired since each of its transitions is reversed from the orienta
tion of the corresponding transition in Mc.
It is important to remember that GR evolves continuously throughout the recognition
process. Consequently, GR and MR denote a different graph and automaton,
64
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
function Reduce (?)
subset := f
Traverse(Q,, t)
while subset 0 do
(p,X,q) := Remove(_subset)
for AKxX/Kzif^p) such that /?=*e do
for r Esuccfa,^) do // Let goto(^(r), A) Ij.
if then
Q:=QU{qj:i}
Traverse({gj:i}, *)
fi
if (<7j:i, A, r) ^ then
:=U{(y;i>Afr)}
subset := subset U{(g;.;, A, r)}
fi
od
od
od
end
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
function Traverse(Qsubset, *)
while Q_subset j^0do
q := Remove(Qsubset)
for goto(iftq),A) = Ij such that A =>*e do //A EN
if qji Q then
Q :=QU{qj:i}
Qsubset :=Qsubset U{g,.,}
fi
:= 6U{(qj:i,A, g)} // Never redundant.
od
od
end
Figure 6.1 continued
respectively, at distinct stages of recognition. The makeup of GR at any given time deter
mines which regular set is recognized by MR. The General_LR0 recognizer is best under
stood through an appreciation of how it transforms GR.
The GeneraLLRO recognizer is comprised of a main function (lines 117 in Figure 6.1)
and three auxiliary functions, Shift, Reduce, and Traverse. The Shift function (lines 1827)
computes the  relation whereas Reduce (lines 2846) computes the (=* relation closure. The
Traverse function (lines 4758) is called from within Reduce. It handles certain transitions on
65
nullable nonterminal symbols. A linebyline description of the GeneraLJLRO recognizer fol
lows.
(Line 1) GeneraLLRO is supplied with two arguments, a reduced Saugmented grammar
G and a string w over the terminal alphabet of G.
(Lines 24) By assumption, w is terminated with $. For simplicity, we also assume that
the LR(O) automaton of G, MC(G), is provided by some external agent.1 Each of w, Mc,
and Gr are visible to the functions that require access to them.
(56) Graph GR is initialized to contain the single state g0.0. The comment in line 6
indicates that GRl can be trivially embedded into an FSA that accepts PVP(G,e) = {e} at
this point. Henceforth, the following statement holds for Gr throughout the duration of
recognition. For g;:i EQ where 0
g0:o (1) spells the reversal of a string in VP(G, i:w), and (2) corresponds to the reversal of a
path from 70 to Ij in Mc. As seen below, even stronger statements may be made about GR
at particular points during recognition.
(7) This for loop iterates once for each terminal symbol in w. Having i range from 0
to n rather than from 1 to n+1 yielded a cleaner expression of the algorithm. The rest of
the discussion primarily elaborates on an ith iteration of this for loop for some i, 0
(810) The comment in line 8 is both a loop invariant and a precondition of the Reduce
function. It clearly holds upon entry to the loop; the Reduce and Shift functions ensure that
it also holds at the start of each iteration. This condition can be alternately stated as fol
lows. A string 76 V* is a member of PVP(
some state q EQÂ¡ to q0.0 which spells 7. The comment in line 10 is a postcondition of the
Reduce function and may be restated similarly; that is, a string 7G V* is a member of
VP(Gi, i:w) if and only if there is a path in GR from some state q EQÂ¡ to q0:0 which spells
7s. Assuming that the precondition holds when Reduce is called, the Reduce function
transforms GR so that the postcondition holds.
1 An alternative is for General_LR0 to construct Mq as an initial task.
66
(1012) The postcondition of Reduce in line 10 is also a precondition of the Shift func
tion. A postcondition of the Shift function is given in line 12 and is similar to the loop invari
ant. However, in this case the following situation holds for GR. A string 7ai+1 Â£ F* is a
member of PVP(G, i+l:u>) if and only if there is a path in GR from some state q GQ,+1 to
q0.0 which spells oi+17^. Assuming that the precondition holds when Shift is called, the Shift
function transforms GR so that this postcondition holds.
(1213) If Qi+i=0 at this point, then MR has no final states. Thus, PVP^, *+l:u/) =
0 and i+l:u/ ^PREFIX(G). Consequently, w (Â£L(G), so GeneralJLRO rejects w.
(1516) Line 15 expresses a postcondition of the for loop. It holds upon completion of
the nth iteration (i.e., when i =n) provided that the postcondition of Shift and Qi+1^0 both
hold at the end of that iteration. In this case, w L(G), so General_LR0 accepts w.
Before continuing with the description of GeneralJLRO, the following important proper
ties of LR(0) automata are reiterated. Let A+w E7y hold for some A+u>EP with A jS'
and 7yÂ£7. In addition, let 6u> be the spelling of an arbitrary path in Mc from 70 to 7y for
some <5ET*. Then <5w(=M holds in G. Now let A*aa/SGIj hold for some A+ota/3(zP
and 7y7, and let <5a be the spelling of an arbitrary path in Mc from 70 to 7). In this case
6a*a6aa holds in G. Based on the manner in which GR is derived from Mc, these two
equivalence properties (i.e., the equivalence of paths from 70 to 7y with respect to reduce and
shift actions) are preserved in GR (i.e., all paths in GRl from q0:0 to qj:i are equivalent with
respect to shift and reduce actions). These equivalence properties are exploited by the Shift
and Reduce functions.
(11,18) The Shift function is called with i as an argument. This makes the relationship
between the values of i in GeneralJLRO and Shift explicit. The operation of the Shift func
tion during its ith invocation from GeneraLLRO is described for some i, 0
(19) At this point, we know that QÂ¡ cannot be empty. Otherwise, the input string
would have been rejected in an earlier iteration of the main for loop. The ith call to Shift
computes the a relation.2 Thus, we want to determine all states q GQ{ for which there is
2 It is important to remember that i ranges from 0 to n.
67
a transition on oi+1 from tÂ¡^q) in Mc. The set variable called Q_subset is initialized to con
tain these states.
(20) Each state in Q_subset is considered in turn. No additional states are added to
Q_subset within the while loop.
(2125) A state q is removed from Qsubset. Since (x/^q), ai+1, L) is a transition in Mc,
we need to add to Q and (<Â¡r;:+1, ai+1, q) to <5. It is possible that there is more than one
transition on ai+1 to Ij in Mc, so <Â¡r;:i+1 may have been added to Q in an earlier iteration of
the while loop. This condition is checked in line 22 and g,.,+1 is added to Q only if it is
necessary. However, the transition (fy:,+1, ai+1, q) cannot already be in 6 since there is only
one transition on at+1 from rj^q) in Mc. This transition is added to <5 in line 25.
(27) By assumption, the precondition in line 10 holds when Shift is called. Based on the
manner in which certain paths in GR are extended by the Shift function under the guidance
of Mc, the postcondition of Shift holds at this point.
The transformations of GR made by Reduce are considerably more elaborate. This is
not unexpected since Reduce computes the reflexivetransitive closure of a relation.
The operation of the Reduce function during its ith invocation from General_LR0 is
described for some i, 0 < i < n. During this invocation, Reduce adds states to Qi and installs
transitions from states in Q to states in Qj where 0
Qi to states in Qj where 0
transitions among states in Q{ warrant special treatment. They are problematic in the gen
eral case as they can introduce cycles into the recognition graph. These transitions, always
made on nullable nonterminals, are handled separately by the Traverse function.
(9,28) Like Shift, the Reduce function is supplied with i as an argument so that the
relationship between the values of i in GeneraLLRO and Reduce is explicit.
(29) At this point, each transition in <5, may come from a state that calls for one or
more reductions. If i =0, then there are no applicable transitions. If i > 0, the relevant
transitions were installed in GR by Shift during the previous iteration of the main for loop of
68
General_LRO. In any case, a set variable called subset is initialized to contain <5,; it is cru
cial that this assignment occur before Traverse is called.
(30) In short, Traverse creates certain paths to states in QÂ¡ that spell strings of nullable
nonterminal symbols. Further discussion of the Traverse function is deferred until later.
The Reduce function can be understood independently of it.
(31) Each transition in subset is considered in turn. All reductions relevant to the
source states of those transitions are performed. Additional transitions may be added to
subset within this loop.
(32) A transition [p,X, q) is removed from subset.
(33) The set of items i/^p) determines what reductions, if any, are applicable to p. Any
kernel item of the form A*aX'f3Erl{p) such that /3=**e holds in G is relevant; that is, we
see through certain nullable suffixes of production righthand sides. In effect, a reduction
from p on A*aX is performed. As described below, a path to p spelling will have been
installed in GR by an earlier call to the Traverse function. In this way, any cycles created in
Qi by nullable nonterminals is left for Traverse to handle.
(34) At this point we are considering one particular reduction applicable to p, say
AkxX'P&i/^p) where /? is nullable. This reduction is performed by traversing certain
paths in GR from p that spell (Xa)fi to locate the states in Q to which transitions on A
must be made. In particular, we want to traverse only those paths that start with the tran
sition (p,X,q). Any other transition from p will have either already been reduced through
or else is in subset waiting to be handled in a later iteration of the while loop. The states
of interest are given by succ(g,o^). It is precisely this application of succ that motivates
reversing the transitions in GR with respect to those in Mc.
(3542) At this point we are dealing with one particular state r Gsucc(g,orfi) and we
assume that goto(4{r),A) Ij for some Ij El. Thus, we need a state g;;i in Q, and a transi
tion (qj.itA, r) in ,. Both of these objects may already exist in GR) so they are condition
ally created as indicated by the if statements. Incidentally, a transition is generated redun
69
dantly here as the result of an ambiguity. If the transition is indeed new, it is added to
subset; any relevant reductions from q]:i are performed through this transition when it is
removed from subset in a later iteration of the while loop.
(46) The postcondition of Reduce holds at this point. To help establish this fact, a sub
set of VP(G, i:w), denoted by VP'(G, i:w), is defined as follows: (1) for i =0, VP^G^Oiu;) =
PVP(G,0:tt>); (2) for 0<
X^?aj+1aj+,2 a, holds in G). For 0
clearly holds. The states and transitions added to GR directly by Reduce ensure that
VP\G, i:w) C L(Mr) holds. The contribution that Traverse makes to the transformation of
Gr can be assessed by noting that VP(G, i:w) = {afiEXP(G)  aGVP^G, i:w), /3=*r holds
in G}. The Traverse function creates any additional states in Qi and transitions among
those states so that VP(G, i:w) \VP'(G, i:w) C L{MR) also holds. Together, the Reduce and
Traverse function guarantee that L(MR) = VP(G, i:w).
(30,37,47) Traverse deals solely with nullable nonterminals and productions with null
able righthand sides. In lines 30 and 37, Traverse is called with a nonempty subset of Q{ as
an argument which becomes associated with the set variable called Q_subset. Traverse has
the effect of transforming GR as if all sequences of reductions by productions that have null
able righthand sides are carried out from the states in Q_subset. However, a transformation
of Gr that produces the same result can be derived from a simple traversal of Mc. By
adopting this alternative approach, complications that can arise due to cycles in GR are
avoided. Consider the states Ik G/ such that iÂ¡^q)Ik for some q GQ_subset and traverse
Mc beginning from these states along all transitions that are made on nullable nonterminals.
The states and transitions encountered in this traversal are exactly those which would arise
from performing the reduction sequences described above. Consequently, counterparts for all
of the states and transitions encountered in this traversal are created in GR. Thus, a partic
ular subgraph of Mc is effectively embedded in QÂ¡ by this process. The specific subgraph is
determined by the composition of Qsubset when Traverse is called.
70
(48) Each state in Q_subset is considered in turn. Additional states may be added to
Q_subset within the loop.
(49) A state q is removed from Q_subset.
(50) All transitions from il{q) in Mc that are made on some nullable nonterminal A are
relevant. Let goto(tJ^q),A) = Ij be one such transition.
(5155) We need a state qj:i in Q, and a transition (qj:i,A,q) in 5,. This state may
already exist in Qit so it is conditionally created. If qj:i is indeed new, it is added to
Q_subset; the traversal will resume from qj:i when it is removed from Qsubset in a later
iteration of the while loop. However, the transition (qj:i,A,q) is never generated redun
dantly; the discipline imposed by the graph traversal ensures that the transitions from each
state encountered are considered at most once.
If the two calls to Traverse are removed from the Reduce function and the line 9a.
Traverse(Q,, *) is added to GeneraLLRO following line 9, an equivalent transformation of
Gr results, i.e., one that satisfies the condition stated in line 10. In this way, Traverse
becomes a postprocessor of Reduce. However, for the purposes of parsing it is more
appropriate to call Traverse from within Reduce as we have done in Figure 6.1. This will
become evident in the next chapter when GeneraLLRO is extended into a general parser.
That GeneraLLRO correctly implements the GeneraLLR recognition scheme may be
established by induction on i. This induction depends, in turn, on proving that the Reduce
(resp. Shift) function correctly transforms GR such that the postcondition in line 10 (resp.
line 12) holds if the precondition in line 8 (resp. line 10) holds before the function is called.
Although the Shift and Reduce functions are not formally proven correct, it is expected that
the above detailed explanation of GeneraLLRO provides sufficient intuitive evidence toward
that end.
71
Earleys Algorithm Revisited
A general recognizer that operates strikingly similar to Earley' is obtained by modifying
General_LRO to use a particular nondeterministic variant of the LR(0) automaton for G as a
control automaton. The alternate control automaton, the modified algorithm, and its relar
tionship to Earley' are briefly discussed in this section.
Alternate Control Automata
The nondeterministic LR(O) (or NLR(O)) automaton of G [24, p. 250] is denoted here by
MNciG) =(/> V> gto> 4, 0 where
(1) /={/0)/1, . ,/*_!> =
(2) goto({A* a X0},X)={A+aX'/?}, and
(3) {Bv*o;}Ggoto({AkxB/3}, e) for each B*u)EP.
In this case, we prescribe that I0={S' *5$} and 7m_1={5'5$}. Again, MNC(G) is
simplified to MNC when G is understood. If the standard subset construction algorithm for
converting NFAs to DFAs is applied to MNC, the (deterministic) LR(0) automaton of G is
obtained, i.e., MC[G).
Some functions related to succ and pred are needed for navigating through NLR(O)
automata and the recognition graphs derived from them. Toward that end, let G0=(Q, E, 8)
be an STG. The 2^succ and 17pred functions, both of type Q Xl?*2, are defined recur
sively as follows.
(1) For qEQ, Esucc(q,e) =Epred(q,e) ={g};
(2) for p EQ, a EE, and x EE",
Esucc(p, xa) {r EQ \ q EEsucc(p,x),(q, a, r)<5} and
Â£pred(p, ax) ={r EQ\q Â£.Â£pred(p,a;),(r, a, q)E6}.
Thus, Â£succ and Â£pred effectively ignore etransitions. Note that if G0 is efree, then E
succ (resp. i^pred) is identical to succ (resp. pred). The esucc and epred functions, both of
type Q+2q, are defined for dealing with etransitions. For pEQ, esucc(p) =
72
{gÂ£<9 (p,, q)E6} and epred(p) = {qÂ£Q  (<7, e,p)Â£}. All four of these functions extend
to subsets of Q in the usual fashion.
The following facts apply to the NLR(O) automaton MNC(G).
(1) L(Mc(G))VP(G).
(2) Each IjÂ£I\{I0} has a unique entry symbol XEEL^e}, again denoted by
entry(/y).
(3) For {A*Of/3}EI such that AjS', Vpred({A* a: /?,}) = {Aa:/?} and
goto{Ij,A) is defined for each Ij Â£epred({A o/?}).
An Alternate Recognizer
The General_LRO recognizer is modified to employ the NLR(O) automaton of G as a
control automaton in place of the LR(O) automaton. The resulting algorithm, called
General_NLRO, is displayed in Figure 6.2. Only a small number of minor changes were
required to derive General_NLRO from GeneraLLRO. The differences between the two
recognizers are discussed next.
The lines in Figure 6.2 were numbered so as to emphasize the correlation between the
General_LRO and General_NLRO recognizers. Consequently, the line numbers cited below
reference code in both Figures 6.1 and 6.2.
(34) It is explicitly recorded that the NLR(O) automaton of G, MNC(G), is used as the
control automaton in General_NLRO. Thus, the recognition graph constructed by
General_NLRO, GR(MNC), is derived from MNC and the input string w.
(23) A state Ij of MNC has more than one incoming transition only if entry(/y) = e.
Therefore, is unconditionally added to Q at this point, i.e., lines 22 and 24 are not
needed in Figure 6.2.
(33) Each set of items in MNC is a singleton, so at most one reduction can apply to
V(p). Thus, an if construct is more appropriate here in place of the for loop of Figure 6.1.
73
1. function General_NLRO(G ={V, T,P,S); wGF)
2. // w=a1a2 an+v n> 0, a,eT\{$}, l
3. // Let Mnc(G)=(I, V, goto, 70, /) be the NLR(O) automaton for (7.
4. // GR(MNC)=(Q, V, 5) is an STG, the recognition graph.
5. Q, 6 := {g0.0}, 0 // Initialize GR.
6. // Let Mr =(Gr\ q0:0, Q0). Then L(MR) = PVP(G, e) = {}.
7. for i := 0 to n do
8. // Let M* =(Gfi1, 9o:0, Q). Then L(Mfl) = PVP(G, i:u,).
9. Reduce (i)
10. // Let Mr ={Gr1, q0:0, Q{). Then L(A4) =VP(G, i:u>).
11. Shift ()
12. // Let A4 =(G'1, q0:0, Qi+1). Then L(Mfi) = PVP(G, +l:u;).
13. if Qi+1=0then Reject(u;) fi
14. od
15. // Let A4 =(Gr\ q0:0, Qn+1). Then L(M*) = PVP(G, w) = {5$}.
16. Accept(u;)
17. end
18. function Shift (i)
19. Q_subset := {q Â£ Qi  goto(4{q), ai+1) is defined }
20. while Q_subset /0do
21. q := Remove(Qsubset) // Let goto(i/{q), ai+1) = Ij.
23. Q := Q U{fy.i+1} // Never redundant.
25. 6 := <5U{(g,.t+1, ai+l, q)} // Never redundant.
26. od
27. end
Figure 6.2 The GeneraLNLRO Recognizer
(34) The appropriate successors of p along paths in GR that spell XaP are located
using the Fsucc and esucc functions (instead of the succ function). This is necessitated by
the presence of transitions in GR.
(50) Similar to General_LR0, the Traverse function of General_NLR0 effectively per
forms a certain traversal of MNC. However, in this case we also want to step over e
transitions. Traversing transitions in this way mirrors the Earley Predictor function.
74
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
function Reduce (i)
Lsubset := t
Traverse^,, i)
while Lsubset ^0do
(p, X, q) := Remove(
if {AKxXft}=iÂ¡^p) such that/?=$* e then
for r Gesucc(Vsucc(g,ar)) do // Let goto= Ij.
if Qji (Â£ Q then
' Q Q U{gy.,}
Traverse({qfy.,}, i)
fi
if (q]:i,A, r)^then
:=U{(?y:i,A,r)}
subset := subset U{(qj.itA, r)}
fi
od
fi
od
end
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
function Traverse(Q_subset, t)
while Qsubset ^ 0 do
q := Remove(Qsubset)
for goto(iftq),X) = Ij such that X =^*e do // XG7VU{e}
if qji k Q then
Q Q u{?;:,}
Qsubset := Qsubset U{fy.,}
fi
:= 8U{(qj:i,X, g)} // Never redundant.
od
od
end
Figure 6.2 continued
Relationship to Earleys Algorithm
A connection between Earleys algorithm and General_NLR0 is established. The link
between these two algorithms is made indirectly through Earley'. Specifically, we describe a
correspondence between the Earley state graph constructed by Earley' and the recognition
graph constructed by General_NLR0.
Let G1=(QV E, x) and G2=(Q2, X,S2) be statetransition graphs. Graph Gx is
homomorphic (resp. isomorphic) to graph G2 if there exists a surjection (resp. bijection)
75
which induces a surjection (resp. bijection) g:8l+82 defined by
g({p,.9))if(p).fl./(?))p,qÂ£Qv eru{e).
Let Mnc(G) = (/, V, goto, I0,1) with I {I0,IV . ,Im_4} be the NLR(O) automaton
of G. Let Ge'{Qe'> t>E') be the Earley state graph constructed by Earley' when it is
applied to G and w. Lastly, let GR(MNC)=(Q, V,S) be the recognition graph constructed
by GeneralNLRO when it is applied to G and w. Graph Gei is homomorphic to GRl as fol
lows. The function fiQE>*Q defined by fi{[A+af3,j]ESi) = qk.i where Ik={A *a0}
is a surjection which induces the surjection gi8Ei*^1 defined by
?1((r,I,S))=(/1W,X,/1(r)),r,iGfeIGKU{f}.
If an STG Gl is homomorphic to an STG G2, then an STG Gk can be derived from Gk
such that Gk is homomorphic to Gi and Gk is isomorphic to G2. Our comparison of Earley'
and General_NLRO is concluded by defining an STG Gei {Qe> V>$e') suc^ that GEt is
homomorphic to Gei and Gei is isomorphic to GRl.
For 0<.k
4 {Aa/?},0
{s: 0<&
tions of Gei are defined as follows. For r, s G Qei and X GFU{e}, (r ,X, s) G8E, if and only
if 3r, s G Qei such that r Gr, s Gs, and (r,X,s)E5Ei. By construction, Gei is homomorphic
to GEt.
That Gei is isomorphic to GRl is established as follows. Define the function
f2'Qe'*Q by /2(sk:i) = clki The function f2 is a bijection which induces the bijection
g2.8Ei*81 defined by g2((r,X,s)) = (f2(s),X,f2(r)), r,sEQEt, IGFU{(}. Therefore,
Gei is isomorphic to GR.
Implementation Considerations
For the remainder of this chapter, we turn our attention back to the GeneraLLRO
recognizer. In this section, some issues that are pertinent to implementing GeneralLRO are
76
addressed. Specifically, means for properly handling graph cycles and for efficiently imple
menting the relevant set operations and the succ function are discussed. A satisfactory reso
lution of these issues facilitates the complexity analyses undertaken in the next section.
Graph Cycles
In any application which involves graphs that are not necessarily acyclic, graph cycles
are a matter of concern. Neither LR(0) automata nor the recognition graphs constructed by
General_LRO are guaranteed to be acyclic.
Let MC(G) denote the LR(O) automaton of G and let GR(MC) denote the recognition
graph constructed by General_LRO when it is applied to G and w. Since all paths in GR are
reflected in Mc, albeit in reverse, GR is cyclic only if Mc is also cyclic. However, the con
verse does not hold; Mc may have cycles that are not replicated in a recognition graph
regardless of the input string.
Properties of contextfree grammars that give rise to cycles of any kind in LR(O) auto
mata are identified first. Since L(Mc) = VP(G), Mc is cyclic if and only if VP(G) contains
strings of unbounded length. Thus, Mc is cyclic if and only if for some A EN, aGV* with
cv^e, and yET*, A=t?aAy holds in G. That is, Vz>0, <5aM. EVP(C?) for some GF*.
Note that a may contain terminal symbols.
Grammatical properties which give rise to those cycles in Mc that can also be repro
duced in Gr are considered next. Since the above conditions characterize all possible cycles
in Mc, a restriction on those conditions is sought. Assume for the moment that GR is cyclic.
Given an arbitrary transition in GR of the form (q^^X, qj:h), we know that h
Thus, a particular cycle in Gr must consist solely of states in Qi for some t, 0 < t < n.
Moreover, every transition between any two states in QÂ¡ is on some nullable nonterminal
symbol. Consequently, the conditions given above are modified as follows. A control auto
maton Mc has a cycle which may be reproduced in GR if and only if for some A EN, Â£*Â£ V*
with a^e, and y ET*, A =*?aAy and cr=^e hold in G. Of course, whether or not a cycle
77
is actually introduced into a recognition graph depends on the input string as well as the sub
ject grammar.
A result by SoisalonSoininen and Tarhio [40] relating to the concept of a looping LR
parser was helpful in identifying the grammatical properties that give rise to cyclic recogni
tion graphs. Looping LR parsers are discussed in conjunction with a method for constructing
deterministic LR parsers for some nonLR(fc) grammars [2]; this method involves disambig
uating multiplydefined parse table entries. A looping LR parser is an LR parser that has a
parsing configuration such that all subsequent actions are reductions. The nonLR(A) gram
mars for which looping LR parsers can be produced (i.e., for some set of disambiguation
choices) can be characterized as follows.
Fact 6.1 A looping LR parser can be constructed for G if and only if for some A EN
and a,/3EV* the following three statements hold in G: (1) A =^+aA/3, (2) ot=$*e, and (3) if
a=e, then
Proof. This is the main result presented by SoisalonSoininen and Tarhio [40],
In summary, a cycle in Mc is introduced into GR only if it spells a nontrivial string of
nullable nonterminal symbols. Paths spelling strings of nullable nonterminals which can
cause cycles are introduced into Gr by the Traverse function. This is effectively carried out
through a traversal of Mc where each state in Mc is considered at most once. Once cycles
are present in GR, they are traversed, if at all, in the Reduce function. Specifically, the com
putation of the succ function implies a traversal of certain paths in GR, including those
which contain cycles. An implementation of the succ function which properly deals with
cycles in GR is described in a later subsection. In either case, cyclic control automata and
recognition graphs do not pose any particular difficulty to GeneraLLRO.
Set Operations
Two sets are maintained by General_LR0 during recognition, viz., Q and & Two set
operations are used in the process. One operation is that of determining if a particular
78
object is an element of a set. The other operation is that of adding an object to a set.
Efficient means for implementing these operations with respect to both Q and 6 are
described below.
The operations on Q are considered first. We assume that the states in Qi are stored
on a separate linked list for each value of i. Thus, whether or not exists in Q can be
determined by scanning a list of at most m items. A state is added to Q by simply linking it
into the appropriate list. Thus, both set operations of interest can be performed with respect
to Q in constant time.
Membership in Q can be resolved faster using the following scheme. A boolean flag is
associated with each state in Mc. The flags are reset to false at the beginning of each iterar
tion of the main for loop in GeneraLLRO. When a state q is added to Q by either Reduce
or Shift in the ith iteration, 0
way, the membership of q G Q can be determined during the th iteration by testing the flag
associated with if^q).
The overhead associated with resetting m boolean flags each time through the loop can
be avoided by using integer flags instead. The flags are initialized to 1. When a state q is
added to Q in the ith iteration, 0<
membership of q in Q is resolved during the ith iteration by comparing i with the value of
the flag for If the flags value is less than *, then q (Â£Q. Otherwise, the flags value is
equal to i and q EQ.
Managing the transition set 6 is slightly more involved. We assume that all of the tran
sitions out of qj:i, with 0
qj.Â¡. Thus, a new transition out of qj:i can simply be linked into this list. However, this list
may contain 0(i11) items, so it can be costly to scan the list in search of a transition. An
efficient method for resolving membership with respect to 8 is described as follows. Note
that we need only be concerned with transitions on nonterminals since transitions on termi
nals are never generated redundantly. Thus, we assume that entry(Ij) = A for some A GiV.
79
Let (qj.i,A,qk.h), h
created an integer flag was attached to it for each nonterminal transition out of Ik in Mc\
these flags are initialized to 1. If A, gt.A)$<5, then the flag attached to qk:h that is
associated with the transition out of Ik on A is less than i. When the transition is added to
8, this flag is set equal to i. The effectiveness of this scheme is a consequence of the order in
which transitions are added to 8. Thus, both set operations can be performed with respect to
8 in constant time as well.
The succ Function
The last significant aspect of General_LRO that needs explication is its use of the succ
function. This subsection proposes one approach to implementing succ. A revised Reduce
function is presented which incorporates the method. The modified function is displayed in
Figure 6.3.
Each use of succ in Reduce implies that a traversal of GR is carried out. An auxiliary
stack, the Succ_Stack, is used by Reduce to effect this traversal. Each entry in the stack
records an intermediate stage in the traversal of GR that is required to compute the succ
function.
Consider the reference to the succ function in line 34 of Figure 6.1. Based on properties
of control automata and recognition graphs, the following holds: succ(g,orfi) = {rEQ 3 a
path in Gr from q to r spelling } = {r E Q  3 a path in GR from q to r of length
len(cvfi)}. Motivated by this observation, each entry in Succ_Stack is a triple (r',A, d) where
(1) r' is a state in GR to which some path traversal from q has progressed, (2) A is the left
hand side of the production being reduced, and (3) d is the distance left to go before a state
in succ(g, o^) is reached where d
that follows clarifies how Succ_Stack is used to compute the succ function.
(13) These three lines correspond to lines 2830 of Figure 6.1.
(4) The Succ_Stack is initially empty.
80
1.
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23.
24.
25.
26.
27.
28.
29.
function Reduce () // Revised to implement the succ function,
subset := ,
Traverse(<5,, i)
Succ_Stack := 0
while Succ_Stack ^0or Lsubset ^0do
if Succ_Stack = 0 then
(p,X,q) := Remove(Lsubset)
for A*aX'/3.'iJ{p) such that ()=$* do
Push(Succ_Stack, (<7,A,len(a:)))
od
end
od
else // Succ_Stack ^ 0
(r,A,d) := Pop(Succ_Stack)
if d > 0 then // Let entry(^r)) = X.
for r'Â£Q such that (r,X, r')E6 do
Push(Succ_Stack, (r',A, d 1))
od
else // d =0, let goto(^(r), A) = Ij.
if ?; , Q then
Q QUiqj.,}
Traverse^.,}, i)
fi
if {qj.itA, r)^then
:=5U{(gy.,,A,r)}
subset := subset U{(q'y.i,J4, r)}
fi
fi
fi
Figure 6.3 A Modified Reduce Function
(5) This while loop corresponds to the while loop at line 31 in Figure 6.1. However,
in this case there are two collections to exhaust before the loop terminates.
(6) The true branch of the if statement deals with items in subset and the false
branch deals with items in Succ_Stack. The if predicate is written so that items in
Succ_Stack have priority over items in subset. Clearly the predicate is false in the first
iteration of the while loop.
(78) These two lines are the same as lines 3233 of Figure 6.1.
(9) Instead of invoking the succ function as in line 34 of Figure 6.1, we initiate the
graph traversal of GR that is implied by that use of succ. Specifically, (g,A,len(o;)) is
81
pushed onto Succ_Stack to record that we want to find the successors of q which are located
at the ends of paths of length len(oi) from q\ moreover, when each of these states is found, a
transition on A will be made to it from an appropriate state in .
(11) The Succ_Stack is not empty, so one of its entries is processed.
(12) An item (r,A,d) is removed from Succ_Stack.
(1316) If d >0, then the stage in the traversal of GR that is recorded by (r,A,d) has
not progressed far enough. Let entry(^r)) = X. Then every transition out of r is on X.
For each state r'EQ such that (r,X,r') is a transition in GR, (r',A,d 1) is pushed onto
Succ_Stack. By effectively moving to r', the length of the traversal has been increased by 1.
Consequently, the distance remaining is decreased by 1.
(1725) If d 0, then r Gsucc(<7, afl) for some q and O' referred to in lines 79. Lines
1825 are identical to lines 3542 of Figure 6.1.
The Complexity of Recognition
In this section, some worstcase complexity bounds are established for the GeneraLLRO
recognizer. Specifically, we consider the amount of space and time required by GeneraLLRO,
in the worst case, when it is applied to G and w. In the following, it is convenient to assume
that w GL(G). In addition, the LR(0) automaton of G, MC(G), is assumed to have m
states.
Bounds on space requirements are derived first. They are useful in determining the
time bounds. In both cases, bounds are established for arbitrary G and for arbitrary unam
biguous G.
Space Bounds
The space complexity of GeneraLLRO is determined by placing an upper bound on the
number of states and transitions in GR at the point when w is accepted. The sizes of the
auxiliary data structures, i.e., Qsubset, subset, and Succ_Stack, are accounted for later.
82
First, we assume that G is arbitrary. For 0
one state. Thus, there are at most m(n+l)+l Â£ O(n) states in GR.
Consider <5, for some *, 0
to every state in U Q,. The number of states in U Q. is at most ra(z+l). Conse
o<;< 3 o
quently, since Q{ has at most m states, 6i has at most m2(i+l) transitions. In addition,
n
contains one transition. Thus, there are at most 1 + 2Jm2(i+l).0(n2) transitions in GR.
'= o
Summarizing, GR contains at most O(n) states and 0(n2) transitions. Therefore, the
space complexity of GeneraLLRO for arbitrary G is 0(n2). An ambiguous grammar that
meets this worstcase space bound is the following: {S+S S \ a  e}.
The space complexity of General_LRO remains 0(n2) even if G is unambiguous. For
example, the unambiguous grammar with production set {Sa S a \ a \ e} meets this
worstcase space bound.
Time Bounds
The time complexity of GeneraLLRO is determined by placing an upper bound on the
time required to construct GR. It transpires that the complexity of GeneraLLRO is dom
inated by the complexity of the Reduce function. The following remarks are made in light of
the earlier observations regarding the efficiency of the set operations used by GeneraLLRO.
The main function invokes the Shift and Reduce functions n+1 times each. Thus, the
time complexity of GeneraLLRO is determined from the time spent in these two functions
throughout the duration of recognition.
At most m states and m transitions are installed in Gr during any one invocation of
the Shift function. Thus, over n+1 calls, 0(n) time is spent within Shift.
In analyzing the complexity of the Reduce function, the time spent within Traverse is
accounted for separately. In any one invocation of Reduce, the Traverse function is called at
most m times. That is, in the worst case it is called once for each state in Q{. Within any
83
one invocation of Traverse, at most m states and m2 transitions are added to the recognition
graph. Thus, over n+1 calls to Reduce, 0(n) time is spent within the Traverse function.
In assessing the contribution of the Reduce function to the time complexity of
GeneraLLRO, we first assume that G is unambiguous. For some i, 1
tion of Reduce is analyzed.3 By an inspection of the while loop, the time spent within
Reduce is based on the number of items that are cycled through subset and Succ_Stack.
From the analysis of the space complexity of GeneraLLRO, there are at most m2i transitions
from states in Q{ to states in U Q, at the completion of the 2th call to Reduce. These
o<;< 3
are precisely the transitions that are cycled through .subset. Although at least one of these
transitions must have been generated in the most recent invocation of Shift, for simplicity we
assume that all O(i) of them are created by Reduce. Under this assumption, each transition
in , results from traversing some path in GR that spells the reversal of some prefix of a pro
duction righthand side. This traversal is effected through the use of the Succ_Stack. Let p
= max({len(a;)  A wG.P}). Thus, at most m2ip entries are cycled through Succ_Stack
while all of the reductions relevant to the 2th call to Reduce are performed. Together, at
most m2i(p+1) items are cycled through .subset and Succ_Stack. Since
n
lm2i(/)+l)Â£0(n2), the total time spent in Reduce over n+1 calls is 0(n2). Accumulating
= i
the total time consumed by Shift, Traverse, and Reduce, we conclude that GeneraLLRO
runs in 0(n2) time in the worst case if G is unambiguous.
Now assume that G is arbitrary. Again, we want to determine the total number of
items cycled through subset and Succ_Stack during the 2th call to Reduce for some *,
1 < 2 < n. The number of transitions cycled through .subset is still bounded by m2i. A
bound on the number of entries cycled through Succ_Stack is given by the number of distinct
paths that may be traversed when making all possible reductions back through those transi
tions. Consider one of the 0(i) transitions in ,, say (p,X,q). Suppose that
3 All of the work is done by Traverse when =0 since o=0 when Reduce is called in that
instance.
84
A+aX .il{p). Further suppose that len(a:X) = p. While traversing all of the paths in GR
that emanate from p, pass through (p,X, q), and spell XoP, an upper bound on the number
pi .
of items that are cycled through Succ_Stack is given by J]i3 EO(ip *). Since there are
j 0
0(i) transitions in 8i that may be reduced back through, 0(ip) entries may be cycled
n
through Succ_Stack during the fth call to Reduce. Since ^Jip EO(np+1), General_LRO runs
= i
in 0(np+1) time in the worst case.
The worstcase running time of GeneralLRO does not compare favorably with Earleys
recognizer. However, the parsing version of GeneraLLRO also runs in 0(np+1) in the worsts
case. As shown in the next chapter, this bound more properly reflects the time required to
construct a convenient representation of all the possible parses of an input string. In con
trast, the 0(n3) bound does not take into account the time required by Earleys algorithm to
analyze its more indirectly represented parse forest.
We have not yet accounted for the maximum sizes potentially attained by the auxiliary
data structures Qsubset, subset, and Succ_Stack. The set variable Q_subset holds at
most m states in either Shift or Traverse. In Reduce, the set variable .subset contains at
most m\i+1) transitions. Since access to Succ_Stack follows a FIFO discipline, it contains
at most O(i) entries at any time. Therefore, the space required for these structures does not
contradict the worstcase space bounds for GeneraLLRO that were derived above.
On Garbage Collection and Lookahead
Garbage collection and lookahead provide means for improving the efficiency of the
GeneraLLRO recognizer. Garbage collection is relevant to reclaiming the space occupied by
states and transitions in GR when they become superfluous to the remainder of the recogni
tion task. Lookahead is used for selectively generating only those states and transitions that
are consistent with the current lookahead string. Some basic notions regarding the use of
garbage collection and lookahead within GeneraLLRO are discussed briefly.
85
Recalling the settheoretic foundation of GeneralLRO helps to motivate the utility of
garbage collection. Since GR represents the sets of viable prefixes that are tracked by the
recognizer, the notion of a dead state as it applies to MR identifies nonessential states of the
recognition graph. Whether GR is considered at line 10 or line 12 of GeneralLRO, all states
that are dead with respect to MR at those points, as well as all transitions emanating from
them, are no longer needed. Consequently, the space used by these states and transitions can
be reclaimed for later use.
In order to determine an appropriate location within GeneralLRO to invoke garbage
collection, note that if MR contains no dead states before Reduce is called, then it has no
dead states when Reduce terminates. However, the same remark does not apply to the Shift
function. In particular, states can become dead during the 'th call to Shift where 0<
a proper subset of the states in Q{ have transitions generated to them. Thus, it is convenient
to perform garbage collection in conjunction with the Shift function by anticipating the
states that become dead as a result of it.
An appropriate place to perform garbage collection is immediately following line 19 in
the Shift function. The following simple scheme is sufficient.
(1) Mark all states that are reached in a traversal of GR that begins at the states in
Q_subset.
(2) In a second traversal that starts from the states in QÂ¡\ Q_subset, delete from Q
the states that were not marked in step (1) and delete from 6 the transitions that
emanate from those states.
Note that a garbage collection scheme based on reference counts would be far less straight
forward due to the selfreferences which arise from cycles in the recognition graph. More
over, the simple markandsweep garbage collection procedure outlined above applies readily
to GeneraLNLRO as well.
Although garbage collection can improve the space efficiency of GeneralLRO, it obvi
ously incurs a time penalty. For 0
86
tions in Gr prior to the th call of Shift. Thus, the procedure outlined above may be per
formed in 0((i+l)2) time. Observe that this is no worse than the worstcase time complex
ity of the Reduce function.
In practice, one would probably want to perform garbage collection less seldom than on
every input symbol. Regardless, a similar procedure involving two graph traversals would
still apply. The first traversal begins from certain states in the most recently completed
state subset Q{ and marks all states reached in the process. In the second traversal, all
unmarked states and their outgoing transitions are deleted from the recognition graph.
The basic goal of garbage collection is to contract periodically the size of the recogni
tion graph. As a consequence, space taken up by nonessential states and transitions becomes
eligible for reuse. In contrast, the aim of lookahead is to anticipate the states and transitions
that are necessary to recognize the input string. In short, lookahead is used within Shift,
Reduce, and Traverse to selectively generate those states and transitions that are consistent
with the current lookahead string.
In order to make use of lookahead, the items in the control automaton are attributed
with appropriate lookahead strings. The literature on the computation and use of lookahead
in the context of LR parsers is quite extensive. The type of lookahead typically used in con
junction with LR(0) automata is either SLR() lookahead [12] or LALR() lookahead
[8,11,29],4 Without going into detail, the use of symbol lookahead in GeneraLLRO5 for
some k > 0 impacts the following locations in Figure 6.1.
(Line 19) Q_subset is computed to contain only those states q (zQÂ¡ such that the shift
on oi+1 from V(?) is consistent with the lookahead string.
(33) Only those reductions are initiated from p that are consistent with the current k
symbol lookahead. This comment also applies to line 8 in Figure 6.3.
(50) Transitions on nullable nonterminal symbols are selectively made based on their
consistency with the symbol lookahead string.
4 Almost invariably, =1.
6 This is somewhat of a misnomer when lookahead is employed.
87
The costs of employing lookahead include the space that is needed for storing lookahead
strings in the control automaton and the time associated with matching the Arsymbol look
ahead string from the input string with occurrences of it in the control automaton. If A; =1
as is generally the case, the overhead of using lookahead is not usually an issue.
On the other hand, the benefits of using lookahead can be substantial. Space is saved
by reducing the number of states and transitions that are needlessly created. In addition,
time is saved that would otherwise be spent generating unnecessary pieces of the recognition
graph and traversing paths that would be called for by Reduce in the absence of lookahead.
Most significantly, General_LRO runs in linear space and time if G is an LR(A;) grammar
provided that A;symbol lookahead is used.
Discussion
The Earley' and General_LRO recognizers both construct statetransition graphs. In
each case, the STG is used for representing the sets of viable prefixes that are tracked by the
GeneralLR recognition scheme. The graph constructed by Earley', Gei, is derived interpre
tively in the sense that the Earley states that are generated during recognition drive the con
struction of the graph. In contrast, GR is constructed under the guidance of a precomputed
control automaton. This distinction is obscured somewhat by the GeneraL_NLRO recognizer.
General_NLRO constructs a statetransition graph that is quite similar to Gei, but does so
under the guidance of the NLR(O) automaton of G.
The GeneraLLRO and General_NLRO recognizers illustrate extremal examples of a
basic approach to general recognition that entails constructing a recognition graph under the
guidance of a controlling automaton. In each case, (1) the structure of the recognition graph
is mirrored in the control automaton, (2) the recognition graph is used to represent the sets
of viable prefixes that are tracked by the General_LR recognition scheme, and (3) the control
automaton accepts the viable prefixes of G. Other possible control automata are suggested
by the fact that the LR(0) automaton of G can be obtained by applying the subset construe
88
tion algorithm to the NLR(O) automaton of G. Any automaton intermediate between the
NLR(O) and LR(0) automata that is built during subset construction provides a viable candi
date for a control automaton. One main advantage of LR(O) automata is their determinism,
whereas a favorable feature of NLR(O) automata is their comparatively smaller number of
states. Automata that are intermediate between these two extremes can be tailored to bal
ance both of these factors. The choice of possible control automata is broadened still further
when lookahead is introduced. An investigation of alternate control automata is left for
future work.
Of the known contextfree recognition algorithms, GeneralLRO is most like Tomitas
algorithm without lookahead [42,43]. In this form, Tomitas algorithm interprets a parse
table derived from the LR(0) automaton of G and maintains a socalled graphstructured
stack that is similar in structure to our recognition graph. However, a transition of the form
(p,A,q) is represented by two edges of the form (p, rA) and (rA, q) where p, q correspond to
parse states and rA is a symbol vertex. In effect, the symbol vertices play the role of our
transition labels. Due to the use of these symbol vertices, the correspondence between the
states and edges in the graphstructured stack and the states and transitions of the underly
ing LR(0) automaton is not as precise as in General_LRO. In addition, the symbol vertices
needlessly increase the number of vertices and edges in the graphstructured stack, increase
the lengths of paths that are traversed during reductions by a factor of 2, and complicate the
operations which manage the stack.
Tomitas algorithm cannot handle cyclic grammars [42], However, it also fails to han
dle some noncyclic grammars that contain eproductions. In short, any grammar that may
introduce a cycle into the graphstructured stack is troublesome. These grammars are
exactly the grammars that can introduce cycles into our recognition graphs.
Tomitas algorithm independently keeps track of edges that may need to be reduced
back through and states that have yet to be acted on (a state is acted on to determine what
parse moves are relevant to it). In contrast, other than the special attention given certain
89
nonterminal transitions, GeneraLLRO uniformly lets the transitions stored in subset drive
the reduction process.
The special handling required of nullable nonterminals is common to all general recog
nizers that allow eproductions. The manner in which Tomitas algorithm deals with e
productions is the cause for its limited coverage. For =0 to n+1, the states in [/, =
U Â£/, j are generated by Tomitas algorithm as follows {Ui corresponds to our QÂ¡).
o
(1) Let j =0.
(2) If i =0, then U0 0 contains only the start state; otherwise, UÂ¡ 0 is comprised of the
states that resulted from shift moves on a, from states in U{_v
(3) If all of the states in Ui have been considered, then all of the reductions have
been performed at stage i. The shift moves on ai+1 are performed next.
(4) Perform all pending reductions by noneproductions from states in Ui y any new
state that is created is placed in j.
(5) Perform all pending reductions by eproductions from states in U{ ; any new
state that is created is placed in Ui j+l.
(6) Let j =y+l and return to step (3).
Thus, reductions by eproductions are delayed until there are no other reductions to be
made. As a consequence of this treatment of eproductions, 0:[/,>/ is not necessarily one
toone where / represents the states in the underlying LR(0) automaton. This is an undesir
able anomaly that further obfuscates the operation of the algorithm. By comparison,
GeneraLLRO ensures that is always onetoone.
The fact that Tomitas algorithm fails to handle some noncyclic grammars with e
productions was also observed by NozohoorFarshi [35]; in particular, grammars for which
3A EN such that A =>+aAf3 and hold in G, but f3=$*e does not hold, are focused on.
In order to accept grammars of this kind, a modification to Tomitas algorithm is proposed
which allows cycles in the graphstructured stack. The basic approach to handling such
cycles is outlined as follows: when a nonterminal transition is installed from a state q E U{
90
that already existed in the graph, all states in Ui which were previously acted on are recon
sidered to see if any reductions from them pass through the new transition. This is
apparently sufficient, but the details of how it is accomplished are not provided.
The worstcase time complexity of Tomitas algorithm is also 0(np+1) [26]. In com
parison, recall that the complexity of Earleys algorithm is not affected by the length of pro
duction righthand sides. Accompanying the complexity analysis by Kipps [26] is a modified
version of Tomitas algorithm that has a worstcase running time in 0(n3). In short, addi
tional interstate links are used for decreasing the number of paths that must be traversed
when performing reductions. However, the plethora of setunion and setmembership opera
tions contained in the algorithm does not make it clear that 0(n3) time is obtained. In any
case, this modification subverts the algorithms ability to construct a parse forest, so it is
only useful for recognition.
CHAPTER Vn
A GENERAL BOTTOMUP PARSER
The General_LRO recognizer is extended into a general bottomup parser in this
chapter. The transformation from general recognizer to general parser is straightforward in
all but one respect some effort must be expended to parse arbitrary derivations of the
empty string. Briefly, a parse of an input string is represented by appropriately annotating
the transitions of the recognition graph. Ambiguity is accommodated by attaching multiple
annotations to relevant transitions. As usual, an arbitrary reduced Saugmented grammar G
= (V, T,P,S) and an arbitrary string w=axa2 a+1, n >0, atEjr\{$} for
an+1=$, are assumed throughout.
From Recognition to Parsing
Implementations of deterministic bottomup parsers, of which LR parsers are exem
plary, are not obliged to build an explicit parse tree for the input string. Whether or not a
parse tree is indeed constructed is primarily dictated by the requirements of the application
to which the parser is applied. Other factors which are influential include memory con
straints and the interface between the parser and other processing components.
In contrast, general bottomup parsers typically cannot avoid explicit parse tree
representations. When parsing against a nondeterministic grammar a forest of parse trees
rather than an identifiably unique tree is typically relevant to the input string. Due to
theoretical limitations on the discrimination afforded by lookahead, this behavior is even
observed with unambiguous grammars. In any case, some representation of the parse forest
must be built during parsing so that a unique parse can eventually be produced.
91
92
In light of these observations, the parsing version of General_LRO, General_LR(y,
overtly maintains a representation of a parse forest. The manner in which this is accom
plished is a simple generalization of the following proposed scheme for explicitly constructing
a parse tree within an LR parser.
Suppose that G is an LR grammar. We consider a hypothetical LR parser for G and
describe one way to explicitly build a parse tree for an input string in conjunction with the
parse stack. We may assume that the parser is based on some LR automaton for G, say M.
At any point during a parse, the contents of the stack is a sequence of states from M. The
parse tree that is synthesized during parsing is represented by associating a node in the tree
with each state in the stack other than the bottommost state.
Let the contents of the stack at some point be s0si sm, m >0, where each s, is a
state of M; in particular, s0 is the start state of M. For 1
bol for state sf. Thus, XxX2 Xm is the viable prefix of G that is implicitly represented
by the supposed stack contents. If m =0, the relevant viable prefix is e. For l
assume that some representation of a parse tree node labeled with XÂ¡ is attached to the
entry for s in the stack. The shift and reduce actions of M generate additional tree nodes
as follows.
A shift action always creates a new leaf node. Suppose that the current input symbol is
o and the next action of the parser is to shift a from sm. As a result of this action, the con
tents of the stack becomes SqSj smtl where goto(sro,a) = tv As a side effect, a new
parse tree node is generated, labeled with a, and attached to t1 in the stack.
A reduce action typically generates one internal node. However, when reducing by an
eproduction, a leaf node is also created. Suppose that the next action called for by the
parser is to reduce by production A Â£. This action transforms the contents of the stack to
sosi smt2 where goto(sm, A) = t2. Two new tree nodes are generated as a side effect.
One tree node is a leaf that is labeled with e. The second is an internal tree node; it is
labeled with A, set to point to the new leaf, and attached to t2.
93
Lastly, suppose that the next action called for by the parser is to reduce by production
A*Xm_T r >0, i.e., the length of the righthand side is strictly greater than
0. If goto(sm_r_1;A) = f3, then the contents of the stack becomes SqSj sm_r_1t3 and a
new tree node labeled with A is attached to i3. In addition, this new internal node is set to
point to each of the nodes that were associated with the states sm_r, . ,sm_l,sm before
the reduction was made.
Upon accepting the input string w, the contents of the stack is s0s's" where goto(s o> S)
= s' and goto(s', $) = s". At this point, the root of the parse tree for axa2 an is
attached to s'.
A parse forest for the input string is synthesized by GeneraLTRO7 in an analogous
fashion. Specifically, the Shift and Reduce functions are modified to annotate the recognition
graph with information sufficient for representing the parse forest. The parse annotations
are attached to the transitions of the recognition graph since the connectivity of the graph,
i.e., as exhibited through the transitions, reflects the structure of the parse forest.
Overlooking many of the details that are supplied later, GeneraLLR(y constructs a
parse forest as follows. When a transition on a T is created by Shift a leaf node labeled
with a is attached to that transition. A transition that is created by Reduce corresponds to
an internal node of the parse forest. The parse annotation attached to it includes pointers to
the parse annotations associated with the transitions that were traversed along the way
toward creating that transition (i.e., the transitions traversed in the computation of the succ
function). The transitions created by Traverse are annotated so as to avoid creating circu
larities in the parse forest that arise due to unbounded derivations of the empty string. In
short, Traverse resolves all ambiguous derivations of e.
A transition that is multiplydefined, i.e., due to ambiguity, can have a distinct parse
annotation attached to it for each path in the recognition graph that reduced to that transi
tion. In this way, the parse forest becomes a factored representation of all possible parse
trees for the input string (excluding ambiguous derivations of e). However, the presentation
94
that follows is simplified by assuming that ambiguities are resolved as soon as they are
detected. Of course, the ease with which ambiguities can actually be resolved is dictated by
semantic properties of the language generated by G.
Parse Annotations
The parse forest built by GeneraLLR(y is maintained through information that is
attached to the transitions of the recognition graph. These attachments have already been
referred to as parse annotations. The notation that is used for denoting parse annotations is
introduced next. For simplicity, only one parse annotation is ever attached to a given transi
tion.
The Greek letter 7r, possibly with a subscript, is used regularly to denote parse annota
tions. All parse annotations are enclosed within square brackets. Thus, [7r] is a simple exam
ple of the notation used to denote a parse annotation.
The parse annotation for a transition on a E T that is generated by Shift is denoted by
[a]. Conceptually, this annotation is some descriptor for the terminal symbol a. A transi
tion on A .N that is generated by Traverse as the result of a reduction by Ais anno
tated with [e], i.e., a suitable descriptor for the empty string. The notion of an empty parse
annotation, denoted by [], is also useful; note that this annotation is distinct from [c].
The parse annotation of every other nonterminal transition, whether generated by
Reduce or Traverse, consists of a list of pointers to other parse annotations. For this pur
pose, we let &7T denote a pointer or reference to the parse annotation [7r] (or equivalently, a
pointer to the transition to which [7r] is attached). Consider a transition on A (zN that is
generated as the result of a reduction by production AXlX2 Xm GP, m >1. Suppose
that for 1
tion. Then the parse annotation that is attached to this transition on A is
&7t2, . ,&7Tm], i.e., an ordered list of pointers to the annotations associated with the
transitions in the path in GR that spells [XlX2 Xm)R.
95
In summary, a parse annotation is either (1) a descriptor of a terminal symbol, (2) a
descriptor of the null string, or (3) a sequence of pointers to parse annotations. In order to
reflect the close connection between parse annotations and recognition graph transitions, the
notation used to specify transitions is modified slightly as follows. Currently, (p,X,q)
denotes a transition in 6. In our discussion of General_LR(y, this transition will be denoted
by the quadruple (p ,X,q,[ii\) where [7r] is the parse annotation of (p,X,q). Thus, upon
acceptance of the input string, a parse tree for it can be recovered from the grammar sym
bols and parse annotations that are associated with the transitions in GR.
Parsing the Empty String
As identified in Chapter VI, a transition in the recognition graph of the form (p,A, q)
where p,qÂ£Qi for some i, 0
Such transitions are handled in a particularly simple fashion by the Traverse function of
GeneraLLRO since the steps in the derivation are not relevant to recognition. However, in
order to fulfill its role as a general parser, General_LR(y must be able to reconstruct a
derivation of e from A for this transition.
Some derivations of the empty string are especially troublesome, namely those which
are unbounded in length. Unbounded derivations of are caused by those nonterminals
A G/V for which A =*+A =**e holds in G. General_LR(y resolves this issue by disambiguat
ing every ambiguous derivation of e that occurs during a parse. The Traverse function is
modified to accomplished this task. The details of the revised Traverse function are given in
the next section. In the remainder of this section, we introduce some notions that are used in
that later discussion of Traverse.
First, we define W={A EN \ A =**e holds in G}. For each nonterminal symbol
A G W, Traverse minimizes the length of derivations of e from A. Toward that end, a parti
tion of W is defined as follows: (1) = {A G W \ AeGP}, and (2) for > 1,
{AÂ£W\A $1<:M Wjt A*B1B2 Bm EP, m>l, Bk Gj
96
A E W, A E if and only if i is the number of steps in a shortest derivation of from A.
Of course, only those subsets for which ^0 holds are of interest. For each A E W,
define elength(A) = i if and only if A E W{. Thus, elength(A) denotes the length of a short
est derivation of e from A.
In addition, a unique production is associated with each A GIF; this production is
denoted by nuller(A). The intent is for nuller(A) to be used in the first step of any deriva
tion of e from A or rather the laststep in the complimentary bottomup parse of e. By
making use of nuller(A), ambiguous derivations of e from A, if they are possible in G, are
disambiguated by Traverse. For each A E W, nuller(A) is defined by the first of the follow
ing two rules which applies.
(1) If A*eEP, then nuller(A) = A.
(2) Otherwise, nuller(A) = AÂ£1Â£2 ' Bm fr sme AB1B2 ' BmEP,
m >1, such that elength(A) = 1 + JJ elength(By).
l
For each A EW, there is a derivation of e from A consisting of i steps in which the first step
is an application of nuller(A). If nuller(A) is determined by rule (2) above, then more than
one production may apply. In this case, an arbitrary choice can be made. Alternatively,
some criteria may be applied toward making this choice more purposeful, e.g., that which
minimizes m or the height of the resulting subparse tree.
Before concluding this section, some motivation for disambiguating all derivations of e
is provided. Suppose that A EW derives e in more than one way. Then if some derivation
of e from A is a segment of a parse for the input string, then any derivation of e from A may
be substituted for this segment. In particular, this substitution may be made independently
of the context in which the segment occurs in the complete parse. If one derivation of e from
A is preferred in a given context, either the grammar must be modified to account for this or
else the favored derivation must be specified by some contextsensitive means. Since
contextsensitive extensions to contextfree grammars are beyond the scope of this work, we
choose to disambiguate all parses of e so as to minimize derivation lengths.
97
The GeneralLRO Parser
The GeneraLLRO parser is described next. For reference, the parser is rendered in
pseudocode in Figure 7.1 (spanning three pages). The discussion focuses on the modifications
made to the recognizer in deriving the parser. For the most part, the changes are rather
minor. However, the Traverse function underwent substantial revision in order to correctly
handle arbitrary derivations of the empty string.
1. function General_LR(y(G ={V, T,P, 5); w G T*)
2. // w =axo2 an+1, n >0, flGT\{$}, l<*
3. // Let MC(G)=(I, V, goto, 70, /) be the LR(0) automaton for G.
4. // GR(MC)=(Q, V, 5) is an STG, the recognition graph.
5. Q, 6 := {<70.0}, 0 // Initialize GR.
6. // Let Mr ={Gr\ <70:0, Q o) Then L(MR) = PVP(G, e) = {e}.
7. for i := 0 to n do
8. // Let Mr =(Gr\ q0:0, Q{). Then L(MR) = PVP(G, i:w).
9.
Reduce (i)
10.
// Let Mr ={Gr\ q0:0t Q{). Then L(MR) = VP(G, i:w).
11.
Shift (t)
12.
// Let Mr =(Gr\ q0.0, Qi+l). Then L(MR) = PVP(G, i+l:w).
13.
if Qi+1 =0 then Reject(u>) fi
14.
od
15.
// Let Mr =(Gr\ q0:0, Qn+1). Then L(MR) = PVP(G, w) = {5$}.
16.
Accept(u;)
17.
end
18.
function Shift ()
19.
Qsubset := {q G Q{ \ goto(^(g), a,+1)
is defined }
20.
while Q^subset ^ 0 do
21.
q := Remove(Q_subset)
// Let goto(iftq), ai+1) = I}
22.
if
23.
Q :=QU{gy:i+1}
24.
fi
25.
6 := <5U{(<7j.,+1, ai+l, q, [a,+i])}
II Never redundant.
26.
od
27.
end
Figure 7.1 The GeneraLLRO Parser
(Line 1) The main function of the parser is named GeneralLRO'. In all other respects,
this function is identical to the main function of the recognizer.
98
(25) The transitions installed by Shift are assigned appropriate parse annotations. As
described earlier, the parse annotation for ai+16r is denoted by [ai+1].
(34) This line reflects the new form taken by the transitions of the recognition graph.
28.
29.
30.
31.
32.
33.
34.
35.
36.
37.
38.
39.
40.
41.
42.
43.
44.
45.
46.
47.
48.
49.
50.
51.
52.
53.
54.
55.
56.
57.
58.
59.
function Reduce (?)
.subset := ,
Traverse(<9,, t)
Succ_Stack := 0
while Succ_Stack ^ 0 or .subset ^ 0 do
if Succ_Stack = 0 then
(p,X,q, [t]) := Remove(_subset)
for AtaX'pE'il^p) such that /?=*e do
// Let [7Tg] be the parse annotation for (3.
Push(Succ_Stack, (q,A, len(o:), [&7T, tt^]))
od
else // Succ_Stack # 0
(r,A,d, [7TJ) := Pop(Succ_Stack)
if d > 0 then // Let X entry(t/(r)).
for r'GQ such that (r,X, r', [7T2])Gdo
Push(Succ_Stack, (r',A, d 1, [&7T2, 7Tx]))
od
else // d =0, let goto(rJj{r),A) = Ij.
if Qji & Q then
Q :=QU{?;:,}
Traverse({gy.,}, i)
fi
if (qj:Â¡,A, r, [n])<Â£6 for any [7r] then
;= U{(9y;i> A r, [TTj])}
subset := subset U{(gj:i, A, r, fo])}
else // Let (fy.,, A, r, [7T2]) G hold for some [7T2].
Disambiguate((gi:i,A, r, [ttJ), (qj:i,A,r, [ttJ))
fi
fi
fi
od
end
Figure 7.1 continued
(3538) As in the recognizer, we need to initiate all relevant reductions from p by push
ing appropriate entries onto Succ_Stack. However, the computation of the succ function
that is carried out here must also construct parse annotations for the transitions installed by
Reduce. A fourth field is added to each entry in Succ_Stack for this purpose. In short, this
99
60.
61.
62.
63.
64.
65.
66.
67.
68.
69.
70.
71.
72.
function Traverse(Q_subset, 0
Qsubset' := Qsubset
while Qsubset ^0do
q Remove(Qsubset)
for goto(rÂ¡^q),A) = L such that A =>* do
if qj.i Q then
Q :=Q
Q_subset := Qsubset U{g;:,}
Qsubset' := Qsubset' U{gy.,}
fi
Insert(6_sorted_list, (qj.^A^))
od
od
//a eN
// Never redundant.
73. while <5_sorted_list ^ 0 do
74. (p,A,q) := Remove_head(<5_sorted_list)
75. if nuller(A) = A ethen
76. S:=6U{(p,A,q,[e])}
77. else // Let nuller(A) = A+BXB2 Bm, m >1.
78. //3 a path (qm, qm_v . qx, q) in GR spelling Bl
79. // i.e., {q:,Bj, qj_h [tt,]), {qvBv q, m >j >2, for some 7ry.
80. 6.= 6U{(p,A,q, [&icv &n2, . ,&nm])}
81. fi
82. od
83.
84.
85.
86.
87.
88.
89.
90.
91.
92.
93.
94.
95.
end
while Qsubset' ^ 0 do
q := Remove(Q_subset')
for A*aX'/3.r/^q) such that /?=>*e do
if fS=e then
Let the parse annotation for /? be []
else // Let Â¡3=BlB2 Bm, m >1.
//3 a path {qm, . ,qvq) in GR spelling BmBm_x Bv i.e.,
// (Â¡j1, fo]). (i, Bi, q, [ttJ) Gfi, m >j >2, for some 7ry.
Let the parse annotation for ft be [&7TJ, &n2, . ,&7Tm],
fi
od
od
Figure 7.1 continued
field is used for storing the parse annotation corresponding to the path traversed so far in the
course of making a reduction. Consider the reduction from p on the production A*aX/3
where f3=$*e holds in G. The parse annotation of every transition on A that results from
this reduction will include a pointer to the parse annotation of the transition on X from p to
100
q, namely &n. In addition, it must include the parse annotation relevant to the nullable
suffix /3, referred to here as One of the tasks of Traverse is to compute [7^] and associ
ate it with the item AcrX*/? of ^(p); in particular, Traverse will have done this by the
time this reduction is made. Thus, the parse annotation [&7T, 7T^] is the fourth field of the
entry pushed onto Succ_Stack that corresponds to this reduction.
(40) This line reflects the new form of the Succ_Stack entries. At this point
represents a nonempty sequence of pointers to parse annotations. These parse annotations
correspond to the suffix of some production righthand side that is being reduced to A.
(4244) This loop demonstrates how parse annotations are built up during the course of
computing the succ function. For every transition (r,X, r', [7r2]) that is traversed within this
loop, a pointer to [7T2] together with the parse annotation built up so far, [7^], becomes part
of the parse annotation for the transition on A that is eventually installed in the recognition
graph. Thus, [&7T2,7^] is the fourth field of the appropriate entry pushed on the Succ_Stack.
(5055) If (<7;:i, A, r, [7r])^ for any parse annotation [7r], we proceed as before. The
transition (<7j:i, A, r, [7^]) is installed in GR and added to subset to allow for subsequent
reductions back through it. Note that at this point 7rx represents a nonempty sequence of
pointers to parse annotations corresponding to the righthand side of some production that
has been reduced to A; more specifically, the sequence of pointers corresponds to a path in
GRl that spells that righthand side. On the other hand, if (qj:i,A, r, for some parse
annotation [7T2], then an ambiguity has been detected. The Disambiguate function is
invoked, the details of which are not specified here, to decide which parse annotation out of
[7^] and [ttJ to retain with the transition.
It is apparent from Figure 7.1 that the Traverse function is substantially more exten
sive than before. It now consists of three while loops. Each loop is discussed in turn.
The first while loop is very similar to the single while loop contained in the version of
Traverse used by the General_LR0 recognizer. Two new lines have been added and one line
has been modified.
101
(61,68) The set variable Qsubset' is initialized to the contents of Q_subset in line 61.
In line 68, each new state that is added to Q within the first while loop is also added to
Q_subset'. The states contained in Qsubset' after the first loop completes are processed
later in the third while loop.
(70) The transitions on nullable nonterminals are not directly added to 6 as before.
Instead, they are entered into a list called sortedlist. The elements of the form (p,A,q)
in sortedlist are sorted in order of increasing elength(A). The contents of sortedlist
are processed by the second while loop.
Within the second while loop, an appropriate parse annotation is determined for each
element in jsorted_list and the annotated transitions are installed into the recognition
graph. The parse annotation assigned to {p,A, q) is determined by nuller(.A).
(73) Each element in sortedlist is considered in turn. No additional elements are
added to _sorted_iist within this loop.
(74) The element (p,A,q) at the head of _sorted_list is removed. At this point, we
know that elength(A) > elength(A') for each element (p',Aq') removed from _sorted_list
in an earlier iteration of the loop.
(7576) Suppose that nuller(A) = A*Â£. Then [e] is the appropriate parse annotation
for (p,A, q). Thus, the transition (p, A, q, [c]) is added to 8.
(7780) Otherwise, nuller(A) = AtB^B^ Bm for some production
A?12?2 Bm where m >1. This implies that elength^,) < elength(A) holds for each
Bi. Since _sorted_list was sorted in order of increasing elength, an annotated transition on
each B{ has already been installed in GR. In particular, there must be a path
(qm, Qmv >7i> ?) *n Gr which spells BmBm_1 Bx. The transitions in this path are of
the form (qj,Bj, qj_v [tt;]), (q1,B1, q, [tTi])(E, m >j >2, for some parse annotations [7T]. In
this case, (p,A,q, [&7T1,&7r2, . ,&7Tm]) is the appropriate transition to add to 8.
The third while loop processes the states contained in Qsubset'. In particular, for
each state p in Qsubset' and each item of the form AtaX'PEiJ^p) such that /?=* holds
102
in G, this loop determines an appropriate parse annotation to associate with the nullable
suffix ft. Thus, p is readied for any reductions that are initiated from it in the for loop at
line 35 of the Reduce function. Note that at this point none of the states in Qsubset' have
had reductions made from them yet.
(83) Each state in Qsubset' is considered in turn. No new states are added to
Qsubset' within the loop.
(84) A state q is removed from Q_subset'.
(85) For each AtaX^fidiil^q) such that /?=** holds in G, we want to associate a
parse annotation to the nullable suffix /?. This becomes the parse annotation [7^] that is
referred to in lines 3637 of the Reduce function.
(8687) If /?=Â£, then the appropriate parse annotation to associate with /? is [].
(8891) Otherwise, P=BXB2 Bm for some Bj G W and m >1. Due to the process
ing done in the second while loop, there is a path (qm, qm_x, . ,qv q) in GR which spells
BmBm\ Bv Let {qj,Bj, qÂ¡_x, [fly]), (qvBv q, K])G, m >j >2, for some parse annota
tions iTj be the transitions in that path. Then the appropriate parse annotation to associate
with P in this case is [&7r1( &7T2, . ,&7Tm],
The Complexity of Parsing
Worstcase complexity bounds for the GeneraLXRO parser are easily derived from the
complexity bounds of the recognizer. In the following, we assume that General_LR(y is
applied to G and w and that w E.L(G) holds. Space bounds are examined first.
The size of a parse annotation is bounded by some constant, e.g., the length p of the
longest production righthand side. If, as assumed, ambiguities are resolved when they are
first detected, only one parse annotation is ever attached to a given transition in GR. Thus,
the space complexity of the parser is the same as the space complexity of the recognizer.
That is, the space complexity of GeneralLR(y is 0(n2) if G is arbitrary, or unambiguous
but otherwise arbitrary, and it is O(n) if G is LR(Ar) and A:symbol lookahead is employed.
103
The LsortecLJist that is used by the parsers version of Traverse contains at most m2
entries at any time. Thus, its use does not affect the space complexity of parsing.
In the other extreme, the resolution of all ambiguities discovered by Reduce is delayed
until after the input string is accepted. Under this scenario, one parse annotation is attached
to a nonterminal transition for each path in GR that reduces to that transition. In this case,
the space complexity of the parser is the same as the time complexity of the recognizer, i.e.,
0(np+1).
Next, the time complexity of the parser is considered. The most substantial differences
between the parser and the recognizer lie with the manufacture of parse annotations and the
Traverse function. The amount of work done within each invocation of Traverse is bounded
by constant factors that are related to the size of MC(G). Since Traverse is called at most
m times within any invocation of Reduce, the more complicated Traverse function used by
GeneraLLRC^ does not increase the time complexity of parsing with respect to recognition.
Moreover, the operations related to constructing parse annotations can clearly be done in a
constant amount of time. Therefore, the worstcase time complexity of the parser is 0(n#>+1)
if G is arbitrary and 0(n2) if G is unambiguous. In addition, LR() grammars can be
parsed in linear time provided that symbol lookahead is used.
Since the Disambiguate function has not been specified, its impact on the time complex
ity of parsing cannot be assessed. In that respect, the above analyses implicitly assume that
the Disambiguate function runs in constant time. However, if more costly mechanisms are
required for resolving ambiguity, the time consumed by them must be accounted for.
Garbage Collection Revisited
Lookahead can be employed within GeneralLRO7 exactly as in General_LR0. How
ever, the garbage collection procedure proposed for General_LR0 is too simplistic for the
parser. The underlying reason for this lies with the manner in which the parse forest is
superimposed on the recognition graph.
104
Consider a point during the parse of an input string at which we would like to perform
garbage collection. If the garbage collection procedure proposed for General_LR0 is applied,
the recognition graph may be contracted more than is desired for parsing. Specifically, tran
sitions may be deleted from GR whose parse annotations are part of the parse forest relevant
to the prefix of the input string analyzed to that point. The marking phase of the garbage
collection procedure must be modified accordingly to correct for this.
Consider the recognition graph just prior to performing garbage collection. Informally,
we will refer to the states in GR that are not deleted by our original garbage collection pro
cedure as being essential to recognition. The states in GR that are essential to parsing are
defined inductively as follows.
(1) If p (EQ is essential to recognition, then p is essential to parsing.
(2) If p Q is essential to parsing and entry(p) = A for some A G/V, for every transi
tion (p,,4, q, [tt])G<5 where [zr] = &7r2, . ,&7rm], m> 1, let
(r,X, s, [7Tm])G<5 be the rightmost transition referenced in [7r], Then r and all
states reachable from r are essential to parsing.
The marking phase of the garbage collection procedure must be modified so as to mark
all states in G* that are essential to parsing. In order to accomplish this, certain branches of
the parse forest must be traversed according to the inductive definition given above. The
second step of the garbage collection procedure, that which deletes unmarked states and
their outgoing transitions, remains unchanged.
Discussion
The GeneraULRO recognizer was extended into a general contextfree parser. The
parse forest constructed by General_LR(y is represented by attaching appropriate parse
annotations to the transitions of GR. In effect, the parse forest is superimposed on the recog
nition graph.
105
Only minor modifications were required of the Shift and Reduce functions in order to
accommodate parsing. The Traverse function, on the other hand, was changed substantially.
It is important to note that Traverse can handle the most illformed grammars. For exam
ple, consider the grammar with the production set P = {50 , SQ+a, S0 S'f,
, Sk_1Sq} fr sme k >1. This grammar was submitted by Graham et al. [20, p.
429] as an example of a particularly bad worstcase. Although this is a contrived example,
the ability to effectively deal with pathological conditions if and when they arise is valuable
from both a theoretical and practical standpoint. Toward that end, the Traverse function
handles the worst situations in a fairly straightforward manner. Nevertheless, Traverse can
be tailored to meet the specific requirements of the subject grammar if the generality it pro
vides is not needed.
A parse annotation for a nonterminal transition is manufactured as a sequence of
pointers to the parse annotations that are encountered while a path is traversed during a
reduction. Tomitas algorithm performs similar operations to construct a parse forest. In his
parsing algorithm, the symbol vertices of the recognizer are used for storing pointers to the
nodes of the parse forest. Of course, the complexity introduced into the recognizer by the
symbol vertices and the ad hoc manner in which productions are handled carry over to the
parser.
In Tomitas algorithm, the parse forest is built separately from the graphstructured
stack. General_LR(y constructs the parse forest more or less on top of the recognition graph,
but could just as easily build the parse forest separately as well. The choice that is made for
an actual implementation primarily has implications on garbage collection.
The worstcase time complexity of GeneraLLRCy matches that of General_LR0. With
respect to GeneralLRC/, the expression np+1 reflects the time required, in the worstcase, to
construct a direct representation of the parse forest. Thus, the relative inefficiency of
General_LR0 as compared to Earleys recognizer is offset by the benefits accrued by
General_LR(y. Specifically, the traversals that are required to produce a parse and to resolve
106
ambiguities are made convenient by the structure of the parse forest. In contrast, Earleys
parser produces a rather indirect representation of the parse forest. Little is said in the
literature of how this affects the ease with which a parse is produced or with which ambigui
ties are resolved by Earleys parser.
The hypothetical Disambiguate function referred to in Figure 7.1 allowed us to keep
the specification of transitions simple. By assumption, Disambiguate resolved ambiguities at
the point where they were first detected, so only one parse annotation was ever attached to a
given transition in GR. Of course, this assumption is unrealistic in the general case. A sub
stantive treatment of ambiguity and its resolution is well beyond the scope of this work.
However, the following very basic observations may be made.
The task confronted by Disambiguate in line 54 of Figure 7.1 is to determine which
transition out of (fy;i,A, r, and (qj.{,A,r, [7r2]) to retain in GR. In order of increasing
complexity, a selection may be made based on the following strategies.
(1) Through a direct comparison of [7^] and [7r2],
(2) A combination of (1) and an analysis of the subparse trees referred to by [7^] and
[tTj], respectively.
(3) An analysis of the surrounding context in combination with (1) and (2).
The Disambiguate function could conceivably resolve ambiguities that entailed analyses of
type (1) or (2) above. On the other hand, ambiguities requiring type (3) analysis would have
to be postponed until later in the parse if they depended on right context. Some simple
approaches to handling ambiguity are described by Aho et al. [2], Earley [15], Tarhio [41],
and Wharton [45],
CHAPTER VIII
CONCLUSION
Summary of Main Results
The first part of this work presented a framework for describing general canonical
contextfree recognition. The framework has a structurally simple mathematical foundation.
The essence of general canonical recognition was captured using a small number of binary
relations and basic settheoretic concepts. Each general recognition scheme that was
presented followed the same script while exploiting inherent properties of viable prefixes.
Specifically, general recognition was reduced to computing a sequence of regular sets in each
case. Regularitypreserving relations were applied to effect the settoset mappings. Our
characterization of general recognition is novel and rather elegant. Its clarity and simplicity
confirm that viable prefixes are especially suitable bases for general recognition. Moreover,
our framework offers a conceptual breakthrough toward a better understanding of the
quintessence of general canonical recognition.
Earleys algorithm proved an especially fitting vehicle for demonstrating the efficacy of
the GeneraULR and GeneralLL recognition schemes. In particular, our graphical variant of
Earleys recognizer, Earley', illustrated one way of realizing explicit representations for the
sets of viable prefixes and viable suffixes that are tracked by these two complementary
schemes. The fact that General_LR is directly manifested by Earley' led us to conclude that
it is more appropriate to interpret Earleys algorithm as a bottomup method rather than a
topdown one. Regardless of which interpretation one favors, Earley' provided much new
insight into Earleys algorithm. Specifically, a deeper understanding of Earleys algorithm
was gained and its relationship with LR parsers was clarified.
107
108
The last two chapters were devoted to describing practical recognizers and parsers that
are derived from the GeneraLLR recognition scheme. Automatabased versions of
GeneralLR are obtained by using an automaton that accepts VP(Gi) to guide the construc
tion of a statetransition graph, the recognition graph. The recognition graph explicitly
represents the sets of viable prefixes that are computed by General_LR. In the discussion of
the algorithms, LR(0) and NLR(O) automata were used as control automata. However, other
choices are possible such as automata that are intermediate between the LR(0) and NLR(O)
automata as well as automata that are attributed with lookahead. The General_LR0 parser
can process arbitrary reduced contextfree grammars. To accommodate especially ill
designed grammars, simple means for dealing with pathological grammar properties were
presented. Finally, the parse forest representation used by the GeneraLLRO parser is easy
to understand and convenient for handling ambiguity.
We have included some discussion of how the Earley and Tomita algorithms compare
to ours. Although the 0(np+1) worstcase time complexity of the General_LR0 recognizer
does not compare favorably with the 0(n3) worstcase complexity of Earleys recognizer, it
is expected that GeneraLLRO would outperform Earleys algorithm in most practical situa
tions. Moreover, it is more convenient to work with the representation of the parse forest
that is used in our framework. The GeneraLLRO algorithm is in the same complexity class
as Tomitas algorithm. This is not a surprising result given the similarities between the two.
However, our algorithm can parse any reduced grammar. Thus, we have generalized
Tomitas algorithm; ironically, our general algorithm is also simpler than Tomitas. Lastly,
our framework provides some firm theoretical justification for Tomitarlike parsers. Tomitas
algorithm is notably lacking in that respect in that it is more of an ad hoc generalization of
the standard LR parsing algorithm.
The general parsers derived in our framework, viz., the GeneraLLRO parser and its
variants, are appropriate to areas of application which require more flexible parsers than are
provided within the confines of LR parsing theory. In a more general sense, our work pro
109
vides a basis from which many issues relating to contextfree recognition and parsing may be
further investigated. Most notably, our viable prefixbased model of recognition and parsing
offers a particularly appropriate framework within which a broad spectrum of related parsing
strategies LR parsers, the Earley and Tomita algorithms, and our general parsers may
be further studied and compared.
Directions for Future Research
Before concluding, we suggest some possible directions for further research. There are
several worthwhile prospects. Of course, it is assumed that the framework laid down herein
would be used as a starting point for the endeavors described below.
Several automatabased versions of the General_LR recognition scheme were con
sidered. Specifically, concrete realizations of GeneraLLR were born out by the Earley',
GeneralLRO, and GeneraLNLRO recognizers. The other lefttoright recognition scheme,
General_LL, was mimicked by Earley' in a rather obscure fashion. The automatartheoretic
aspects of GeneralLL should be investigated to determine more direct means for tracking
the sets of viable suffixes that are computed by it. Our preliminary findings along this line
indicate that an automatabased GeneraLLL recognizer that runs in 0(n3) time in the worst
case is indeed attainable. That is, the time complexity does not depend on the length of pro
duction righthand sides as is the case with GeneraLLRO. However, we were unable to
extend this general viable suffixbased recognizer into a parser, so further study of this issue
was suspended.
It is expected that a pursuit of the following three topics would benefit from experi
menting with actual implementations.
(1) Ascertain a more precise characterization of the 0(n2) time and 0(n) time gram
mar classes. It is wellknown that Earleys algorithm recognizes grammars with
bounded ambiguity in quadratic time; moreover, even some ambiguous grammars
are recognized in linear time.
110
(2) Consider alternate control automata for implementing the GeneraLLR recogni
tion scheme (including automata that are attributed with lookahead). We have
already suggested employing automata that are intermediate between NLR(O)
automata and LR(0) automata.
(3) Identify means for classifying ambiguity and investigate disambiguation strategies.
As described, the General_LRO parser produces a parse of the input string only after
the string is accepted, i.e., like Earleys algorithm. It would be advantageous to be able to
obtain parse fragments as soon as they are known to be part of a final parse. The parser
would then behave more like an extended LR parser. The GeneraLLRO parser should be
modified to provide for such a piecemeal delivery of a parse. Note that such a mechanism
would have implications on garbage collection.
The 0(np+1) worstcase time complexity of General_LRO compares unfavorably with
Earleys algorithm. The last topic that we suggest addresses this. A grammar is in canoni
cal twoform if its productions are of the forms A+B C, A+B, A*a, and A+e [39].
Clearly, every canonical twoform grammar can be recognized in 0(n3) time. One possible
approach to recognizing an arbitrary grammar in 0(n3) time is to transform it into an
equivalent canonical twoform grammar and recognize the input string with respect to the
new grammar. A parse in the original grammar could then be reconstructed from the parse
that is obtained in the transformed canonical twoform grammar.
REFERENCES
[1] Aho, A. V., Hopcroft, J. E., and Ullman, J. D. The Design and Analysis of Computer
Algorithms. AddisonWesley, Reading, Mass., 1974.
[2] Aho, A. V., Johnson, S. C., and Ullman, J. D. Deterministic parsing of ambiguous
grammars. Commun. ACM 18(8), pp. 44152, Aug. 1975.
[3] Aho, A. V. and Peterson, T. G. A minimum distance errorcorrecting parser for
contextfree languages. SIAM J. Comput. 1(4), pp. 30512, Dec. 1972.
[4] Aho, A. V., Sethi, R, and Ullman, J. D. Compilers: Principles, Techniques, and Tools.
AddisonWesley, Reading, Mass., 1986.
[5] Aho, A. V. and Ullman, J. D. Optimization of LR(&) parsers. J. Comput. Syst.
Sci. 6(6), pp. 573602, Dec. 1972.
[6] Aho, A. V. and Ullman, J. D. The Theory of Parsing, Translation, and Compiling.
Volume I: Parsing, PrenticeHall, Englewood Cliffs, N. J., 1972.
[7] Aho, A. V. and Ullman, J. D. The Theory of Parsing, Translation, and Compiling.
Volume II: Compiling, PrenticeHall, Englewood Cliffs, N. J., 1973.
[8] Bermudez, M. E. and Logothetis, G. Simple computation of LALR(l) lookahead sets.
Inf. Process. Lett. 31(5), pp. 2338, 12 June 1989.
[9] Bouckaert, M., Pirotte, A., and Snelling, M. Efficient parsing algorithms for general
contextfree parsers. Inf. Sci. 8, pp. 126, Jan. 1975.
[10] Christopher, T. W., Hatcher, P. J., and Kukuk, R. C. Using dynamic programming to
generate optimized code in a GrahamGlanville style code generator. ACM SIGPLAN
Notices 19(6), pp. 2536, June 1984.
[11] DeRemer, F. and Pennello, T. Efficient computation of LALR(l) lookahead sets.
ACM Trans. Program. Lang. Syst. 4(4), pp. 61549, Oct. 1982.
[12] DeRemer, F. L. Simple LR(A:) grammars. Commun. ACM 14(7), pp. 45360, July
1971.
[13] Earley, J. An efficient contextfree parsing algorithm. Ph. D. Thesis, Comput. Sci.
Dept., CarnegieMellon U., Pittsburgh, Pa., 1968.
[14] Earley, J. An efficient contextfree parsing algorithm. Commun. ACM 13(2), pp. 94
102, Feb. 1970.
Ill
112
[15] Earley, J. Ambiguity and precedence in syntax description. Acta Inf. 4(2), pp. 18392,
1975.
[16] Ginsburg, S. and Greibach, S. Deterministic contextfree languages. Inf. Control 9(6),
pp. 62048, Dec. 1966.
[17] Glanville, R. S. and Graham, S. L. A new method for compiler code generation. In
Conference Record of the Fifth Annual ACM Symposium on Principles of Program
ming Languages, pp. 23140, Association for Computing Machinery, New York, N. Y.,
Jan. 1978.
[18] Gonzalez, R. C. and Thomason, M. G. Syntactic Pattern Recognition, An Introduc
tion. AddisonWesley, Reading, Mass., 1978.
[19] Graham, S. L. and Harrison, M. A. Parsing of general contextfree languages. In
Advances in Computers, ed. M. Rubinoff and M. C. Yovits, vol. 14, pp. 77185,
Academic Press, New York, N. Y., 1976.
[20] Graham, S. L., Harrison, M. A., and Ruzzo, W. L. An improved contextfree recog
nizer. ACM Trans. Program. Lang. Syst. 2(3), pp. 41562, July 1980.
[21] Greibach, S. A. A note on pushdown store automata and regular systems. Proc.
Amer. Math. Soc. 18(2), pp. 2638, April 1967.
[22] Griffiths, T. and Petrick, S. On the relative efficiencies of contextfree grammar recog
nizers. Commun. ACM 8(5), pp. 289300, May 1965.
[23] Heering, J., Klint, P., and Rekers, J. Incremental generation of parsers. ACM SIG
PLAN Notices 24(7), pp. 17991, July 1989.
[24] Hopcroft, J. E. and Ullman, J. D. Introduction to Automata Theory, Languages, and
Computation. AddisonWesley, Reading, Mass., 1979.
[25] Kasami, T. and Torii, K. A syntax analysis procedure for unambiguous contextfree
grammars. J. ACM 16(3), pp. 42331, July 1969.
[26] Kipps, J. R. Analysis of Tomitas algorithm for general contextfree parsing. In
Proceedings of the International Workshop on Parsing Technologies, pp. 193202,
CarnegieMellon U., Pittsburgh, Pa., 2831 Aug. 1989.
[27] Knuth, D. E. On the translation of languages from left to right. Inf. Control 8(6), pp.
60739, Oct. 1965.
[28] Knuth, D. E. Topdown syntax analysis. Acta Inf. 1(2), pp. 79110, 1971.
[29] Kristensen, B. B. and Madsen, O. L. Methods for computing LALR(A:) lookahead.
ACM Trans. Program. Lang. Syst. 3(1), pp. 6082, Jan. 1981.
[30] Langmaack, H. Application of regular canonical systems to grammars translatable
from left to right. Acta Inf. 1(2), pp. 11114, 1971.
[31] Lyon, G. Syntaxdirected leasterrors analysis for contextfree languages: a practical
approach. Commun. ACM 17(1), pp. 314, Jan. 1974.
113
[32] Manacher, G. K. An improved version of the CockeYoungerKasami algorithm. Corn
put. Lang. 3, pp. 12733, 1978.
[33] Mayer, O. On deterministic canonical bottomup parsing. Inf. Control 43(3), pp. 280
303, Dec. 1979.
[34] Nijholt, A. Computers and Languages, Theory and Practice. NorthHolland, Amster
dam, 1988.
[35] NozohoorFarshi, R. Handling of illdesigned grammars in Tomitas parsing algorithm.
In Proceedings of the International Workshop on Parsing Technologies, pp. 18292,
CarnegieMellon U., Pittsburgh, Pa., 2831 Aug. 1989.
[36] Rosenkrantz, D. J. and Stearns, R. E. Properties of deterministic topdown grammars.
Inf. Control 17(3), pp. 22656, Oct. 1970.
[37] Salomaa, A. Theory of Automata. Pergamon Press, Oxford, 1969.
[38] Sippu, S. and SoisalonSoininen, E. On LL(Ar) parsing. Inf. Control 53(3), pp. 14164,
June 1982.
[39] Sippu, S. and SoisalonSoininen, E. Parsing Theory. SpringerVerlag, New York, N.
Y., 1988.
[40] SoisalonSoininen, E. and Tarhio, J. Looping LR parsers. Inf. Process. Lett. 26(5), pp.
2513, 11 Jan. 1988.
[41] Tarhio, J. LR parsing of some ambiguous grammars. Inf. Process. Lett. 14(3), pp.
1013, 16 May 1982.
[42] Tomita, M. Efficient Parsing for Natural Language, A Fast Algorithm for Practical
Systems. Kluwer Academic Publishers, Boston, Mass., 1986.
[43] Tomita, M. An efficient augmentedcontextfree parsing algorithm. Comp.
Linguistics 13(12), pp. 3146, Jan.June 1987.
[44] Valiant, L. G. General contextfree recognition in less than cubic time. J. Comput.
Syst. Sci. 10(2), pp. 30815, April 1975.
[45] Wharton, R. M. Resolution of ambiguity in parsing. Acta Inf. 6(4), pp. 38795, 1976.
[46] Younger, D. H. Recognition and parsing of contextfree languages in time n3. Inf.
Control 10(2), pp. 189208, Feb. 1967.
BIOGRAPHICAL SKETCH
The author spent his early formative years in Georgetown, Massachusetts. An under
graduate education in Physics at Rensselaer Polytechnic Institute was followed by four years
with the General Electric Company in upstate New York. Thence the corporate chains were
shorn for the charms of the South Pacific. Not wanting too much of a good thing, the author
resumed his education as a graduate student at the University of Florida. In due time, an
M.S. in Computer Science was attained. Spurred on by the sage counseling of his mentor, a
Ph.D. in Computer Science was pursued as an afterthought.
114
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
Â£
Manuel E. Bermudez, Chairman
Assistant Professor of
Computer and Information Sciences
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
George LogKrhetis, Cochairman
Assistant Professor of
Computer and Information Sciences
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
0
'Qm m O/m f/\ GGi 7ftaT"
Yu a
PrQ
iChieh Chow
essor of
Computer and Information Sciences
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
I certify that I have read this study and that in my opinion it conforms to acceptable
standards of scholarly presentation and is fully adequate, in scope and quality, as a disserta
tion for the degree of Doctor of Philosophy.
David C. Wilson
Professor of
Mathematics
This dissertation was submitted to the Graduate Facuity of the College of Engineering
and to the Graduate School and was accepted for partial fulfillment of the requirements of
the degree of Doctor of Philosophy.
Cl
May 1990
sib* Winfred M. Phillips
' Dean, College of Engineering
Madelyn M. Lockhart
Dean, Graduate School
77
is actually introduced into a recognition graph depends on the input string as well as the sub
ject grammar.
A result by SoisalonSoininen and Tarhio [40] relating to the concept of a looping LR
parser was helpful in identifying the grammatical properties that give rise to cyclic recogni
tion graphs. Looping LR parsers are discussed in conjunction with a method for constructing
deterministic LR parsers for some nonLR(fc) grammars [2]; this method involves disambig
uating multiplydefined parse table entries. A looping LR parser is an LR parser that has a
parsing configuration such that all subsequent actions are reductions. The nonLR(A) gram
mars for which looping LR parsers can be produced (i.e., for some set of disambiguation
choices) can be characterized as follows.
Fact 6.1 A looping LR parser can be constructed for G if and only if for some A EN
and a,/3EV* the following three statements hold in G: (1) A =^+aA/3, (2) ot=$*e, and (3) if
a=e, then
Proof. This is the main result presented by SoisalonSoininen and Tarhio [40],
In summary, a cycle in Mc is introduced into GR only if it spells a nontrivial string of
nullable nonterminal symbols. Paths spelling strings of nullable nonterminals which can
cause cycles are introduced into Gr by the Traverse function. This is effectively carried out
through a traversal of Mc where each state in Mc is considered at most once. Once cycles
are present in GR, they are traversed, if at all, in the Reduce function. Specifically, the com
putation of the succ function implies a traversal of certain paths in GR, including those
which contain cycles. An implementation of the succ function which properly deals with
cycles in GR is described in a later subsection. In either case, cyclic control automata and
recognition graphs do not pose any particular difficulty to GeneraLLRO.
Set Operations
Two sets are maintained by General_LR0 during recognition, viz., Q and & Two set
operations are used in the process. One operation is that of determining if a particular
29
Proof. Assume that a=**Ap holds in G. By Fact 3.2, =>*(A P)R = PRA holds in GR.
Thus, a8 =*f /SR A also holds in GR by Lemma 3.2.
Lemma 3.28 For aEF* and aET, if a=$*afi holds in G for some PEV*, then
op =*j? 7a holds in GR for some 7 EV*.
Proof. If a=**aft holds in G for some PEV*, then of* =>*(aP)R =PR a holds in GR by Fact
3.2. By Lemma 3.3, it follows that of =^R^a holds in GR for some 7E V*.
Lemma 3.29 For A EN and XE V, X is leftreachable from A in G if and only if X
is rightreachable from A in GR.
Proof. Assume that A XP holds in G for some pE V*. By Fact 3.2, A =>?(XP)R =PR X
holds in GR, so A =^rqX holds in GR for some a E V*. This latter conclusion follows from
Lemma 3.2 if XEN, and from Lemma 3.3 otherwise. Conversely, suppose that A =$nOtX
holds in Gr for some aEV*. It follows from Lemma 3.1 that A =*?aX holds in GR. By
Fact 3.2, A =>f(aX)R =XaR holds in G.
Corollary For A EN, A is leftrecursive in G if and only if A is rightrecursive in
Gr.
Clearly, the nullability of vocabulary symbols is invariant with respect to grammar
reversal. Thus, the following statements are equivalent for XEV\ (l) X is nullable in G\
(2) X =*r e holds in G; (3) X =>r e holds in GR. This observation is easily generalized to
strings in V*.
Although Lemma 3.6 obviously applies to GR, it is restated below in terms of GR
because of its importance in showing how the =$r and I relations cooperate.
Lemma 3.30 For O'GV'*, at least one of the following two statements is true: (1)
a=>R Pa holds in GR for some pEV* and a E T; (2) cx=$r e holds in GR.
Left Sentential Forms Revisited
The left sentential forms and sentences of G are defined in terms of the Rderives and
chop relations of GR. Similar to rightmost derivations, a leftmost derivation in G is ren
113
[32] Manacher, G. K. An improved version of the CockeYoungerKasami algorithm. Corn
put. Lang. 3, pp. 12733, 1978.
[33] Mayer, O. On deterministic canonical bottomup parsing. Inf. Control 43(3), pp. 280
303, Dec. 1979.
[34] Nijholt, A. Computers and Languages, Theory and Practice. NorthHolland, Amster
dam, 1988.
[35] NozohoorFarshi, R. Handling of illdesigned grammars in Tomitas parsing algorithm.
In Proceedings of the International Workshop on Parsing Technologies, pp. 18292,
CarnegieMellon U., Pittsburgh, Pa., 2831 Aug. 1989.
[36] Rosenkrantz, D. J. and Stearns, R. E. Properties of deterministic topdown grammars.
Inf. Control 17(3), pp. 22656, Oct. 1970.
[37] Salomaa, A. Theory of Automata. Pergamon Press, Oxford, 1969.
[38] Sippu, S. and SoisalonSoininen, E. On LL(Ar) parsing. Inf. Control 53(3), pp. 14164,
June 1982.
[39] Sippu, S. and SoisalonSoininen, E. Parsing Theory. SpringerVerlag, New York, N.
Y., 1988.
[40] SoisalonSoininen, E. and Tarhio, J. Looping LR parsers. Inf. Process. Lett. 26(5), pp.
2513, 11 Jan. 1988.
[41] Tarhio, J. LR parsing of some ambiguous grammars. Inf. Process. Lett. 14(3), pp.
1013, 16 May 1982.
[42] Tomita, M. Efficient Parsing for Natural Language, A Fast Algorithm for Practical
Systems. Kluwer Academic Publishers, Boston, Mass., 1986.
[43] Tomita, M. An efficient augmentedcontextfree parsing algorithm. Comp.
Linguistics 13(12), pp. 3146, Jan.June 1987.
[44] Valiant, L. G. General contextfree recognition in less than cubic time. J. Comput.
Syst. Sci. 10(2), pp. 30815, April 1975.
[45] Wharton, R. M. Resolution of ambiguity in parsing. Acta Inf. 6(4), pp. 38795, 1976.
[46] Younger, D. H. Recognition and parsing of contextfree languages in time n3. Inf.
Control 10(2), pp. 189208, Feb. 1967.
Abstract of Dissertation Presented to the Graduate School
of the University of Florida in Partial Fulfillment of the
Requirements for the Degree of Doctor of Philosophy
GENERAL CONTEXTFREE RECOGNITION AND PARSING
BASED ON VIABLE PREFIXES
By
D. Clay Wilson
May 1990
Chairman: Dr. Manuel E. Bermudez
Major Department: Computer and Information Sciences
Viable prefixes play an important role in LR parsing theory. In the work presented
here, viable prefixes have a commensurately central role in a theory of general contextfree
recognition and parsing.
A settheoretic framework for describing general contextfree recognition is presented.
The operators and operands in the framework are regularitypreserving relations and regular
sets of viable prefixes, respectively. A basic operation consists of computing the image of a
regular set of viable prefixes under one of the relations. By extension, general recognition is
characterized in terms of computing a sequence of regular sets.
For implementation purposes, finitestate automata are used to represent the regular
sets. A general bottomup recognizer that constructs an appropriate sequence of automata is
described in detail. The regular languages accepted by these automata correspond to the
sets of viable prefixes computed by the recognizers settheoretic counterpart. The automata
are constructed under the guidance of a control automaton which accepts the viable prefixes
of the subject grammar. Ultimately, the automatabased recognizer is extended to a truly
general bottomup parser.
Earleys algorithm is analyzed in the context of our viable prefixbased framework as it
provides a convenient vehicle for illustrating some of our ideas. We describe how Earleys
Vll
76
addressed. Specifically, means for properly handling graph cycles and for efficiently imple
menting the relevant set operations and the succ function are discussed. A satisfactory reso
lution of these issues facilitates the complexity analyses undertaken in the next section.
Graph Cycles
In any application which involves graphs that are not necessarily acyclic, graph cycles
are a matter of concern. Neither LR(0) automata nor the recognition graphs constructed by
General_LRO are guaranteed to be acyclic.
Let MC(G) denote the LR(O) automaton of G and let GR(MC) denote the recognition
graph constructed by General_LRO when it is applied to G and w. Since all paths in GR are
reflected in Mc, albeit in reverse, GR is cyclic only if Mc is also cyclic. However, the con
verse does not hold; Mc may have cycles that are not replicated in a recognition graph
regardless of the input string.
Properties of contextfree grammars that give rise to cycles of any kind in LR(O) auto
mata are identified first. Since L(Mc) = VP(G), Mc is cyclic if and only if VP(G) contains
strings of unbounded length. Thus, Mc is cyclic if and only if for some A EN, aGV* with
cv^e, and yET*, A=t?aAy holds in G. That is, Vz>0, <5aM. EVP(C?) for some GF*.
Note that a may contain terminal symbols.
Grammatical properties which give rise to those cycles in Mc that can also be repro
duced in Gr are considered next. Since the above conditions characterize all possible cycles
in Mc, a restriction on those conditions is sought. Assume for the moment that GR is cyclic.
Given an arbitrary transition in GR of the form (q^^X, qj:h), we know that h
Thus, a particular cycle in Gr must consist solely of states in Qi for some t, 0 < t < n.
Moreover, every transition between any two states in QÂ¡ is on some nullable nonterminal
symbol. Consequently, the conditions given above are modified as follows. A control auto
maton Mc has a cycle which may be reproduced in GR if and only if for some A EN, Â£*Â£ V*
with a^e, and y ET*, A =*?aAy and cr=^e hold in G. Of course, whether or not a cycle
algorithm implicitly tracks the sets of viable prefixes that arise in our model. Moreover, by
modifying Earleys recognizer to construct a certain directed graph, the representation of
these sets is made explicit.
Our settheoretic framework yields elegant and succinct characterizations of general
contextfree recognition that appear to capture the essence of the task. On the practical
front, a general bottomup parser is described in sufficient detail to be readily implemented.
Although its practical potential is not evaluated here, the parser is intended for use in prob
lem areas that require more flexible parsers than are provided within the efficient but re
stricted LR framework. Regardless, our viable prefixbased treatment of recognition and
parsing provides a particularly appropriate framework within which the continuum between
LR parsers and our general parsers may be further investigated.
57
at a state in Sit 0
the topdown interpretation given to Earleys algorithm by Fact 5.3.
Lemma 5.5 Let p =(s0,sv . ,sm), m >0, be a rooted path in GE< such that 7GT*
is the state derivative of p and sm=[A+Of/3,j]ESi for some A+a/3EP and i,j,
0
Proof. We show that ^={PS)R for some <5GF* such that S'=**a1a2 a^Ab holds in G.
The proof is by induction on m.
Basis [m =0). Thus, =0, sm =s0 = [5' *5$,0]G50, and 7=$5. By definition,
$5 GVS(<2,0:t(;) and s0 is clearly valid for $5.
Induction (m > 0). Two cases are analyzed, based on whether or not a=e.
Case (i): ae. In this case, j =i and sm = [A+ Â¡3,] was added to SÂ¡ by the Predictor. Let
sm_1 = [BKTAT,jr]ESi for some BktAt&P and j', 0
derivatives of p'=(s0,sv . and p are (AtS)r and (/3rb)R =7, respectively, for some
GL*. By Fact 5.3, <7=w*aJi+1aJ/+2 a, holds in G. By the induction hypothesis,
S' =>*a1a2 ajiBS holds in G. That is, {AtS)r GVS((7, i:w) and [B+(XAt, j1] GS, is
valid for (At5)r. Clearly, S'=$*ala2 a^Arb also holds in G. Thus,
(/St5)r =7GVS(G!, i:w) and \A*(), i] is valid for 7.
Case (ii): a^e. Thus, sm was added to 5, by either the Scanner or the Completer, i.e.,
otolX for some o/GF* and XGF. Let sm_l \A*oI'Xf3,j\ GS1, for some i', j
and let p'=(s0,s 1, . The state derivatives of p' and p are (Xf3S)R and (/35)R =7,
respectively, for some GF*. By Fact 5.3, o/=^*aJ+1aJ+2 a, holds in G. By the induc
tion hypothesis, S' =**a1a2 ajAb holds in G, so (XÂ¡38)R GVS(G!, i':w) and
[Afo,.JT)9,j]G5,i is valid for {X/36)R. If XGT, then X = a{ and 1. If XEN, then
X=^*a,/+1a,/+2 a, holds in G. In either case, Qf=4*Oy+1a;+2 a, holds in G. There
fore, (/35)R =7GVS(G, i:w) and [A+Of/3,j] GS, is valid for 7.
83
one invocation of Traverse, at most m states and m2 transitions are added to the recognition
graph. Thus, over n+1 calls to Reduce, 0(n) time is spent within the Traverse function.
In assessing the contribution of the Reduce function to the time complexity of
GeneraLLRO, we first assume that G is unambiguous. For some i, 1
tion of Reduce is analyzed.3 By an inspection of the while loop, the time spent within
Reduce is based on the number of items that are cycled through subset and Succ_Stack.
From the analysis of the space complexity of GeneraLLRO, there are at most m2i transitions
from states in Q{ to states in U Q, at the completion of the 2th call to Reduce. These
o<;< 3
are precisely the transitions that are cycled through .subset. Although at least one of these
transitions must have been generated in the most recent invocation of Shift, for simplicity we
assume that all O(i) of them are created by Reduce. Under this assumption, each transition
in , results from traversing some path in GR that spells the reversal of some prefix of a pro
duction righthand side. This traversal is effected through the use of the Succ_Stack. Let p
= max({len(a;)  A wG.P}). Thus, at most m2ip entries are cycled through Succ_Stack
while all of the reductions relevant to the 2th call to Reduce are performed. Together, at
most m2i(p+1) items are cycled through .subset and Succ_Stack. Since
n
lm2i(/)+l)Â£0(n2), the total time spent in Reduce over n+1 calls is 0(n2). Accumulating
= i
the total time consumed by Shift, Traverse, and Reduce, we conclude that GeneraLLRO
runs in 0(n2) time in the worst case if G is unambiguous.
Now assume that G is arbitrary. Again, we want to determine the total number of
items cycled through subset and Succ_Stack during the 2th call to Reduce for some *,
1 < 2 < n. The number of transitions cycled through .subset is still bounded by m2i. A
bound on the number of entries cycled through Succ_Stack is given by the number of distinct
paths that may be traversed when making all possible reductions back through those transi
tions. Consider one of the 0(i) transitions in ,, say (p,X,q). Suppose that
3 All of the work is done by Traverse when =0 since o=0 when Reduce is called in that
instance.
55
Proof. This lemma appears to be rather more difficult than Lemma 5.2 to prove rigorously.
In lieu of a formal proof, an intuitive argument is given. First the following observations are
made.
(1) Every state which is valid for 7 is in 5,. Otherwise, a contradiction of Fact 5.2
would result.
(2) If 7^6, then there is some state s ES, such that s is valid for 7 and s properly
cuts 7. In particular, Earley states that are added by the Scanner or Completer
properly cut the viable prefixes that they are valid for.
(3) If 7t^c, then for each state s ES, which is valid for 7 there is a state r ES) such
that (i) r is also valid for 7, (ii) r properly cuts 7, and (iii) there exists a path in
Gei from r to s which spells e.
Given these observations, an informal inductive argument proceeds as follows where the
induction is on len(7).
Basis (len(7)=0). For each state sESq which is valid for eGYP(G,0:u;), there exists a
rooted path in Gei to s which spells e.
Induction (len(7) > 0). Let 7=VX for some and XGK. By points (2) and (3) above,
we may assume that [A ra/3, j] ES, properly cuts 7, i.e., ot=otX for some o'ET. Let
s = [A o/X*/3, j] E5,. For every j
X =}*o,i+1a,/+2 a, hold in G, [Ato/ Xft, j] is in S#. Pick one such i' (there must be at
least one) and let r =[A*oIX^,j] E5,/. By the induction hypothesis, '/iVP(Cr, i'.w), r is
valid for T7, and there exists a rooted path to r in Gei which spells T7. When s is added to
by either the Scanner or Completer, the transition (r,X,s) is installed in Gei. Therefore,
there exists a rooted path in Gei to s which spells 7.
Theorem 5.4 For 0
3 0
(GE>t,, s0,5'i) denote an NFA. Then L(Mei ,) = VP(G, t:w).
Proof. This theorem follows from Lemmas 5.2 and 5.3.
REFERENCES
[1] Aho, A. V., Hopcroft, J. E., and Ullman, J. D. The Design and Analysis of Computer
Algorithms. AddisonWesley, Reading, Mass., 1974.
[2] Aho, A. V., Johnson, S. C., and Ullman, J. D. Deterministic parsing of ambiguous
grammars. Commun. ACM 18(8), pp. 44152, Aug. 1975.
[3] Aho, A. V. and Peterson, T. G. A minimum distance errorcorrecting parser for
contextfree languages. SIAM J. Comput. 1(4), pp. 30512, Dec. 1972.
[4] Aho, A. V., Sethi, R, and Ullman, J. D. Compilers: Principles, Techniques, and Tools.
AddisonWesley, Reading, Mass., 1986.
[5] Aho, A. V. and Ullman, J. D. Optimization of LR(&) parsers. J. Comput. Syst.
Sci. 6(6), pp. 573602, Dec. 1972.
[6] Aho, A. V. and Ullman, J. D. The Theory of Parsing, Translation, and Compiling.
Volume I: Parsing, PrenticeHall, Englewood Cliffs, N. J., 1972.
[7] Aho, A. V. and Ullman, J. D. The Theory of Parsing, Translation, and Compiling.
Volume II: Compiling, PrenticeHall, Englewood Cliffs, N. J., 1973.
[8] Bermudez, M. E. and Logothetis, G. Simple computation of LALR(l) lookahead sets.
Inf. Process. Lett. 31(5), pp. 2338, 12 June 1989.
[9] Bouckaert, M., Pirotte, A., and Snelling, M. Efficient parsing algorithms for general
contextfree parsers. Inf. Sci. 8, pp. 126, Jan. 1975.
[10] Christopher, T. W., Hatcher, P. J., and Kukuk, R. C. Using dynamic programming to
generate optimized code in a GrahamGlanville style code generator. ACM SIGPLAN
Notices 19(6), pp. 2536, June 1984.
[11] DeRemer, F. and Pennello, T. Efficient computation of LALR(l) lookahead sets.
ACM Trans. Program. Lang. Syst. 4(4), pp. 61549, Oct. 1982.
[12] DeRemer, F. L. Simple LR(A:) grammars. Commun. ACM 14(7), pp. 45360, July
1971.
[13] Earley, J. An efficient contextfree parsing algorithm. Ph. D. Thesis, Comput. Sci.
Dept., CarnegieMellon U., Pittsburgh, Pa., 1968.
[14] Earley, J. An efficient contextfree parsing algorithm. Commun. ACM 13(2), pp. 94
102, Feb. 1970.
Ill
3
Given an arbitrary grammar G and an arbitrary string x over the terminal alphabet of
G, VP(G,:r) is a regular language. This fact can be established analytically. Alternatively,
the graphical variant of Earleys algorithm mentioned above provides a constructive proof of
this result.
In light of these observations, the primary thrust of this work is on the formal develop
ment of an approach to general contextfree recognition and parsing that is based on expli
citly computing VP(
lar, the viable prefix is the central concept upon which useful general recognizers and parsers
are founded. The development is rigorous, yet we strive for clarity and elegance by resorting
to basic principles wherever possible. In short, our approach to general recognition and pars
ing generalizes the role played by viable prefixes in LR parsers in order to accommodate
arbitrary grammars.
This work consists of three logical divisions. In the first (Chapters HI and IV), the
mathematical foundation for our viable prefixbased approach to recognition and parsing is
developed. The basic tools are a handful of binary relations on strings. General recognition
is described using these relations and simple settheoretic concepts. A key property of the
relations is that they preserve regularity. Consequently, general topdown and bottomup
recognition schemes are defined in terms of computing the images of regular sets of viable
prefixes under these relations. In short, general recognition is reduced to computing a
sequence of regular sets.
In the second major division (Chapter V), Earleys algorithm is used as a vehicle for
demonstrating the efficacy of our settheoretic approach to general recognition. In particu
lar, the graphbased variant of Earleys algorithm is presented there. This modified algo
rithm illustrates one way in which VP(
of the input string. In the process of analyzing our Earley derivative, some subtle properties
of Earleys original algorithm are also revealed and its relationship with LR parsers is
clarified.
97
The GeneralLRO Parser
The GeneraLLRO parser is described next. For reference, the parser is rendered in
pseudocode in Figure 7.1 (spanning three pages). The discussion focuses on the modifications
made to the recognizer in deriving the parser. For the most part, the changes are rather
minor. However, the Traverse function underwent substantial revision in order to correctly
handle arbitrary derivations of the empty string.
1. function General_LR(y(G ={V, T,P, 5); w G T*)
2. // w =axo2 an+1, n >0, flGT\{$}, l<*
3. // Let MC(G)=(I, V, goto, 70, /) be the LR(0) automaton for G.
4. // GR(MC)=(Q, V, 5) is an STG, the recognition graph.
5. Q, 6 := {<70.0}, 0 // Initialize GR.
6. // Let Mr ={Gr\ <70:0, Q o) Then L(MR) = PVP(G, e) = {e}.
7. for i := 0 to n do
8. // Let Mr =(Gr\ q0:0, Q{). Then L(MR) = PVP(G, i:w).
9.
Reduce (i)
10.
// Let Mr ={Gr\ q0:0t Q{). Then L(MR) = VP(G, i:w).
11.
Shift (t)
12.
// Let Mr =(Gr\ q0.0, Qi+l). Then L(MR) = PVP(G, i+l:w).
13.
if Qi+1 =0 then Reject(u>) fi
14.
od
15.
// Let Mr =(Gr\ q0:0, Qn+1). Then L(MR) = PVP(G, w) = {5$}.
16.
Accept(u;)
17.
end
18.
function Shift ()
19.
Qsubset := {q G Q{ \ goto(^(g), a,+1)
is defined }
20.
while Q^subset ^ 0 do
21.
q := Remove(Q_subset)
// Let goto(iftq), ai+1) = I}
22.
if
23.
Q :=QU{gy:i+1}
24.
fi
25.
6 := <5U{(<7j.,+1, ai+l, q, [a,+i])}
II Never redundant.
26.
od
27.
end
Figure 7.1 The GeneraLLRO Parser
(Line 1) The main function of the parser is named GeneralLRO'. In all other respects,
this function is identical to the main function of the recognizer.
59
Discussion
A graphical variant of Earleys algorithm was examined within the framework estar
blished in the previous two chapters. In the process, some properties of Earleys algorithm
were identified and the efficacy of the GeneraLLR and GeneraLLL approaches to general
recognition was established. Earleys algorithm is an excellent vehicle for demonstrating the
effectiveness of GeneralLR and GeneraLLL given that it is so wellknown and highly
regarded.
The analyses contained in the previous two sections illustrated how the sets of viable
prefixes (resp. viable suffixes) tracked by GeneralLR (resp. GeneraLLL) are explicitly
represented in the statetransition graph that is constructed by Earley'. As Earley' is a
direct descendant of Earley, it is fair to conclude that these same sets are represented impli
citly in the Earley state sets that are constructed by Earleys original algorithm. By viewing
Earleys algorithm from this novel perspective, its operation and correctness has been
explained at a level of abstraction that is closer to that necessary for capturing the essence of
general canonical recognition.
The structure of Gei exhibits how Earley' subsumes both the GeneralLR and
GeneralLL recognition schemes. Clearly, Earley' embodies GeneralLR considerably more
directly than GeneralLL. In light of this, it is perhaps more apt to view Earleys algorithm
as a general bottomup recognizer.
Practical aspects of the GeneralLR recognition scheme are examined further in the
next chapter and Chapter VII extends it into a general parser. Thus, this chapter is transi
tional in that it bridges the abstract treatment of general recognition presented in Chapters
III and IV with the concrete treatment of General_LR contained in Chapters VI and VII.
Attempts at deriving a general parser from GeneraLLL were unsuccessful. Thus, an investi
gation of the practical potential of GeneraLLL is left for future work.
72
{gÂ£<9 (p,, q)E6} and epred(p) = {qÂ£Q  (<7, e,p)Â£}. All four of these functions extend
to subsets of Q in the usual fashion.
The following facts apply to the NLR(O) automaton MNC(G).
(1) L(Mc(G))VP(G).
(2) Each IjÂ£I\{I0} has a unique entry symbol XEEL^e}, again denoted by
entry(/y).
(3) For {A*Of/3}EI such that AjS', Vpred({A* a: /?,}) = {Aa:/?} and
goto{Ij,A) is defined for each Ij Â£epred({A o/?}).
An Alternate Recognizer
The General_LRO recognizer is modified to employ the NLR(O) automaton of G as a
control automaton in place of the LR(O) automaton. The resulting algorithm, called
General_NLRO, is displayed in Figure 6.2. Only a small number of minor changes were
required to derive General_NLRO from GeneraLLRO. The differences between the two
recognizers are discussed next.
The lines in Figure 6.2 were numbered so as to emphasize the correlation between the
General_LRO and General_NLRO recognizers. Consequently, the line numbers cited below
reference code in both Figures 6.1 and 6.2.
(34) It is explicitly recorded that the NLR(O) automaton of G, MNC(G), is used as the
control automaton in General_NLRO. Thus, the recognition graph constructed by
General_NLRO, GR(MNC), is derived from MNC and the input string w.
(23) A state Ij of MNC has more than one incoming transition only if entry(/y) = e.
Therefore, is unconditionally added to Q at this point, i.e., lines 22 and 24 are not
needed in Figure 6.2.
(33) Each set of items in MNC is a singleton, so at most one reduction can apply to
V(p). Thus, an if construct is more appropriate here in place of the for loop of Figure 6.1.
CHAPTER VI
A GENERAL BOTTOMUP RECOGNIZER
In this chapter, a general bottomup recognizer that is directly based on the
General_LR recognition scheme is presented. In particular, the algorithm constructs a graph
in such a way that the regular sets of viable prefixes manipulated by GeneraLLR are
represented in this graph. Aside from complications that can arise due to nullable nontermi
nals, the recognizer is extended into a general parser rather seamlessly (parsing is the subject
of the next chapter). Thus, in light of the algorithms practical potential, several implemen
tation issues are discussed. Throughout this chapter, an arbitrary reduced $augmented
grammar G {V, T,P,S) and an arbitrary string w=a1a2 an+1, n >0, a,Â£T\{$}for
1 < i < n, an+1=$, are assumed.
Control Automata and Recognition Graphs
The recognizer described in this chapter constructs a statetransition graph which we
call the recognition graph. The correctness of the algorithm is based on properties of this
graph. The recognition graph is constructed under the guidance of an FSA called the control
automaton. The control automaton is determined from the subject grammar G and is fixed
throughout the recognition process. In contrast, the recognition graph evolves during recog
nition; its structure is derived from the control automaton and the input string w.
For simplicity, the LR(0) automaton of G is used as the control automaton for guiding
the recognition of w with respect to G; alternative control automata are suggested later.
The LR(0) automaton of G is a DFA which is based on the canonical collection of sets of
LR(0) items of G and the associated goto function [4,11]. Recall that each set is comprised
of kernel and closure items. The item S' *5$ is a kernel item as are all items of the form
60
94
that follows is simplified by assuming that ambiguities are resolved as soon as they are
detected. Of course, the ease with which ambiguities can actually be resolved is dictated by
semantic properties of the language generated by G.
Parse Annotations
The parse forest built by GeneraLLR(y is maintained through information that is
attached to the transitions of the recognition graph. These attachments have already been
referred to as parse annotations. The notation that is used for denoting parse annotations is
introduced next. For simplicity, only one parse annotation is ever attached to a given transi
tion.
The Greek letter 7r, possibly with a subscript, is used regularly to denote parse annota
tions. All parse annotations are enclosed within square brackets. Thus, [7r] is a simple exam
ple of the notation used to denote a parse annotation.
The parse annotation for a transition on a E T that is generated by Shift is denoted by
[a]. Conceptually, this annotation is some descriptor for the terminal symbol a. A transi
tion on A .N that is generated by Traverse as the result of a reduction by Ais anno
tated with [e], i.e., a suitable descriptor for the empty string. The notion of an empty parse
annotation, denoted by [], is also useful; note that this annotation is distinct from [c].
The parse annotation of every other nonterminal transition, whether generated by
Reduce or Traverse, consists of a list of pointers to other parse annotations. For this pur
pose, we let &7T denote a pointer or reference to the parse annotation [7r] (or equivalently, a
pointer to the transition to which [7r] is attached). Consider a transition on A (zN that is
generated as the result of a reduction by production AXlX2 Xm GP, m >1. Suppose
that for 1
tion. Then the parse annotation that is attached to this transition on A is
&7t2, . ,&7Tm], i.e., an ordered list of pointers to the annotations associated with the
transitions in the path in GR that spells [XlX2 Xm)R.
45
Lemma 4.13 Let w $L(G) be arbitrary. If GeneraLLR is applied to G and w, then
GeneralLR rejects w.
Proof. There are two cases to consider according to whether or not w is in PREFB^G).
Case (i): w GPREFBC(G). In this case, PVPlr(G, i:w) and VPlr(G,i:w) are nonempty for
all i, 0), so the for loop of GeneraLLR completes len(i/>) iterations. Since
w ^L(G) by assumption, cu^VPlr(G, w) for any S*oj(zP. Therefore, w is rejected by
GeneraLLR in the if statement that follows the for loop.
Case (ii): w ^SUFFB^G). Let xET* be the unique string which is the longest prefix of w
such that x GPREFIX(G) holds. Let len(a:) = m and note that 0<.m < len(u/). For all i,
0
completes m iterations. During the (m+l)3t iteration, PVPm(G,(ml):u;)=0 is computed.
Therefore, w is rejected by GeneraLLR in the if statement enclosed within the for loop.
Regularity Properties
The regularity properties inherent to all contextfree grammars that are exploited by
GeneraLLR are identified in this section. Specifically, for an arbitrary string x G T*,
PVPi^G, x) and VPlr((j, x) are regular languages.
Lemma 4.14 Relation (=* is regularitypreserving.
Proof. Let G = (V, T,P,S) be an arbitrary grammar and let L be an arbitrary regular sub
set of VP((7). Define the regular canonical system C = {V,IJ) such that II =
{(Â£cu, Â£A) \AvwGF}. Since =*c is defined on V* and is defined on VP(G) C V*, f= is a
subrelation of =>c. By Fact 3.1, L' = r(L, C,{e}) is a regular language. Since regular
languages are closed under intersection, L'nVP(G) is regular. Clearly, f=* (L) C L'DVPG)
holds, since [= is a subrelation of =*c that is restricted to VP(G). The converse inclusion,
viz., L'nVP(G) C (=* (L), is obtained by applying the corollary to Lemma 4.6. Specifically,
for aGL and ^GL'nVP(G), if a=*i0 holds in C, then a\=*/3 holds in G. Thus, =*(L) =
L'nVP(G), so =* is regularitypreserving.
103
The LsortecLJist that is used by the parsers version of Traverse contains at most m2
entries at any time. Thus, its use does not affect the space complexity of parsing.
In the other extreme, the resolution of all ambiguities discovered by Reduce is delayed
until after the input string is accepted. Under this scenario, one parse annotation is attached
to a nonterminal transition for each path in GR that reduces to that transition. In this case,
the space complexity of the parser is the same as the time complexity of the recognizer, i.e.,
0(np+1).
Next, the time complexity of the parser is considered. The most substantial differences
between the parser and the recognizer lie with the manufacture of parse annotations and the
Traverse function. The amount of work done within each invocation of Traverse is bounded
by constant factors that are related to the size of MC(G). Since Traverse is called at most
m times within any invocation of Reduce, the more complicated Traverse function used by
GeneraLLRC^ does not increase the time complexity of parsing with respect to recognition.
Moreover, the operations related to constructing parse annotations can clearly be done in a
constant amount of time. Therefore, the worstcase time complexity of the parser is 0(n#>+1)
if G is arbitrary and 0(n2) if G is unambiguous. In addition, LR() grammars can be
parsed in linear time provided that symbol lookahead is used.
Since the Disambiguate function has not been specified, its impact on the time complex
ity of parsing cannot be assessed. In that respect, the above analyses implicitly assume that
the Disambiguate function runs in constant time. However, if more costly mechanisms are
required for resolving ambiguity, the time consumed by them must be accounted for.
Garbage Collection Revisited
Lookahead can be employed within GeneralLRO7 exactly as in General_LR0. How
ever, the garbage collection procedure proposed for General_LR0 is too simplistic for the
parser. The underlying reason for this lies with the manner in which the parse forest is
superimposed on the recognition graph.
44
Let x E T* be an arbitrary string. The primitive LRassociates of x (in G) are defined
by PVPix(G,x) = (aEVP(G)l eQ=*)> holds in G). Clearly, PVPu^G,*) = {e}. The
LRassociates of x (in G) are defined by VPij^G^a:) = {q,GVP(G) e ([=*)*[=*a; holds in
G}. By Lemma 4.2, this set is equivalent to (o'EVP(G)  a=*?x holds in G}.
An input string w E T* is recognized by General_LR through the computation of
PVPlr(G, i:w) and VPlr(G, i:w) as i ranges from 0 to len(tz>). The process terminates when
either an empty set is produced or the input string is exhausted. Analogous to the topdown
recognition schemes, the relationships between VPlr(G,2:) and PVPlr(G,:c), and between
PVPlr(G, xa) and VPlr(G, a;) are significant. Specifically, for x E T* and a E T, VPlr(G, x)
= {/?EVP(G) a^/3 holds in G for some q;EPVPlr(G, a;)} = K(PVP(G,a:)) and
PVPlr(G, xa) = {/3EVP(G)  aa/3 holds in G for some oEVPlr(G, a;)} = <a (VPl^G, *)).
The conditions for termination are analogous to those for GeneraUtR and General_LL.
Given an input string u;ET* first suppose that u;EL(G). In this case, VPlr(G,u;) is the
last set of LRassociates computed by General_LR; after it is completed, w is accepted based
on the fact that cuEVPlr(G, w) for some ShjEP if and only if w EL(G). Alternatively,
suppose that w ^L(G). If w ^PREFIX(G) also holds, there is a unique string x E T* which
is the shortest prefix of w such that x ^PREFEX(G) holds. In this case, PVPii^G^a;) is the
first empty set computed by the recognizer. On the other hand, suppose that w ^L(G) and
wEPREFEX(G) both hold. In this case, it is discovered that cj$VPlr(G, w) for any
S*u>(zP. In either case, the input string is rejected by General_LR.
The correctness of GeneraULR is recorded more formally in the next two lemmas.
Lemma 4.12 Let u)EL(G) be arbitrary. If General_LR is applied to G and w, then
GeneraL_LR accepts w.
Proof. From earlier results, PVPlr(G, Lw) and VPlr(G, i:w), 0
nonempty. Thus, the for loop of GeneralLR completes len(w) iterations. Since w EL(G)
by assumption, o>EVPlr(G, u>) for some S+ojEP. Therefore, w is accepted by
GeneraLLR in the second if statement.
30
dered as an alternation of strong rightmost derivations in GR and rightmost chops of termi
nal symbols.
Lemma 3.31 For a,0E.V* and xET*, if q:(=*r\)*=>r 0 holds in GR, then
cS =*r{Px)R =xR0R holds in G.
Proof. By assumption, a(=$H I)* =$% 0 holds in GR. It follows from Lemma 3.9 that or =$>*/?:r
also holds in GR. This implies that cf =$*(0x)R =xR0R holds in G by Fact 3.2.
Lemma 3.32 For a,0EV*, let a=**0 hold in G. Write 0 as 2:7 for some x Â£ T* and
7G V such that 7G7VY* if 0(zT*NV* and 7=e otherwise (i.e., x is the longest prefix of 0
that is made up of only terminal symbols). Then aP (=w I)*/? =$r 7^ holds in GR.
Proof. Assume that the conditions in the hypothesis of the lemma hold. From the assump
tion that a=**0 holds in G and Fact 3.2, of* =*0R holds in GR. Since 0=x^,
0R (x^f)R =iR xR. Thus, xR is the longest suffix of 0R that is made up of terminal symbols
alone. We conclude from Lemma 3.12 that o^ (=$* I)*r=*r 7s holds in GR.
Theorem 3.33 SF^C) = {7EV* S (=>r I) a holds in GR for some aGV* and
x Â£ T* such that 7=(o'z)fi}.
Proof. First suppose that S (=*r I) =** a holds in GR for some aEV* and xET*. By
Lemma 3.31, this implies that S =$*(atx)R =xRaR holds in G, so (aar)^ is a left sentential
form of G. Conversely, assume that S'=>*7 holds in G for some 7GF*. Let
r)=xRaR =(ax)R for x E T* and ftL* such that xR is the longest prefix of 7 contained in
T*. This implies, by Lemma 3.32, that S (=*r I)* =$r a holds in GR.
Corollary L(G) = {w G T*\ S(=$r e holds in GR}.
Corollary PREFEX( G) = {z C T* \ S (=j? I) ** cv holds in GR for some aGL*}.
Viable Suffixes
A topdown complement to the class of LR(&) grammars is the class of LL(Ar) grammars
[28,36], A theory of LL(Ar) parsing that is a dual to the theory of LR(fc) parsing is developed
by Sippu and SoisalonSoininen [38]. In particular, the concept of a viable suffix is introduced
33
function General_LL(G* ={V, T,PR,S); wÂ£T*)
II w =a,a2 a, n >0, each a{ G T
PVSiu(G,) :={u>S>cuGP*}
for i := 0 to n 1 do
VSulG,i:w) := =*{PVSu{G,i:w))
PVSu{G,i+l:w) := I#.+i(VSuj(G!>i:w))
if PVSll(G, i+l:u;) = 0 then Reject(u>) fi
od
VSi^G, w) := ^i?(PVSiu(G, u;))
if eGVSu^G, w) then Accept(w) else Reject(u;) fi
end
Figure 3.2 A General TopDown CorrectPrefix Recognizer
defined by VSu^G.a:) = {/?GV* oj{=>* l)*H =** P holds in GR for some S+uGPR}. By
Theorems 3.33 and 3.39, VSll(G, x) = {/?GVS(G)  S =$*x(3R holds in G} which is precisely
the set described in the previous paragraph. Input string w is recognized by computing
PVSll(G, i:w) and VSll(G, i:w) as i ranges from 0 to len(u;).
The set VSll(G,:e) is equivalently expressed as {/?G V* \ ot =$Â£ /? holds in Gr for some
q;GPVSll(G, a:)}; this form explicitly reflects that VSix(G,a:) is the reflexivetransitive clo
sure of PVSu^G^z) under the =*r relation. Thus, VSi^G^a;) is computed by applying =>j?
to PVSll(G, a:).
Given VSLL(G,a:) and a G T, PVSu^G, ara) is determined from VSll(G,Â£) through an
application of the l0 relation since PVSll(G,xa) = {/? G V*  a: la/? holds in GR for some
orGVSu^G', a;)}. Clearly, PVSu^Gjar) and VSm(G,a;) are both nonempty if and only if
x GPREFIX(G). The initialization step entails computing the primitive LLassociates of e,
i.e., PVSll(G, e) = {cuI 5^cjGP^}.
The conditions under which GeneraLLL terminates are analogous to those of
General_RR. If w GL(G), then VSll(G, w) is the last set of LLassociates computed; after it
is known, w is accepted since cGVSll(G,w) if and only if w GL(G). Conversely, suppose
that w ^L(G). If w ^PREFDC(G) also holds, then there is a unique string iGT* which is
the shortest prefix of w such that x ^PREFIX(G) holds. In this case, PVSu^G, a:) is the first
90
that already existed in the graph, all states in Ui which were previously acted on are recon
sidered to see if any reductions from them pass through the new transition. This is
apparently sufficient, but the details of how it is accomplished are not provided.
The worstcase time complexity of Tomitas algorithm is also 0(np+1) [26]. In com
parison, recall that the complexity of Earleys algorithm is not affected by the length of pro
duction righthand sides. Accompanying the complexity analysis by Kipps [26] is a modified
version of Tomitas algorithm that has a worstcase running time in 0(n3). In short, addi
tional interstate links are used for decreasing the number of paths that must be traversed
when performing reductions. However, the plethora of setunion and setmembership opera
tions contained in the algorithm does not make it clear that 0(n3) time is obtained. In any
case, this modification subverts the algorithms ability to construct a parse forest, so it is
only useful for recognition.
88
tion algorithm to the NLR(O) automaton of G. Any automaton intermediate between the
NLR(O) and LR(0) automata that is built during subset construction provides a viable candi
date for a control automaton. One main advantage of LR(O) automata is their determinism,
whereas a favorable feature of NLR(O) automata is their comparatively smaller number of
states. Automata that are intermediate between these two extremes can be tailored to bal
ance both of these factors. The choice of possible control automata is broadened still further
when lookahead is introduced. An investigation of alternate control automata is left for
future work.
Of the known contextfree recognition algorithms, GeneralLRO is most like Tomitas
algorithm without lookahead [42,43]. In this form, Tomitas algorithm interprets a parse
table derived from the LR(0) automaton of G and maintains a socalled graphstructured
stack that is similar in structure to our recognition graph. However, a transition of the form
(p,A,q) is represented by two edges of the form (p, rA) and (rA, q) where p, q correspond to
parse states and rA is a symbol vertex. In effect, the symbol vertices play the role of our
transition labels. Due to the use of these symbol vertices, the correspondence between the
states and edges in the graphstructured stack and the states and transitions of the underly
ing LR(0) automaton is not as precise as in General_LRO. In addition, the symbol vertices
needlessly increase the number of vertices and edges in the graphstructured stack, increase
the lengths of paths that are traversed during reductions by a factor of 2, and complicate the
operations which manage the stack.
Tomitas algorithm cannot handle cyclic grammars [42], However, it also fails to han
dle some noncyclic grammars that contain eproductions. In short, any grammar that may
introduce a cycle into the graphstructured stack is troublesome. These grammars are
exactly the grammars that can introduce cycles into our recognition graphs.
Tomitas algorithm independently keeps track of edges that may need to be reduced
back through and states that have yet to be acted on (a state is acted on to determine what
parse moves are relevant to it). In contrast, other than the special attention given certain
95
In summary, a parse annotation is either (1) a descriptor of a terminal symbol, (2) a
descriptor of the null string, or (3) a sequence of pointers to parse annotations. In order to
reflect the close connection between parse annotations and recognition graph transitions, the
notation used to specify transitions is modified slightly as follows. Currently, (p,X,q)
denotes a transition in 6. In our discussion of General_LR(y, this transition will be denoted
by the quadruple (p ,X,q,[ii\) where [7r] is the parse annotation of (p,X,q). Thus, upon
acceptance of the input string, a parse tree for it can be recovered from the grammar sym
bols and parse annotations that are associated with the transitions in GR.
Parsing the Empty String
As identified in Chapter VI, a transition in the recognition graph of the form (p,A, q)
where p,qÂ£Qi for some i, 0
Such transitions are handled in a particularly simple fashion by the Traverse function of
GeneraLLRO since the steps in the derivation are not relevant to recognition. However, in
order to fulfill its role as a general parser, General_LR(y must be able to reconstruct a
derivation of e from A for this transition.
Some derivations of the empty string are especially troublesome, namely those which
are unbounded in length. Unbounded derivations of are caused by those nonterminals
A G/V for which A =*+A =**e holds in G. General_LR(y resolves this issue by disambiguat
ing every ambiguous derivation of e that occurs during a parse. The Traverse function is
modified to accomplished this task. The details of the revised Traverse function are given in
the next section. In the remainder of this section, we introduce some notions that are used in
that later discussion of Traverse.
First, we define W={A EN \ A =**e holds in G}. For each nonterminal symbol
A G W, Traverse minimizes the length of derivations of e from A. Toward that end, a parti
tion of W is defined as follows: (1) = {A G W \ AeGP}, and (2) for > 1,
{AÂ£W\A $1<:M Wjt A*B1B2 Bm EP, m>l, Bk Gj
CHAPTER IV
GENERAL BOTTOMUP RECOGNITION: A FORMAL FRAMEWORK
A formal framework for describing general bottomup recognition is developed next. In
particular, a general bottomup recognition scheme that scans input strings from left to right
is presented. The bottomup lefttoright character of the recognition scheme, called
GeneraULR, intimates that it is an inverse of General_RR. Indeed, General_LR is directly
derived from General_RR through inverses of the Rderives and chop relations. Conse
quently, General_LR also exploits certain regularity properties of contextfree grammars.
In keeping with Chapter III, some formal aspects of general bottomup recognition are
examined in a settheoretic framework. Later chapters affect a less abstract character;
specifically, General_LR is cast into concrete terms, viz., statetransition graphs and finite
state automata. Ultimately, a general bottomup parser based on GeneraLLR is described.
An arbitrary reduced grammar G =(V, T,P,S) is assumed throughout this chapter.
BottomUp LefttoRight Recognition
In a bottomup approach to recognition, an attempt is made to construct a parse tree
for an input string, perhaps implicitly, by starting from the leaves and working toward the
root. A basic step in the upward synthesis of a parse tree involves grafting together the
roots of one or more subtrees into a larger subtree. Suppose that the collection of these sub
trees is represented by the string of grammar symbols which label their roots. A grafting
operation may be described in terms of applying the inverse of the => relation to this linear
ized form of the partially constructed parse tree. That is, the occurrence of a production
righthand side in this string is replaced by (or reduced to) the corresponding lefthand side
nonterminal symbol; this symbol labels the root of the subtree produced by the grafting
37
