OLD | NEW |
| (Empty) |
1 <html> | |
2 <head> | |
3 <title>The Lemon Parser Generator</title> | |
4 </head> | |
5 <body bgcolor=white> | |
6 <h1 align=center>The Lemon Parser Generator</h1> | |
7 | |
8 <p>Lemon is an LALR(1) parser generator for C or C++. | |
9 It does the same job as ``bison'' and ``yacc''. | |
10 But lemon is not another bison or yacc clone. It | |
11 uses a different grammar syntax which is designed to | |
12 reduce the number of coding errors. Lemon also uses a more | |
13 sophisticated parsing engine that is faster than yacc and | |
14 bison and which is both reentrant and thread-safe. | |
15 Furthermore, Lemon implements features that can be used | |
16 to eliminate resource leaks, making is suitable for use | |
17 in long-running programs such as graphical user interfaces | |
18 or embedded controllers.</p> | |
19 | |
20 <p>This document is an introduction to the Lemon | |
21 parser generator.</p> | |
22 | |
23 <h2>Theory of Operation</h2> | |
24 | |
25 <p>The main goal of Lemon is to translate a context free grammar (CFG) | |
26 for a particular language into C code that implements a parser for | |
27 that language. | |
28 The program has two inputs: | |
29 <ul> | |
30 <li>The grammar specification. | |
31 <li>A parser template file. | |
32 </ul> | |
33 Typically, only the grammar specification is supplied by the programmer. | |
34 Lemon comes with a default parser template which works fine for most | |
35 applications. But the user is free to substitute a different parser | |
36 template if desired.</p> | |
37 | |
38 <p>Depending on command-line options, Lemon will generate between | |
39 one and three files of outputs. | |
40 <ul> | |
41 <li>C code to implement the parser. | |
42 <li>A header file defining an integer ID for each terminal symbol. | |
43 <li>An information file that describes the states of the generated parser | |
44 automaton. | |
45 </ul> | |
46 By default, all three of these output files are generated. | |
47 The header file is suppressed if the ``-m'' command-line option is | |
48 used and the report file is omitted when ``-q'' is selected.</p> | |
49 | |
50 <p>The grammar specification file uses a ``.y'' suffix, by convention. | |
51 In the examples used in this document, we'll assume the name of the | |
52 grammar file is ``gram.y''. A typical use of Lemon would be the | |
53 following command: | |
54 <pre> | |
55 lemon gram.y | |
56 </pre> | |
57 This command will generate three output files named ``gram.c'', | |
58 ``gram.h'' and ``gram.out''. | |
59 The first is C code to implement the parser. The second | |
60 is the header file that defines numerical values for all | |
61 terminal symbols, and the last is the report that explains | |
62 the states used by the parser automaton.</p> | |
63 | |
64 <h3>Command Line Options</h3> | |
65 | |
66 <p>The behavior of Lemon can be modified using command-line options. | |
67 You can obtain a list of the available command-line options together | |
68 with a brief explanation of what each does by typing | |
69 <pre> | |
70 lemon -? | |
71 </pre> | |
72 As of this writing, the following command-line options are supported: | |
73 <ul> | |
74 <li><tt>-b</tt> | |
75 <li><tt>-c</tt> | |
76 <li><tt>-g</tt> | |
77 <li><tt>-m</tt> | |
78 <li><tt>-q</tt> | |
79 <li><tt>-s</tt> | |
80 <li><tt>-x</tt> | |
81 </ul> | |
82 The ``-b'' option reduces the amount of text in the report file by | |
83 printing only the basis of each parser state, rather than the full | |
84 configuration. | |
85 The ``-c'' option suppresses action table compression. Using -c | |
86 will make the parser a little larger and slower but it will detect | |
87 syntax errors sooner. | |
88 The ``-g'' option causes no output files to be generated at all. | |
89 Instead, the input grammar file is printed on standard output but | |
90 with all comments, actions and other extraneous text deleted. This | |
91 is a useful way to get a quick summary of a grammar. | |
92 The ``-m'' option causes the output C source file to be compatible | |
93 with the ``makeheaders'' program. | |
94 Makeheaders is a program that automatically generates header files | |
95 from C source code. When the ``-m'' option is used, the header | |
96 file is not output since the makeheaders program will take care | |
97 of generated all header files automatically. | |
98 The ``-q'' option suppresses the report file. | |
99 Using ``-s'' causes a brief summary of parser statistics to be | |
100 printed. Like this: | |
101 <pre> | |
102 Parser statistics: 74 terminals, 70 nonterminals, 179 rules | |
103 340 states, 2026 parser table entries, 0 conflicts | |
104 </pre> | |
105 Finally, the ``-x'' option causes Lemon to print its version number | |
106 and then stops without attempting to read the grammar or generate a parser.</p> | |
107 | |
108 <h3>The Parser Interface</h3> | |
109 | |
110 <p>Lemon doesn't generate a complete, working program. It only generates | |
111 a few subroutines that implement a parser. This section describes | |
112 the interface to those subroutines. It is up to the programmer to | |
113 call these subroutines in an appropriate way in order to produce a | |
114 complete system.</p> | |
115 | |
116 <p>Before a program begins using a Lemon-generated parser, the program | |
117 must first create the parser. | |
118 A new parser is created as follows: | |
119 <pre> | |
120 void *pParser = ParseAlloc( malloc ); | |
121 </pre> | |
122 The ParseAlloc() routine allocates and initializes a new parser and | |
123 returns a pointer to it. | |
124 The actual data structure used to represent a parser is opaque -- | |
125 its internal structure is not visible or usable by the calling routine. | |
126 For this reason, the ParseAlloc() routine returns a pointer to void | |
127 rather than a pointer to some particular structure. | |
128 The sole argument to the ParseAlloc() routine is a pointer to the | |
129 subroutine used to allocate memory. Typically this means ``malloc()''.</p> | |
130 | |
131 <p>After a program is finished using a parser, it can reclaim all | |
132 memory allocated by that parser by calling | |
133 <pre> | |
134 ParseFree(pParser, free); | |
135 </pre> | |
136 The first argument is the same pointer returned by ParseAlloc(). The | |
137 second argument is a pointer to the function used to release bulk | |
138 memory back to the system.</p> | |
139 | |
140 <p>After a parser has been allocated using ParseAlloc(), the programmer | |
141 must supply the parser with a sequence of tokens (terminal symbols) to | |
142 be parsed. This is accomplished by calling the following function | |
143 once for each token: | |
144 <pre> | |
145 Parse(pParser, hTokenID, sTokenData, pArg); | |
146 </pre> | |
147 The first argument to the Parse() routine is the pointer returned by | |
148 ParseAlloc(). | |
149 The second argument is a small positive integer that tells the parse the | |
150 type of the next token in the data stream. | |
151 There is one token type for each terminal symbol in the grammar. | |
152 The gram.h file generated by Lemon contains #define statements that | |
153 map symbolic terminal symbol names into appropriate integer values. | |
154 (A value of 0 for the second argument is a special flag to the | |
155 parser to indicate that the end of input has been reached.) | |
156 The third argument is the value of the given token. By default, | |
157 the type of the third argument is integer, but the grammar will | |
158 usually redefine this type to be some kind of structure. | |
159 Typically the second argument will be a broad category of tokens | |
160 such as ``identifier'' or ``number'' and the third argument will | |
161 be the name of the identifier or the value of the number.</p> | |
162 | |
163 <p>The Parse() function may have either three or four arguments, | |
164 depending on the grammar. If the grammar specification file request | |
165 it, the Parse() function will have a fourth parameter that can be | |
166 of any type chosen by the programmer. The parser doesn't do anything | |
167 with this argument except to pass it through to action routines. | |
168 This is a convenient mechanism for passing state information down | |
169 to the action routines without having to use global variables.</p> | |
170 | |
171 <p>A typical use of a Lemon parser might look something like the | |
172 following: | |
173 <pre> | |
174 01 ParseTree *ParseFile(const char *zFilename){ | |
175 02 Tokenizer *pTokenizer; | |
176 03 void *pParser; | |
177 04 Token sToken; | |
178 05 int hTokenId; | |
179 06 ParserState sState; | |
180 07 | |
181 08 pTokenizer = TokenizerCreate(zFilename); | |
182 09 pParser = ParseAlloc( malloc ); | |
183 10 InitParserState(&sState); | |
184 11 while( GetNextToken(pTokenizer, &hTokenId, &sToken) ){ | |
185 12 Parse(pParser, hTokenId, sToken, &sState); | |
186 13 } | |
187 14 Parse(pParser, 0, sToken, &sState); | |
188 15 ParseFree(pParser, free ); | |
189 16 TokenizerFree(pTokenizer); | |
190 17 return sState.treeRoot; | |
191 18 } | |
192 </pre> | |
193 This example shows a user-written routine that parses a file of | |
194 text and returns a pointer to the parse tree. | |
195 (We've omitted all error-handling from this example to keep it | |
196 simple.) | |
197 We assume the existence of some kind of tokenizer which is created | |
198 using TokenizerCreate() on line 8 and deleted by TokenizerFree() | |
199 on line 16. The GetNextToken() function on line 11 retrieves the | |
200 next token from the input file and puts its type in the | |
201 integer variable hTokenId. The sToken variable is assumed to be | |
202 some kind of structure that contains details about each token, | |
203 such as its complete text, what line it occurs on, etc. </p> | |
204 | |
205 <p>This example also assumes the existence of structure of type | |
206 ParserState that holds state information about a particular parse. | |
207 An instance of such a structure is created on line 6 and initialized | |
208 on line 10. A pointer to this structure is passed into the Parse() | |
209 routine as the optional 4th argument. | |
210 The action routine specified by the grammar for the parser can use | |
211 the ParserState structure to hold whatever information is useful and | |
212 appropriate. In the example, we note that the treeRoot field of | |
213 the ParserState structure is left pointing to the root of the parse | |
214 tree.</p> | |
215 | |
216 <p>The core of this example as it relates to Lemon is as follows: | |
217 <pre> | |
218 ParseFile(){ | |
219 pParser = ParseAlloc( malloc ); | |
220 while( GetNextToken(pTokenizer,&hTokenId, &sToken) ){ | |
221 Parse(pParser, hTokenId, sToken); | |
222 } | |
223 Parse(pParser, 0, sToken); | |
224 ParseFree(pParser, free ); | |
225 } | |
226 </pre> | |
227 Basically, what a program has to do to use a Lemon-generated parser | |
228 is first create the parser, then send it lots of tokens obtained by | |
229 tokenizing an input source. When the end of input is reached, the | |
230 Parse() routine should be called one last time with a token type | |
231 of 0. This step is necessary to inform the parser that the end of | |
232 input has been reached. Finally, we reclaim memory used by the | |
233 parser by calling ParseFree().</p> | |
234 | |
235 <p>There is one other interface routine that should be mentioned | |
236 before we move on. | |
237 The ParseTrace() function can be used to generate debugging output | |
238 from the parser. A prototype for this routine is as follows: | |
239 <pre> | |
240 ParseTrace(FILE *stream, char *zPrefix); | |
241 </pre> | |
242 After this routine is called, a short (one-line) message is written | |
243 to the designated output stream every time the parser changes states | |
244 or calls an action routine. Each such message is prefaced using | |
245 the text given by zPrefix. This debugging output can be turned off | |
246 by calling ParseTrace() again with a first argument of NULL (0).</p> | |
247 | |
248 <h3>Differences With YACC and BISON</h3> | |
249 | |
250 <p>Programmers who have previously used the yacc or bison parser | |
251 generator will notice several important differences between yacc and/or | |
252 bison and Lemon. | |
253 <ul> | |
254 <li>In yacc and bison, the parser calls the tokenizer. In Lemon, | |
255 the tokenizer calls the parser. | |
256 <li>Lemon uses no global variables. Yacc and bison use global variables | |
257 to pass information between the tokenizer and parser. | |
258 <li>Lemon allows multiple parsers to be running simultaneously. Yacc | |
259 and bison do not. | |
260 </ul> | |
261 These differences may cause some initial confusion for programmers | |
262 with prior yacc and bison experience. | |
263 But after years of experience using Lemon, I firmly | |
264 believe that the Lemon way of doing things is better.</p> | |
265 | |
266 <h2>Input File Syntax</h2> | |
267 | |
268 <p>The main purpose of the grammar specification file for Lemon is | |
269 to define the grammar for the parser. But the input file also | |
270 specifies additional information Lemon requires to do its job. | |
271 Most of the work in using Lemon is in writing an appropriate | |
272 grammar file.</p> | |
273 | |
274 <p>The grammar file for lemon is, for the most part, free format. | |
275 It does not have sections or divisions like yacc or bison. Any | |
276 declaration can occur at any point in the file. | |
277 Lemon ignores whitespace (except where it is needed to separate | |
278 tokens) and it honors the same commenting conventions as C and C++.</p> | |
279 | |
280 <h3>Terminals and Nonterminals</h3> | |
281 | |
282 <p>A terminal symbol (token) is any string of alphanumeric | |
283 and underscore characters | |
284 that begins with an upper case letter. | |
285 A terminal can contain lowercase letters after the first character, | |
286 but the usual convention is to make terminals all upper case. | |
287 A nonterminal, on the other hand, is any string of alphanumeric | |
288 and underscore characters than begins with a lower case letter. | |
289 Again, the usual convention is to make nonterminals use all lower | |
290 case letters.</p> | |
291 | |
292 <p>In Lemon, terminal and nonterminal symbols do not need to | |
293 be declared or identified in a separate section of the grammar file. | |
294 Lemon is able to generate a list of all terminals and nonterminals | |
295 by examining the grammar rules, and it can always distinguish a | |
296 terminal from a nonterminal by checking the case of the first | |
297 character of the name.</p> | |
298 | |
299 <p>Yacc and bison allow terminal symbols to have either alphanumeric | |
300 names or to be individual characters included in single quotes, like | |
301 this: ')' or '$'. Lemon does not allow this alternative form for | |
302 terminal symbols. With Lemon, all symbols, terminals and nonterminals, | |
303 must have alphanumeric names.</p> | |
304 | |
305 <h3>Grammar Rules</h3> | |
306 | |
307 <p>The main component of a Lemon grammar file is a sequence of grammar | |
308 rules. | |
309 Each grammar rule consists of a nonterminal symbol followed by | |
310 the special symbol ``::='' and then a list of terminals and/or nonterminals. | |
311 The rule is terminated by a period. | |
312 The list of terminals and nonterminals on the right-hand side of the | |
313 rule can be empty. | |
314 Rules can occur in any order, except that the left-hand side of the | |
315 first rule is assumed to be the start symbol for the grammar (unless | |
316 specified otherwise using the <tt>%start</tt> directive described below.) | |
317 A typical sequence of grammar rules might look something like this: | |
318 <pre> | |
319 expr ::= expr PLUS expr. | |
320 expr ::= expr TIMES expr. | |
321 expr ::= LPAREN expr RPAREN. | |
322 expr ::= VALUE. | |
323 </pre> | |
324 </p> | |
325 | |
326 <p>There is one non-terminal in this example, ``expr'', and five | |
327 terminal symbols or tokens: ``PLUS'', ``TIMES'', ``LPAREN'', | |
328 ``RPAREN'' and ``VALUE''.</p> | |
329 | |
330 <p>Like yacc and bison, Lemon allows the grammar to specify a block | |
331 of C code that will be executed whenever a grammar rule is reduced | |
332 by the parser. | |
333 In Lemon, this action is specified by putting the C code (contained | |
334 within curly braces <tt>{...}</tt>) immediately after the | |
335 period that closes the rule. | |
336 For example: | |
337 <pre> | |
338 expr ::= expr PLUS expr. { printf("Doing an addition...\n"); } | |
339 </pre> | |
340 </p> | |
341 | |
342 <p>In order to be useful, grammar actions must normally be linked to | |
343 their associated grammar rules. | |
344 In yacc and bison, this is accomplished by embedding a ``$$'' in the | |
345 action to stand for the value of the left-hand side of the rule and | |
346 symbols ``$1'', ``$2'', and so forth to stand for the value of | |
347 the terminal or nonterminal at position 1, 2 and so forth on the | |
348 right-hand side of the rule. | |
349 This idea is very powerful, but it is also very error-prone. The | |
350 single most common source of errors in a yacc or bison grammar is | |
351 to miscount the number of symbols on the right-hand side of a grammar | |
352 rule and say ``$7'' when you really mean ``$8''.</p> | |
353 | |
354 <p>Lemon avoids the need to count grammar symbols by assigning symbolic | |
355 names to each symbol in a grammar rule and then using those symbolic | |
356 names in the action. | |
357 In yacc or bison, one would write this: | |
358 <pre> | |
359 expr -> expr PLUS expr { $$ = $1 + $3; }; | |
360 </pre> | |
361 But in Lemon, the same rule becomes the following: | |
362 <pre> | |
363 expr(A) ::= expr(B) PLUS expr(C). { A = B+C; } | |
364 </pre> | |
365 In the Lemon rule, any symbol in parentheses after a grammar rule | |
366 symbol becomes a place holder for that symbol in the grammar rule. | |
367 This place holder can then be used in the associated C action to | |
368 stand for the value of that symbol.<p> | |
369 | |
370 <p>The Lemon notation for linking a grammar rule with its reduce | |
371 action is superior to yacc/bison on several counts. | |
372 First, as mentioned above, the Lemon method avoids the need to | |
373 count grammar symbols. | |
374 Secondly, if a terminal or nonterminal in a Lemon grammar rule | |
375 includes a linking symbol in parentheses but that linking symbol | |
376 is not actually used in the reduce action, then an error message | |
377 is generated. | |
378 For example, the rule | |
379 <pre> | |
380 expr(A) ::= expr(B) PLUS expr(C). { A = B; } | |
381 </pre> | |
382 will generate an error because the linking symbol ``C'' is used | |
383 in the grammar rule but not in the reduce action.</p> | |
384 | |
385 <p>The Lemon notation for linking grammar rules to reduce actions | |
386 also facilitates the use of destructors for reclaiming memory | |
387 allocated by the values of terminals and nonterminals on the | |
388 right-hand side of a rule.</p> | |
389 | |
390 <h3>Precedence Rules</h3> | |
391 | |
392 <p>Lemon resolves parsing ambiguities in exactly the same way as | |
393 yacc and bison. A shift-reduce conflict is resolved in favor | |
394 of the shift, and a reduce-reduce conflict is resolved by reducing | |
395 whichever rule comes first in the grammar file.</p> | |
396 | |
397 <p>Just like in | |
398 yacc and bison, Lemon allows a measure of control | |
399 over the resolution of paring conflicts using precedence rules. | |
400 A precedence value can be assigned to any terminal symbol | |
401 using the %left, %right or %nonassoc directives. Terminal symbols | |
402 mentioned in earlier directives have a lower precedence that | |
403 terminal symbols mentioned in later directives. For example:</p> | |
404 | |
405 <p><pre> | |
406 %left AND. | |
407 %left OR. | |
408 %nonassoc EQ NE GT GE LT LE. | |
409 %left PLUS MINUS. | |
410 %left TIMES DIVIDE MOD. | |
411 %right EXP NOT. | |
412 </pre></p> | |
413 | |
414 <p>In the preceding sequence of directives, the AND operator is | |
415 defined to have the lowest precedence. The OR operator is one | |
416 precedence level higher. And so forth. Hence, the grammar would | |
417 attempt to group the ambiguous expression | |
418 <pre> | |
419 a AND b OR c | |
420 </pre> | |
421 like this | |
422 <pre> | |
423 a AND (b OR c). | |
424 </pre> | |
425 The associativity (left, right or nonassoc) is used to determine | |
426 the grouping when the precedence is the same. AND is left-associative | |
427 in our example, so | |
428 <pre> | |
429 a AND b AND c | |
430 </pre> | |
431 is parsed like this | |
432 <pre> | |
433 (a AND b) AND c. | |
434 </pre> | |
435 The EXP operator is right-associative, though, so | |
436 <pre> | |
437 a EXP b EXP c | |
438 </pre> | |
439 is parsed like this | |
440 <pre> | |
441 a EXP (b EXP c). | |
442 </pre> | |
443 The nonassoc precedence is used for non-associative operators. | |
444 So | |
445 <pre> | |
446 a EQ b EQ c | |
447 </pre> | |
448 is an error.</p> | |
449 | |
450 <p>The precedence of non-terminals is transferred to rules as follows: | |
451 The precedence of a grammar rule is equal to the precedence of the | |
452 left-most terminal symbol in the rule for which a precedence is | |
453 defined. This is normally what you want, but in those cases where | |
454 you want to precedence of a grammar rule to be something different, | |
455 you can specify an alternative precedence symbol by putting the | |
456 symbol in square braces after the period at the end of the rule and | |
457 before any C-code. For example:</p> | |
458 | |
459 <p><pre> | |
460 expr = MINUS expr. [NOT] | |
461 </pre></p> | |
462 | |
463 <p>This rule has a precedence equal to that of the NOT symbol, not the | |
464 MINUS symbol as would have been the case by default.</p> | |
465 | |
466 <p>With the knowledge of how precedence is assigned to terminal | |
467 symbols and individual | |
468 grammar rules, we can now explain precisely how parsing conflicts | |
469 are resolved in Lemon. Shift-reduce conflicts are resolved | |
470 as follows: | |
471 <ul> | |
472 <li> If either the token to be shifted or the rule to be reduced | |
473 lacks precedence information, then resolve in favor of the | |
474 shift, but report a parsing conflict. | |
475 <li> If the precedence of the token to be shifted is greater than | |
476 the precedence of the rule to reduce, then resolve in favor | |
477 of the shift. No parsing conflict is reported. | |
478 <li> If the precedence of the token it be shifted is less than the | |
479 precedence of the rule to reduce, then resolve in favor of the | |
480 reduce action. No parsing conflict is reported. | |
481 <li> If the precedences are the same and the shift token is | |
482 right-associative, then resolve in favor of the shift. | |
483 No parsing conflict is reported. | |
484 <li> If the precedences are the same the shift token is | |
485 left-associative, then resolve in favor of the reduce. | |
486 No parsing conflict is reported. | |
487 <li> Otherwise, resolve the conflict by doing the shift and | |
488 report the parsing conflict. | |
489 </ul> | |
490 Reduce-reduce conflicts are resolved this way: | |
491 <ul> | |
492 <li> If either reduce rule | |
493 lacks precedence information, then resolve in favor of the | |
494 rule that appears first in the grammar and report a parsing | |
495 conflict. | |
496 <li> If both rules have precedence and the precedence is different | |
497 then resolve the dispute in favor of the rule with the highest | |
498 precedence and do not report a conflict. | |
499 <li> Otherwise, resolve the conflict by reducing by the rule that | |
500 appears first in the grammar and report a parsing conflict. | |
501 </ul> | |
502 | |
503 <h3>Special Directives</h3> | |
504 | |
505 <p>The input grammar to Lemon consists of grammar rules and special | |
506 directives. We've described all the grammar rules, so now we'll | |
507 talk about the special directives.</p> | |
508 | |
509 <p>Directives in lemon can occur in any order. You can put them before | |
510 the grammar rules, or after the grammar rules, or in the mist of the | |
511 grammar rules. It doesn't matter. The relative order of | |
512 directives used to assign precedence to terminals is important, but | |
513 other than that, the order of directives in Lemon is arbitrary.</p> | |
514 | |
515 <p>Lemon supports the following special directives: | |
516 <ul> | |
517 <li><tt>%code</tt> | |
518 <li><tt>%default_destructor</tt> | |
519 <li><tt>%default_type</tt> | |
520 <li><tt>%destructor</tt> | |
521 <li><tt>%extra_argument</tt> | |
522 <li><tt>%include</tt> | |
523 <li><tt>%left</tt> | |
524 <li><tt>%name</tt> | |
525 <li><tt>%nonassoc</tt> | |
526 <li><tt>%parse_accept</tt> | |
527 <li><tt>%parse_failure </tt> | |
528 <li><tt>%right</tt> | |
529 <li><tt>%stack_overflow</tt> | |
530 <li><tt>%stack_size</tt> | |
531 <li><tt>%start_symbol</tt> | |
532 <li><tt>%syntax_error</tt> | |
533 <li><tt>%token_destructor</tt> | |
534 <li><tt>%token_prefix</tt> | |
535 <li><tt>%token_type</tt> | |
536 <li><tt>%type</tt> | |
537 </ul> | |
538 Each of these directives will be described separately in the | |
539 following sections:</p> | |
540 | |
541 <h4>The <tt>%code</tt> directive</h4> | |
542 | |
543 <p>The %code directive is used to specify addition C/C++ code that | |
544 is added to the end of the main output file. This is similar to | |
545 the %include directive except that %include is inserted at the | |
546 beginning of the main output file.</p> | |
547 | |
548 <p>%code is typically used to include some action routines or perhaps | |
549 a tokenizer as part of the output file.</p> | |
550 | |
551 <h4>The <tt>%default_destructor</tt> directive</h4> | |
552 | |
553 <p>The %default_destructor directive specifies a destructor to | |
554 use for non-terminals that do not have their own destructor | |
555 specified by a separate %destructor directive. See the documentation | |
556 on the %destructor directive below for additional information.</p> | |
557 | |
558 <p>In some grammers, many different non-terminal symbols have the | |
559 same datatype and hence the same destructor. This directive is | |
560 a convenience way to specify the same destructor for all those | |
561 non-terminals using a single statement.</p> | |
562 | |
563 <h4>The <tt>%default_type</tt> directive</h4> | |
564 | |
565 <p>The %default_type directive specifies the datatype of non-terminal | |
566 symbols that do no have their own datatype defined using a separate | |
567 %type directive. See the documentation on %type below for addition | |
568 information.</p> | |
569 | |
570 <h4>The <tt>%destructor</tt> directive</h4> | |
571 | |
572 <p>The %destructor directive is used to specify a destructor for | |
573 a non-terminal symbol. | |
574 (See also the %token_destructor directive which is used to | |
575 specify a destructor for terminal symbols.)</p> | |
576 | |
577 <p>A non-terminal's destructor is called to dispose of the | |
578 non-terminal's value whenever the non-terminal is popped from | |
579 the stack. This includes all of the following circumstances: | |
580 <ul> | |
581 <li> When a rule reduces and the value of a non-terminal on | |
582 the right-hand side is not linked to C code. | |
583 <li> When the stack is popped during error processing. | |
584 <li> When the ParseFree() function runs. | |
585 </ul> | |
586 The destructor can do whatever it wants with the value of | |
587 the non-terminal, but its design is to deallocate memory | |
588 or other resources held by that non-terminal.</p> | |
589 | |
590 <p>Consider an example: | |
591 <pre> | |
592 %type nt {void*} | |
593 %destructor nt { free($$); } | |
594 nt(A) ::= ID NUM. { A = malloc( 100 ); } | |
595 </pre> | |
596 This example is a bit contrived but it serves to illustrate how | |
597 destructors work. The example shows a non-terminal named | |
598 ``nt'' that holds values of type ``void*''. When the rule for | |
599 an ``nt'' reduces, it sets the value of the non-terminal to | |
600 space obtained from malloc(). Later, when the nt non-terminal | |
601 is popped from the stack, the destructor will fire and call | |
602 free() on this malloced space, thus avoiding a memory leak. | |
603 (Note that the symbol ``$$'' in the destructor code is replaced | |
604 by the value of the non-terminal.)</p> | |
605 | |
606 <p>It is important to note that the value of a non-terminal is passed | |
607 to the destructor whenever the non-terminal is removed from the | |
608 stack, unless the non-terminal is used in a C-code action. If | |
609 the non-terminal is used by C-code, then it is assumed that the | |
610 C-code will take care of destroying it if it should really | |
611 be destroyed. More commonly, the value is used to build some | |
612 larger structure and we don't want to destroy it, which is why | |
613 the destructor is not called in this circumstance.</p> | |
614 | |
615 <p>By appropriate use of destructors, it is possible to | |
616 build a parser using Lemon that can be used within a long-running | |
617 program, such as a GUI, that will not leak memory or other resources. | |
618 To do the same using yacc or bison is much more difficult.</p> | |
619 | |
620 <h4>The <tt>%extra_argument</tt> directive</h4> | |
621 | |
622 The %extra_argument directive instructs Lemon to add a 4th parameter | |
623 to the parameter list of the Parse() function it generates. Lemon | |
624 doesn't do anything itself with this extra argument, but it does | |
625 make the argument available to C-code action routines, destructors, | |
626 and so forth. For example, if the grammar file contains:</p> | |
627 | |
628 <p><pre> | |
629 %extra_argument { MyStruct *pAbc } | |
630 </pre></p> | |
631 | |
632 <p>Then the Parse() function generated will have an 4th parameter | |
633 of type ``MyStruct*'' and all action routines will have access to | |
634 a variable named ``pAbc'' that is the value of the 4th parameter | |
635 in the most recent call to Parse().</p> | |
636 | |
637 <h4>The <tt>%include</tt> directive</h4> | |
638 | |
639 <p>The %include directive specifies C code that is included at the | |
640 top of the generated parser. You can include any text you want -- | |
641 the Lemon parser generator copies it blindly. If you have multiple | |
642 %include directives in your grammar file the value of the last | |
643 %include directive overwrites all the others.</p. | |
644 | |
645 <p>The %include directive is very handy for getting some extra #include | |
646 preprocessor statements at the beginning of the generated parser. | |
647 For example:</p> | |
648 | |
649 <p><pre> | |
650 %include {#include <unistd.h>} | |
651 </pre></p> | |
652 | |
653 <p>This might be needed, for example, if some of the C actions in the | |
654 grammar call functions that are prototyed in unistd.h.</p> | |
655 | |
656 <h4>The <tt>%left</tt> directive</h4> | |
657 | |
658 The %left directive is used (along with the %right and | |
659 %nonassoc directives) to declare precedences of terminal | |
660 symbols. Every terminal symbol whose name appears after | |
661 a %left directive but before the next period (``.'') is | |
662 given the same left-associative precedence value. Subsequent | |
663 %left directives have higher precedence. For example:</p> | |
664 | |
665 <p><pre> | |
666 %left AND. | |
667 %left OR. | |
668 %nonassoc EQ NE GT GE LT LE. | |
669 %left PLUS MINUS. | |
670 %left TIMES DIVIDE MOD. | |
671 %right EXP NOT. | |
672 </pre></p> | |
673 | |
674 <p>Note the period that terminates each %left, %right or %nonassoc | |
675 directive.</p> | |
676 | |
677 <p>LALR(1) grammars can get into a situation where they require | |
678 a large amount of stack space if you make heavy use or right-associative | |
679 operators. For this reason, it is recommended that you use %left | |
680 rather than %right whenever possible.</p> | |
681 | |
682 <h4>The <tt>%name</tt> directive</h4> | |
683 | |
684 <p>By default, the functions generated by Lemon all begin with the | |
685 five-character string ``Parse''. You can change this string to something | |
686 different using the %name directive. For instance:</p> | |
687 | |
688 <p><pre> | |
689 %name Abcde | |
690 </pre></p> | |
691 | |
692 <p>Putting this directive in the grammar file will cause Lemon to generate | |
693 functions named | |
694 <ul> | |
695 <li> AbcdeAlloc(), | |
696 <li> AbcdeFree(), | |
697 <li> AbcdeTrace(), and | |
698 <li> Abcde(). | |
699 </ul> | |
700 The %name directive allows you to generator two or more different | |
701 parsers and link them all into the same executable. | |
702 </p> | |
703 | |
704 <h4>The <tt>%nonassoc</tt> directive</h4> | |
705 | |
706 <p>This directive is used to assign non-associative precedence to | |
707 one or more terminal symbols. See the section on precedence rules | |
708 or on the %left directive for additional information.</p> | |
709 | |
710 <h4>The <tt>%parse_accept</tt> directive</h4> | |
711 | |
712 <p>The %parse_accept directive specifies a block of C code that is | |
713 executed whenever the parser accepts its input string. To ``accept'' | |
714 an input string means that the parser was able to process all tokens | |
715 without error.</p> | |
716 | |
717 <p>For example:</p> | |
718 | |
719 <p><pre> | |
720 %parse_accept { | |
721 printf("parsing complete!\n"); | |
722 } | |
723 </pre></p> | |
724 | |
725 | |
726 <h4>The <tt>%parse_failure</tt> directive</h4> | |
727 | |
728 <p>The %parse_failure directive specifies a block of C code that | |
729 is executed whenever the parser fails complete. This code is not | |
730 executed until the parser has tried and failed to resolve an input | |
731 error using is usual error recovery strategy. The routine is | |
732 only invoked when parsing is unable to continue.</p> | |
733 | |
734 <p><pre> | |
735 %parse_failure { | |
736 fprintf(stderr,"Giving up. Parser is hopelessly lost...\n"); | |
737 } | |
738 </pre></p> | |
739 | |
740 <h4>The <tt>%right</tt> directive</h4> | |
741 | |
742 <p>This directive is used to assign right-associative precedence to | |
743 one or more terminal symbols. See the section on precedence rules | |
744 or on the %left directive for additional information.</p> | |
745 | |
746 <h4>The <tt>%stack_overflow</tt> directive</h4> | |
747 | |
748 <p>The %stack_overflow directive specifies a block of C code that | |
749 is executed if the parser's internal stack ever overflows. Typically | |
750 this just prints an error message. After a stack overflow, the parser | |
751 will be unable to continue and must be reset.</p> | |
752 | |
753 <p><pre> | |
754 %stack_overflow { | |
755 fprintf(stderr,"Giving up. Parser stack overflow\n"); | |
756 } | |
757 </pre></p> | |
758 | |
759 <p>You can help prevent parser stack overflows by avoiding the use | |
760 of right recursion and right-precedence operators in your grammar. | |
761 Use left recursion and and left-precedence operators instead, to | |
762 encourage rules to reduce sooner and keep the stack size down. | |
763 For example, do rules like this: | |
764 <pre> | |
765 list ::= list element. // left-recursion. Good! | |
766 list ::= . | |
767 </pre> | |
768 Not like this: | |
769 <pre> | |
770 list ::= element list. // right-recursion. Bad! | |
771 list ::= . | |
772 </pre> | |
773 | |
774 <h4>The <tt>%stack_size</tt> directive</h4> | |
775 | |
776 <p>If stack overflow is a problem and you can't resolve the trouble | |
777 by using left-recursion, then you might want to increase the size | |
778 of the parser's stack using this directive. Put an positive integer | |
779 after the %stack_size directive and Lemon will generate a parse | |
780 with a stack of the requested size. The default value is 100.</p> | |
781 | |
782 <p><pre> | |
783 %stack_size 2000 | |
784 </pre></p> | |
785 | |
786 <h4>The <tt>%start_symbol</tt> directive</h4> | |
787 | |
788 <p>By default, the start-symbol for the grammar that Lemon generates | |
789 is the first non-terminal that appears in the grammar file. But you | |
790 can choose a different start-symbol using the %start_symbol directive.</p> | |
791 | |
792 <p><pre> | |
793 %start_symbol prog | |
794 </pre></p> | |
795 | |
796 <h4>The <tt>%token_destructor</tt> directive</h4> | |
797 | |
798 <p>The %destructor directive assigns a destructor to a non-terminal | |
799 symbol. (See the description of the %destructor directive above.) | |
800 This directive does the same thing for all terminal symbols.</p> | |
801 | |
802 <p>Unlike non-terminal symbols which may each have a different data type | |
803 for their values, terminals all use the same data type (defined by | |
804 the %token_type directive) and so they use a common destructor. Other | |
805 than that, the token destructor works just like the non-terminal | |
806 destructors.</p> | |
807 | |
808 <h4>The <tt>%token_prefix</tt> directive</h4> | |
809 | |
810 <p>Lemon generates #defines that assign small integer constants | |
811 to each terminal symbol in the grammar. If desired, Lemon will | |
812 add a prefix specified by this directive | |
813 to each of the #defines it generates. | |
814 So if the default output of Lemon looked like this: | |
815 <pre> | |
816 #define AND 1 | |
817 #define MINUS 2 | |
818 #define OR 3 | |
819 #define PLUS 4 | |
820 </pre> | |
821 You can insert a statement into the grammar like this: | |
822 <pre> | |
823 %token_prefix TOKEN_ | |
824 </pre> | |
825 to cause Lemon to produce these symbols instead: | |
826 <pre> | |
827 #define TOKEN_AND 1 | |
828 #define TOKEN_MINUS 2 | |
829 #define TOKEN_OR 3 | |
830 #define TOKEN_PLUS 4 | |
831 </pre> | |
832 | |
833 <h4>The <tt>%token_type</tt> and <tt>%type</tt> directives</h4> | |
834 | |
835 <p>These directives are used to specify the data types for values | |
836 on the parser's stack associated with terminal and non-terminal | |
837 symbols. The values of all terminal symbols must be of the same | |
838 type. This turns out to be the same data type as the 3rd parameter | |
839 to the Parse() function generated by Lemon. Typically, you will | |
840 make the value of a terminal symbol by a pointer to some kind of | |
841 token structure. Like this:</p> | |
842 | |
843 <p><pre> | |
844 %token_type {Token*} | |
845 </pre></p> | |
846 | |
847 <p>If the data type of terminals is not specified, the default value | |
848 is ``int''.</p> | |
849 | |
850 <p>Non-terminal symbols can each have their own data types. Typically | |
851 the data type of a non-terminal is a pointer to the root of a parse-tree | |
852 structure that contains all information about that non-terminal. | |
853 For example:</p> | |
854 | |
855 <p><pre> | |
856 %type expr {Expr*} | |
857 </pre></p> | |
858 | |
859 <p>Each entry on the parser's stack is actually a union containing | |
860 instances of all data types for every non-terminal and terminal symbol. | |
861 Lemon will automatically use the correct element of this union depending | |
862 on what the corresponding non-terminal or terminal symbol is. But | |
863 the grammar designer should keep in mind that the size of the union | |
864 will be the size of its largest element. So if you have a single | |
865 non-terminal whose data type requires 1K of storage, then your 100 | |
866 entry parser stack will require 100K of heap space. If you are willing | |
867 and able to pay that price, fine. You just need to know.</p> | |
868 | |
869 <h3>Error Processing</h3> | |
870 | |
871 <p>After extensive experimentation over several years, it has been | |
872 discovered that the error recovery strategy used by yacc is about | |
873 as good as it gets. And so that is what Lemon uses.</p> | |
874 | |
875 <p>When a Lemon-generated parser encounters a syntax error, it | |
876 first invokes the code specified by the %syntax_error directive, if | |
877 any. It then enters its error recovery strategy. The error recovery | |
878 strategy is to begin popping the parsers stack until it enters a | |
879 state where it is permitted to shift a special non-terminal symbol | |
880 named ``error''. It then shifts this non-terminal and continues | |
881 parsing. But the %syntax_error routine will not be called again | |
882 until at least three new tokens have been successfully shifted.</p> | |
883 | |
884 <p>If the parser pops its stack until the stack is empty, and it still | |
885 is unable to shift the error symbol, then the %parse_failed routine | |
886 is invoked and the parser resets itself to its start state, ready | |
887 to begin parsing a new file. This is what will happen at the very | |
888 first syntax error, of course, if there are no instances of the | |
889 ``error'' non-terminal in your grammar.</p> | |
890 | |
891 </body> | |
892 </html> | |
OLD | NEW |