.\" Copyright (c) 2007 by Ian Piumarta .\" All rights reserved. .\" .\" Permission is hereby granted, free of charge, to any person obtaining a .\" copy of this software and associated documentation files (the 'Software'), .\" to deal in the Software without restriction, including without limitation .\" the rights to use, copy, modify, merge, publish, distribute, and/or sell .\" copies of the Software, and to permit persons to whom the Software is .\" furnished to do so, provided that the above copyright notice(s) and this .\" permission notice appear in all copies of the Software. Acknowledgement .\" of the use of this Software in supporting documentation would be .\" appreciated but is not required. .\" .\" THE SOFTWARE IS PROVIDED 'AS IS'. USE ENTIRELY AT YOUR OWN RISK. .\" .\" Last edited: 2012-04-29 16:58:44 by piumarta on emilia .\" .TH PEG 1 "April 2012" "Version 0.1" .SH NAME peg, leg \- parser generators .SH SYNOPSIS .B peg .B [\-hvV \-ooutput] .I [filename ...] .sp 0 .B leg .B [\-hvV \-ooutput] .I [filename ...] .SH DESCRIPTION .I peg and .I leg are tools for generating recursive-descent parsers: programs that perform pattern matching on text. They process a Parsing Expression Grammar (PEG) [Ford 2004] to produce a program that recognises legal sentences of that grammar. .I peg processes PEGs written using the original syntax described by Ford; .I leg processes PEGs written using slightly different syntax and conventions that are intended to make it an attractive replacement for parsers built with .IR lex (1) and .IR yacc (1). Unlike .I lex and .IR yacc , .I peg and .I leg support unlimited backtracking, provide ordered choice as a means for disambiguation, and can combine scanning (lexical analysis) and parsing (syntactic analysis) into a single activity. .PP .I peg reads the specified .IR filename s, or standard input if no .IR filename s are given, for a grammar describing the parser to generate. .I peg then generates a C source file that defines a function .IR yyparse(). This C source file can be included in, or compiled and then linked with, a client program. Each time the client program calls .IR yyparse () the parser consumes input text according to the parsing rules, starting from the first rule in the grammar. .IR yyparse () returns non-zero if the input could be parsed according to the grammar; it returns zero if the input could not be parsed. .PP The prefix 'yy' or 'YY' is prepended to all externally-visible symbols in the generated parser. This is intended to reduce the risk of namespace pollution in client programs. (The choice of 'yy' is historical; see .IR lex (1) and .IR yacc (1), for example.) .SH OPTIONS .I peg and .I leg provide the following options: .TP .B \-h prints a summary of available options and then exits. .TP .B \-ooutput writes the generated parser to the file .B output instead of the standard output. .TP .B \-v writes verbose information to standard error while working. .TP .B \-V writes version information to standard error then exits. .SH A SIMPLE EXAMPLE The following .I peg input specifies a grammar with a single rule (called 'start') that is satisfied when the input contains the string "username". .nf start <- "username" .fi (The quotation marks are .I not part of the matched text; they serve to indicate a literal string to be matched.) In other words, .IR yyparse () in the generated C source will return non-zero only if the next eight characters read from the input spell the word "username". If the input contains anything else, .IR yyparse () returns zero and no input will have been consumed. (Subsequent calls to .IR yyparse () will also return zero, since the parser is effectively blocked looking for the string "username".) To ensure progress we can add an alternative clause to the 'start' rule that will match any single character if "username" is not found. .nf start <- "username" / . .fi .IR yyparse () now always returns non-zero (except at the very end of the input). To do something useful we can add actions to the rules. These actions are performed after a complete match is found (starting from the first rule) and are chosen according to the 'path' taken through the grammar to match the input. (Linguists would call this path a 'phrase marker'.) .nf start <- "username" { printf("%s\\n", getlogin()); } / < . > { putchar(yytext[0]); } .fi The first line instructs the parser to print the user's login name whenever it sees "username" in the input. If that match fails, the second line tells the parser to echo the next character on the input the standard output. Our parser is now performing useful work: it will copy the input to the output, replacing all occurrences of "username" with the user's account name. .PP Note the angle brackets ('<' and '>') that were added to the second alternative. These have no effect on the meaning of the rule, but serve to delimit the text made available to the following action in the variable .IR yytext . .PP If the above grammar is placed in the file .BR username.peg , running the command .nf peg -o username.c username.peg .fi will save the corresponding parser in the file .BR username.c . To create a complete program this parser could be included by a C program as follows. .nf #include /* printf(), putchar() */ #include /* getlogin() */ #include "username.c" /* yyparse() */ int main() { while (yyparse()) /* repeat until EOF */ ; return 0; } .fi .SH PEG GRAMMARS A grammar consists of a set of named rules. .nf name <- pattern .fi The .B pattern contains one or more of the following elements. .TP .B name The element stands for the entire pattern in the rule with the given .BR name . .TP .BR \(dq characters \(dq A character or string enclosed in double quotes is matched literally. The ANSI C esacpe sequences are recognised within the .IR characters . .TP .BR ' characters ' A character or string enclosed in single quotes is matched literally, as above. .TP .BR [ characters ] A set of characters enclosed in square brackets matches any single character from the set, with escape characters recognised as above. If the set begins with an uparrow (^) then the set is negated (the element matches any character .I not in the set). Any pair of characters separated with a dash (-) represents the range of characters from the first to the second, inclusive. A single alphabetic character or underscore is matched by the following set. .nf [a-zA-Z_] .fi Similarly, the following matches any single non-digit character. .nf [^0-9] .fi .TP .B . A dot matches any character. Note that the only time this fails is at the end of file, where there is no character to match. .TP .BR ( \ pattern\ ) Parentheses are used for grouping (modifying the precendence of the operators described below). .TP .BR { \ action\ } Curly braces surround actions. The action is arbitray C source code to be executed at the end of matching. Any braces within the action must be properly nested. Any input text that was matched before the action and delimited by angle brackets (see below) is made available within the action as the contents of the character array .IR yytext . The length of (number of characters in) .I yytext is available in the variable .IR yyleng . (These variable names are historical; see .IR lex (1).) .TP .B < An opening angle bracket always matches (consuming no input) and causes the parser to begin accumulating matched text. This text will be made available to actions in the variable .IR yytext . .TP .B > A closing angle bracket always matches (consuming no input) and causes the parser to stop accumulating text for .IR yytext . .PP The above .IR element s can be made optional and/or repeatable with the following suffixes: .TP .RB element\ ? The element is optional. If present on the input, it is consumed and the match succeeds. If not present on the input, no text is consumed and the match succeeds anyway. .TP .RB element\ + The element is repeatable. If present on the input, one or more occurrences of .I element are consumed and the match succeeds. If no occurrences of .I element are present on the input, the match fails. .TP .RB element\ * The element is optional and repeatable. If present on the input, one or more occurrences of .I element are consumed and the match succeeds. If no occurrences of .I element are present on the input, the match succeeds anyway. .PP The above elements and suffixes can be converted into predicates (that match arbitray input text and subsequently succeed or fail .I without consuming that input) with the following prefixes: .TP .BR & \ element The predicate succeeds only if .I element can be matched. Input text scanned while matching .I element is not consumed from the input and remains available for subsequent matching. .TP .BR ! \ element The predicate succeeds only if .I element cannot be matched. Input text scanned while matching .I element is not consumed from the input and remains available for subsequent matching. A popular idiom is .nf !. .fi which matches the end of file, after the last character of the input has already been consumed. .PP A special form of the '&' predicate is provided: .TP .BR & {\ expression\ } In this predicate the simple C .I expression .RB ( not statement) is evaluated immediately when the parser reaches the predicate. If the .I expression yields non-zero (true) the 'match' succeeds and the parser continues with the next element in the pattern. If the .I expression yields zero (false) the 'match' fails and the parser backs up to look for an alternative parse of the input. .PP Several elements (with or without prefixes and suffixes) can be combined into a .I sequence by writing them one after the other. The entire sequence matches only if each individual element within it matches, from left to right. .PP Sequences can be separated into disjoint alternatives by the alternation operator '/'. .TP .RB sequence-1\ / \ sequence-2\ / \ ...\ / \ sequence-N Each sequence is tried in turn until one of them matches, at which time matching for the overall pattern succeeds. If none of the sequences matches then the match of the overall pattern fails. .PP Finally, the pound sign (#) introduces a comment (discarded) that continues until the end of the line. .PP To summarise the above, the parser tries to match the input text against a pattern containing literals, names (representing other rules), and various operators (written as prefixes, suffixes, juxtaposition for sequencing and and infix alternation operator) that modify how the elements within the pattern are matched. Matches are made from left to right, 'descending' into named sub-rules as they are encountered. If the matching process fails, the parser 'back tracks' ('rewinding' the input appropriately in the process) to find the nearest alternative 'path' through the grammar. In other words the parser performs a depth-first, left-to-right search for the first successfully-matching path through the rules. If found, the actions along the successful path are executed (in the order they were encountered). .PP Note that predicates are evaluated .I immediately during the search for a successful match, since they contribute to the success or failure of the search. Actions, however, are evaluated only after a successful match has been found. .SH PEG GRAMMAR FOR PEG GRAMMARS The grammar for .I peg grammars is shown below. This will both illustrate and formalise the above description. .nf Grammar <- Spacing Definition+ EndOfFile Definition <- Identifier LEFTARROW Expression Expression <- Sequence ( SLASH Sequence )* Sequence <- Prefix* Prefix <- AND Action / ( AND | NOT )? Suffix Suffix <- Primary ( QUERY / STAR / PLUS )? Primary <- Identifier !LEFTARROW / OPEN Expression CLOSE / Literal / Class / DOT / Action / BEGIN / END Identifier <- < IdentStart IdentCont* > Spacing IdentStart <- [a-zA-Z_] IdentCont <- IdentStart / [0-9] Literal <- ['] < ( !['] Char )* > ['] Spacing / ["] < ( !["] Char )* > ["] Spacing Class <- '[' < ( !']' Range )* > ']' Spacing Range <- Char '-' Char / Char Char <- '\\\\' [abefnrtv'"\\[\\]\\\\] / '\\\\' [0-3][0-7][0-7] / '\\\\' [0-7][0-7]? / '\\\\' '-' / !'\\\\' . LEFTARROW <- '<-' Spacing SLASH <- '/' Spacing AND <- '&' Spacing NOT <- '!' Spacing QUERY <- '?' Spacing STAR <- '*' Spacing PLUS <- '+' Spacing OPEN <- '(' Spacing CLOSE <- ')' Spacing DOT <- '.' Spacing Spacing <- ( Space / Comment )* Comment <- '#' ( !EndOfLine . )* EndOfLine Space <- ' ' / '\\t' / EndOfLine EndOfLine <- '\\r\\n' / '\\n' / '\\r' EndOfFile <- !. Action <- '{' < [^}]* > '}' Spacing BEGIN <- '<' Spacing END <- '>' Spacing .fi .SH LEG GRAMMARS .I leg is a variant of .I peg that adds some features of .IR lex (1) and .IR yacc (1). It differs from .I peg in the following ways. .TP .BI %{\ text... \ %} A declaration section can appear anywhere that a rule definition is expected. The .I text between the delimiters '%{' and '%}' is copied verbatim to the generated C parser code .I before the code that implements the parser itself. .TP .IB name\ = \ pattern The 'assignment' operator replaces the left arrow operator '<-'. .TP .B rule-name Hyphens can appear as letters in the names of rules. Each hyphen is converted into an underscore in the generated C source code. A single single hyphen '-' is a legal rule name. .nf - = [ \\t\\n\\r]* number = [0-9]+ - name = [a-zA-Z_][a-zA_Z_0-9]* - l-paren = '(' - r-paren = ')' - .fi This example shows how ignored whitespace can be obvious when reading the grammar and yet unobtrusive when placed liberally at the end of every rule associated with a lexical element. .TP .IB seq-1\ | \ seq-2 The alternation operator is vertical bar '|' rather than forward slash '/'. The .I peg rule .nf name <- sequence-1 / sequence-2 / sequence-3 .fi is therefore written .nf name = sequence-1 | sequence-2 | sequence-3 ; .fi in .I leg (with the final semicolon being optional, as described next). .TP .IB pattern\ ; A semicolon punctuator can optionally terminate a .IR pattern . .TP .BI %% \ text... A double percent '%%' terminates the rules (and declarations) section of the grammar. All .I text following '%%' is copied verbatim to the generated C parser code .I after the parser implementation code. .TP .BI $$\ = \ value A sub-rule can return a semantic .I value from an action by assigning it to the pseudo-variable '$$'. All semantic values must have the same type (which defaults to 'int'). This type can be changed by defining YYSTYPE in a declaration section. .TP .IB identifier : name The semantic value returned (by assigning to '$$') from the sub-rule .I name is associated with the .I identifier and can be referred to in subsequent actions. .PP The desk calclator example below illustrates the use of '$$' and ':'. .SH LEG EXAMPLE: A DESK CALCULATOR The extensions in .I leg described above allow useful parsers and evaluators (including declarations, grammar rules, and supporting C functions such as 'main') to be kept within a single source file. To illustrate this we show a simple desk calculator supporting the four common arithmetic operators and named variables. The intermediate results of arithmetic evaluation will be accumulated on an implicit stack by returning them as semantic values from sub-rules. .nf %{ #include /* printf() */ #include /* atoi() */ int vars[26]; %} Stmt = - e:Expr EOL { printf("%d\\n", e); } | ( !EOL . )* EOL { printf("error\\n"); } Expr = i:ID ASSIGN s:Sum { $$ = vars[i] = s; } | s:Sum { $$ = s; } Sum = l:Product ( PLUS r:Product { l += r; } | MINUS r:Product { l -= r; } )* { $$ = l; } Product = l:Value ( TIMES r:Value { l *= r; } | DIVIDE r:Value { l /= r; } )* { $$ = l; } Value = i:NUMBER { $$ = atoi(yytext); } | i:ID !ASSIGN { $$ = vars[i]; } | OPEN i:Expr CLOSE { $$ = i; } NUMBER = < [0-9]+ > - { $$ = atoi(yytext); } ID = < [a-z] > - { $$ = yytext[0] - 'a'; } ASSIGN = '=' - PLUS = '+' - MINUS = '-' - TIMES = '*' - DIVIDE = '/' - OPEN = '(' - CLOSE = ')' - - = [ \\t]* EOL = '\\n' | '\\r\\n' | '\\r' | ';' %% int main() { while (yyparse()) ; return 0; } .fi .SH LEG GRAMMAR FOR LEG GRAMMARS The grammar for .I leg grammars is shown below. This will both illustrate and formalise the above description. .nf grammar = - ( declaration | definition )+ trailer? end-of-file declaration = '%{' < ( !'%}' . )* > RPERCENT trailer = '%%' < .* > definition = identifier EQUAL expression SEMICOLON? expression = sequence ( BAR sequence )* sequence = prefix+ prefix = AND action | ( AND | NOT )? suffix suffix = primary ( QUERY | STAR | PLUS )? primary = identifier COLON identifier !EQUAL | identifier !EQUAL | OPEN expression CLOSE | literal | class | DOT | action | BEGIN | END identifier = < [-a-zA-Z_][-a-zA-Z_0-9]* > - literal = ['] < ( !['] char )* > ['] - | ["] < ( !["] char )* > ["] - class = '[' < ( !']' range )* > ']' - range = char '-' char | char char = '\\\\' [abefnrtv'"\\[\\]\\\\] | '\\\\' [0-3][0-7][0-7] | '\\\\' [0-7][0-7]? | !'\\\\' . action = '{' < [^}]* > '}' - EQUAL = '=' - COLON = ':' - SEMICOLON = ';' - BAR = '|' - AND = '&' - NOT = '!' - QUERY = '?' - STAR = '*' - PLUS = '+' - OPEN = '(' - CLOSE = ')' - DOT = '.' - BEGIN = '<' - END = '>' - RPERCENT = '%}' - - = ( space | comment )* space = ' ' | '\\t' | end-of-line comment = '#' ( !end-of-line . )* end-of-line end-of-line = '\\r\\n' | '\\n' | '\\r' end-of-file = !. .fi .SH CUSTOMISING THE PARSER The following symbols can be redefined in declaration sections to modify the generated parser code. .TP .B YYSTYPE The semantic value type. The pseudo-variable '$$' and the identifiers 'bound' to rule results with the colon operator ':' should all be considered as being declared to have this type. The default value is 'int'. .TP .B YYPARSE The name of the main entry point to the parser. The default value is 'yyparse'. .TP .B YYPARSEFROM The name of an alternative entry point to the parser. This function expects one argument: the function corresponding to the rule from which the search for a match should begin. The default is 'yyparsefrom'. Note that yyparse() is defined as .nf int yyparse() { return yyparsefrom(yy_foo); } .fi where 'foo' is the name of the first rule in the grammar. .TP .BI YY_INPUT( buf , \ result , \ max_size ) This macro is invoked by the parser to obtain more input text. .I buf points to an area of memory that can hold at most .I max_size characters. The macro should copy input text to .I buf and then assign the integer variable .I result to indicate the number of characters copied. If no more input is available, the macro should assign 0 to .IR result . By default, the YY_INPUT macro is defined as follows. .nf #define YY_INPUT(buf, result, max_size) \\ { \\ int yyc= getchar(); \\ result= (EOF == yyc) ? 0 : (*(buf)= yyc, 1); \\ } .fi .TP .B YY_DEBUG If this symbols is defined then additional code will be included in the parser that prints vast quantities of arcane information to the standard error while the parser is running. .TP .B YY_BEGIN This macro is invoked to mark the start of input text that will be made available in actions as 'yytext'. This corresponds to occurrences of '<' in the grammar. These are converted into predicates that are expected to succeed. The default definition .nf #define YY_BEGIN (yybegin= yypos, 1) .fi therefore saves the current input position and returns 1 ('true') as the result of the predicate. .TP .B YY_END This macros corresponds to '>' in the grammar. Again, it is a predicate so the default definition saves the input position before 'succeeding'. .nf #define YY_END (yyend= yypos, 1) .fi .TP .BI YY_PARSE( T ) This macro declares the parser entry points (yyparse and yyparsefrom) to be of type .IR T . The default definition .nf #define YY_PARSE(T) T .fi leaves yyparse() and yyparsefrom() with global visibility. If they should not be externally visible in other source files, this macro can be redefined to declare them 'static'. .nf #define YY_PARSE(T) static T .fi .TP .BI YY_CTX_LOCAL If this symbol is defined during compilation of a generated parser then global parser state will be kept in a structure of type 'yycontext' which can be declared as a local variable. This allows multiple instances of parsers to coexist and to be thread-safe. The parsing function .IR yyparse () will be declared to expect a first argument of type 'yycontext *', an instance of the structure holding the global state for the parser. This instance must be allocated and initialised to zero by the client. A trivial but complete example is as follows. .nf #include #define YY_CTX_LOCAL #include "the-generated-parser.peg.c" int main() { yycontext ctx; memset(&ctx, 0, sizeof(yycontext)); while (yyparse(&ctx)); return 0; } .fi Note that if this symbol is undefined then the compiled parser will statically allocate its global state and will be neither reentrant nor thread-safe. .TP .BI YY_CTX_MEMBERS If YY_CTX_LOCAL is defined (see above) then the macro YY_CTX_MEMBERS can be defined to expand to any additional member field declarations that the client would like included in the declaration of the 'yycontext' structure type. These additional members are otherwise ignored by the generated parser. The instance of 'yycontext' associated with the currently-active parser is available in actions through the pointer variable .IR yyctx . .PP The following variables can be reffered to within actions. .TP .B char *yybuf This variable points to the parser's input buffer used to store input text that has not yet been matched. .TP .B int yypos This is the offset (in yybuf) of the next character to be matched and consumed. .TP .B char *yytext The most recent matched text delimited by '<' and '>' is stored in this variable. .TP .B int yyleng This variable indicates the number of characters in 'yytext'. .TP .B yycontext *yyctx This variable points to the instance of 'yycontext' associated with the currently-active parser. .SH DIAGNOSTICS .I peg and .I leg warn about the following conditions while converting a grammar into a parser. .TP .B syntax error The input grammar was malformed in some way. The error message will include the text about to be matched (often backed up a huge amount from the actual location of the error) and the line number of the most recently considered character (which is often the real location of the problem). .TP .B rule 'foo' used but not defined The grammar referred to a rule named 'foo' but no definition for it was given. Attempting to use the generated parser will likely result in errors from the linker due to undefined symbols associated with the missing rule. .TP .B rule 'foo' defined but not used The grammar defined a rule named 'foo' and then ignored it. The code associated with the rule is included in the generated parser which will in all other respects be healthy. .TP .B possible infinite left recursion in rule 'foo' There exists at least one path through the grammar that leads from the rule 'foo' back to (a recursive invocation of) the same rule without consuming any input. .PP Left recursion, especially that found in standards documents, is often 'direct' and implies trivial repetition. .nf # (6.7.6) direct-abstract-declarator = LPAREN abstract-declarator RPAREN | direct-abstract-declarator? LBRACKET assign-expr? RBRACKET | direct-abstract-declarator? LBRACKET STAR RBRACKET | direct-abstract-declarator? LPAREN param-type-list? RPAREN .fi The recursion can easily be eliminated by converting the parts of the pattern following the recursion into a repeatable suffix. .nf # (6.7.6) direct-abstract-declarator = direct-abstract-declarator-head? direct-abstract-declarator-tail* direct-abstract-declarator-head = LPAREN abstract-declarator RPAREN direct-abstract-declarator-tail = LBRACKET assign-expr? RBRACKET | LBRACKET STAR RBRACKET | LPAREN param-type-list? RPAREN .fi .SH BUGS The 'yy' and 'YY' prefixes cannot be changed. .PP Left recursion is detected in the input grammar but is not handled correctly in the generated parser. .PP Diagnostics for errors in the input grammar are obscure and not particularly helpful. .PP Several commonly-used .IR lex (1) features (yywrap(), yyin, etc.) are completely absent. .PP The generated parser foes not contain '#line' directives to direct C compiler errors back to the grammar description when appropriate. .IR lex (1) features (yywrap(), yyin, etc.) are completely absent. .SH SEE ALSO D. Val Schorre, .I META II, a syntax-oriented compiler writing language, 19th ACM National Conference, 1964, pp.\ 41.301--41.311. Describes a self-implementing parser generator for analytic grammars with no backtracking. .PP Alexander Birman, .I The TMG Recognition Schema, Ph.D. dissertation, Princeton, 1970. A mathematical treatment of the power and complexity of recursive-descent parsing with backtracking. .PP Bryan Ford, .I Parsing Expression Grammars: A Recognition-Based Syntactic Foundation, ACM SIGPLAN Symposium on Principles of Programming Languages, 2004. Defines PEGs and analyses them in relation to context-free and regular grammars. Introduces the syntax adopted in .IR peg . .PP The standard Unix utilies .IR lex (1) and .IR yacc (1) which influenced the syntax and features of .IR leg . .PP The source code for .I peg and .I leg whose grammar parsers are written using themselves. .PP The latest version of this software and documentation: .nf http://piumarta.com/software/peg .fi .SH AUTHOR .IR peg , .I leg and this manual page were written by Ian Piumarta (first-name at last-name dot com) while investigating the viablility of regular- and parsing-expression grammars for efficiently extracting type and signature information from C header files. .PP Please send bug reports and suggestions for improvements to the author at the above address.