mirror of
https://github.com/RetroShare/RetroShare.git
synced 2024-12-29 09:26:18 -05:00
934 lines
28 KiB
Groff
934 lines
28 KiB
Groff
|
.\" Copyright (c) 2007 by Ian Piumarta
|
||
|
.\" All rights reserved.
|
||
|
.\"
|
||
|
.\" Permission is hereby granted, free of charge, to any person obtaining a
|
||
|
.\" copy of this software and associated documentation files (the 'Software'),
|
||
|
.\" to deal in the Software without restriction, including without limitation
|
||
|
.\" the rights to use, copy, modify, merge, publish, distribute, and/or sell
|
||
|
.\" copies of the Software, and to permit persons to whom the Software is
|
||
|
.\" furnished to do so, provided that the above copyright notice(s) and this
|
||
|
.\" permission notice appear in all copies of the Software. Acknowledgement
|
||
|
.\" of the use of this Software in supporting documentation would be
|
||
|
.\" appreciated but is not required.
|
||
|
.\"
|
||
|
.\" THE SOFTWARE IS PROVIDED 'AS IS'. USE ENTIRELY AT YOUR OWN RISK.
|
||
|
.\"
|
||
|
.\" Last edited: 2012-04-29 16:58:44 by piumarta on emilia
|
||
|
.\"
|
||
|
.TH PEG 1 "April 2012" "Version 0.1"
|
||
|
.SH NAME
|
||
|
peg, leg \- parser generators
|
||
|
.SH SYNOPSIS
|
||
|
.B peg
|
||
|
.B [\-hvV \-ooutput]
|
||
|
.I [filename ...]
|
||
|
.sp 0
|
||
|
.B leg
|
||
|
.B [\-hvV \-ooutput]
|
||
|
.I [filename ...]
|
||
|
.SH DESCRIPTION
|
||
|
.I peg
|
||
|
and
|
||
|
.I leg
|
||
|
are tools for generating recursive-descent parsers: programs that
|
||
|
perform pattern matching on text. They process a Parsing Expression
|
||
|
Grammar (PEG) [Ford 2004] to produce a program that recognises legal
|
||
|
sentences of that grammar.
|
||
|
.I peg
|
||
|
processes PEGs written using the original syntax described by Ford;
|
||
|
.I leg
|
||
|
processes PEGs written using slightly different syntax and conventions
|
||
|
that are intended to make it an attractive replacement for parsers
|
||
|
built with
|
||
|
.IR lex (1)
|
||
|
and
|
||
|
.IR yacc (1).
|
||
|
Unlike
|
||
|
.I lex
|
||
|
and
|
||
|
.IR yacc ,
|
||
|
.I peg
|
||
|
and
|
||
|
.I leg
|
||
|
support unlimited backtracking, provide ordered choice as a means for
|
||
|
disambiguation, and can combine scanning (lexical analysis) and
|
||
|
parsing (syntactic analysis) into a single activity.
|
||
|
.PP
|
||
|
.I peg
|
||
|
reads the specified
|
||
|
.IR filename s,
|
||
|
or standard input if no
|
||
|
.IR filename s
|
||
|
are given, for a grammar describing the parser to generate.
|
||
|
.I peg
|
||
|
then generates a C source file that defines a function
|
||
|
.IR yyparse().
|
||
|
This C source file can be included in, or compiled and then linked
|
||
|
with, a client program. Each time the client program calls
|
||
|
.IR yyparse ()
|
||
|
the parser consumes input text according to the parsing rules,
|
||
|
starting from the first rule in the grammar.
|
||
|
.IR yyparse ()
|
||
|
returns non-zero if the input could be parsed according to the
|
||
|
grammar; it returns zero if the input could not be parsed.
|
||
|
.PP
|
||
|
The prefix 'yy' or 'YY' is prepended to all externally-visible symbols
|
||
|
in the generated parser. This is intended to reduce the risk of
|
||
|
namespace pollution in client programs. (The choice of 'yy' is
|
||
|
historical; see
|
||
|
.IR lex (1)
|
||
|
and
|
||
|
.IR yacc (1),
|
||
|
for example.)
|
||
|
.SH OPTIONS
|
||
|
.I peg
|
||
|
and
|
||
|
.I leg
|
||
|
provide the following options:
|
||
|
.TP
|
||
|
.B \-h
|
||
|
prints a summary of available options and then exits.
|
||
|
.TP
|
||
|
.B \-ooutput
|
||
|
writes the generated parser to the file
|
||
|
.B output
|
||
|
instead of the standard output.
|
||
|
.TP
|
||
|
.B \-v
|
||
|
writes verbose information to standard error while working.
|
||
|
.TP
|
||
|
.B \-V
|
||
|
writes version information to standard error then exits.
|
||
|
.SH A SIMPLE EXAMPLE
|
||
|
The following
|
||
|
.I peg
|
||
|
input specifies a grammar with a single rule (called 'start') that is
|
||
|
satisfied when the input contains the string "username".
|
||
|
.nf
|
||
|
|
||
|
start <- "username"
|
||
|
|
||
|
.fi
|
||
|
(The quotation marks are
|
||
|
.I not
|
||
|
part of the matched text; they serve to indicate a literal string to
|
||
|
be matched.) In other words,
|
||
|
.IR yyparse ()
|
||
|
in the generated C source will return non-zero only if the next eight
|
||
|
characters read from the input spell the word "username". If the
|
||
|
input contains anything else,
|
||
|
.IR yyparse ()
|
||
|
returns zero and no input will have been consumed. (Subsequent calls
|
||
|
to
|
||
|
.IR yyparse ()
|
||
|
will also return zero, since the parser is effectively blocked looking
|
||
|
for the string "username".) To ensure progress we can add an
|
||
|
alternative clause to the 'start' rule that will match any single
|
||
|
character if "username" is not found.
|
||
|
.nf
|
||
|
|
||
|
start <- "username"
|
||
|
/ .
|
||
|
|
||
|
.fi
|
||
|
.IR yyparse ()
|
||
|
now always returns non-zero (except at the very end of the input). To
|
||
|
do something useful we can add actions to the rules. These actions
|
||
|
are performed after a complete match is found (starting from the first
|
||
|
rule) and are chosen according to the 'path' taken through the grammar
|
||
|
to match the input. (Linguists would call this path a 'phrase
|
||
|
marker'.)
|
||
|
.nf
|
||
|
|
||
|
start <- "username" { printf("%s\\n", getlogin()); }
|
||
|
/ < . > { putchar(yytext[0]); }
|
||
|
|
||
|
.fi
|
||
|
The first line instructs the parser to print the user's login name
|
||
|
whenever it sees "username" in the input. If that match fails, the
|
||
|
second line tells the parser to echo the next character on the input
|
||
|
the standard output. Our parser is now performing useful work: it
|
||
|
will copy the input to the output, replacing all occurrences of
|
||
|
"username" with the user's account name.
|
||
|
.PP
|
||
|
Note the angle brackets ('<' and '>') that were added to the second
|
||
|
alternative. These have no effect on the meaning of the rule, but
|
||
|
serve to delimit the text made available to the following action in
|
||
|
the variable
|
||
|
.IR yytext .
|
||
|
.PP
|
||
|
If the above grammar is placed in the file
|
||
|
.BR username.peg ,
|
||
|
running the command
|
||
|
.nf
|
||
|
|
||
|
peg -o username.c username.peg
|
||
|
|
||
|
.fi
|
||
|
will save the corresponding parser in the file
|
||
|
.BR username.c .
|
||
|
To create a complete program this parser could be included by a C
|
||
|
program as follows.
|
||
|
.nf
|
||
|
|
||
|
#include <stdio.h> /* printf(), putchar() */
|
||
|
#include <unistd.h> /* getlogin() */
|
||
|
|
||
|
#include "username.c" /* yyparse() */
|
||
|
|
||
|
int main()
|
||
|
{
|
||
|
while (yyparse()) /* repeat until EOF */
|
||
|
;
|
||
|
return 0;
|
||
|
}
|
||
|
.fi
|
||
|
.SH PEG GRAMMARS
|
||
|
A grammar consists of a set of named rules.
|
||
|
.nf
|
||
|
|
||
|
name <- pattern
|
||
|
|
||
|
.fi
|
||
|
The
|
||
|
.B pattern
|
||
|
contains one or more of the following elements.
|
||
|
.TP
|
||
|
.B name
|
||
|
The element stands for the entire pattern in the rule with the given
|
||
|
.BR name .
|
||
|
.TP
|
||
|
.BR \(dq characters \(dq
|
||
|
A character or string enclosed in double quotes is matched literally.
|
||
|
The ANSI C esacpe sequences are recognised within the
|
||
|
.IR characters .
|
||
|
.TP
|
||
|
.BR ' characters '
|
||
|
A character or string enclosed in single quotes is matched literally, as above.
|
||
|
.TP
|
||
|
.BR [ characters ]
|
||
|
A set of characters enclosed in square brackets matches any single
|
||
|
character from the set, with escape characters recognised as above.
|
||
|
If the set begins with an uparrow (^) then the set is negated (the
|
||
|
element matches any character
|
||
|
.I not
|
||
|
in the set). Any pair of characters separated with a dash (-)
|
||
|
represents the range of characters from the first to the second,
|
||
|
inclusive. A single alphabetic character or underscore is matched by
|
||
|
the following set.
|
||
|
.nf
|
||
|
|
||
|
[a-zA-Z_]
|
||
|
|
||
|
.fi
|
||
|
Similarly, the following matches any single non-digit character.
|
||
|
.nf
|
||
|
|
||
|
[^0-9]
|
||
|
|
||
|
.fi
|
||
|
.TP
|
||
|
.B .
|
||
|
A dot matches any character. Note that the only time this fails is at
|
||
|
the end of file, where there is no character to match.
|
||
|
.TP
|
||
|
.BR ( \ pattern\ )
|
||
|
Parentheses are used for grouping (modifying the precendence of the
|
||
|
operators described below).
|
||
|
.TP
|
||
|
.BR { \ action\ }
|
||
|
Curly braces surround actions. The action is arbitray C source code
|
||
|
to be executed at the end of matching. Any braces within the action
|
||
|
must be properly nested. Any input text that was matched before the
|
||
|
action and delimited by angle brackets (see below) is made available
|
||
|
within the action as the contents of the character array
|
||
|
.IR yytext .
|
||
|
The length of (number of characters in)
|
||
|
.I yytext
|
||
|
is available in the variable
|
||
|
.IR yyleng .
|
||
|
(These variable names are historical; see
|
||
|
.IR lex (1).)
|
||
|
.TP
|
||
|
.B <
|
||
|
An opening angle bracket always matches (consuming no input) and
|
||
|
causes the parser to begin accumulating matched text. This text will
|
||
|
be made available to actions in the variable
|
||
|
.IR yytext .
|
||
|
.TP
|
||
|
.B >
|
||
|
A closing angle bracket always matches (consuming no input) and causes
|
||
|
the parser to stop accumulating text for
|
||
|
.IR yytext .
|
||
|
.PP
|
||
|
The above
|
||
|
.IR element s
|
||
|
can be made optional and/or repeatable with the following suffixes:
|
||
|
.TP
|
||
|
.RB element\ ?
|
||
|
The element is optional. If present on the input, it is consumed and
|
||
|
the match succeeds. If not present on the input, no text is consumed
|
||
|
and the match succeeds anyway.
|
||
|
.TP
|
||
|
.RB element\ +
|
||
|
The element is repeatable. If present on the input, one or more
|
||
|
occurrences of
|
||
|
.I element
|
||
|
are consumed and the match succeeds. If no occurrences of
|
||
|
.I element
|
||
|
are present on the input, the match fails.
|
||
|
.TP
|
||
|
.RB element\ *
|
||
|
The element is optional and repeatable. If present on the input, one or more
|
||
|
occurrences of
|
||
|
.I element
|
||
|
are consumed and the match succeeds. If no occurrences of
|
||
|
.I element
|
||
|
are present on the input, the match succeeds anyway.
|
||
|
.PP
|
||
|
The above elements and suffixes can be converted into predicates (that
|
||
|
match arbitray input text and subsequently succeed or fail
|
||
|
.I without
|
||
|
consuming that input) with the following prefixes:
|
||
|
.TP
|
||
|
.BR & \ element
|
||
|
The predicate succeeds only if
|
||
|
.I element
|
||
|
can be matched. Input text scanned while matching
|
||
|
.I element
|
||
|
is not consumed from the input and remains available for subsequent
|
||
|
matching.
|
||
|
.TP
|
||
|
.BR ! \ element
|
||
|
The predicate succeeds only if
|
||
|
.I element
|
||
|
cannot be matched. Input text scanned while matching
|
||
|
.I element
|
||
|
is not consumed from the input and remains available for subsequent
|
||
|
matching. A popular idiom is
|
||
|
.nf
|
||
|
|
||
|
!.
|
||
|
|
||
|
.fi
|
||
|
which matches the end of file, after the last character of the input
|
||
|
has already been consumed.
|
||
|
.PP
|
||
|
A special form of the '&' predicate is provided:
|
||
|
.TP
|
||
|
.BR & {\ expression\ }
|
||
|
In this predicate the simple C
|
||
|
.I expression
|
||
|
.RB ( not
|
||
|
statement) is evaluated immediately when the parser reaches the
|
||
|
predicate. If the
|
||
|
.I expression
|
||
|
yields non-zero (true) the 'match' succeeds and the parser continues
|
||
|
with the next element in the pattern. If the
|
||
|
.I expression
|
||
|
yields zero (false) the 'match' fails and the parser backs up to look
|
||
|
for an alternative parse of the input.
|
||
|
.PP
|
||
|
Several elements (with or without prefixes and suffixes) can be
|
||
|
combined into a
|
||
|
.I sequence
|
||
|
by writing them one after the other. The entire sequence matches only
|
||
|
if each individual element within it matches, from left to right.
|
||
|
.PP
|
||
|
Sequences can be separated into disjoint alternatives by the
|
||
|
alternation operator '/'.
|
||
|
.TP
|
||
|
.RB sequence-1\ / \ sequence-2\ / \ ...\ / \ sequence-N
|
||
|
Each sequence is tried in turn until one of them matches, at which
|
||
|
time matching for the overall pattern succeeds. If none of the
|
||
|
sequences matches then the match of the overall pattern fails.
|
||
|
.PP
|
||
|
Finally, the pound sign (#) introduces a comment (discarded) that
|
||
|
continues until the end of the line.
|
||
|
.PP
|
||
|
To summarise the above, the parser tries to match the input text
|
||
|
against a pattern containing literals, names (representing other
|
||
|
rules), and various operators (written as prefixes, suffixes,
|
||
|
juxtaposition for sequencing and and infix alternation operator) that
|
||
|
modify how the elements within the pattern are matched. Matches are
|
||
|
made from left to right, 'descending' into named sub-rules as they are
|
||
|
encountered. If the matching process fails, the parser 'back tracks'
|
||
|
('rewinding' the input appropriately in the process) to find the
|
||
|
nearest alternative 'path' through the grammar. In other words the
|
||
|
parser performs a depth-first, left-to-right search for the first
|
||
|
successfully-matching path through the rules. If found, the actions
|
||
|
along the successful path are executed (in the order they were
|
||
|
encountered).
|
||
|
.PP
|
||
|
Note that predicates are evaluated
|
||
|
.I immediately
|
||
|
during the search for a successful match, since they contribute to the
|
||
|
success or failure of the search. Actions, however, are evaluated
|
||
|
only after a successful match has been found.
|
||
|
.SH PEG GRAMMAR FOR PEG GRAMMARS
|
||
|
The grammar for
|
||
|
.I peg
|
||
|
grammars is shown below. This will both illustrate and formalise
|
||
|
the above description.
|
||
|
.nf
|
||
|
|
||
|
Grammar <- Spacing Definition+ EndOfFile
|
||
|
|
||
|
Definition <- Identifier LEFTARROW Expression
|
||
|
Expression <- Sequence ( SLASH Sequence )*
|
||
|
Sequence <- Prefix*
|
||
|
Prefix <- AND Action
|
||
|
/ ( AND | NOT )? Suffix
|
||
|
Suffix <- Primary ( QUERY / STAR / PLUS )?
|
||
|
Primary <- Identifier !LEFTARROW
|
||
|
/ OPEN Expression CLOSE
|
||
|
/ Literal
|
||
|
/ Class
|
||
|
/ DOT
|
||
|
/ Action
|
||
|
/ BEGIN
|
||
|
/ END
|
||
|
|
||
|
Identifier <- < IdentStart IdentCont* > Spacing
|
||
|
IdentStart <- [a-zA-Z_]
|
||
|
IdentCont <- IdentStart / [0-9]
|
||
|
Literal <- ['] < ( !['] Char )* > ['] Spacing
|
||
|
/ ["] < ( !["] Char )* > ["] Spacing
|
||
|
Class <- '[' < ( !']' Range )* > ']' Spacing
|
||
|
Range <- Char '-' Char / Char
|
||
|
Char <- '\\\\' [abefnrtv'"\\[\\]\\\\]
|
||
|
/ '\\\\' [0-3][0-7][0-7]
|
||
|
/ '\\\\' [0-7][0-7]?
|
||
|
/ '\\\\' '-'
|
||
|
/ !'\\\\' .
|
||
|
LEFTARROW <- '<-' Spacing
|
||
|
SLASH <- '/' Spacing
|
||
|
AND <- '&' Spacing
|
||
|
NOT <- '!' Spacing
|
||
|
QUERY <- '?' Spacing
|
||
|
STAR <- '*' Spacing
|
||
|
PLUS <- '+' Spacing
|
||
|
OPEN <- '(' Spacing
|
||
|
CLOSE <- ')' Spacing
|
||
|
DOT <- '.' Spacing
|
||
|
Spacing <- ( Space / Comment )*
|
||
|
Comment <- '#' ( !EndOfLine . )* EndOfLine
|
||
|
Space <- ' ' / '\\t' / EndOfLine
|
||
|
EndOfLine <- '\\r\\n' / '\\n' / '\\r'
|
||
|
EndOfFile <- !.
|
||
|
Action <- '{' < [^}]* > '}' Spacing
|
||
|
BEGIN <- '<' Spacing
|
||
|
END <- '>' Spacing
|
||
|
|
||
|
.fi
|
||
|
.SH LEG GRAMMARS
|
||
|
.I leg
|
||
|
is a variant of
|
||
|
.I peg
|
||
|
that adds some features of
|
||
|
.IR lex (1)
|
||
|
and
|
||
|
.IR yacc (1).
|
||
|
It differs from
|
||
|
.I peg
|
||
|
in the following ways.
|
||
|
.TP
|
||
|
.BI %{\ text... \ %}
|
||
|
A declaration section can appear anywhere that a rule definition is
|
||
|
expected. The
|
||
|
.I text
|
||
|
between the delimiters '%{' and '%}' is copied verbatim to the
|
||
|
generated C parser code
|
||
|
.I before
|
||
|
the code that implements the parser itself.
|
||
|
.TP
|
||
|
.IB name\ = \ pattern
|
||
|
The 'assignment' operator replaces the left arrow operator '<-'.
|
||
|
.TP
|
||
|
.B rule-name
|
||
|
Hyphens can appear as letters in the names of rules. Each hyphen is
|
||
|
converted into an underscore in the generated C source code. A single
|
||
|
single hyphen '-' is a legal rule name.
|
||
|
.nf
|
||
|
|
||
|
- = [ \\t\\n\\r]*
|
||
|
number = [0-9]+ -
|
||
|
name = [a-zA-Z_][a-zA_Z_0-9]* -
|
||
|
l-paren = '(' -
|
||
|
r-paren = ')' -
|
||
|
|
||
|
.fi
|
||
|
This example shows how ignored whitespace can be obvious when reading
|
||
|
the grammar and yet unobtrusive when placed liberally at the end of
|
||
|
every rule associated with a lexical element.
|
||
|
.TP
|
||
|
.IB seq-1\ | \ seq-2
|
||
|
The alternation operator is vertical bar '|' rather than forward
|
||
|
slash '/'. The
|
||
|
.I peg
|
||
|
rule
|
||
|
.nf
|
||
|
|
||
|
name <- sequence-1
|
||
|
/ sequence-2
|
||
|
/ sequence-3
|
||
|
|
||
|
.fi
|
||
|
is therefore written
|
||
|
.nf
|
||
|
|
||
|
name = sequence-1
|
||
|
| sequence-2
|
||
|
| sequence-3
|
||
|
;
|
||
|
|
||
|
.fi
|
||
|
in
|
||
|
.I leg
|
||
|
(with the final semicolon being optional, as described next).
|
||
|
.TP
|
||
|
.IB pattern\ ;
|
||
|
A semicolon punctuator can optionally terminate a
|
||
|
.IR pattern .
|
||
|
.TP
|
||
|
.BI %% \ text...
|
||
|
A double percent '%%' terminates the rules (and declarations) section of
|
||
|
the grammar. All
|
||
|
.I text
|
||
|
following '%%' is copied verbatim to the generated C parser code
|
||
|
.I after
|
||
|
the parser implementation code.
|
||
|
.TP
|
||
|
.BI $$\ = \ value
|
||
|
A sub-rule can return a semantic
|
||
|
.I value
|
||
|
from an action by assigning it to the pseudo-variable '$$'. All
|
||
|
semantic values must have the same type (which defaults to 'int').
|
||
|
This type can be changed by defining YYSTYPE in a declaration section.
|
||
|
.TP
|
||
|
.IB identifier : name
|
||
|
The semantic value returned (by assigning to '$$') from the sub-rule
|
||
|
.I name
|
||
|
is associated with the
|
||
|
.I identifier
|
||
|
and can be referred to in subsequent actions.
|
||
|
.PP
|
||
|
The desk calclator example below illustrates the use of '$$' and ':'.
|
||
|
.SH LEG EXAMPLE: A DESK CALCULATOR
|
||
|
The extensions in
|
||
|
.I leg
|
||
|
described above allow useful parsers and evaluators (including
|
||
|
declarations, grammar rules, and supporting C functions such
|
||
|
as 'main') to be kept within a single source file. To illustrate this
|
||
|
we show a simple desk calculator supporting the four common arithmetic
|
||
|
operators and named variables. The intermediate results of arithmetic
|
||
|
evaluation will be accumulated on an implicit stack by returning them
|
||
|
as semantic values from sub-rules.
|
||
|
.nf
|
||
|
|
||
|
%{
|
||
|
#include <stdio.h> /* printf() */
|
||
|
#include <stdlib.h> /* atoi() */
|
||
|
int vars[26];
|
||
|
%}
|
||
|
|
||
|
Stmt = - e:Expr EOL { printf("%d\\n", e); }
|
||
|
| ( !EOL . )* EOL { printf("error\\n"); }
|
||
|
|
||
|
Expr = i:ID ASSIGN s:Sum { $$ = vars[i] = s; }
|
||
|
| s:Sum { $$ = s; }
|
||
|
|
||
|
Sum = l:Product
|
||
|
( PLUS r:Product { l += r; }
|
||
|
| MINUS r:Product { l -= r; }
|
||
|
)* { $$ = l; }
|
||
|
|
||
|
Product = l:Value
|
||
|
( TIMES r:Value { l *= r; }
|
||
|
| DIVIDE r:Value { l /= r; }
|
||
|
)* { $$ = l; }
|
||
|
|
||
|
Value = i:NUMBER { $$ = atoi(yytext); }
|
||
|
| i:ID !ASSIGN { $$ = vars[i]; }
|
||
|
| OPEN i:Expr CLOSE { $$ = i; }
|
||
|
|
||
|
NUMBER = < [0-9]+ > - { $$ = atoi(yytext); }
|
||
|
ID = < [a-z] > - { $$ = yytext[0] - 'a'; }
|
||
|
ASSIGN = '=' -
|
||
|
PLUS = '+' -
|
||
|
MINUS = '-' -
|
||
|
TIMES = '*' -
|
||
|
DIVIDE = '/' -
|
||
|
OPEN = '(' -
|
||
|
CLOSE = ')' -
|
||
|
|
||
|
- = [ \\t]*
|
||
|
EOL = '\\n' | '\\r\\n' | '\\r' | ';'
|
||
|
|
||
|
%%
|
||
|
|
||
|
int main()
|
||
|
{
|
||
|
while (yyparse())
|
||
|
;
|
||
|
return 0;
|
||
|
}
|
||
|
|
||
|
.fi
|
||
|
.SH LEG GRAMMAR FOR LEG GRAMMARS
|
||
|
The grammar for
|
||
|
.I leg
|
||
|
grammars is shown below. This will both illustrate and formalise the
|
||
|
above description.
|
||
|
.nf
|
||
|
|
||
|
grammar = -
|
||
|
( declaration | definition )+
|
||
|
trailer? end-of-file
|
||
|
|
||
|
declaration = '%{' < ( !'%}' . )* > RPERCENT
|
||
|
|
||
|
trailer = '%%' < .* >
|
||
|
|
||
|
definition = identifier EQUAL expression SEMICOLON?
|
||
|
|
||
|
expression = sequence ( BAR sequence )*
|
||
|
|
||
|
sequence = prefix+
|
||
|
|
||
|
prefix = AND action
|
||
|
| ( AND | NOT )? suffix
|
||
|
|
||
|
suffix = primary ( QUERY | STAR | PLUS )?
|
||
|
|
||
|
primary = identifier COLON identifier !EQUAL
|
||
|
| identifier !EQUAL
|
||
|
| OPEN expression CLOSE
|
||
|
| literal
|
||
|
| class
|
||
|
| DOT
|
||
|
| action
|
||
|
| BEGIN
|
||
|
| END
|
||
|
|
||
|
identifier = < [-a-zA-Z_][-a-zA-Z_0-9]* > -
|
||
|
|
||
|
literal = ['] < ( !['] char )* > ['] -
|
||
|
| ["] < ( !["] char )* > ["] -
|
||
|
|
||
|
class = '[' < ( !']' range )* > ']' -
|
||
|
|
||
|
range = char '-' char | char
|
||
|
|
||
|
char = '\\\\' [abefnrtv'"\\[\\]\\\\]
|
||
|
| '\\\\' [0-3][0-7][0-7]
|
||
|
| '\\\\' [0-7][0-7]?
|
||
|
| !'\\\\' .
|
||
|
|
||
|
action = '{' < [^}]* > '}' -
|
||
|
|
||
|
EQUAL = '=' -
|
||
|
COLON = ':' -
|
||
|
SEMICOLON = ';' -
|
||
|
BAR = '|' -
|
||
|
AND = '&' -
|
||
|
NOT = '!' -
|
||
|
QUERY = '?' -
|
||
|
STAR = '*' -
|
||
|
PLUS = '+' -
|
||
|
OPEN = '(' -
|
||
|
CLOSE = ')' -
|
||
|
DOT = '.' -
|
||
|
BEGIN = '<' -
|
||
|
END = '>' -
|
||
|
RPERCENT = '%}' -
|
||
|
|
||
|
- = ( space | comment )*
|
||
|
space = ' ' | '\\t' | end-of-line
|
||
|
comment = '#' ( !end-of-line . )* end-of-line
|
||
|
end-of-line = '\\r\\n' | '\\n' | '\\r'
|
||
|
end-of-file = !.
|
||
|
|
||
|
.fi
|
||
|
.SH CUSTOMISING THE PARSER
|
||
|
The following symbols can be redefined in declaration sections to
|
||
|
modify the generated parser code.
|
||
|
.TP
|
||
|
.B YYSTYPE
|
||
|
The semantic value type. The pseudo-variable '$$' and the
|
||
|
identifiers 'bound' to rule results with the colon operator ':' should
|
||
|
all be considered as being declared to have this type. The default
|
||
|
value is 'int'.
|
||
|
.TP
|
||
|
.B YYPARSE
|
||
|
The name of the main entry point to the parser. The default value
|
||
|
is 'yyparse'.
|
||
|
.TP
|
||
|
.B YYPARSEFROM
|
||
|
The name of an alternative entry point to the parser. This function
|
||
|
expects one argument: the function corresponding to the rule from
|
||
|
which the search for a match should begin. The default
|
||
|
is 'yyparsefrom'. Note that yyparse() is defined as
|
||
|
.nf
|
||
|
|
||
|
int yyparse() { return yyparsefrom(yy_foo); }
|
||
|
|
||
|
.fi
|
||
|
where 'foo' is the name of the first rule in the grammar.
|
||
|
.TP
|
||
|
.BI YY_INPUT( buf , \ result , \ max_size )
|
||
|
This macro is invoked by the parser to obtain more input text.
|
||
|
.I buf
|
||
|
points to an area of memory that can hold at most
|
||
|
.I max_size
|
||
|
characters. The macro should copy input text to
|
||
|
.I buf
|
||
|
and then assign the integer variable
|
||
|
.I result
|
||
|
to indicate the number of characters copied. If no more input is available,
|
||
|
the macro should assign 0 to
|
||
|
.IR result .
|
||
|
By default, the YY_INPUT macro is defined as follows.
|
||
|
.nf
|
||
|
|
||
|
#define YY_INPUT(buf, result, max_size) \\
|
||
|
{ \\
|
||
|
int yyc= getchar(); \\
|
||
|
result= (EOF == yyc) ? 0 : (*(buf)= yyc, 1); \\
|
||
|
}
|
||
|
|
||
|
.fi
|
||
|
.TP
|
||
|
.B YY_DEBUG
|
||
|
If this symbols is defined then additional code will be included in
|
||
|
the parser that prints vast quantities of arcane information to the
|
||
|
standard error while the parser is running.
|
||
|
.TP
|
||
|
.B YY_BEGIN
|
||
|
This macro is invoked to mark the start of input text that will be
|
||
|
made available in actions as 'yytext'. This corresponds to
|
||
|
occurrences of '<' in the grammar. These are converted into
|
||
|
predicates that are expected to succeed. The default definition
|
||
|
.nf
|
||
|
|
||
|
#define YY_BEGIN (yybegin= yypos, 1)
|
||
|
|
||
|
.fi
|
||
|
therefore saves the current input position and returns 1 ('true') as
|
||
|
the result of the predicate.
|
||
|
.TP
|
||
|
.B YY_END
|
||
|
This macros corresponds to '>' in the grammar. Again, it is a
|
||
|
predicate so the default definition saves the input position
|
||
|
before 'succeeding'.
|
||
|
.nf
|
||
|
|
||
|
#define YY_END (yyend= yypos, 1)
|
||
|
|
||
|
.fi
|
||
|
.TP
|
||
|
.BI YY_PARSE( T )
|
||
|
This macro declares the parser entry points (yyparse and yyparsefrom)
|
||
|
to be of type
|
||
|
.IR T .
|
||
|
The default definition
|
||
|
.nf
|
||
|
|
||
|
#define YY_PARSE(T) T
|
||
|
|
||
|
.fi
|
||
|
leaves yyparse() and yyparsefrom() with global visibility. If they
|
||
|
should not be externally visible in other source files, this macro can
|
||
|
be redefined to declare them 'static'.
|
||
|
.nf
|
||
|
|
||
|
#define YY_PARSE(T) static T
|
||
|
|
||
|
.fi
|
||
|
.TP
|
||
|
.BI YY_CTX_LOCAL
|
||
|
If this symbol is defined during compilation of a generated parser
|
||
|
then global parser state will be kept in a structure of
|
||
|
type 'yycontext' which can be declared as a local variable. This
|
||
|
allows multiple instances of parsers to coexist and to be thread-safe.
|
||
|
The parsing function
|
||
|
.IR yyparse ()
|
||
|
will be declared to expect a first argument of type 'yycontext *', an
|
||
|
instance of the structure holding the global state for the parser.
|
||
|
This instance must be allocated and initialised to zero by the client.
|
||
|
A trivial but complete example is as follows.
|
||
|
.nf
|
||
|
|
||
|
#include <stdio.h>
|
||
|
|
||
|
#define YY_CTX_LOCAL
|
||
|
|
||
|
#include "the-generated-parser.peg.c"
|
||
|
|
||
|
int main()
|
||
|
{
|
||
|
yycontext ctx;
|
||
|
memset(&ctx, 0, sizeof(yycontext));
|
||
|
while (yyparse(&ctx));
|
||
|
return 0;
|
||
|
}
|
||
|
|
||
|
.fi
|
||
|
Note that if this symbol is undefined then the compiled parser will
|
||
|
statically allocate its global state and will be neither reentrant nor
|
||
|
thread-safe.
|
||
|
.TP
|
||
|
.BI YY_CTX_MEMBERS
|
||
|
If YY_CTX_LOCAL is defined (see above) then the macro YY_CTX_MEMBERS
|
||
|
can be defined to expand to any additional member field declarations
|
||
|
that the client would like included in the declaration of
|
||
|
the 'yycontext' structure type. These additional members are
|
||
|
otherwise ignored by the generated parser. The instance
|
||
|
of 'yycontext' associated with the currently-active parser is
|
||
|
available in actions through the pointer variable
|
||
|
.IR yyctx .
|
||
|
.PP
|
||
|
The following variables can be reffered to within actions.
|
||
|
.TP
|
||
|
.B char *yybuf
|
||
|
This variable points to the parser's input buffer used to store input
|
||
|
text that has not yet been matched.
|
||
|
.TP
|
||
|
.B int yypos
|
||
|
This is the offset (in yybuf) of the next character to be matched and
|
||
|
consumed.
|
||
|
.TP
|
||
|
.B char *yytext
|
||
|
The most recent matched text delimited by '<' and '>' is stored in this variable.
|
||
|
.TP
|
||
|
.B int yyleng
|
||
|
This variable indicates the number of characters in 'yytext'.
|
||
|
.TP
|
||
|
.B yycontext *yyctx
|
||
|
This variable points to the instance of 'yycontext' associated with
|
||
|
the currently-active parser.
|
||
|
.SH DIAGNOSTICS
|
||
|
.I peg
|
||
|
and
|
||
|
.I leg
|
||
|
warn about the following conditions while converting a grammar into a parser.
|
||
|
.TP
|
||
|
.B syntax error
|
||
|
The input grammar was malformed in some way. The error message will
|
||
|
include the text about to be matched (often backed up a huge amount
|
||
|
from the actual location of the error) and the line number of the most
|
||
|
recently considered character (which is often the real location of the
|
||
|
problem).
|
||
|
.TP
|
||
|
.B rule 'foo' used but not defined
|
||
|
The grammar referred to a rule named 'foo' but no definition for it
|
||
|
was given. Attempting to use the generated parser will likely result
|
||
|
in errors from the linker due to undefined symbols associated with the
|
||
|
missing rule.
|
||
|
.TP
|
||
|
.B rule 'foo' defined but not used
|
||
|
The grammar defined a rule named 'foo' and then ignored it. The code
|
||
|
associated with the rule is included in the generated parser which
|
||
|
will in all other respects be healthy.
|
||
|
.TP
|
||
|
.B possible infinite left recursion in rule 'foo'
|
||
|
There exists at least one path through the grammar that leads from the
|
||
|
rule 'foo' back to (a recursive invocation of) the same rule without
|
||
|
consuming any input.
|
||
|
.PP
|
||
|
Left recursion, especially that found in standards documents, is
|
||
|
often 'direct' and implies trivial repetition.
|
||
|
.nf
|
||
|
|
||
|
# (6.7.6)
|
||
|
direct-abstract-declarator =
|
||
|
LPAREN abstract-declarator RPAREN
|
||
|
| direct-abstract-declarator? LBRACKET assign-expr? RBRACKET
|
||
|
| direct-abstract-declarator? LBRACKET STAR RBRACKET
|
||
|
| direct-abstract-declarator? LPAREN param-type-list? RPAREN
|
||
|
|
||
|
.fi
|
||
|
The recursion can easily be eliminated by converting the parts of the
|
||
|
pattern following the recursion into a repeatable suffix.
|
||
|
.nf
|
||
|
|
||
|
# (6.7.6)
|
||
|
direct-abstract-declarator =
|
||
|
direct-abstract-declarator-head?
|
||
|
direct-abstract-declarator-tail*
|
||
|
|
||
|
direct-abstract-declarator-head =
|
||
|
LPAREN abstract-declarator RPAREN
|
||
|
|
||
|
direct-abstract-declarator-tail =
|
||
|
LBRACKET assign-expr? RBRACKET
|
||
|
| LBRACKET STAR RBRACKET
|
||
|
| LPAREN param-type-list? RPAREN
|
||
|
|
||
|
.fi
|
||
|
.SH BUGS
|
||
|
The 'yy' and 'YY' prefixes cannot be changed.
|
||
|
.PP
|
||
|
Left recursion is detected in the input grammar but is not handled
|
||
|
correctly in the generated parser.
|
||
|
.PP
|
||
|
Diagnostics for errors in the input grammar are obscure and not
|
||
|
particularly helpful.
|
||
|
.PP
|
||
|
Several commonly-used
|
||
|
.IR lex (1)
|
||
|
features (yywrap(), yyin, etc.) are completely absent.
|
||
|
.PP
|
||
|
The generated parser foes not contain '#line' directives to direct C
|
||
|
compiler errors back to the grammar description when appropriate.
|
||
|
.IR lex (1)
|
||
|
features (yywrap(), yyin, etc.) are completely absent.
|
||
|
.SH SEE ALSO
|
||
|
D. Val Schorre,
|
||
|
.I META II, a syntax-oriented compiler writing language,
|
||
|
19th ACM National Conference, 1964, pp.\ 41.301--41.311. Describes a
|
||
|
self-implementing parser generator for analytic grammars with no
|
||
|
backtracking.
|
||
|
.PP
|
||
|
Alexander Birman,
|
||
|
.I The TMG Recognition Schema,
|
||
|
Ph.D. dissertation, Princeton, 1970. A mathematical treatment of the
|
||
|
power and complexity of recursive-descent parsing with backtracking.
|
||
|
.PP
|
||
|
Bryan Ford,
|
||
|
.I Parsing Expression Grammars: A Recognition-Based Syntactic Foundation,
|
||
|
ACM SIGPLAN Symposium on Principles of Programming Languages, 2004.
|
||
|
Defines PEGs and analyses them in relation to context-free and regular
|
||
|
grammars. Introduces the syntax adopted in
|
||
|
.IR peg .
|
||
|
.PP
|
||
|
The standard Unix utilies
|
||
|
.IR lex (1)
|
||
|
and
|
||
|
.IR yacc (1)
|
||
|
which influenced the syntax and features of
|
||
|
.IR leg .
|
||
|
.PP
|
||
|
The source code for
|
||
|
.I peg
|
||
|
and
|
||
|
.I leg
|
||
|
whose grammar parsers are written using themselves.
|
||
|
.PP
|
||
|
The latest version of this software and documentation:
|
||
|
.nf
|
||
|
|
||
|
http://piumarta.com/software/peg
|
||
|
|
||
|
.fi
|
||
|
.SH AUTHOR
|
||
|
.IR peg ,
|
||
|
.I leg
|
||
|
and this manual page were written by Ian Piumarta (first-name at
|
||
|
last-name dot com) while investigating the viablility of regular- and
|
||
|
parsing-expression grammars for efficiently extracting type and
|
||
|
signature information from C header files.
|
||
|
.PP
|
||
|
Please send bug reports and suggestions for improvements to the author
|
||
|
at the above address.
|