The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Parse::Eyapp::eyappintro - An introduction to Parse::Eyapp

SYNOPSIS

  # File 'calc.eyp': translates infix expressions to postfix
  # Compile it with:  eyapp -o calc.pl -C Postfix.eyp
  # Execution:        ./calc.pl -c 'a = 2*3+b'
  %token NUM = /([0-9]+(?:\.[0-9]+)?)/
  %token VAR = /([A-Za-z][A-Za-z0-9_]*)/

  %right  '='
  %left   '-' '+'
  %left   '*' '/'
  %left   NEG

  %defaultaction { "$left $right $op"; }

  %%
  line: $exp  { print "$exp\n" }
  ;

  exp:        $NUM  { $NUM }            
          |   $VAR  { $VAR }            
          |   VAR.left '='.op exp.right         
          |   exp.left '+'.op exp.right         
          |   exp.left '-'.op exp.right        
          |   exp.left '*'.op exp.right       
          |   exp.left '/'.op exp.right      
          |   '-' $exp %prec NEG { "$exp NEG" }
          |   '(' $exp ')' { $exp }      
  ;

  %%

INTRODUCTION TO PARSING WITH Parse::Eyapp

Introduction

Parsing is the activity of producing a syntax tree from an input stream. The program example in the synopsis section shows an example of a translation scheme. It translates infix arithmetic expressions like

   a = 2 * 3 + 4 * b

into postfix expressions like

   a 2 3 * 4 b * + =

The program contains a context free eyapp grammar defining the language of arithmetic infix expressions. A context free grammar is a mathematical device to define languages. To better see the grammar for the example above we have to eliminate the semantic actions (We call semantic actions to the sections delimited by curly brackets containing Perl code). This can be done calling eyapp with options -vc:

  $ eyapp -vc Postfix.eyp 
  %token NUM =/([0-9]+(?:\.[0-9]+)?)/ 
  %token VAR =/([A-Za-z][A-Za-z0-9_]*)/ 
  %right '=' 
  %left '-' '+' 
  %left '*' '/' 
  %left NEG 

  %%

  line:
        exp 
  ;
  exp:
        NUM
      | VAR
      | VAR '=' exp
      | exp '+' exp
      | exp '-' exp
      | exp '*' exp
      | exp '/' exp
      | '-' exp %prec NEG
      | '(' exp ')'  
  ;

  %%

A grammar generates a language. A grammar is defined by a set of production rules. A production rule has two components: a left hand side which is a syntactic variable or non terminal and a right hand side which is a phrase made of syntactic variables and terminals. The left hand side (lhs) and the right hand side (rhs) are usually separated by an arrow like in:

                                    exp -> VAR = exp

A note: when you see a production rule like:

                        line: exp <+ ';'>

is not really a production rule but an abbreviation for two productions. It stands for:

                        line : exp
                             | line ';' exp
                        ;

A terminal or token never appears on the left hand side of a production rule.

Ambiguity

The phrases of the language are those obtained successively applying the production rules of the grammar until no more rules can be applied. The successive substitutions must start from the start symbol of the grammar (line in the example). Such legal sequence of substitutions is known as a derivation. The following is an example of a legal derivation (the big arrow => is read derives):

  line => exp => VAR = exp => VAR = exp + exp => VAR = exp + NUM => VAR = VAR + NUM

thus the phrase VAR = VAR + NUM belongs to the language generated by the former grammar. A derivation like this can be seen as a tree. For instance, the former derivation is equivalent (has the same information) than the following tree:

                             +----+
                             |line|
                             +----+
                                |
                              +---+
                              |exp|
                              +---+
                    .-----.-----^-----.
                  +---+ +---+        +---+
                  |VAR| | = |        |exp|
                  +---+ +---+        +---+
                                 .-----+-----.
                               +---+ +---+ +---+
                               |exp| | + | |exp|
                               +---+ +---+ +---+
                                 |           |
                               +---+       +---+
                               |VAR|       |NUM|
                               +---+       +---+

which can be written more succinctly:

                        line(exp(VAR, '=', exp(exp(VAR), '+',  exp(NUM))))

or even more briefly:

                                      VAR = (VAR + NUM)

Such a tree is called a syntax tree for the input VAR = VAR + NUM. A grammar is said to be ambiguous if there are phrases in the generated language that have more than one syntax tree. The grammar in the synopsis example is ambiguous. Here is an alternative tree for the same phrase VAR = VAR + NUM:

                                    +----+
                                    |line|
                                    +----+
                                       |
                                     +---+
                                     |exp|
                                     +---+
                            .----------^-----.-----.
                          +---+            +---+  +---+
                          |exp|            | + |  |exp|
                          +---+            +---+  +---+
                      .-----+-----.                 |
                    +---+ +---+ +---+             +---+
                    |VAR| | = | |exp|             |NUM|
                    +---+ +---+ +---+             +---+
                                  |
                                +---+
                                |VAR|
                                +---+

or

                        line(exp(exp(VAR, '=', exp(VAR)), '+', exp(NUM)))

or

                                     (VAR = VAR) + NUM

Semantic Actions and Attributes

Parse::Eyapp analyzes your grammar and produce a LALR parser. Actually the SYNOPSIS example is more than a context free grammar: is a translation scheme. A translation scheme scheme is a context free grammar where the right hand sides of the productions have been augmented with semantic actions (i.e. with chunks of Perl code):

                                A -> alpha { action(@_) } beta

The analyzer generated by Eyapp executes { action(@_) } after all the semantic actions associated with alpha have been executed and before the execution of any of the semantic actions associated with beta.

In a translation scheme each symbol occurrence has an associated attribute. The embedded actions modify the attributes associated with the symbols of the grammar:

                        A -> alpha { action(@_) } beta

Each symbol on the right hand side of a production rule has an associated scalar attribute. In eyapp the attributes of the symbols to the left of action are passed as arguments to action (in the example, those of alpha). These arguments are preceded by a reference to the syntax analyzer object. Therefore, you can access to the attributes associated with the first, second, etc. symbols in the right hand side using the notation:

               $_[1], $_[2], ...

However it is better to refer to the attributes by names. This is the purpose of the dot and dollar notations as in:

  exp:        $NUM  { $NUM }            
          |   $VAR  { $VAR }            
          |   VAR.left '='.op exp.right         
          |   exp.left '+'.op exp.right         
          |   exp.left '-'.op exp.right        
          |   exp.left '*'.op exp.right       
          |   exp.left '/'.op exp.right      
          |   '-' $exp %prec NEG { "$exp NEG" }
          |   '(' $exp ')' { $exp }      
  ;

By prefixing the symbol NUM by a $ we can refer to the associated attribute inside the semantic action as $NUM:

  exp:        $NUM  { $NUM }            

By postfixing the two appearances of expr with .left and .right and the appearance of '+' with .op we can refer to the associates attributes as $left, $right and $op instead of $_[1], $_[3] and $_[2]:

  %defaultaction { "$left $right $op"; }

There is no way inside an ordinary eyapp program for an intermediate action to access the attributes of the symbols on its right, i.e. those associated with the symbols of beta. This restriction is lifted if you use the %metatree directive to build a full translation scheme. See Parse::Eyapp::translationschemestut to know more about full translation schemes.

Actions on the right hand side counts as symbols and so they can be referenced by its positional argument in later actions in the same production rule. For intermediate actions, the value returned by the action is the attribute associated with such action. For an action at the end of the rule:

                        A -> alpha { lastaction(@_) } 

the returned value constitutes the attribute of the left hand side of the rule (the attribute of A in this case). The action at the end of the right hand side is called the action associated with the production rule. When no explicit action has been associated with a production rule the default action applies. In Parse::Eyapp the programmer can define what is the default action through the %defaultaction directive:

                        %defaultaction { "$left $right $op"; }

Actually, intermediate actions are implemented via a trick. When eyapp sees an intermediate action like:

                        A -> alpha { action(@_) } beta

it creates a new auxiliary syntactic variable Temp:

                      Temp -> /* empty */ { action(@_) }

Solving Ambiguities via Precedence and Associativity Declarations

Notice that ambiguous grammars produce ambiguous translation schemes: since a phrase may have two syntactic trees it will be more than one tree-traversing and consequently more than one way to execute the embedded semantic actions. Certainly different execution orders will usually produce different results. Thus, syntactic ambiguities translate onto semantic ambiguities. That is why it is so important to resolve all the ambiguities and conflicts that may arise in our grammars. This is the function of the %left and %right declarations on the header section:

      %right  '='     # Lowest precedence
      %left   '-' '+' # + and - have more precedence than = Disambiguate a-b-c as (a-b)-c
      %left   '*' '/' # * and / have more precedence than + Disambiguate a/b/c as (a/b)/c
      %left   NEG     # Disambiguate -a-b as (-a)-b and not as -(a-b)

Priority can be assigned to tokens by using the %left and %right declarations. Tokens in lines below have more precedence than tokens in line above. The idea behind this notation is this: Any ambiguity can be seen as a parenthesizing problem. You can parenthesize left (in the jargon this is called reduce) or parenthesize right (in the jargon, shift). Recall the main points of yacc-like parsers related to priorities:

  • The directives

                %left
                %right
                %nonassoc

    can be used in the head section to declare the priority of a token

  • The later the declaration line the higher the priority

  • Tokens in the same line have the same priority. Ties will be solved using the token associativity (whether they were declared %left or %right)

  • The precedence of a production rule (right hand side) is the precedence of the last token in the right hand side

  • If the parser emits a warning announcing a shift-reduce conflict or a reduce-reduce conflict in your grammar, it likely means that your grammar is ambiguous or not LALR. In such case, recompile the grammar with eyapp -v and carefully study the .output file generated. Detect which token and which rules are involved in the conflict.

  • In a shift-reduce conflict the default action is to shift (i.e. associate right). This action can be changed if the production and the token involved have explicit priorities

  • Most of the time the presence of a reduce-reduce conflict means that your grammar is ambiguous. Rewrite your grammar. Alternatively, use the %conflict and %PREC directives (see example debuggintut/pascalenumeratedvsrangesolvedviadyn.eyp). The default action is to reduce by the first production.

  • If the precedence of the production rule is higher the shift-reduce conflict is solved in favor of the reduction (i.e. associate left)

  • If the precedence of the token is higher the shift-reduce conflict is solved in favor of the shift (i.e. associate right).

  • If the precedence of the token is the same than the precedence of the rule, and is left the shift-reduce conflict is solved in favor of the reduction (i.e. associate left)

  • If the precedence of the token is the same than the precedence of the rule, and is right the shift-reduce conflict is solved in favor of the shift

  • If the precedence of the token is the same than the precedence of the rule, and is nonassoc the presence of a shift-reduce conflict means an error. This is used to describe operators, like the operator .LT. in FORTRAN, that may not associate with themselves. That is, because

                                 A .LT. B .LT. C

    is invalid in FORTRAN, .LT. would be described with the keyword %nonassoc in eyapp.

  • The default precedence of a production can be changed using the %prec TOKEN directive. Now the rule has the precedence and associativity of the specified TOKEN.

Examples

  • By giving token '+' more precedence than token '=' we solve the ambiguity in VAR = VAR + NUM in favor of VAR = (VAR + NUM). The conflict occurs between the productions

                                exp -> exp . '+' exp 
                                exp -> VAR '=' exp .

    Where the dot means:

    If I have seen VAR '=' exp and I am in the presence of a token '+' I can associate left, i.e. reduce VAR '=' exp to exp or to associate right, that is, to shift to the right to reduce exp '+' exp to exp later.

  • How it works when two tokens are declared in the same line? Consider the phrase NUM - NUM - NUM. It will be interpreted as (NUM - NUM) - NUM if the token '-' is declared %left '-' and will be interpreted as NUM - (NUM - NUM) if the token '-' is declared %right '-'. By saying '-' is left we are saying we prefer between the two trees in dispute the one that deepens to the left:

                                           +---+
                                           |exp|
                                           +---+
                                    .--------^--.-----.
                                  +---+       +---+ +---+
                                  |exp|       | - | |exp|
                                  +---+       +---+ +---+
                              .-----+-----.           |
                            +---+ +---+ +---+       +---+
                            |exp| | - | |exp|       |NUM|
                            +---+ +---+ +---+       +---+
                              |           |
                            +---+       +---+
                            |NUM|       |NUM|
                            +---+       +---+

    By saying '-' is right we are saying we prefer between the two trees in dispute the one that deepens to the right:

                                            +---+
                                            |exp|
                                            +---+
                                  .-----.-----^-----.
                                +---+ +---+       +---+
                                |exp| |MIN|       |exp|
                                +---+ +---+       +---+
                                  |           .-----+-----.
                                +---+       +---+ +---+ +---+
                                |NUM|       |exp| |MIN| |exp|
                                +---+       +---+ +---+ +---+
                                               |           |
                                             +---+       +---+
                                             |NUM|       |NUM|
                                             +---+       +---+

    Since priority means earlier evaluation and the evaluation by eyapp of semantic actions is bottom up, the deeper the associated subtree the higher the priority.

  • Consider now the phrase -NUM-NUM. There are two interpretations: one as -(NUM-NUM) and the other as (-NUM)-NUM. The conflict occurs between the productions

                                exp -> exp . '-' exp 
                                exp -> '-' exp.

    Both productions have the precedence of the token '-'. But we prefer the interpretation (-NUM)-NUM to win. We do that by explicitly changing the precedence associated with the unary minus production via the %prec directive.

Lexical Analysis

Parsers created by eyapp do not deal directly with the input. Instead they expect the input to be processed by a lexical analyzer. The lexical analyzer parses the input and produces the next token. A token is a pair. The first component is the name of the token (like NUM or VAR) and the second is its attribute (i.e. the information associated with the token, like that the value is 4 for a NUM or the identifier is temperature for a VAR). Tokens are usually defined using regular expressions. Thus the token NUM is characterized by /[0-9]+(?:\.[0-9]+)?/ and the token VAR by /[A-Za-z][A-Za-z0-9_]*/. The eyapp compiler automatically generates a lexical analyzer from your token definitions. The tokens NUM and VAR were defined using the %token directives:

  %token NUM = /([0-9]+(?:\.[0-9]+)?)/
  %token VAR = /([A-Za-z][A-Za-z0-9_]*)/

The order in which the tokens are defined is important. The input will be matched against the regular expression for NUM before the regular expression for VAR is tried, and all the literal tokens that appear between quotes inside the body of the grammar, like '+' or '-, are tried before any explicitly defined token.

You can, alternatively, define the lexical analyzer explicitly. There are many ways to do it. Here is an example of a definition of a lexical analyzer using the %lexer directive:

  %lexer  {
    m{\G[ \t]*}gc;
    m{\G(\n)+}gc                    and $self->tokenline($1 =~ tr/\n//);
    m{\G([0-9]+(?:\.[0-9]+)?)}gc    and return ('NUM',   $1);
    m{\G([A-Za-z_][A-Za-z0-9_]*)}gc and return ('VAR',   $1);
    m{\G(.)}gc                      and return ($1,      $1);
  }

The %lexer directive is followed by the code defining the lexical analyzer. When called, the variable $_ is an alias of the input. The input can also be set and accessed via the input method of the $parser object.

To catch the next pattern we use the anchor \G. The \G anchor matches at the point where the previous /g match left off. Normally, when a scalar m{}g match fails, the match position is reset and \G will start matching at the beginning of the string. The c option causes the match position to be retained following an unsuccessful match. The couple ('',undef) which signals the end of the input is automatically inserted by eyapp.

By default, the lexers generated by eyapp emit the end-of-input token ('', undef) when the end of the current string is reached. A incremental lexer differs from these behavior: when the end is reached it reads more input from the current file, which was set by

                 $parser->YYInputFile

See the following variant of the synopsis example:

  ~/LEyapp/examples/eyappintro$ cat InputFromStream.eyp 
  %whites /([ \t]+)/
  %token NUM = /([0-9]+(?:\.[0-9]+)?)/
  %token VAR = /([A-Za-z][A-Za-z0-9_]*)/

  %right '='
  %left   '-' '+'
  %left   '*' '/'
  %left   NEG

  %defaultaction { "$_[1] $_[3] $_[2]" }

  # example of incremental lexer
  %incremental lexer  'Write an arithmetic expression: '

  %%
  input:                  {}
          |   input line  {}
  ;

  line:     '\n'       {}
          | exp '\n'   { print "$_[1]\n" } 
          | error '\n'   {}
  ;

  exp:        NUM                { $_[1] }
          |   VAR                { $_[1] }
          |   VAR '=' exp         
          |   exp '+' exp         
          |   exp '-' exp        
          |   exp '*' exp       
          |   exp '/' exp      
          |   '-' exp %prec NEG  { "$_[2] NEG" }
          |   '(' exp ')'        { $_[2] } 
  ;

  %%

Now, after the grammar is compiled

  ~/LEyapp/examples/eyappintro$ eyapp -C InputFromStream.eyp 

When the generated modulino is executed, each time the end of the input string is reached, it asks for more input until we press the end-of-file (^D in Unix) key:

  ~/LEyapp/examples/eyappintro$ ./InputFromStream.pm -noslurp
  Write an arithmetic expression: a=2+3*b
  a 2 3 b * + =
  Write an arithmetic expression: a=-b*2
  a b NEG 2 * =
  Write an arithmetic expression: ^D
  ~/LEyapp/examples/eyappintro$ 

The main and error subroutines

If you compile your grammar with option -C, eyapp will insert a line like this as the first line of the generated .pm file:

                #!/usr/bin/perl

It will also append a line like this as the last line of the .pm file:

          unless (caller) { exit !__PACKAGE__->main(''); }

This allows the alternative use of the module as a script. Unless a main subroutine was defined, the one provided by Parse::Eyapp::Driver will be called. It also provides a default subroutine for the handling of error messages.

The default main accepts a few arguments from the command line. Here are some:

  • -f filename input from filename

  • -c 'string' input from 'string'

  • -noslurp when input is from STDIN don't wait for end of file

SEE ALSO

REFERENCES

CONTRIBUTORS

AUTHOR

Casiano Rodriguez-Leon (casiano@ull.es)

ACKNOWLEDGMENTS

This work has been supported by CEE (FEDER) and the Spanish Ministry of Educacion y Ciencia through Plan Nacional I+D+I number TIN2005-08818-C04-04 (ULL::OPLINK project http://www.oplink.ull.es/). Support from Gobierno de Canarias was through GC02210601 (Grupos Consolidados). The University of La Laguna has also supported my work in many ways and for many years.

A large percentage of code is verbatim taken from Parse::Yapp 1.05. The author of Parse::Yapp is Francois Desarmenien.

I wish to thank Francois Desarmenien for his Parse::Yapp module, to my students at La Laguna and to the Perl Community. Thanks to the people who have contributed to improve the module (see "CONTRIBUTORS" in Parse::Eyapp). Thanks to Larry Wall for giving us Perl. Special thanks to Juana.

LICENCE AND COPYRIGHT

Copyright (c) 2006-2008 Casiano Rodriguez-Leon (casiano@ull.es). All rights reserved.

Parse::Yapp copyright is of Francois Desarmenien, all rights reserved. 1998-2001

These modules are free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.