The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

yagg - generate a string generator from a grammar

SYNOPSIS

  yagg -u user_files_dir grammar.yg

DESCRIPTION

Given YACC-like and LEX-like input files, yagg generates a C++ program that generates all strings of a user-specified length. The YACC-like language grammar file provides the grammar productions for string generation, along with optional action blocks that can perform context-sensitive checks in order to limit the generated strings. The LEX-like terminal generator file provides specifications that instruct the program how to generate strings for terminals in the grammar.

If the programmer already has a YACC or Bison parser file, he or she only needs to add "unaction" blocks to allow the recursive generator to undo the side effects of the action blocks. If the programmer already has a LEX or FLEX lexer input file, he or she only needs to remove extraneous code and replace any regular expressions with one of the terminal generator specifications.

OPTIONS AND ARGUMENTS

.yg grammar file

This is the main grammar file, which is very similar (if not identical) to the YACC file that a programmer might write to parse an input file. There are one or two differences--see the section "INPUT FILE SYNTAX" below for details.

.lg terminal generator file

A terminal specification file that defines productions for nonterminals which represent tokens in the language grammar. This is analogous to, and replaces, the LEX file that a programmer might write to parse an input file. See the section "INPUT FILE SYNTAX" below for details.

This input file is not necessary if you use constant strings exclusively in your grammar (i.e. if the grammar has no terminals).

-C "compiler flags"

Set the compiler flags in the GNUmakefile for the generated code. This is "-Wall -pedantic -O3" by default. Be sure to quote the argument to -C if you have multiple compiler flags. e.g.:

  yagg -C "-g -Wall" foo.yg foo.lg

There are a number of flags for printing debug information during generation. See the generated GNUmakefile for a list. (-DSHORT_RULE_TRACE is especially useful if you want to trace the recursive generation of strings from the grammar.)

Remember to force a rebuild if you change the compiler flags. You can do this with the -f flag, or by running make clean in the output directory.

-d

Output debugging information to STDERR.

-f

Delete the contents of the output directory before generating. By default, new files overwrite old files if they are different. (Non-generated files that exist in the output directory are not removed, so that any compiler intermediate files can be reuse.)

-L "linker flags"

Set the linker flags in the GNUmakefile for the generated code. This is empty by default.

-m

Run "make" in the output directory after completing the generation.

If -o specifies a remote directory, this command will be run remotely using ssh, or whatever the RSYNC_RSH environment variable is set to.

-M

Pass these additional arguments to "make". This is useful if you want to set a control what is built on the remote machine. For example, you can send "LD=g++" to set g++ as the linker, or send "test" to run the tests.

-r <length>

Automatically run the generator program using the specified length. Implies -m.

If -o specifies a remote directory, this command will be run remotely using ssh, or whatever the RSYNC_RSH environment variable is set to. If you set up your public and private keys correctly, you should be able to run the generator without having to type in any passwords. That way you can transparently leverage the greater computational resources of a remote computer.

WARNING: The makefile uses a set of programs such as ln, g++, etc. The remote build will use whatever program paths happen to be in your default path.

-o directory

Specify the output directory. This is passed directly to rsync, so you can specify a remote directory using any of the formats it supports (assuming your remote maching has rsync).

  yagg -o userid@example.com:output foo.yg foo.lg
-u directory

Specify a directory with user-supplied files for the generator. Any files and subdirectories are copied into the output directory, overriding any generated files. You can use this to provide custom versions of any file, such as the src/progs/generate.cc file for the main program.

-X

Don't put any #line preprocessor commands in the generated files. Ordinarily yagg puts them in the files so that the C++ compiler and debuggers will associate errors with your source file, the grammar file. This option causes them to associate errors with the generated code, treating it as an independent source file in its own right.

NOTE: This flag can save you some compile time, because changing the input grammar file changes the line numbers, which changes the #line directives, which forces a recompile of the generated code.

--help

Print the usage message and exit.

--version

Print the version and exit.

INPUT FILE SYNTAX

This section provides a brief overview of the input file syntax. See the yagg::Tutorial for more discussion.

Language Grammar File

The language grammar file syntax is based on that of YACC. yagg should be able to process your .y file unchanged. (Please report a bug if it can not.) Then for any actions that have side-effects, you will need to add "unaction" blocks to reverse those side effects (described in more detail shortly). Otherwise, you should not have to make any other changes to the file.

A couple of things must be kept in mind. First, make very sure that your code does not contain memory leaks. While you may not notice them for a single execution of a parser, the generator will run your action code many times, in which case memory leaks will accrue. (Valgrind on Linux is a good leak checker.)

If your grammar actions have side effects, you must provide an unaction block that will reverse the side effects. This is because the generator needs to be able to backtrack in its search, and can't figure out how to undo your changes automatically. An example rule with an unaction block follows:

  // nonterminal and rule_1 have a return type of string*
  nonterminal : rule_1 "constant string"
  {
    // Normal action block
    global_count += $2;
    $$ = new string(*$1);
    delete $1;
  }
  {
    // New unaction block
    global_count -= $2; // <--- To restore global state
  };

First, notice that I am careful to delete the $1 positional parameter in the action block. Failing to do so would cause a memory leak. In the unaction block, I decrement the global_count variable. Note that you do not have to worry about deallocating or otherwise restoring the state of $$--that is handled automatically. Any positional parameters such as $2 that are used in the unaction block are automatically initialized with copies. (This means that any pointers to user-defined types must have a copy constructor defined.) Copies will not be made in the unaction block if you do not use a positional parameter, so you only need to delete $1 if you use it.

In an action block, any call to yyerror() is interpreted by the generator as a string that is invalid, and should not be generated. In the unaction block, m_error_occurred will be true if the action block resulted in an invalid string. Here's how you might use this to add a constraint on the generated strings:

  // nonterminal and rule_1 have a return type of string*
  nonterminal : rule_1 "constant string"
  {
    if (*$1 == "foo")
      yyerror("foo is not allowed!");
    else
    {
      global_count += $2;  // <--- Only increment for valid strings
      $$ = new string(*$1);
    }

    delete $1;
  }
  {
    if (!m_error_occurred) // <--- Only decrement for valid strings
      global_count -= $2;
  };

Terminal Generator File

The terminal generator file specifies the productions for terminals (which would be tokens in LEX). The generator supports a number of features for limiting the generation, as described below. The format is loosely based on that of LEX. The major change is that the only code that can be in the {...} blocks is a return statement for the token.

For obvious reasons, generating an unbounded number of possible strings for a regular expression is infeasible. Therefore, the programmer must provide one of several specifications for each terminal that tell the generator how to generate its strings.

Simple

The most simple specification is a constant string, which will replace the terminal wherever it appears in the generated string. For example:

  "=" return EQUAL;

You may also use constant strings in the language grammar file. They will be automatically replaced with "virtual terminals" having a simple string specification.

Alternation

If there are several possibilities for a terminal, you can use the syntax "(alt1|alt2)" to specify them. for example:

  ( "+" | "-" | "*" | "/" ) {
    return OPERATOR;
  }

This example also demonstrates an alternative form for the return statement. During generation, the OPERATOR terminal will be replaced with each of the alternatives, creating four times as many output strings.

Equivalence Classes

If an alternation is enclosed in square brackets, then the alternatives are considered to be interchangeable. This means that strings which differ only in terms of which alternative was chosen will not be printed. However, strings which utilize multiple alternatives will still be generated.

This is useful when generating terminals such as variable names:

  [ "x" | "y" ] return VARIABLE;

Consider a language grammar containing the following:

  SUM : VARIABLE "+" VARIABLE ;

Without the equivalence class, the following strings would be generated:

  x+x
  x+y
  y+x
  y+y

With the equivalence class, the following strings will be generated:

  x+x
  x+y

Since x and y are part of the same equivalence class, x+x is the same as y+y. Similarly, x+y is the same as y+x.

Equivalence Generators

If the terminal specification is an equivalence containing one literal string containing one "#" character, then the generator will create strings as needed, replacing the "#" character with 1, 2, 3, etc. Use "\#" if you want a literal "#" character in the produced string.

This is useful when generating an unlimited number of terminals such as variable names:

  [ "var_#" ] return VARIABLE;

Consider a language grammar containing the following:

  SUM : VARIABLE "+" VARIABLE ;

With the equivalence generator, the following strings will be generated:

  var_1+var_1
  var_1+var_2

You can think of this feature as an "infinite equivalence class".

BUGS, LIMITATIONS, POSSIBLE IMPROVEMENTS

Not fully tested

I still need to test it for grammars found "in the wild." I also need to validate the steps in the second example of the tutorial. I have not tested it on many different platforms.

Not optimized for speed or memory

It's very recursive, so we can use memoization or other techniques to speed things up. There's probably plenty of opportunity to make the generated C++ code faster. I'm hoping someone with optimization experience can help out.

%{ ... %} and code block parsing is error-prone

Instead of doing real parsing of the user-provided C/C++ code, I try to parse it in Perl. We need to do this in order to identify declarations to extern, and to move definitions in order to avoid multiply defined symbols. Some real code may confuse the simple Perl parser, causing the utility files to fail to compile.

Other bugs and planned changes

See the TODO file.

AUTHOR

David Coppit, <david@coppit.org>, http://coppit.org/

SEE ALSO

Run perldoc yagg::Tutorial for a tutorial. Also see the examples/ directory in the distribution.

Parse::Yapp, YACC, Bison, LEX, FLEX