The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

App::Framework::Extension::Filter - Script filter application object

SYNOPSIS

  use App::Framework '::Filter' ;

DESCRIPTION

Application that filters a file or files to produce some other output

Application Subroutines

This extension modifies the normal call flow for the application subroutines. The extension calls the subroutines for each input file being filtered. Also, the main 'app' subroutine is called for each of the lines of text in the input file.

The pseudo-code for the extension is:

    FOREACH input file
        <init variables, state HASH>
        call 'app_start' subroutine 
        FOREACH input line
                call 'app' subroutine 
        END
        call 'app_end' subroutine 
        END

For each input file, a state HASH is created and passed as a reference to the application subroutines. The state HASH contains various values maintained by the extension, but the application may add it's own additional values to the HASH. These values will be passed unmodified to each of the application subroutine calls.

The state HASH contains the following fields:

  • num_files

    Total number of input files.

  • file_number

    Current input file number (1 to num_files)

  • file_list

    ARRAY ref. List of input filenames.

  • vars

    HASH ref. Empty HASH created so that any application-specific variables may be stored here.

  • line_num

    Current line number of line being processed (1 to N).

  • output_lines

    ARRAY ref. List of the output lines that are to be written to the output file (maintained by the extension).

  • file

    Current file name of the file being processed.

  • line

    String of line being processed.

  • output

    Special variable used by application to tell extension what to output (see "Output").

The state HASH reference is passed to all 3 of the application subroutines. In addition, the input line of text is also passed to the main 'app' subroutine. The interface for the subroutines is:

app_start($app, $opts_href, $state_href)

Called once for each input file. Called at the start of processing. Allows any setting up of variables stored in the state HASH.

Arguments are:

$app - The application object
$opts_href - HASH ref to the command line options (see App::Framework::Feature::Options and "Filter Options")
$state_href - HASH ref to state
app($app, $opts_href, $state_href, $line)

Called once for each input file. Called at the start of processing. Allows any setting up of variables stored in the state HASH.

Arguments are:

$app - The application object
$opts_href - HASH ref to the command line options (see App::Framework::Feature::Options and "Filter Options")
$state_href - HASH ref to state
$line - Text of input line
app_end($app, $opts_href, $state_href)

Called once for each input file. Called at the end of processing. Allows for any end of file tidy up, data sorting etc.

Arguments are:

$app - The application object
$opts_href - HASH ref to the command line options (see App::Framework::Feature::Options and "Filter Options")
$state_href - HASH ref to state

Output

By default, each time the extension calls the 'app' subroutine it sets the output field of the state HASH to undef. The 'app' subroutine must set this field to some value for the extension to write anything to the output file.

For examples, the following simple 'app' subroutine causes all input files to be output uppercased:

        sub app
        {
                my ($app, $opts_href, $state_href, $line) = @_ ;
                
                # uppercase
                $state_href->{output} = uc $line ;      
        }

If no "outfile" option is specified, then all output will be written to STDOUT. Also, normally the output is written line-by-line after each line has been processed. If the "buffer" option has been specified, then all output lines are buffered (into the state variable "output_lines") then written out at the end of processing all input. Similarly, if the "inplace" option is specified, then buffering is used to process the complete input file then overwrite it with the output.

Outfile option

The "outfile" option may be used to set the output filename. This may include variables that are specific to the Filter extension, where the variables value is updated for each input file being processed. The following Filter-sepcific variables may be used:

                $filter{'filter_file'} = $state_href->{file} ;
                $filter{'filter_filenum'} = $state_href->{file_number} ;
                my ($base, $path, $ext) = fileparse($file, '\..*') ;
                $filter{'filter_name'} = $base ;
                $filter{'filter_base'} = $base ;
                $filter{'filter_path'} = $path ;
                $filter{'filter_ext'} = $ext ;
filter_file - Input full file path
filter_base - Basename of input file (excluding extension)
filter_name - Alias for "filter_base"
filter_path - Directory path of input file
filter_ext - Extension of input file
filter_filenum - Input file number (starting from 1)

NOTE: Specifying these variables for options at the command line will require you to escape the variables per the operating system you are using (e.g. use single quotes ' ' around the value in Linux).

For example, with the command line arguments:

    -outfile '/tmp/$filter_name-$filter_filenum.txt' afile.doc /doc/bfile.text

Processes './afile.doc' into '/tmp/afile-1.txt', and '/doc/bfile.text' into '/tmp/bfile-2.txt'

Example

As an example, here is a script that filters one or more HTML files to strip out unwanted sections (they are actually Doxygen HTML files that I wanted to convert into a pdf book):

    #!/usr/bin/perl
    #
    use strict ;
    use App::Framework '::Filter' ;
    
    # VERSION
    our $VERSION = '1.00' ;
    
        ## Create app
        go() ;
    
    #----------------------------------------------------------------------
    sub app_begin
    {
        my ($app, $opts_href, $state_href, $line) = @_ ;
    
        # force in-place editing
        $app->set(inplace => 1) ;
    
        # set to start state
        $state_href->{vars} = {
            'state'        => 'start',
        } ;
    }
    
    #----------------------------------------------------------------------
    # Main execution
    #
    sub app
    {
        my ($app, $opts_href, $state_href, $line) = @_ ;
    
        my $ok = 1 ;
        if ($state_href->{'vars'}{'state'} eq 'start')
        {
            if ($line =~ m/<!-- Generated by Doxygen/i)
            {
                $ok = 0 ;
                $state_href->{'vars'}{'state'} = 'doxy-head' ;
            }
        }
        elsif ($state_href->{'vars'}{'state'} eq 'doxy-head')
        {
            $ok = 0 ;
            if ($line =~ m/<div class="contents">/i)
            {
                $ok = 1 ;
                $state_href->{'vars'}{'state'} = 'contents' ;
            }
        }
        elsif ($state_href->{'vars'}{'state'} eq 'contents')
        {
            if ($line =~ m/<hr size="1"><address style="text-align: right;"><small>Generated/i)
            {
                $ok = 0 ;
                $state_href->{'vars'}{'state'} = 'doxy-foot' ;
            }
        }
        elsif ($state_href->{'vars'}{'state'} eq 'doxy-foot')
        {
            $ok = 0 ;
            if ($line =~ m%</body>%i)
            {
                $ok = 1 ;
                $state_href->{'vars'}{'state'} = 'end' ;
            }
        }
    
        # only output if ok to do so
        $state_href->{'output'} = $line if $ok ;
    }
    
    
    #=================================================================================
    # SETUP
    #=================================================================================
    __DATA__
    
    [SUMMARY]
    Filter Doxygen created html removing frames etc.
    
    [DESCRIPTION]
    B<$name> does some stuff.

The script takes in HTML of the form:

    <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
    <html><head><meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
    <title>rctu4_test: File Index</title>
    <link href="doxygen.css" rel="stylesheet" type="text/css">
    <link href="tabs.css" rel="stylesheet" type="text/css">
    </head><body>
    **<!-- Generated by Doxygen 1.5.5 -->
    **<div class="navigation" id="top">
    **  <div class="tabs">
    **    <ul>
    ..
    **  </div>
    **</div>
    <div class="contents">
    <h1>File List</h1>Here is a list of all files with brief descriptions:<table>
      <tr><td class="indexkey">src/<a class="el" href="rctu4__tests_8c.html">rctu4_tests.c</a></td><td class="indexvalue"></td></tr>
      <tr><td class="indexkey">src/common/<a class="el" href="ate__general_8c.html">ate_general.c</a></td><td class="indexvalue"></td></tr>
    ...
    
      <tr><td class="indexkey">src/tests/<a class="el" href="test__star__daisychain__specific_8c.html">test_star_daisychain_specific.c</a></td><td class="indexvalue"></td></tr>
      <tr><td class="indexkey">src/tests/<a class="el" href="test__version__functions_8c.html">test_version_functions.c</a></td><td class="indexvalue"></td></tr>
    </table>
    
    </div>
    
    **<hr size="1"><address style="text-align: right;"><small>Generated on Fri Jun 5 13:43:31 2009 for rctu4_test by&nbsp;
    **<a href="http://www.doxygen.org/index.html">
    **<img src="doxygen.png" alt="doxygen" align="middle" border="0"></a> 1.5.5 </small></address>
    </body>
    </html>

And removes the lines beginning '**'.

The script does in-place updating of the HTML files and can be run as:

    filter-script *.html

ADDITIONAL COMMAND LINE OPTIONS

This extension adds the following additional command line options to any application:

-skip_empty - Skip blanks

Do not process empty lines (lines that contain only whitespace)

-trim_space - Trim spaces

Remove spaces from start and end of lines

-trim_comment - Trim comments

Remove any comments from the line, starting from the comment string to the end of the line

-inplace - In-place filter

Read file, process, then overwrite original input file with processed output

-outdir - Specify output directory

Write file(s) into specified directory rather that into same directory as input file

-outfile - Specify output file

Specify the output filename, which may include variables (see "Output Filename")

-comment - Specify command string

Specify the comment start string. Used in conjuntion with "-trim_comment".

COMMAND LINE ARGUMENTS

This extension sets the following additional command line arguments for any application:

file - Input file(s)

Specify one of more input files to be processed. If no files are specified on the command line then reads from STDIN.

FIELDS

Note that the fields match with the command line options.

skip_empty - Skip blanks

Do not process empty lines (lines that contain only whitespace)

trim_space - Trim spaces

Remove spaces from start and end of lines

trim_comment - Trim comments

Remove any comments from the line, starting from the comment string to the end of the line

inplace - In-place filter

Read file, process, then overwrite original input file with processed output

buffer - Buffer output

Store output lines into a buffer, then write out file at end of processing

outdir - Specify output directory

Write file(s) into specified directory rather that into same directory as input file

outfile - Specify output file

Specify the output filename, which may include variables (see "Output Filename")

comment - Specify command string

Specify the comment start string. Used in conjuntion with "trim_comment".

out_fh - Output file handle

Read only. File handle of current output file.

CONSTRUCTOR METHODS

new([%args])

Create a new App::Framework::Extension::Filter.

The %args are specified as they would be in the set method, for example:

        'mmap_handler' => $mmap_handler

The full list of possible arguments are :

        'fields'        => Either ARRAY list of valid field names, or HASH of field names with default values 

CLASS METHODS

init_class([%args])

Initialises the object class variables.

OBJECT METHODS

filter_run($app, $opts_href, $args_href)

Filter the specified file(s) one at a time.

write_output($output)

Application interface for writing out extra lines

_start_output($state_href, $opts_href)

Start of output file

_handle_output($state_href, $opts_href)

Write out line (if required)

_end_output($state_href, $opts_href)

End of output file

_open_output($state_href, $opts_href)

Open the file (or STDOUT) depending on settings

_close_output($state_href, $opts_href)

Close the file if open

_wr_output($state_href, $opts_href, $line)

End of output file

DIAGNOSTICS

Setting the debug flag to level 1 prints out (to STDOUT) some debug messages, setting it to level 2 prints out more verbose messages.

AUTHOR

Steve Price <sdprice at cpan.org>

BUGS

None that I know of!