The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Scraper - Structured data from (un)structured text

SYNOPSIS

    use Text::Scraper;

    use LWP::Simple;
    use Data::Dumper;

    #
    # 1. Get our template and source text
    #
    my $tmpl = Text::Scraper->slurp(\*DATA);
    my $src  = get('http://search.cpan.org/recent') || die $!;
    
    #
    # 2. Extract data from source
    #
    my $obj  = Text::Scraper->new(tmpl => $tmpl);
    my $data = $obj->scrape($src);

    #
    # 3. Do something really neat...(left as excercise)
    #
    print "Newest Submission: ", $data->[0]{submissions}[0]{name},  "\n\n";
    print "Scraper model:\n",    Dumper($obj),                      "\n\n";
    print "Parsed  model:\n",    Dumper($data) ,                    "\n\n";

    __DATA__

    <div class=path><center><table><tr>
    <?tmpl stuff pre_nav ?>
    <td class=datecell><span><big><b> <?tmpl var date_string ?> </b></big></span></td>
    <?tmpl stuff post_nav ?>
    </tr></table></center></div>

    <ul>
    <?tmpl loop submissions ?>
     <li><a href="<?tmpl var link ?>"><?tmpl var name ?></a>
      <?tmpl if has_description ?>
      <small> -- <?tmpl var description ?></small>
      <?tmpl end has_description ?>
     </li>
    <?tmpl end submissions ?>
     </ul>

ABSTRACT

Text::Scraper provides a fully functional base-class to quickly develop Screen-Scrapers and other text extraction tools. Programmatically generated text such as dynamic webpages are trivially reversed engineered.

Using templates, the programmer is freed from staring at fragile, heavily escaped regular expressions, mapping capture groups to named variables or wrestling with the DOM and badly formed HTML. In addition, extracted data can be hierarchical, which is beyond the capabilities of vanilla regular expressions.

Text::Scraper's functionality overlaps some existing CPAN modules - Template::Extract and WWW::Scraper.

Text::Scraper is much more lightweight than either and has a more general application domain than the latter. It has no dependencies on other frameworks, modules or design-decisions. On average, Text::Scraper benchmarks around 250% faster than Template::Extract - and uses significantly less memory.

Unlike both existing modules, Text::Scraper generalizes its functionality to allow the programmer to refine template capture groups beyond (.*?), fully redefine the template syntax and introduce new template constructs bound to custom classes.

BACKGROUND

Using templates is a popular method of seperating visual presentation from programming logic - particularly popular in programs generating dynamic webpages. Text::Scraper reverses this process, using templates to extract the data back out of the surrounding presentation.

If you are familiar with templating concepts, then the SYNOPSIS should be sufficient to get you started. If not, I would recommend reading the documentation for HTML::Template - a module thats syntax and terminology is very similar to Text::Scraper's.

DESCRIPTION

Template Tags are classed as Leaves or Branches. Like XML, Branches must have an associated closing tag, Leaves must not. By default, Leaf nodes return SCALARs and Branch nodes return ARRAYs of HASHes - each array element mapping to a matched sub-sequence. Blessing or filtering this data is left as an exercise for subclasses.

The default syntax is based on the XML preprocessor syntax:

    <?tmpl TYPE NAME [ATTRIBUTES] ?>
    

and for Branches:

    <?tmpl TYPE NAME [ATTRIBUTES] ?>  
        ...  
    <?tmpl end NAME ?>    

By default, Tags must be named and any closing tag must include the name of the opening tag it is closing. Attributes have the same syntax as XML attributes - but (similar to Perl regular expressions) can use any non-bracket punctuation character as quotation delimiters:

    <?tmpl var foo bar="baz" blah=/But dont "quote" me on that!/ ?> 

The only attribute acted on by the default tag classes is regex - used to refine how the Tag is translated into a regular-expression capture group:

    <?tmpl var naiveEmailAddress  regex="([\w\d\.]+\@[\w\d\.]+)"  ?>

This can be used to further filter the parsed data - similar to using grep:

    <?tmpl var onlyFoocomEmailAddresses regex="([\w\d\.]+@(?:foo\.com))" ?>

Each tag should create only one capture group - but it is fine to make the outer group non-capturing:

    <?tmpl var dateJustMonth regex="(?:\d+ (\S+) \d+)" ?>

The above would capture only the month field in dates formated as 02 July 1979.

Default Tags

The default tags provided by Text::Scraper are typical for basic scraping but can be subclassed for additional functionality. All the default tags are demonstrated in the SYNOPSIS:

var

Vars represent strings of text in a template. They are instances of Text::Scraper::Leaf.

stuff

Stuff tags represent spans of text that are of no interest in the extracted data, but can ease parsing in certain situations. They are instances of Text::Scraper:Ignorable - a subclass of Text::Scraper::Leaf.

loop

Loops represent repeated information in a template and are extracted as an array of hashes. They are instances of Text::Scraper::Branch.

if

A conditional region in the template. If not present, the parent scope will contain a false value under the tags name. Otherwise the value will be true and any tags inside the if's scope will be exported to its parent scope also.

These are instances of Text::Scraper::Conditional.

User API

These methods alone are sufficient for a basic scraping session:

my $string = Text::Scraper->slurp( STRING|GLOBREF )

Static utility method to return either a filename or filehandle as a string

my $object = Text::Scraper->new(HASH)

Returns a new Text::Scraper object. Optional parameters are:

tmpl

A template as a string

syntax

A Text::Scraper::Syntax instance. See "Defining a custom syntax".

$obj->compile(STRING)

Only required for recompilation or if no tmpl parameter is passed to the constructor.

my $data = $obj->scrape(STRING)

Extract data from STRING based on compiled template.

Subclass API

Text::Scraper allows its users to define custom tags and bless captured data into custom classes. Because Text::Scraper objects are prototype based, a subclass can both inherit the scraping logic and also encapsulate any particular instance of the scraped data.

During template compilation, a single instance of each tag type is created as the prototype object. Its attributes will be related to the tag, any supplied tag attributes, etc. During scraping, each prototype is invoked to scrape the relevent sub-text against its sub-template.

$subclass->on_create()

General construction callback. Text::Scraper objects are prototype based so overriding the constructor is not recommended. Objects are hash based; any constructor arguments become attributes of the new instance before invoking this method.

$subclass->on_destroy()

General destruction callback. Text::Scraper uses the DESTROY hook so any custom functionality is best implemented here.

$subclass->on_data(SCALAR)

This is the subclasses opportunity to bless or otherwise process any parsed data. The return value from on_data is added to the generated output data-structure. By default these values are just returned unblessed.

The SCALAR argument depends on the class of tag. For Text::Scraper::Leaf subclasses, SCALAR will be the matched text. For Text::Scraper::Branch subclasses, SCALAR will be a reference to an array of hashes. Below is an example of two custom tag classes that bless captured data into the same class:

    package Myleaf; 
    use base "Text::Scraper::Leaf";
    sub on_data
    {
        my ($self, $match) = @_;
        return $self->new(value => $match);
    }

    package MyBranch; 
    use base "Text::Scraper::Branch";
    sub on_data
    {
        my ($self, $matches) = @_;
        @$matches = map {  $self->new(%$_)  } @$matches;
        return $matches;
    }

my $regex = $subclass->to_regex()

Returns this nodes representation as a regular expression, to be used in a compiled template. If you find yourself using a particular regex attribute a lot, it might be easier to define a custom tag that overloads this method.

my $boolean = $subclass->ignore()

Returns a boolean value stating whether the parser should ignore the data captured by this object.

$subclass->proto() $subclass->proto(SCALAR)

Utility method to allow Tag instances to access (attributes of) their prototype. This can be safely called from a prototype object, which just points to itself.

my @children = $subclass->nodes()

Returns instance data in-order, including any present conditional data.

Defining a custom syntax

The two areas of customization are Tag Syntax and Tag Classes. The defaults are encapsulated in the Text::Scraper::Syntax class.

The interested reader is encouraged to copy the source of the default syntax class and play around with changes. All the over-ridable methods begin with define_* and are fairly self explanatory or well commented.

Any new Tag classes should be subclassed from either Text::Scraper::Leaf, Text::Scraper::Branch, Text::Scraper::Ignorable or Text::Scraper::Conditional.

BUGS & CAVEATS

Rather than write a slow parser in pure Perl, Text::Scraper farms a lot of the work out to Perl's optimized regular-expression engine. This works well in general but, unfortunately, doesn't allow for a lot of error feedback during scraping. A fair understanding of the pros and cons of using regular expressions in this manner can be beneficial, but is outside the scope of this documentation.

Data::Dumper can be indespensible in following the success of your scraping. It can be safely applied to a Text::Scraper instance to analyze the parser's object model, or to the return value from a scrape() invokation to analyze what was parsed.

Bug reports and suggestions welcome.

AUTHOR

Copyright (C) 2005 Chris McEwan - All rights reserved.

Chris McEwan <mcewan@cpan.org>

LICENSE

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.