The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

scrape - command-line frontend to HTML::ListScraper

SYNOPSIS

 scrape --core=all sample.html

 scrape --core=list [ --min-count=10 ] [ --detail=all ] [ --shapeless ]
        [ --ignore=b,i,em,strong,wbr ] [ --export=seq.txt ] sample.html

 scrape --core=item --import=seq.txt sample.html

 scrape --whole sample.html

 scrape --core=all --detail=all --acquire=Perl.html
        'http://search.yahoo.com/search?p=Perl'

DESCRIPTION

This script processes a HTML page with HTML::ListScraper and shows the result, as YAML (down to the tag sequences, which are YAML scalars formatted by HTML::ListScraper::Interactive). It's meant for interactive exploration of HTML::ListScraper results and fine-tuning of its settings for a specific scraping application.

For every invocation, the single source file or URL is mandatory. URLs are recognized by their http scheme - source names that don't start with http:// are normally interpreted as file names. All other command-line switches are optional and mutually independent. Note that with no switches, the script doesn't output anything. The switches are as follows:

==head2 core

Show found repeats. Value is a string, one of

item (or just "i")

Show only the first sequence instance.

list (or just "l")

Show all instances of the first sequence.

all (or just "a")

Show all instances of all found sequences.

By default, no matches are shown. When they are shown, a YAML document, corresponding to a HTML::ListScraper::Sequence, has the sequence length as YAML field len, the repeat count as count and a YAML sequence with items corresponding to HTML::ListScraper::Instance. Each item starts with a field, keyed by the value of HTML::ListScraper::Instance::match, whose value is the start position, followed by score (for approximate matches only) and inst with the actual tag sequence. The tag sequence is formatted by HTML::ListScraper::Interactive::format_tags, with formatting options depending on the value of the --detail command line switch.

==head2 shapeless

Boolean switch, sets HTML::ListScraper::shapeless to true.

==head2 min-count

Value is an integer bigger than 1, used to set HTML::ListScraper::min_count.

==head2 detail

Specifies formatting of found tag sequences. Value is a string, one of

none

Don't show the matches at all. This is useful to see just how many sequences were found, how many instances they have and where.

tags

Show just the tags, without text and links. This is the default value.

text

Show tags and text.

attributes

Show tags with links.

all

Show all content fields of HTML:ListScraper::Tag: tags, text and links.

==head2 whole

Boolean switch. When used, scrape outputs, as the first YAML document containing a single YAML scalar, the whole sequence maintained by HTML::ListScraper. Note that the sequence is formatted without attributes, without text and with tag positions, irrespective of the value of --detail.

==head2 ignore

A comma-separated list of tags the HTML parser should ignore. The list items shouldn't contain any slashes nor angle brackets. For every name in the list, both opening and closing tag are ignored. Default is b, i, em, strong; when specifying the value explicitly, you probably want to include these tags in it.

==head2 export

Instructs scrape to dump the first found sequence into the file specified by the option's value. If the file already exists, it's overwritten. When no sequence is found, nothing is dumped. Note that the sequence is formatted with just tags, irrespective of the value of --detail.

==head2 import

Instructs scrape to call HTML::ListScraper::find_known_sequence instead of HTML::ListScraper::find_sequences, with arguments read from the file specified by the option's value. Lines of that file are converted to tag names by HTML::ListScraper::Interactive::canonicalize_tags.

==head2 acquire

Instructs scrape to save the downloaded HTML into the file specified by the option's value. If the file already exists, it's overwritten. Using this switch causes scrape to interpret the source as a URL, irrespective of its scheme, and pass it to LWP.

AUTHOR

Vaclav Barta, <vbar@comp.cz>

COPYRIGHT & LICENSE

Copyright 2007 Vaclav Barta, all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.