scrape - command-line frontend to HTML::ListScraper
scrape --core=all sample.html scrape --core=list [ --min-count=10 ] [ --detail=all ] [ --shapeless ] [ --ignore=b,i,em,strong,wbr ] [ --export=seq.txt ] sample.html scrape --core=item --import=seq.txt sample.html scrape --whole sample.html scrape --core=all --detail=all --acquire=Perl.html 'http://search.yahoo.com/search?p=Perl'
This script processes a HTML page with HTML::ListScraper and shows the result, as YAML (down to the tag sequences, which are YAML scalars formatted by HTML::ListScraper::Interactive). It's meant for interactive exploration of HTML::ListScraper results and fine-tuning of its settings for a specific scraping application.
HTML::ListScraper
For every invocation, the single source file or URL is mandatory. URLs are recognized by their http scheme - source names that don't start with http:// are normally interpreted as file names. All other command-line switches are optional and mutually independent. Note that with no switches, the script doesn't output anything. The switches are as follows:
http
http://
==head2 core
Show found repeats. Value is a string, one of
Show only the first sequence instance.
Show all instances of the first sequence.
Show all instances of all found sequences.
By default, no matches are shown. When they are shown, a YAML document, corresponding to a HTML::ListScraper::Sequence, has the sequence length as YAML field len, the repeat count as count and a YAML sequence with items corresponding to HTML::ListScraper::Instance. Each item starts with a field, keyed by the value of HTML::ListScraper::Instance::match, whose value is the start position, followed by score (for approximate matches only) and inst with the actual tag sequence. The tag sequence is formatted by HTML::ListScraper::Interactive::format_tags, with formatting options depending on the value of the --detail command line switch.
len
count
HTML::ListScraper::Instance::match
score
inst
HTML::ListScraper::Interactive::format_tags
--detail
==head2 shapeless
Boolean switch, sets HTML::ListScraper::shapeless to true.
HTML::ListScraper::shapeless
==head2 min-count
Value is an integer bigger than 1, used to set HTML::ListScraper::min_count.
HTML::ListScraper::min_count
==head2 detail
Specifies formatting of found tag sequences. Value is a string, one of
Don't show the matches at all. This is useful to see just how many sequences were found, how many instances they have and where.
Show just the tags, without text and links. This is the default value.
Show tags and text.
Show tags with links.
Show all content fields of HTML:ListScraper::Tag: tags, text and links.
==head2 whole
Boolean switch. When used, scrape outputs, as the first YAML document containing a single YAML scalar, the whole sequence maintained by HTML::ListScraper. Note that the sequence is formatted without attributes, without text and with tag positions, irrespective of the value of --detail.
scrape
==head2 ignore
A comma-separated list of tags the HTML parser should ignore. The list items shouldn't contain any slashes nor angle brackets. For every name in the list, both opening and closing tag are ignored. Default is b, i, em, strong; when specifying the value explicitly, you probably want to include these tags in it.
b, i, em, strong
==head2 export
Instructs scrape to dump the first found sequence into the file specified by the option's value. If the file already exists, it's overwritten. When no sequence is found, nothing is dumped. Note that the sequence is formatted with just tags, irrespective of the value of --detail.
==head2 import
Instructs scrape to call HTML::ListScraper::find_known_sequence instead of HTML::ListScraper::find_sequences, with arguments read from the file specified by the option's value. Lines of that file are converted to tag names by HTML::ListScraper::Interactive::canonicalize_tags.
HTML::ListScraper::find_known_sequence
HTML::ListScraper::find_sequences
HTML::ListScraper::Interactive::canonicalize_tags
==head2 acquire
Instructs scrape to save the downloaded HTML into the file specified by the option's value. If the file already exists, it's overwritten. Using this switch causes scrape to interpret the source as a URL, irrespective of its scheme, and pass it to LWP.
Vaclav Barta, <vbar@comp.cz>
<vbar@comp.cz>
Copyright 2007 Vaclav Barta, all rights reserved.
This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install HTML::ListScraper, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::ListScraper
CPAN shell
perl -MCPAN -e shell install HTML::ListScraper
For more information on module installation, please visit the detailed CPAN module installation guide.