HTML::Content::Extractor - Recieving a main text of publication from HTML page and main media content that is bound to the text
my $obj = HTML::Content::Extractor->new(); $obj->analyze($html, {class => ["comment", "tags", "blog", "theme", "footer", "head"]}); my $main_text = $obj->get_main_text(); my $main_images = $obj->get_main_images(1, {src => "logo", alt => ["logo", "crazy"]}, 150); my $raw_text = $obj->get_raw_text(); my $main_text_we = $obj->get_main_text_with_elements(1, ["a", "b", "br", "strike", ...]); print $main_text, "\n\n"; print "Images:\n"; foreach my $elem (@$main_images) { print $elem->{prop}->{src}, "\n"; } # html elements my $obj = HTML::Content::Extractor->new(); $obj->build_tree($html); my $tree = $obj->get_tree(); my $i = -1; while( my $element = $obj->get_element_by_name("div", ++$i) ) { print "<", $element->{name}; foreach my $key (keys %{$element->{prop}}) { print " ", $key, '="', $element->{prop}->{$key}, '"'; } print ">\n"; }
This module analyzes an HTML document and extracts the main text (for example front page article contents on the news site) and all related images.
my $obj = HTML::Content::Extractor->new();
Creates and prepares the structure for the subsequent analysis and parsing HTML.
$obj->analyze($html, [hashref]);
Creates an HTML document tree and analyzes it. [hashref] - optional parameter which may (or may not) contain key-value pairs, where key is name of html tag attribute and its value is stop word, which will be ignored with all child tags. Useful for common tags for header/footer/logo etc.
# UTF-8 my $main_text = $obj->get_main_text(1); # or not my $main_text = $obj->get_main_text(0); # default UTF-8 is on
Return plain text.
# UTF-8 my $raw_text = $obj->get_raw_text(1); # or not my $raw_text = $obj->get_raw_text(0); # default UTF-8 is on
Return the main text without post-processing (saving all html tags)
# UTF-8 my $main_text_we = $obj->get_main_text_with_elements(1, ["span", ...]); # or not my $main_text_we = $obj->get_main_text_with_elements(0, ["span", ...]); # default UTF-8 is on
Returns the main text while saving selected html tags. Post-processing is skipped
# UTF-8 my $main_images = $obj->get_main_images(1, [hashref], [min_width]); # or not my $main_images = $obj->get_main_images(0, [hashref], [min_width]); # default UTF-8 is on
[hashref] - optional parameter which may (or may not) contain key-value pairs, where key is name of html tag attribute and its value is stop word, which will be ignored with all child tags. Useful for common tags for header/footer/logo etc. [min_width] - optional parameter
my $res = $obj->build_tree($html);
Build flat html tree and returns 1
my $res = $obj->build_tree($html); my $tree = $obj->get_tree(1);
Returns ARRAY with flat html tree
my $element_tree = $obj->get_tree_by_element_id($element->{id}, 1);
Returns ARRAY with flat html tree by element id
my $element = $obj->get_element_by_name("div", 0);
Returns HASH or undef with element by tag name. ARGS: 1) tag name 2) offset
Structure of this element:
$element = { id => <number>, name => <text>, tag_id => <number>, prop => <HASH>, level => <number>, start => <number>, stop => <number>, bstart => <number>, bstop => <number> };
my $element = $obj->get_stat_by_element_id($element->{id});
Returns HASH with element stats by element id. HASH included: count, all, words, AI_TEXT, AI_LINK, AI_IMG, all_AI_LINK, all_AI_LINK, all_AI_IMG
my $element = $obj->get_child(0);
my $element = $obj->get_parent();
my $element = $obj->get_curr_element();
my $element = $obj->get_prev_element();
my $element = $obj->get_next_element_curr_level();
my $element = $obj->get_prev_element_curr_level();
my $element = $obj->set_position($element);
Set position by element. Returns this element or undef if something is wrong
undef $obj;
Cleaning of all internal structures (HTML tree and other)
Alexander Borisov <lex.borisov@gmail.com>
This software is copyright (c) 2013 by Alexander Borisov.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install HTML::Content::Extractor, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Content::Extractor
CPAN shell
perl -MCPAN -e shell install HTML::Content::Extractor
For more information on module installation, please visit the detailed CPAN module installation guide.