HTML::Content::Extractor - Recieving a main text of publication from HTML page and main media content that is bound to the text
my $obj = HTML::Content::Extractor->new(); $obj->analyze($html); my $main_text = $obj->get_main_text(); my $main_images = $obj->get_main_images(); print $main_text, "\n\n"; print "Images:\n"; foreach my $url (@$main_images) { print $url, "\n"; }
This module analyzes an HTML document and extracts the main text (for example front page article contents on the news site) and all related images.
my $obj = HTML::Content::Extractor->new();
Creates and prepares the structure for the subsequent analysis and parsing HTML.
$obj->analyze($html);
Creates an HTML document tree and analyzes it.
# UTF-8 my $main_text = $obj->get_main_text(1); # or not my $main_text = $obj->get_main_text(0); # default UTF-8 is on
Return plain text.
# UTF-8 my $main_images = $obj->get_main_images(1); # or not my $main_images = $obj->get_main_images(0); # default UTF-8 is on
Returns ARRAY with pictures URL.
undef $obj;
Cleaning of all internal structures (HTML tree and other)
Alexander Borisov <lex.borisov@gmail.com>
This software is copyright (c) 2013 by Alexander Borisov.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install HTML::Content::Extractor, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Content::Extractor
CPAN shell
perl -MCPAN -e shell install HTML::Content::Extractor
For more information on module installation, please visit the detailed CPAN module installation guide.