HTML::AsText::Fix - extends HTML::Element::as_text() to render text properly
version 0.002
# fix individual objects my $tree = HTML::TreeBuilder::XPath->new_from_content($html); my $guard = HTML::AsText::Fix::object($tree); # fix deeply nested objects use URI; use Web::Scraper; # First, create your scraper block my $tweets = scraper { process "li.status", "tweets[]" => scraper { process ".entry-content", body => 'TEXT'; process ".entry-date", when => 'TEXT'; process 'a[rel="bookmark"]', link => '@href'; }; }; my $res; { my $guard = HTML::AsText::Fix::global(); $res = $tweets->scrape( URI->new("http://twitter.com/creaktive") ); }
Consider the following HTML sample:
<p> <span>AAA</span> BBB </p> <h2>CCC</h2> DDD <br> EEE
HTML::Element::as_text() method stringifies it as AAABBBCCCDDDEEE. Despite being correct, this is far from the actual renderization within a "real" browser. links(1), lynx(1) & w3m(1) break lines this way:
HTML::Element::as_text()
AAABBB CCC DDD EEE
This module tries to implement the same behavior in the method "as_text" in HTML::Element. By default, $/ value is inserted in place of line breaks, and "\x{200b}" (Unicode zero-width space) separates text from adjacent inline elements.
$/
"\x{200b}"
"span", for instance, is an inline node:
<p><span>A</span>pple</p>
In that case, there really shouldn't be a space between "A" and "pple". To handle inline nodes properly, only block nodes are separated by line break. Following nodes are currently assumed being blocks:
p
h1 h2 h3 h4 h5 h6
dl dt dd
ol ul li
dir
address
blockquote
center
del
div
hr
ins
noscript script
pre
br (just to make sense)
(source: http://en.wikipedia.org/wiki/HTML_element#Block_elements)
The replacement function. Not to be used separately. It is injected inside HTML::Element.
Hook into every HTML::Element within the lexical scope. Returns the guard object, destroying it will unhook safely.
Accepts following options:
lf_char: character inserted between block nodes (by default, $/);
zwsp_char: character inserted between inline nodes (by default, "\x{200b}", Unicode zero-width space);
trim: trim heading/trailing spaces (considers "\x{A0}" as space!);
"\x{A0}"
extra_chars: extra characters to trim;
skip_dels: if true, then text content under "del" nodes is not included in what's returned.
For example, to completely get rid of separation between inline nodes:
my $guard = HTML::AsText::Fix::global(zwsp_char => '');
Hook object instance. Accepts the same options as "global":
my $guard = HTML::AsText::Fix::object($tree, zwsp_char => '');
HTML::Element
HTML::Tree
HTML::FormatText
Monkey::Patch
Αριστοτέλης Παγκαλτζής
Toby Inkster
Stanislaw Pusep <stas@sysd.org>
This software is copyright (c) 2012 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install HTML::AsText::Fix, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::AsText::Fix
CPAN shell
perl -MCPAN -e shell install HTML::AsText::Fix
For more information on module installation, please visit the detailed CPAN module installation guide.