untemplate - analyze several HTML documents based on the same template
version 0.017
untemplate [options] HTML1 HTML2 [HTML3] [...]
Takes multiple HTML documents generated using the same template and attempts to extract only the data inserted into original template.
Accepts URL if AnyEvent::Net::Curl::Queued is present.
This.
Specify the HTML document encoding (latin1, utf8). UTF-8 is assumed by default.
latin1
utf8
Enable syntax highlight for XPath. By default, enabled automatically on interactive terminals.
Use 16 system colors. By default, try to use 256-color ANSI palette.
Disables the --color option and highlights using HTML/CSS.
--color
Enable the display of "partial" templates, that is, nodes present in some documents. By default, only the nodes present in all documents are displayed.
Shrink the XPath to the minimal unique identifier. For example:
/html/body[@id='cpansearch']/form[@class='searchbox']/input[@name='query']
Could be shortened as:
//input[@name='query']
The shrinking is enabled by default.
Strict mode disables grouping by id, class or name attributes. The grouping is enabled by default.
id
class
name
Specify regex(es) to unmangle id/class attributes. Some CMS (WordPress) insert unique identifiers into HTML elements, like:
<body class="post-id-12345">
This tend to break HTML tree analysis. To fix the above case, use --unmangle 'post-id-\d+'. Multiple unmanglers are accepted (--unmangle a --unmangle b).
--unmangle 'post-id-\d+'
--unmangle a --unmangle b
untemplate --color http://bash.org/?1839 http://bash.org/?2486 | less -R
Trying to untemplate HTML documents not based on the same template, the results will be empty.
Unfortunately, employing any kind of document identifier as part of element class/id (common practice in WordPress themes) is enough to constitute "not same template".
See the --unmangle option for a work-around.
--unmangle
Stanislaw Pusep <stas@sysd.org>
This software is copyright (c) 2013 by Stanislaw Pusep.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.
To install HTML::Untemplate, copy and paste the appropriate command in to your terminal.
cpanm
cpanm HTML::Untemplate
CPAN shell
perl -MCPAN -e shell install HTML::Untemplate
For more information on module installation, please visit the detailed CPAN module installation guide.