The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::Laundry - Perl module to clean HTML by the piece

VERSION

Version 0.0103

SYNOPSIS

    #!/usr/bin/perl -w
    use strict;
    use HTML::Laundry;
    my $laundry = HTML::Laundry->new();
    my $snippet = q{
        <P STYLE="font-size: 300%"><BLINK>"You may get to touch her<BR>
        If your gloves are sterilized<BR></BR>
        Rinse your mouth with Listerine</BR>
        Blow disinfectant in her eyes"</BLINK><BR>
        -- X-Ray Spex, <I>Germ-Free Adolescents<I>
        <SCRIPT>alert('!!');</SCRIPT>
    };
    my $germfree = $laundry->clean($snippet);
    # $germfree is now:
    #   <p>&quot;You may get to touch her<br />
    #   If your gloves are sterilized<br />
    #   Rinse your mouth with Listerine<br />
    #   Blow disinfectant in her eyes&quot;<br />
    #   -- X-Ray Spex, <i>Germ-Free Adolescents</i></p>
        

DESCRIPTION

HTML::Laundry is an HTML::Parser-based HTML normalizer, meant for small pieces of HTML, such as user comments, Atom feed entries, and the like, rather than full pages. Laundry takes these and returns clean, sanitary, UTF-8-based XHTML. The parser's behavior may be changed with callbacks, and the whitelist of acceptable elements and attributes may be updated on the fly.

A snippet is cleaned several ways:

  • Normalized, using HTML::Parser: attributes and elements will be lowercased, empty elements such as <img /> and <br /> will be forced into the empty tag syntax if needed, and unknown attributes and elements will be stripped.

  • Sanitized, using an extensible whitelist of valid attributes and elements based on Mark Pilgrim and Aaron Swartz's work on sanitize.py: tags and attributes which are known to be possible attack vectors are removed.

  • Tidied, using HTML::Tidy or HTML::Tidy::libXML (as available): unclosed tags will be closed and the output generally neatened; future version may also use tidying to deal with character encoding issues.

  • Optionally rebased, to turn relative URLs in attributes into absolute ones.

HTML::Laundry provides mechanisms to extend the list of known allowed (and disallowed) tags, along with callback methods to allow scripts using HTML::Laundry to extend the behavior in various ways. Future versions may provide additional options for altering the rules used to clean snippets.

Out of the box, HTML::Laundry does not currently know about the <head> tag and its children. For santizing full HTML pages, consider using HTML::Scrubber or HTML::Defang.

FUNCTIONS

new

Create an HTML::Laundry object.

    my $l = HTML::Laundry->new();

Takes an optional anonymous hash of arguments:

  • base_url

    This turns relative URIs, as in <img src="surly_otter.png">, into absolute URIs, as for use in feed parsing.

        my $l = HTML::Laundry->new({ base_uri => 'http://example.com/foo/' });
        
  • notidy

    Disable use of HTML::Tidy or HTML::Tidy::libXML, even if they are available on your system.

        my $l = HTML::Laundry->new({ notidy => 1 });
        

initialize

Instantiates the Laundry object properties based on an HTML::Laundry::Rules module.

add_callback

Adds a callback of type "start_tag", "end_tag", "text", "uri", or "output" to the appropriate internal array.

    $l->add_callback('start_tag', sub {
        my ($laundry, $tagref, $attrhashref) = @_;
        # Now, perform actions and return
    });

start_tag, end_tag, text, and uri callbacks that return false values will suppress the return value of the element they are processing; this allows additional checks to be done (for instance, images can be allowed only from whitelisted source domains).

clear_callback

Removes all callbacks of given type.

    $l->clear_callback('start_tag');

clean

Cleans a snippet of HTML, using the ruleset and object creation options given to the Laundry object. The snippet should be passed as a scalar.

    $output1 =  $l->clean( '<p>The X-rays were penetrating' );
    $output2 =  $l->clean( $snippet );

base_uri

Used to get or set the base_uri property, used in URI rebasing.

    my $base_uri = $l->base_uri; # returns current base_uri
    $l->base_uri(q{http://example.com}); # return 'http://example.com'
    $l->base_uri(''); # unsets base_uri

gen_output

Used to generate the final, XHTML output from the internal stack of text and tag tokens. Generally meant to be used internally, but potentially useful for callbacks that require a snapshot of what the output would look like before the cleaning process is complete.

    my $xhtml = $l->gen_output;

empty_elements

Returns a list of the Laundry object's known empty elements: elements such as <img /> or <br /> which must not contain any children.

remove_empty_element

Removes an element (or, if given an array reference, multiple elements) from the "empty elements" list maintained by the Laundry object.

    $l->remove_empty_element(['img', 'br']); # Let's break XHTML!
    

This will not affect the acceptable/unacceptable status of the elements.

acceptable_elements

Returns a list of the Laundry object's known acceptable elements, which will not be stripped during the sanitizing process.

add_acceptable_element

Adds an element (or, if given an array reference, multiple elements) to the "acceptable elements" list maintained by the Laundry object. Items added in this manner will automatically be removed from the "unacceptable elements" list if they are present.

    $l->add_acceptable_element('style');

Elements which are empty may be flagged as such with an optional argument. If this flag is set, all elements provided by the call will be added to the "empty element" list.

    $l->add_acceptable_element(['applet', 'script'], { empty => 1 });

remove_acceptable_element

Removes an element (or, if given an array reference, multiple elements) to the "acceptable elements" list maintained by the Laundry object. These items (although not their child elements) will now be stripped during parsing.

    $l->remove_acceptable_element(['img', 'h1', 'h2']);
    $l->clean(q{<h1>The Day the World Turned Day-Glo</h1>});
    # returns 'The Day the World Turned Day-Glo'

unacceptable_elements

Returns a list of the Laundry object's unacceptable elements, which will be stripped -- including child objects -- during the cleaning process.

add_unacceptable_element

Adds an element (or, if given an array reference, multiple elements) to the "unacceptable elements" list maintained by the Laundry object.

    $l->add_unacceptable_element(['h1', 'h2']);
    $l->clean(q{<h1>The Day the World Turned Day-Glo</h1>});
    # returns null string

remove_unacceptable_element

Removes an element (or, if given an array reference, multiple elements) from the "unacceptable elements" list maintained by the Laundry object. Note that this does not automatically add the element to the acceptable_element list.

    $l->clean(q{<script>alert('!')</script>});
    # returns null string
    $l->remove_unacceptable_element( q{script} );
    $l->clean(q{<script>alert('!')</script>});
    # returns "alert('!')"

acceptable_attributes

Returns a list of the Laundry object's known acceptable attributes, which will not be stripped during the sanitizing process.

add_acceptable_attribute

Adds an attribute (or, if given an array reference, multiple attributes) to the "acceptable attributes" list maintained by the Laundry object.

    my $snippet = q{ <p austen:id="3">"My dear Mr. Bennet," said his lady to 
        him one day, "have you heard that <span austen:footnote="netherfield">
        Netherfield Park</span> is let at last?"</p>
    };
    $l->clean( $snippet );
    # returns:
    #   <p>&quot;My dear Mr. Bennet,&quot; said his lady to him one day, 
    #   &quot;have you heard that <span>Netherfield Park</span> is let at 
    #   last?&quot;</p>
    $l->add_acceptable_attribute([austen:id, austen:footnote]);
    $l->clean( $snippet );
    # returns:
    #   <p austen:id="3">&quot;My dear Mr. Bennet,&quot; said his lady to him
    #   one day, &quot;have you heard that <span austen:footnote="netherfield">
    #   Netherfield Park</span> is let at last?&quot;</span></p>
    

remove_acceptable_attribute

Removes an attribute (or, if given an array reference, multiple attributes) from the "acceptable attributes" list maintained by the Laundry object.

    $l->clean(q{<p id="plugh">plover</p>});
    # returns '<p id="plugh">plover</p>'
    $l->remove_acceptable_element( q{id} );
    $l->clean(q{<p id="plugh">plover</p>});
    # returns '<p>plover</p>

SEE ALSO

There are a number of tools designed for sanitizing HTML, some of which may be better suited than HTML::Laundry to particular circumstances. In addition to HTML::Scrubber, you may want to consider HTML::StripScripts::Parser, an HTML::Parser-based module designed solely for the purposes of sanitizing HTML from potential XSS attack vectors; HTML::Defang, a whitelist-based, pure-Perl module; or HTML::Restrict, an HTML tag whitelist using HTML::Parser.

AUTHOR

Steve Cook, <scook at sixapart.com>

BUGS

Please report any bugs or feature requests on the GitHub page for this project, http://github.com/snark/html-laundry.

ACKNOWLEDGMENTS

Thanks to Dave Cross and Vera Tobin.

SUPPORT

You can find documentation for this module with the perldoc command.

    perldoc HTML::Laundry

COPYRIGHT & LICENSE

Copyright 2009 Six Apart, Ltd., all rights reserved.

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.