The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

HTML::ParagraphSplit - Change text containing HTML into a formatted HTML fragment

SYNOPSIS

  use HTML::ParagraphSplit qw( split_paragraphs_to_text split_paragraphs );

  # Read in from a file handle, output text
  print split_paragraphs_to_text(\*ARGV);

  # Convert text to nicely split text
  print split_paragraphs_to_text(<<END_OF_MARKUP);
  This is one paragraph.

  This is a another paragraph.
  END_OF_MARKUP

  # Convert to an HTML::Element object instead
  my $tree = split_paragraphs($html_input);
  print $tree->as_HTML;

  # Create your own HTML::Element object and split it
  my $tree = HTML::TreeBuilder->new;
  $tree->parse($text);
  $tree->eof;

  split_paragraphs($tree);

  my $html_fragment = $tree->guts->as_HTML;
  $tree->delete;

DESCRIPTION

The purpose of this library is to provide methods for converting double line-breaks in text to HTML paragraphs (i.e., wrap in <P></P> tags). It can also convert single line breaks into <BR> tags. In addition, markup can be mixed in as well and this library will DoTheRightThing(tm). There are a number of additional options that can modify how the paragraph splits are performed.

For example, given this input (the initial text was generated by DadaDodo http://www.jwz.org/dadadodo/dadadodo.cgi, btw):

  I see over the <strong>noise</strong> but I don't understand sometimes.

  <ol><li>One</li><li>Two</li><li>Three</li><ol>

  Fortunately, we've traded the club you can't skimp on the do because This
  week! Presented by code Lounge: except, for controlling Knox video cameras
  Linux well that the reason, the runlevel to run some reason number of coming 
  back next server; sees you Control <a href="blah.html">display</a> a steep 
  and I tagged with specifications of six feet, moving to Code, flyer main room
  motel balcony, <p>and airflow in which define the ability to run a common. We
  need to current in a manner <pre>than six months and that already gotten a
  webcast</pre> is roughly long and bulk: and up the src page: and updates on a:
  user will probably does this.

This would be converted into the following:

  <p>I see over the <strong>noise</strong> but I don't understand sometimes.</p>

  <ol><li>One</li><li>Two</li><li>Three</li><ol>

  <p>Fortunately, we've traded the club you can't skimp on the do because This
  week! Presented by code Lounge: except, for controlling Knox video cameras
  Linux well that the reason, the runlevel to run some reason number of coming 
  back next server; sees you Control <a href="blah.html">display</a> a steep 
  and I tagged with specifications of six feet, moving to Code, flyer main room
  motel balcony,</p>
  <p>and airflow in which define the ability to run a common. We need to
  current in a manner</p>
  <pre>than six months and that already gotten a
  webcast</pre>
  <p>is roughly long and bulk: and up the src page: and updates on a: user will 
  probably does this.</p>

This allows authors to use HTML markup some without having to cope with getting their paragraph tags right.

This library depends upon HTML::TreeBuilder and HTML::Tagset. You may wish to see the documentation for those libraries for additional details.

METHODS

The primary method of this library is split_paragraphs(). An additional method, split_paragraphs_to_text() is provided to simplify the task of generating output without having to fuss with HTML::TreeBuilder.

split_paragraphs

$element = split_paragraphs($handle, \%options)
$element = split_paragraphs($text, \%options)
$element = split_paragraphs($element, \%options)

This method has three forms, which vary only in the input they receive. If the first argument is a file handle, $handle, then that handle will be read, parsed, and split. If the first argument is a scalar, $text, then that text will parsed and split. If the first argument is a subclass of HTML::Element, $element, then the tree represented by the node will be traversed and split.

If you use the third form, your tree will be modified in place and the same tree will be returned. You will want to clone the tree ahead of time if you need to preserve the old tree.

All forms take an optional second parameter, \%options, which is a reference to a hash of options which modify the default behavior. See below for details.

The first two forms perform an extra step, but are handled essentially the same after the input is parsed into an HTML::Element using HTML::TreeBuilder. This is done using the defaults, except that no_space_compacting() is set to a true value (otherwise, we lose any double returns that were in the original text). If you parse your own trees, you'll probably want to do the same.

This method will search down the element tree and find the first node with non-implicit child ndoes and use that as the root of operations.

The split_paragraphs() method then walks the tree and wraps any undecorated text node in a paragraph. Any double line break discovered will result in multiple paragraphs. Any paragraph content elements (as defined by %is_Possible_Strict_P_Content of HTML::Tagset) will be inserted into the paragraph elements as if they were text. Any block level tags (i.e., not in %is_Possible_Strict_P_Content) cause a paragraph break immediately before and after such elements.

Any text found within a block-level node may also be paragraphified. Those blocks of text will not be wrapped in paragraphs unless they contain a double-line break (that way we're not inserting P-tags without an explicit need for them).

Note also that this will insert P-tags conservatively. If more than two line-breaks are present, even if they are mixed with other white space, all of that whitespace will be treated as the same paragraph break. No empty P-tags or P-tags containing only whitespace will be inserted (mostly). The only exception is when the white space is created by white space entities, such as &nbsp;.

All of that is the default behavior. That behavior may be modified by the second parameter, which is used to specify options that modify that behavior.

Here's the list of options and what they do:

p_on_breaks_only => 1

If this option is used, then paragrpahs will not be added to your text unless there is at least one double-line break. This option is used internally to make sure nested elements do not have extra P-tags unnecessarily.

single_line_breaks_to_br => 1

If this option is given, then single line breaks will also be converted to BR-tags.

br_only_if_can_tighten => 1

This option modifies the single_line_breaks_to_br option by specifying that BR-tags are not added within blocks that cannot be tightened (i.e., aren't set in %canTighten of HTML::Tagset). This can be useful for preventing double-line breaks from appearing inside PRE-tags or TEXTAREA-tags because of added BR-tags.

use_br_instead_of_p => 1

As an alternative to using P-tags at all, this can also just place BR-tags everywhere instead. Instead of inserting P-tags whenever a double line-break is enountered, two BR-tags will be inserted instead.

This option is independant of single_line_breaks_to_br as single line-breaks are not dealt with unless that option is turned on. Also note that, like P-tag insertion, it inserts BR-tags conservatively. Multiple consecutive line-breaks (even mixed with whitespace) will be treated just as if they were only two. Thus, given the default stylesheet of your typical browser, the rendered output will appear pretty much the same in most circumstances.

add_attrs_to_p => \%attrs

This can be used to insert a static set of attributes to each inserted P-element. For example:

  # Give each newly added paragraph the "generated" class.
  split_paragraphs($tree, {
      add_attrs_to_p => { class => 'generated' },
  });
add_attrs_to_br => \%attrs

Same as above, but for the inserted BR-tags.

filter_added_nodes => \&sub

This can be used to run a small subroutine on each added paragraph or line-break tag as it is added. For example:

  # Give each newly added paragraph a unique ID
  split_paragraphs($tree, {
      filter_added_nodes => sub {
          my ($element) = @_;
          $element->idf();
      },
  });

Many, if not all, of the other options can be simulated using this method, by the way.

use_instead_of_p => $tag

Rather than using P-tags to break everything, use a different tag. This example uses DIV-tags instead of P-tags:

  split_paragraphs($tree, {
      use_instead_of_p => 'div',
  });

split_paragraphs_to_text

$html_text = split_paragraphs_to_text($handle, \%options)
$html_text = split_paragraphs_to_text($text, \%options)
$html_text = split_paragraphs_to_text($element, \%options)

This method performs the exact same operation as the split_paragraphs() method, but returns the text as a scalar value. This is helpful if you just want a quick method that takes in text and outputs text and you don't really need the HTML formatted in any particular way and don't need to modify the tree at all.

I created this method primarily as a way of outputing the tree to make testing easier. If the output isn't want you like, use split_paragraphs() instead and use the output methods available in HTML::Element directly to get the desired output.

SEE ALSO

HTML::TreeBuilder, HTML::Tagset

BUGS AND TODO

I don't really have any explicit plans for this module, but if you find a bug or would like an additional feature or have another contribution, send me email at <hanenkamp@cpan.org>.

NOTES

I tried to name this library HTML::Paragraphify first. After typing that a dozen times and looking at it for a few hours, my eyes felt like they were starting to bleed so I changed it to HTML::ParagraphSplit.

AUTHOR

Andrew Sterling Hanenkamp, <hanenkamp@cpan.org>

LICENSE AND COPYRIGHT

Copyright 2006 Andrew Sterling Hanenkamp <hanenkamp@cpan.org>. All Rights Reserved.

This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself. See perlartistic.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.