The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

KSx::Highlight::Summarizer - KinoSearch Highlighter subclass that provides more comprehensive summaries

VERSION

0.06 (beta)

SYNOPSIS

  use KSx::Highlight::Summarizer;
  my $summarizer = new KSx::Highlight::Summarizer
      searchable => $searcher,
      query      => 'foo bar',
      field      => 'content',
      
      # optional:
      pre_tag        => '<b>',
      post_tag       => '</b>',
      encoder        => sub {
          my $str = shift; $str =~ s/([&'"<])/'&#'.ord($1).';'/eg; $str
      },
      page_handler   => sub { "<h3>Page $_[1]:</h3>" },
      ellipsis       => "\x{2026}", # default: ' ... '
      excerpt_length => 150,        # default: 200
      summary_length => 400,
  ;

  my $excerpt = $summarizer->create_excerpt( $hit );

DESCRIPTION

This module extends KinoSearch::Highlight::Highlighter (which provides an excerpt for a search result, with search words highlighted) to provide various customisations, especially summaries, i.e., multiple excerpts joined together with ellipses.

The superclass finds the best location with the text of a search result, takes a single piece of text surrounding it, and then formats it, highlighting words as appropriate. This module will also take the second best location and create an excerpt for that (removing overlap), and so on until the summary_length is reached or exceeded.

METHODS

new

This is the constructor. It takes hash-style arguments, as shown in the "SYNOPSIS". The various arguments are as follows:

searchable

A reference to an object that isa KinoSearch::Search::Searchable (e.g., a KinoSearch::Searcher)

query

A query string or object

field

The name of the field for which to make a summary

pre_tag, post_tag

These two are strings of text to be inserted around highlighted words, such as HTML tags. The defaults are '<strong>' and '</strong>'.

encoder

An code ref that is expected to encode the text fed to it, e.g., with HTML entities

page_handler

A coderef. If this is provided, it will be called for every page break (form feed; ASCII character 12) in the summary, and its return value substituted for that page break. The arguments will be (0) the hit (a KinoSearch::Doc::HitDoc object) and (1) the page number.

ellipsis

The ellipsis mark to use. The default is three ASCII dots surrounded by spaces: ' ... '

excerpt_length

The length of each excerpt (default is 200), not including ellipses. Actually, an excerpt may end up being shorter than this, because the start is trimmed to the nearest sentence boundary or page break, and the end is trimmed to the nearest word boundary.

summary_length

The approximate length of the summary, not including ellipses. Excerpts are collected together until the lengths of the excerpts (before trimming) equal or exceed the number passed to this argument. If this is omitted, only one excerpt will be made.

create_excerpt

This requires a KinoSearch::Doc::HitDoc object as its sole argument. It creates and returns a summary.

BUGS

A very long custom ellipsis, or two page breaks a few characters apart, can break the page-counting algorithm.

SINE QUIBUS NON

This module requires perl and the following modules, which available from the CPAN:

Number::Range

Hash::Util::FieldHash::Compat

The development version of KinoSearch available at http://www.rectangular.com/svn/kinosearch/trunk, revision 4604 or later. It has only been tested with revision 4625.

AUTHOR & COPYRIGHT

Copyright (C) 2008-9 Father Chrysostomos <sprout at, um, cpan.org>

This program is free software; you may redistribute or modify it (or both) under the same terms as perl.

ACKNOWLEDGEMENTS

Much of the code in this module is based on revision 3122 of Marvin Humphrey's KinoSearch::Highlight::Highlighter, of which this is a subclass.