The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Plucene::SearchEngine::Index - A higher level abstraction for Plucene

SYNOPSIS

    my $indexer = Plucene::SearchEngine::Index->new(
        dir => "/var/lib/plucene" 
    );

    my @documents = map { $_->document } 
        Plucene::SearchEngine::Index::File->examine("foo.html");

    $indexer->index($_) for @documents;

DESCRIPTION

This module makes it easy to write to Plucene indexes. It does so by providing an interface to the index writer which, in terms of complexity, sits between Plucene::Index::Writer and Plucene::Simple; it also provides a framework of modules for turning data into Plucene::Document objects, so that you don't necessarily have to parse them yourself. See "Document Frontends and Backends" for more on this.

Designed to be used with Plucene::SearchEngine::Query, these two modules aim to make it easy for anyone writing search engines based on Plucene.

METHODS

new

    my $indexer = Plucene::SearchEngine::Index->new(
        dir      => "/var/plucene/foo",
        analyzer => "Plucene::Analysis::SimpleAnalyzer",
    );

This creates a new indexer; you must specify the directory to contain the index, and you may specify an analyzer to tokenize the data.

index

This adds a Plucene::Document to the index.

Document Frontends and Backends

So far so good, but how do you create these Plucene::Documents? You can, of course, do so manually, but the easiest way is to use the supplied Plucene::SearchEngine::Index::File or Plucene::SearchEngine::Index::URL modules.

These two modules are frontends which gather metadata about a file or URL and then hand the data off to one of the backend modules - there are backends supplied for PDF, HTML and plain text files. These in turn return a list of documents found in the file or URL. In most cases, there'll only be one document, but, for instance, a Unix mbox should return an object for each email in the box. These objects can be turned into Plucene::Document objects by calling the document method on them. This isn't done by default because you may wish to mess with the hash yourself, or serialize it, or whatever.

Creating your own backend

If you want to handle a different type of file, it's relatively easy to do. All you need to do is create a module called Plucene::SearchEngine::Index::Whatever; this should inherit from Plucene::SearchEngine::Index::Base and supply a gather_data_from_file method. It should also call the register_handler method to state which MIME types and file extensions it can handle.

For instance, suppose we want to create a backend which grabs metadata from an image and indexes that. (Not unlike Plucene::SearchEngine::Index::Image...) We'd start off like this:

    package Plucene::SearchEngine::Index::Image;
    use strict;
    use warnings;
    use base 'Plucene::SearchEngine::Index::Base';
    use Image::Info;

Now we register the mime types and file extensions we can handle:

    __PACKAGE__->register_handler(qw( 
        image/bmp           .bmp 
        image/gif           .gif
        image/jpeg          .jpeg .jpg .jpe
        ...
    ));

And our gather_data_from_file method will call add_data for each bit of metadata it can find:

    sub gather_data_from_file {
        my ($self, $filename) = @_;
        my $info = image_info($filename);    
        return if $info->{error};  
        $self->add_data("size", "UnStored", scalar html_dim($info));
        $self->add_data("text", "UnStored", $info->{Comment});
        $self->add_data("subtype", "UnStored", $info->{file_ext});
        $self->add_data("created", "Date", Time::Piece->new(
            str2time($info->{LastModificationTime})));
    }

See Plucene::SearchEngine::Index::Base for an explanation of add_data.

Beceause Plucene::SearchEngine::Index uses a plugin architecture, once this module is installed, it will automatically be called upon to handle those image types it can deal with, without any additional action by the user.

Creating your own frontend

For certain types of data, such as emails, news articles, or instant messages, you may not want to use the file or URL frontends. Alternatively, if you have a simple piece of data which isn't file-based, you may just want to do everything yourself. Even then, Plucene::SearchEngine::Index::Base can help you to create Plucene::Documents - just inherit from it, and use add_data to add fields to the document in your examine method. See Plucene::SearchEngine::Index::Base for more details.

SEE ALSO

Plucene::SearchEngine::Index::File, Plucene::SearchEngine::Index::URL, Plucene::SearchEngine::Index::Base, Plucene::SearchEngine::Query, Plucene::Simple.

AUTHOR

Simon Cozens simon@cpan.org.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.