The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

MMM::Text::Search - Perl module for indexing and searching text files and web objects

SYNOPSIS

  use MMM::Text::Search;
          
  my $srch = new MMM::Text::Search {    #for indexing...
        #index main file location...  
                IndexPath => "/tmp/myindex.db",
        #local files... (optional)
                FileMask  => '(?i)(\.txt|\.htm.?)$',
                Dirs      => [ "/usr/doc", "/tmp" ] ,
                FollowSymLinks => 0|1, (default = 0)
        #web objects... (optional)
                URLs      => [ "http://localhost/", ... ],
                Level     => recursion-level (0=unlimited)              
        #common options...              
                IgnoreLimit =>  0.3,   (default = 2/3)
                Verbose => 0|1                          
        };
  
  $srch->start_indexing_session();
        
  $srch->commit_indexing_session();
  
  $srch->index_default_locations();
        
  $srch->index_content( { title =>   '...', 
                          content=>  '...', 
                          id =>      '...'  } );
         
  $srch->makeindex;
       (Obsolete.) 


        
        

  my $srch = new MMM::Text::Search (  #for searching....
                  "/tmp/myindex.db", verbose_flag );
  
  my $hashref = $srch->query("pizza","ciao", "-pasta" );  
  my $hashref = $srch->advanced_query("(pizza OR ciao) AND NOT pasta");  

  $srch->errstr()       # returns last error 
                        # (only query syntax-errors for the moment being)

  
  $srch->dump_word_stats(\*FH)  

DESCRIPTION

  • Indexing

    When a session is closed the following files will have been created (assuming IndexPath = /path/myindex.db, see constructor):

            /path/myindex.db             word index database
            /path/myindex-locations.db   filename/URL database
            /path/myindex-titles.db      html title database
            /path/myindex.stopwords      stop-words list
            /path/myindex.filelist       readable list of indexed files/URLs
            /path/myindex.deadlinks      broken http links
    
            [... lots of important things missing ... ]

    start_indexing_session() starts session.

    commit_indexing_session() commits and closes current session.

    index_default_locations() indexes all files and URLs specified on construction.

    index_content() pushes content into indexing engine. Argument must have the following structure

     { title =>   '...', content=>  '...', id =>      '...'  }

    makeindex() is obsolete. Equivalent to: $srch->start_indexing_session(); $srch->index_default_locations(); $srch->commit_indexing_session();

    dump_word_stats(\*FH) dumps all words sorted by occurence frequency using FH file handle (or STDOUT if no parameter is specified). Stop-words get a frequency value of 1.

  • Searching

    Both query() and advanced_query() return a reference to a hash with the following structure:

            (
             ignored  => [ string, string, ... ], # ignored words
             searched => [ string, string, ... ], # words searched for
             entries    => [  hashref, hashref, ... ] # list of records 
                                                    # found
             )
            

    The 'entries' element is a reference to an array of hashes, each having the following structure:

            (
             location => string,  # file path or URL or anything
             score    => number,  # score 
             title    => string   # HTML title               
            )

NOTES

Note on implementation: The technique used for indexing is substantially derived from that exposed by Tim Kientzle on Dr. Dobbs magazine.

BUGS

Many, I guess.

AUTHOR

Max Muzi <maxim@comm2000.it>

SEE ALSO

perl(1).

2 POD Errors

The following errors were encountered while parsing the POD:

Around line 501:

'=item' outside of any '=over'

Around line 564:

You forgot a '=back' before '=head1'