MMM::Text::Search - Perl module for indexing and searching text files and web objects
use MMM::Text::Search; my $srch = new MMM::Text::Search { #for indexing... #index main file location... IndexPath => "/tmp/myindex.db", #local files... (optional) FileMask => '(?i)(\.txt|\.htm.?)$', Dirs => [ "/usr/doc", "/tmp" ] , FollowSymLinks => 0|1, (default = 0) #web objects... (optional) URLs => [ "http://localhost/", ... ], Level => recursion-level (0=unlimited) #common options... IgnoreLimit => 0.3, (default = 2/3) Verbose => 0|1 }; $srch->start_indexing_session(); $srch->commit_indexing_session(); $srch->index_default_locations(); $srch->index_content( { title => '...', content=> '...', id => '...' } ); $srch->makeindex; (Obsolete.) my $srch = new MMM::Text::Search ( #for searching.... "/tmp/myindex.db", verbose_flag ); my $hashref = $srch->query("pizza","ciao", "-pasta" ); my $hashref = $srch->advanced_query("(pizza OR ciao) AND NOT pasta"); $srch->errstr() # returns last error # (only query syntax-errors for the moment being) $srch->dump_word_stats(\*FH)
Indexing
When a session is closed the following files will have been created (assuming IndexPath = /path/myindex.db, see constructor):
/path/myindex.db word index database /path/myindex-locations.db filename/URL database /path/myindex-titles.db html title database /path/myindex.stopwords stop-words list /path/myindex.filelist readable list of indexed files/URLs /path/myindex.deadlinks broken http links [... lots of important things missing ... ]
start_indexing_session() starts session.
commit_indexing_session() commits and closes current session.
index_default_locations() indexes all files and URLs specified on construction.
index_content() pushes content into indexing engine. Argument must have the following structure
{ title => '...', content=> '...', id => '...' }
makeindex() is obsolete. Equivalent to: $srch->start_indexing_session(); $srch->index_default_locations(); $srch->commit_indexing_session();
dump_word_stats(\*FH) dumps all words sorted by occurence frequency using FH file handle (or STDOUT if no parameter is specified). Stop-words get a frequency value of 1.
Searching
Both query() and advanced_query() return a reference to a hash with the following structure:
( ignored => [ string, string, ... ], # ignored words searched => [ string, string, ... ], # words searched for entries => [ hashref, hashref, ... ] # list of records # found )
The 'entries' element is a reference to an array of hashes, each having the following structure:
( location => string, # file path or URL or anything score => number, # score title => string # HTML title )
Note on implementation: The technique used for indexing is substantially derived from that exposed by Tim Kientzle on Dr. Dobbs magazine.
Many, I guess.
Max Muzi <maxim@comm2000.it>
perl(1).
2 POD Errors
The following errors were encountered while parsing the POD:
'=item' outside of any '=over'
You forgot a '=back' before '=head1'
To install MMM::Text::Search, copy and paste the appropriate command in to your terminal.
cpanm
cpanm MMM::Text::Search
CPAN shell
perl -MCPAN -e shell install MMM::Text::Search
For more information on module installation, please visit the detailed CPAN module installation guide.