The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Categorize::Util - Method to get keywords and phrases of text.

SYNOPSIS

  use strict;
  use warnings;
  use Text::Categorize::Textrank::En;
  use Text::Categorize::Util qw(getKeywordsAndPhrases);
  use Data::Dump qw(dump);
  my $textrankerEn = Text::Categorize::Textrank::En->new();
  my $text         = Text::Categorize::Util::getTestText();
  print $text;
  my $textrankInfo = $textrankerEn->getTextrankInfoOfText(listOfText => [$text]);
  my $keywordInfo = getKeywordsAndPhrases(
    %$textrankInfo,
    listOfStemmedTaggedDocuments => [ $textrankInfo->{listOfStemmedTaggedSentences} ],
    numberOfKeywords             => 9
  );
  dump $keywordInfo;
  my %phrases = map { ($_->{phrase}, 1) } map { (@$_) } @{ $keywordInfo->{keyphrases} };
  dump [ sort keys %phrases ];

DESCRIPTION

Text::Categorize::Util provides a routine to select the keywords and related phrases from the results of the routine getTextrankInfoOfText in Text::Categorize::Textrank::En.

ROUTINES

getKeywordsAndPhrases

From the results of the routine getTextrankInfoOfText in Text::Categorize::Textrank::En the routine getKeywordsAndPhrases selects the keywords for the text and their most common instance in the text (keywordOrderInstance) plus the keyphrases in the text associated with the selected keywords (keyphrases).

More precisely, if $results is the returned hash, then $results->{keywordOrderInstance} contains an array reference of the selected keywords in their descending order of importance within the text; each item in the list is {keyword => '', instance => ''}, where keyword is the identifier used for the keyword and instance is the most common form or instance of the keyword in the text.

$results->{keyphrases} contains an array reference of hashes of the form {wordsOfPhrase => [], keywordsOfPhrase => [], phrase => ''} where wordsOfPhrase is a list of the words from listOfStemmedTaggedSentences that comprise the phrase, keywordsOfPhrase is a list of the keywords that occur in the phrase, and phrase is the string of the phrase words.

listOfStemmedTaggedDocuments
 listOfStemmedTaggedDocuments => [...]

listOfStemmedTaggedDocuments is an array reference where each item in the array is a list of stemmed and part-of-speech tagged sentences from Text::StemTagPos. If listOfStemmedTaggedDocuments is not defined, then the text to be processed should be provided via listOfText.

hashOfTextrankValues
  hashOfTextrankValues => {}

hashOfTextrankValues holds the hash of the textrank values computed by getTextrankOfListOfTokens. Selected phrases will only begin and end with tokens for which hashOfTextrankValues is defined and positive.

useStemmedWords
  useStemmedWords => 1

If useStemmedWords should be set to the same value when computing the textrank using the routine getTextrankInfoOfText in Text::Categorize::Textrank::En. The default is true.

numberOfKeywords
  numberOfKeywords => 10

numberOfKeywords should be set to the number of keywords to select for the text. If it is greater than the number of values in hashOfTextrankValues, it is then set to that value. The default is 10.

INSTALLATION

To install the module run the following commands:

  perl Makefile.PL
  make
  make test
  make install

If you are on a windows box you should use 'nmake' rather than 'make'.

BUGS

Please email bugs reports or feature requests to bug-text-categorize-util@rt.cpan.org, or through the web interface at http://rt.cpan.org/NoAuth/ReportBug.html?Queue=Text-Categorize-Util. The author will be notified and you can be automatically notified of progress on the bug fix or feature request.

AUTHOR

 Jeff Kubina<jeff.kubina@gmail.com>

COPYRIGHT

Copyright (c) 2009 Jeff Kubina. All rights reserved. This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself.

The full text of the license can be found in the LICENSE file included with this module.

KEYWORDS

categorize, keywords, keyphrases, nlp, textrank

SEE ALSO

Log::Log4perl, Text::Categorize::Textrank::En