The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::EN::Bigram - Extract n-grams from a text and list them according to frequency and/or T-Score

SYNOPSIS

  # initalize
  use Lingua::EN::Bigram;
  $ngrams = Lingua::EN::Bigram->new;
  $ngrams->text( 'All men by nature desire to know. An indication of this...' );

  # calculate t-score for bigrams; t-score is only available for bigrams
  $tscore = $ngrams->tscore;
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

    print "$$tscore{ $_ }\t" . "$_\n";

  }

  # list trigrams according to frequency
  @trigrams = $ngrams->ngram( 3 );
  $count = $ngrams->ngram_count( \@trigrams );
  foreach my $trigram ( sort { $$count{ $b } <=> $$count{ $a } } keys %$count ) {

    print $$count{ $trigram }, "\t$trigram\n";

  }

DESCRIPTION

This module is designed to: 1) pull out all of the ngrams (multi-word phrases) in a given text, and 2) list these phrases according to their frequency. Using this module is it possible to create lists of the most common phrases in a text as well as order them by their statistical occurance, thus implying significance. This process is useful for the purposes of textual analysis and "distant reading".

METHODS

new

Create a new, empty Lingua::EN::Bigram object:

  # initalize
  $ngrams = Lingua::EN::Bigram->new;

text

Set or get the text to be analyzed:

  # fill Lingua::EN::Bigram object with content 
  $ngrams->text( 'All good things must come to an end...' );

  # get the Lingua::EN::Bigram object's content 
  $text = $ngrams->text;

words

Return a list of all the tokens in a text. Each token will be a word or puncutation mark:

  # get words
  @words = $ngrams->words;

word_count

Return a reference to a hash whose keys are a token and whose values are the number of times the token occurs in the text:

  # get word count
  $word_count = $ngrams->word_count;

  # list the words according to frequency
  foreach ( sort { $$word_count{ $b } <=> $$word_count{ $a } } keys %$word_count ) {

    print $$word_count{ $_ }, "\t$_\n";

  }

bigrams

Return a list of all bigrams in the text. Each item will be a pair of tokens and the tokens may consist of words or puncutation marks:

  # get bigrams
  @bigrams = $ngrams->bigrams;

This is a convienience method for the ngram method, described below. It is identical to $ngrams->ngram( 2 ). In fact, that is exactly what is called within the module itself.

bigram_count

Return a reference to a hash whose keys are a bigram and whose values are the frequency of the bigram in the text:

  # get bigram count
  $count = $ngrams->bigram_count;

  # list the bigrams according to frequency
  foreach ( sort { $$count{ $b } <=> $$count{ $a } } keys %$count ) {

    print $$count{ $_ }, "\t$_\n";

  }

tscore

Return a reference to a hash whose keys are a bigram and whose values are a T-Score -- a probabalistic calculation determining the significance of the bigram occuring in the text:

  # get t-score
  $tscore = $ngrams->tscore;

  # list bigrams according to t-score
  foreach ( sort { $$tscore{ $b } <=> $$tscore{ $a } } keys %$tscore ) {

          print "$$tscore{ $_ }\t" . "$_\n";

  }

T-Score can only be computed against bigrams.

trigrams

Return a list of all trigrams (three-word phrases) in the text. Each item will include three tokens and the tokens may consist of words or puncutation marks:

  # get trigrams
  @trigrams = $ngrams->trigrams;

This is a convienience method for the ngram method, described below. It is identical to $ngrams->ngram( 3 ). In fact, that is exactly what is called within the module itself.

trigram_count

Return a reference to a hash whose keys are a trigram and whose values are the frequency of the trigram in the text:

  # get trigram count
  $count = $ngrams->trigram_count;

  # list the trigrams according to frequency
  foreach ( sort { $$count{ $b } <=> $$count{ $a } } keys %$count ) {

    print $$count{ $_ }, "\t$_\n";

  }

quadgrams

Return a list of all quadgrams (four-word phrases) in the text. Each item will include four tokens and the tokens may consist of words or puncutation marks:

  # get quadgrams
  @quadgrams = $ngrams->quadgrams;

This is a convienience method for the ngram method, described below. It is identical to $ngrams->ngram( 4 ). In fact, that is exactly what is called within the module itself.

quadgram_count

Return a reference to a hash whose keys are a quadgram and whose values are the frequency of the quadgram in the text:

  # get quadgram count
  $count = $ngrams->quadgram_count;

  # list the trigrams according to frequency
  foreach ( sort { $$count{ $b } <=> $$count{ $a } } keys %$count ) {

    print $$count{ $_ }, "\t$_\n";

  }

ngram

Return a list of ngrams where the length of each ngram is denoted by the method's parameter:

  # create a list of trigrams
  @trigrams = $ngrams->ngram( 3 );
  

This method requires a single parameter and that parameter must be an integer.

ngram_count

Given a reference to an array, return a reference to a hash whose keys are an ngram and whose values are the frequency of the ngram in the text:

  # count ngram frequency
  $counts = $ngrams->ngram_count( \@trigrams );
  foreach ( sort { $$counts{ $b } <=> $$counts{ $a } } keys %$counts ) {

    print $$counts{ $_ }, "\t$_\n";
        
  }

DISCUSSION

Given the increasing availability of full text materials, this module is intended to help "digital humanists" apply mathematical methods to the analysis of texts. For example, the developer can extract the high-frequency words using the word_count method and allow the user to search for those words in a concordance. The bigram_count method simply returns the frequency of a given bigram, but the tscore method can order them in a more finely tuned manner.

Consider using T-Score-weighted bigrams as classification terms to supplement the "aboutness" of texts. Concatonate many texts together and look for common phrases written by the author. Compare these commonly used phrases to the commonly used phrases of other authors.

Each bigram, trigram, quadgram, or ngram includes punctuation. This is intentional. Developers may need want to remove bigrams, trigrams, quadgrams, or ngrams containing such values from the output. Similarly, no effort has been made to remove commonly used words -- stop words -- from the methods. Consider the use of Lingua::StopWords, Lingua::EN::StopWords, or the creation of your own stop word list to make output more meaningful. The distribution came with a script (bin/ngrams.pl) demonstrating how to remove puncutation and stop words from the displayed output.

Finally, this is not the only module supporting bigram extraction. See also Text::NSP which supports n-gram extraction.

TODO

There are probably a number of ways the module can be improved:

    * the constructor method could take a scalar as input, thus reducing the need for the text method

    * the distribution's license should probably be changed to the Perl Aristic License

    * the addition of alternative T-Score calculations would be nice

    * make sure the module works with character sets beyond ASCII

CHANGES

    * August 23, 2010 (version 0.03) - added ngram and ngram_counts methods

    * August 22, 2010 (version 0.02) - added trigrams and quadgrams; tweaked documentation; removed bigrams.pl from the distribution and substituted it wih n-grams.pl

    * June 19, 2009 (version 0.01) - initial release

ACKNOWLEDGEMENTS

T-Score, as well as a number of the module's methods, is calculated as per Nugues, P. M. (2006). An introduction to language processing with Perl and Prolog: An outline of theories, implementation, and application with special consideration of English, French, and German. Cognitive technologies. Berlin: Springer.

AUTHOR

Eric Lease Morgan <eric_morgan@infomotions.com>