The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Text::Positional::Ngram

SYNOPSIS

This document provides a general introduction to the Text::Positional::Ngram module.

DESCRIPTION

1. Introduction

The Text::Positional::Ngram module is a module that allows for the retrieval of variable length ngrams. An ngram is defined as a sequence of 'n' tokens that occur within a window of at leaste 'n' tokens in the text. What constitutes as a 'token' can be defined by the user.

2. Ngrams

An ngram is a sequence of n tokens. The tokens in the ngrams are delimited by the diamond symbol, "<>". Therefore "to<>be<>" is a bigram whose tokens consist of "to" and "be". Similarly, "or<>not<>to<>" is a trigram whose tokens consist of "or", "not", and "to".

Given a piece of text, Ngrams are usually formed of contiguous tokens. For example, if we take the phrase:

    to     be     or     not     to     be

The bigrams for this phrase would be:

    to<>be<>     be<>or<>     or<>not<>

The trigrams for this phrase would be:

    to<>be<>or<>     be<>or<>not<>     
    or<>not<>to<>    not<>to<>be<>

3. Tokens

We define a token as a contiguous sequence of characters that match one of a set of regular expressions. These regular expressions may be user-provided, or, if not provided, are assumed to be the following two regular expressions:

 \w+        -> this matches a contiguous sequence of alpha-numeric characters

 [\.,;:\?!] -> this matches a single punctuation mark

For example, assume the following is a line of text:

"to be or not to be, that is the question!"

Then, using the above regular expressions, we get the following tokens:

    to           be           or          not       
    to           be           ,           that      
    is           the          question    !

If we assume that the user provides the following regular expression:

 [a-zA-Z]+  -> this matches a contiguous sequence of alphabetic characters

Then, we get the following tokens:

    to           be           or          not       
    to           be           that        is      
    the          question 
    

4. Usage

    use Text::Positional::Ngram;

Text::Positional::Ngram Requirements

   use Text::Positional::Ngram;

   #  create an instance of Text::Positional::Ngram
   my $text = Text::Positional::Ngram->new();

   #  create the files needed and specify which
   #  file you would like to get the ngrams from
   $text->create_files("my_file.txt");

   #  get the ngrams
   $text->get_ngrams();

Text::Positional::Ngram Functions

1. create_files(@FILE)
    Takes an array of files in which the ngrams are
    to be obtained from. This function will creates the 
    files that are required for the ngrams to be 
    created. These files are defined as the name of the
    first file entered in the FILE array and timestamped.

    1. vocabulary file : converts tokens to integers prefix: 
    2. snt file        : integer representation of corpus
    3. sntngram file   : integer representation of the ngrams
                         and their frequency counts
    4. ngram file      : ngrams and their frequencies
2. get_ngrams()
    Obtains ngrams of size two and their frequencies
    storing them in the given ngram file.
3. create_stop_list(FILE)
    Removes n-grams containing at least one (in OR mode) 
    stop word or all stop words (in AND mode). The default 
    is OR mode. Each stop word should be a regular expression 
    in this FILE and should be on a line of its own. These 
    should be valid Perl regular expressions, which means that 
    any occurrence of the forward slash '/' within the regular 
    expression must be 'escaped'. 
4. set_stop_mode(MODE)
    OR mode removes n-grams containing at least 
    one stop word and AND mode removes n-grams 
    that consists of entirely of stop words. 
    Default:  AND
5. set_token_file(FILE)
    Each regular expression in this FILE should be on a line
    of its own, and should be delimited by the forward slash 
    '/'. These should be valid Perl regular expressions, which 
    means that any occurrence of the forward slash '/' within 
    the regular expression must be 'escaped'. 
    
    NOTE: This function should be called before the 
    function that creates the main files ie before 
    create_files(FILE).
        
6. set_nontoken_file(FILE)
    The set_nontoken_file function can be used when there 
    are predictable sequences of characters that you know 
    should not be included as tokens.

    NOTE: This function should be called before the 
    function that creates the main files ie before 
    create_files(FILE).
7. set_remove()
    Ignores Ignores n-grams that occur less than N times. 
    Ignored n-grams are not counted and so do not affect 
    counts and frequencies.

    NOTE:  Should be set before you retrieve the ngrams, 
    ie before you call the get_ngrams() function.
           
8. set_marginals()
    The marginal frequencies consist of the frequencies of 
    the individual tokens in their respective positions in
    the n-gram. 

    NOTE:  Should be set before you retrieve the ngrams, 
    ie before you call the get_ngrams() function.
9. set_newline()
    Prevents n-grams from spanning across the new-line
    character
        
10. set_frequency(N)
    Does not display n-grams that occur less than N times

    NOTE:  Should be set before you retrieve the ngrams, 
    ie before you call the get_ngrams() function.
11. set_min_ngram_size(N)
    Finds n-grams greater than or equal to size N.
    Default: 2

    NOTE:  Should be set before you retrieve the ngrams, 
    ie before you call the get_ngrams() function.
12. set_max_ngram_size(N)
    Finds n-grams less than or equal to size N
    Default: 2

    NOTE:  Should be set before you retrieve the ngrams, 
    ie before you call the get_ngrams() function.
13. set_ngram_size(N)
    Finds ngrams equal to size N
    Default : 2

    NOTE:  Should be set before you retrieve the ngrams, 
    ie before you call the get_ngrams() function.
14. set_destination_file(FILE)
    Prints the ngrams to FILE. 

    The hidden files that get erased when program is 
    completed are named: <FILE>.<ext>.

    If this is not set the files will be named
    default.<ext>
    
15. get_ngram_count()
    Returns the number of n-grams.
16. remove_files()
    Removes the snt, sntngram and the vocab file.
    
17. set_window_size(N)
    Sets the size of the window in which positional 
    ngram can be found in.
=head1 AUTHOR

Bridget Thomson McInnes, bthomson@d.umn.edu

BUGS

Limitations of this package are:

1. Only a partial set of marginal counts are found in this package. The frequency of the individual tokens in the n-gram are recorded. For example, given the trigram, w1 w2 w3, the marginal counts that would be returned are: the number of times w1 occurs in position one of the ngram, the number of times that w2 occurs in the second position of an ngram, and the number of times that w3 occurs in the third position of the ngram.
2. The size of the corpus that this package can retrieve ngrams fromm is limited to approximatly 75 million tokens. Please note that this number may vary dependng on what options are used.

SEE ALSO

COPYRIGHT

Copyright (C) 2004-2007, Bridget Thomson McInnes

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 59 Temple Place - Suite 330, Boston, MA 02111-1307, USA.

Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.

perl(1)

4 POD Errors

The following errors were encountered while parsing the POD:

Around line 891:

'=item' outside of any '=over'

Around line 1030:

You forgot a '=back' before '=head1'

Around line 1034:

'=item' outside of any '=over'

Around line 1048:

You forgot a '=back' before '=head1'