The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

clusterlabeling.pl - Label discovered clusters based on their content

SYNOPSIS

 clusterlabeling.pl [OPTIONS] INPUTFILE

DESCRIPTION

Assigns labels to each cluster with the significant word pairs found in the cluster contexts. Also separates the clusters in different files. This is particularly useful for the web-interface.

Two types of labels are assigned to each cluster : Descriptive and Discriminating. Descriptive labels are the top n significant word pairs. Discriminating labels are the word-pairs unique to the cluster out of the top n significant word-pairs for the cluster.

Required Arguments:

INPUTFILE

File created by Toolkit/evaluate/format_clusters.pl with --context option.

Optional Arguments:

--token TOKEN

A file containing Perl regex/s that define the tokenization scheme in INPUTFILE file.

If --token is not specified, default token regex file token.regex is searched in the current directory.

--prefix PRE

Specify a prefix to be used for the file names of the cluster files. e.g. If the PRE is the prefix specified then cluster with id=0 will have file name: PRE.cluster.0

If prefix is not specified then prefix is created by concatenating time stamp to the string "expr".

--stop STOPFILE

A file of Perl regexes that define the stop list of words to be excluded from the features.

STOPFILE could be specified with two modes :

  • AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE

  • OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE [Default]

AND mode ignores word pairs in which both words are stop words.

OR mode ignores word pairs in which either word is a stop word.

--ngram n

Allows user to set the size of the ngrams that will be used for the labels. Valid values are 2, 3, and 4.

Default value for this option is 2 (i.e. default feature selection)

--remove N

Removes bigrams that occur less than N times.

Default value for this option is 5

--window W

Specifies the window size for bigrams. Pairs of words that co-occur within the specified window from each other (window W allows at most W-2 intervening words) will form the bigram features.

Default window size is 2 which allows only consecutive word pairs.

--stat STAT

Specifies the statistical scores of association. The following are available:

                ll              -       Log Likelihood Ratio [default]
                pmi             -       Point-Wise Mutual Information
                tmi             -       True Mutual Information
                x2              -       Chi-Squared Test
                phi             -       Phi Coefficient
                tscore          -       T-Score
                dice            -       Dice Coefficient
                odds            -       Odds Ratio
                leftFisher      -       Left Fisher's Test
                rightFisher     -       Right Fisher's Test

--rank R

Word pairs ranking below R when arranged in descending order of their test scores are ignored.

Default value for this option is 10

--newLine

If turned on, word pair selection process will not span across newlines.

By default this option is turned off, that is, word pair selection spans across lines.

Other Options :

--help

Displays the quick summary of program options.

--version

Displays the version information.

--verbose

Displays to STDERR the current program status.

OUTPUT

1. Cluster ids followed by the assigned labels are directed to STDOUT:
 Cluster 0 (Descriptive): Bill Clinton, Mariana Islands, Northern Mariana, Pacific island, World Cup, per hour

 Cluster 0 (Discriminating): Mariana Islands, Northern Mariana, Pacific island, World Cup, per hour

 Cluster 2 (Descriptive): Bill Clinton, Erik wrote, Inc Within, Jersey And, Lyle Menendez

 Cluster 2 (Discriminating): Erik wrote, Inc Within, Jersey And, Lyle Menendez

 Cluster 1: 

 Cluster 3:

 Cluster -1 (Descriptive): York Times, Undated _
 
 Cluster -1 (Discriminating): York Times, Undated _
2. Cluster files, named with the specified prefix or the generated prefix.

SYSTEM REQUIREMENTS

Input to this program should be created by format_clusters.pl

BUGS

AUTHOR

 Anagha Kulkarni, Carnegie-Mellon University

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT

Copyright (c) 2004-2008,2013 Anagha Kulkarni and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.