The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

discriminate.pl Wrapper program to run SenseClusters in a single command

SYNOPSIS

Discriminates among the given text instances based on their contextual similarities.

USAGE

discriminate.pl [OPTIONS] TEST

INPUT

Required Arguments:

TEST

Senseval-2 formatted TEST instance file that contains the instances to be clustered.

Optional Arguments:

DATA OPTIONS :

--training TRAIN

Training file in plain text format that can be used to select features. If this is not specified, features are selected from the given TEST file.

--split N

Splits the given TEST file into two portions, N% for the use as the TRAIN data and (100-N)% as the TEST data. The value for N is a percentage and should be an integer between 1 to 99 (inclusive). The instances from the original TEST file are not picked or split in any particular order but are randomly split into the two portions of TRAIN and TEST data while maintaining the ratio of N/(100-N).

Note: This option cannot be used when --training option is also used.

--token TOKEN

A file containing Perl regex/s that define the tokenization scheme in TRAIN and TEST files. If --token is not specified, default token regex file token.regex is searched in the current directory.

--target TARGET

A file containing Perl regex/s for identifying the target word. A sample target.regex file containing regex:

    /<head>\w+</head>/

is provided with this distribution. If --target is not specified, default target regex file target.regex is searched in the current directory. If this file doesn't exist, target.regex is automatically created by finding all instances of <head> tags from the TEST data. If there are no instances of <head> tags in TEST, the given data is assumed to be global and target word is not searched in either TRAIN or TEST.

 Note: --target cannot be specified with headless input data
       i.e. test file without head/target word(s).

--prefix PRE

Specify a prefix to be used in all output file names. e.g. context vector file will have name 'PRE.vectors', features file will have name 'PRE.features' and so on ... By default, a random prefix is created using the time stamp.

--format f16.XX

The default format for floating point numbers is f16.06. This means that there is room for 6 digits to the right of the decimal, and 9 to the left. You may change XX to any value between 0 and 15, however, the format must remain 16 spaces long due to formatting requirements of SVDPACKC.

--wordclust

Discriminates and clusters each word based upon its direct and indirect co-occurrence with other words (when used without the --lsa switch) or clusters words or features based upon their occurrences in different contexts (when used with the --lsa switch).

 Note: 1. Separate (--training) TRAIN data should not be used with word 
          clustering.
       2. Starting with Version 0.93, word clustering is no longer 
          restricted to using only headless data. However, options 
          specific to headed data such as --scope_test and target 
          co-occurrence features (see below) cannot be used.

--lsa

Uses Latent Semantic Analysis (LSA) style representation for clustering features or contexts. LSA representation is the transpose of the context-by-feature matrix created using the native SenseClusters order1 context representation.

This option can be used only in the following two combinations of the --context and the --wordclust options:

1. --context o1 --wordclust --lsa

Performs feature clustering by grouping together features based on the contexts that they occur in. Features can be unigrams, bigrams or co-occurrences. Feature vectors are the rows of the transposed context-by-feature representation created by order1vec.pl.

2. --context o2 --lsa

Performs context clustering by creating context vectors by averaging the feature vectors from the transposed context-by-feature representation of order1vec.pl.

FEATURE OPTIONS :

--feature TYPE

Specify the feature type to be used for representing contexts. Possible options for feature type with first order context representation:

        bi      -   bigrams  [default]
        tco     -   target co-occurrences       
        co      -   co-occurrences
        uni     -   unigrams

Possible options for feature type with second order context representation:

        bi      -   bigrams  [default]
        co      -   co-occurrences
        tco     -   target co-occurrences

 Note: --tco (target co-occurrences) cannot be used with headless 
       data i.e. test/train file without head/target word(s).

--scope_train S1

Limits the scope of the training contexts to S1 words around (on both sides of) the TARGET word. Thus, it allows selection of local features. If --scope_train is used, each training instance is expected to include the target word as specified by the --target option or default target.regex.

 Note: --scope_train cannot be used with headless data i.e. train files
       without head/target word(s).

--scope_test S2

Limits the scope of the test contexts to S2 words around (on both sides of) the TARGET word. Thus, it allows to match and use local features in the context vectors.

 Note: --scope_test cannot be used with headless data i.e. test files
       without head/target word(s).

--stop STOPFILE

A file of Perl regexes that define the stop list of words to be excluded from the features.

STOPFILE could be specified with two modes -

AND mode - declared by including '@stop.mode=AND' on the first line of the STOPFILE. - ignores word pairs in which both words are stop words.

OR mode - declared by including '@stop.mode=OR' on the first line of the STOPFILE. - ignores word pairs in which either word is a stop word.

Both modes exclude stop words from unigram features.

Default is OR mode.

--remove F

Removes features that occur less than F times in the training corpus.

--window W

Specifies the window size for bigram/co-occurrence features. Pairs of words that co-occur within the specified window from each other (window W allows at most W-2 intervening words) will form the bigram/co-occurrence features.

Default window size is 2 which allows only consecutive word pairs.

Not applicable to unigram features.

--stat STAT

Bigrams and co-occurrences can be selected based on their statistical scores of association as specified by this option. If --vector = o2 and --stat is used, word association matrix will use the scores computed by the specified statistical test instead of simple joint frequency counts of the word pairs.

Available tests of association are :

        dice            -       Dice Coefficient
        ll              -       Log Likelihood Ratio
        odds            -       Odds Ratio
        phi             -       Phi Coefficient
        pmi             -       Point-Wise Mutual Information
        tmi             -       True Mutual Information
        x2              -       Chi-Squared Test
        tscore          -       T-Score
        leftFisher      -       Left Fisher's Test
        rightFisher     -       Right Fisher's Test

By default, features are selected and represented using their frequency counts.

--stat_rank N

Word pairs ranking below N when arranged in descending order of their test scores are ignored.

--stat_rank has no effect unless --stat is specified.

--stat_score S

Selects word pairs with scores greater than S after performing the selected test of association. Score could be any real number that will give reasonable number of features for the requested test.

--stat_score has no effect unless --stat is specified.

VECTOR OPTIONS :

--context ORD

Specifies the context representation to be used. Set ORD to 'o1' to use 1st order context vectors, and to 'o2' to select 2nd order context vectors. Default context representation is o2.

--binary

Creates binary feature and context vectors. By default, feature vectors show the joint frequency scores of the associated word pairs while the context vectors show the average of the feature vectors of words that occur in the context. With --binary turned ON, feature vectors show mere presence or absence of the particular word pair (co-occurrence/bigram) in TRAIN, while the context vectors will represent a binary 'OR' operation on the corresponding vectors of contextual features.

SVD OPTIONS :

--svd

Reduces the feature space dimensions by performing Singular Value Decomposition (SVD). By default, all feature dimensions are retained.

--k K

Reduces the dimensions of the feature space to K. Default K = 300

--rf RF

Specifies the scaling factor for reducing feature space dimensions such that feature space with N dimensions is reduced down to N/RF. Default RF = 4. RF should be an integer greater than 1.

If both --k and --rf are specified, dimensions are reduced to min(k,N/RF).

 Note: If the reduced dimensions ( min(k,N/RF) ) turn-out to be less than 
       or equal to 10 then svd is not performed.

--iter I

Specifies the number of iterations of SVD. Recommended value is 3 times the desired K.

CLUSTER-STOPPING OPTIONS:

--cluststop CS

Specifies the cluster stopping measure to be used to predict the number the number of clusters.

   The possible option values:
   pk1 - Use PK1 measure [ PK1[m] = (crfun[m] - mean(crfun[1...deltaM]))/std(crfun[1...deltaM])) ]
   pk2 - Use PK2 measure [ PK2[m] = (crfun[m]/crfun[m-1]) ]
   pk3 - Use PK3 measure [ PK3[m] = ((2 * crfun[m])/(crfun[m-1] + crfun[m+1])) ]
   gap - Use Adapted Gap Statistic. 
   pk  - Use all the PK measures.
   all - Use all the four cluster stopping measures.

More about these measures can be found in the documentation of Toolkit/clusterstop/clusterstopping.pl

NOTE: Options --cluststop and --clusters (described under Clustering options) cannot be used together.

--delta INT

NOTE: Delta value can only be a positive integer value.

Specify 0 to stop the iterating clustering process when two consecutive crfun values are exactly equal. This is the default setting when the crfun values are integer/whole numbers.

Specify non-zero positive integer to stop the iterating clustering process when the difference between two consecutive crfun values is less than or equal to this value. However, note that the integer value specified is internally shifted to capture the difference in the least significant digit of the crfun values when these crfun values are fractional. For example: For crfun = 1.23e-02 & delta = 1 will be transformed to 0.0001 For crfun = 2.45e-01 & delta = 5 will be transformed to 0.005 The default delta value when the crfun values are fractional is 1.

However if the crfun values are integer/whole numbers (exponent >= 2) then the specified delta value is internally shifted only until the least significant digit in the scientific notation. For example: For crfun = 1.23e+04 & delta = 2 will be transformed to 200 For crfun = 2.45e+02 & delta = 5 will be transformed to 5 For crfun = 1.44e+03 & delta = 1 will be transformed to 10

--threspk1 NUM

Specifies the threshold value that should be used by the PK1 measure to predict the k value. Default = -0.7

NOTE: This option should be used only when --cluststop option is also used with option value of "all" or "pk1".

CLUSTER-STOPPING: ADAPTED GAP STATISTIC OPTIONS:

--B NUM

The number of replicates/references to be generated. Default: 1

--typeref TYP

Specifies whether to generate B replicates from a reference or to generate B references.

The possible option values: rep - replicates [Default] ref - references

--percentage NUM

Specifies the percentage confidence to be reported in the log file. Since Gap Statistic uses parametric bootstrap method for reference distribution generation, it is critical to understand the interval around the sample mean that could contain the population ("true") mean and with what certainty. Default: 90

--seed NUM

The seed to be used with the random number generator. Default: No seed is set.

CLUSTERING OPTIONS :

--clusters N

Specifies number of clusters to be created. Default is set to 2.

--space SPACE

Specifies whether clustering is to be performed in vector or similarity space. Set the value of SPACE to 'vector' to perform clustering in vector space i.e. to cluster the context vectors directly. To cluster in similarity space by explicitly finding the pair-wise similarities among the contexts, set SPACE to 'similarity'.

By default, clustering is performed in vector space.

--clmethod CL

Specifies the clustering method.

Possible option values are :

        rb - Repeated Bisections [Default]
        rbr - Repeated Bisections for by k-way refinement
        direct - Direct k-way clustering
        agglo  - Agglomerative clustering
        graph  - Graph partitioning-based clustering
        bagglo - Partitional biased Agglomerative clustering

For large amount of data, 'rb', 'rbr' or 'direct' are recommended.

--crfun CR

Selects the criteria function for Clustering. The meanings of these criteria functions are explained in Cluto's manual.

The possible values are:

        i1      -  I1  Criterion function
        i2      -  I2  Criterion function [default for partitional]
        e1      -  E1  Criterion function
        g1      -  G1  Criterion function
        g1p     -  G1' Criterion function
        h1      -  H1  Criterion function
        h2      -  H2  Criterion function
        slink   -  Single link merging scheme
        wslink  -  Single link merging scheme weighted w.r.t. cluster sim
        clink   -  Complete link merging scheme
        wclink  -  Complete link merging scheme weighted w.r.t. cluster sim
        upgma   -  Group average merging scheme [default for agglomerative]

Note that for cluster stopping, i1, i2, e1, h1 and h2 criterion functions can only be used. If a crfun other than these is selected then cluster stopping uses the default crfun (i2) while the final clustering of contexts is performed using the crfun specified.

--sim SIM

Specifies the similarity measure to be used for either vector or similarity space clustering.

When --space = vector (or default), possible values of SIM are :

        cos      -  Cosine [default]
        corr     -  Correlation Coefficient
        dist     -  Euclidean distance
        jacc     -  Extended Jaccard Coefficient

When --space = similarity and --binary is ON, possible values of SIM are -

        cos     - Cosine [default]
        mat     - Match
        jac     - Jaccard
        ovr     - Overlap
        dic     - Dice

Otherwise, only cosine measure is available and is default.

The following table summarizes availability of similarity measures for 2 clustering approaches - vector(vcl) and similarity(scl) and on 2 different types of context vectors - binary Vs frequency

        vcl+bin         vcl+freq        scl+bin         scl+freq
 cos     Y               Y               Y               Y
 mat     N               N               Y               N
 jacc    Y               Y               Y               N
 dice    N               N               Y               N
 ovr     N               N               Y               N
 dist    Y               Y               N               N
 corr    Y               Y               N               N

The reasons are purely implementation issues and in future, we plan to support more consistent measures across these combinations.

--rowmodel RMOD

The option is used to specify the model to be used to scale every column of each row. (For further details please refer Cluto manual)

The possible values for RMOD - none - no scaling is performed (default setting) maxtf - post scaling the values are between 0.5 and 1.0 sqrt - square-root of actual values log - log of actual values

--colmodel CMOD

The option is used to specify the model to be used to (globally) scale each column across all rows. (For further details please refer Cluto manual)

The possible values for CMOD - none - no scaling is performed (default setting) idf - scaling according to inverse-document-frequency

LABELING OPTIONS :

Note: Labeling options cannot be used with word-clustering (--wordclust).

--label_stop LABEL_STOPFILE

A file of Perl regexes that define the stop list of words to be excluded from the features.

LABEL_STOPFILE could be specified with two modes -

AND mode - declared by including '@stop.mode=AND' on the first line of the LABEL_STOPFILE - ignores word pairs in which both words are stop words.

OR mode - declared by including '@stop.mode=OR' on the first line of the LABEL_STOPFILE - ignores word pairs in which either word is a stop word.

Default is OR.

--label_ngram LABEL_NGRAM

Specifies the value of n in 'n-gram' for the feature selection. The supported values for n are 2, 3 and 4.

Default value is 2 i.e. bigram.

--label_remove LABEL_N

Removes ngrams that occur less than LABEL_N times.

--label_window LABEL_W

Specifies the window size for bigrams. Pairs of words that co-occur within the specified window from each other (window LABEL_W allows at most LABEL_W-2 intervening words) will form the bigram features. Default window size is 2 which allows only consecutive word pairs.

--label_stat LABEL_STAT

Specifies the statistical scores of association.

Available tests of association are :

                dice            -       Dice Coefficient
        ll              -       Log Likelihood Ratio
        odds            -       Odds Ratio
        phi             -       Phi Coefficient
        pmi             -       Point-Wise Mutual Information
        tmi             -       True Mutual Information
        x2              -       Chi-Squared Test
        tscore          -       T-Score
        leftFisher      -       Left Fisher's Test
        rightFisher     -       Right Fisher's Test

--label_rank LABEL_R

Word pairs ranking below LABEL_R when arranged in descending order of their test scores are ignored.

Other Options :

--eval

Evaluates clustering performance by computing precision and recall for maximally accurate assignment of sense tags to clusters. Maximal Assignment is when clusters are given sense labels such that maximum number of instances will be attached with their true sense tags.

TEST instances tagged with multiple senses are automatically attached with the single sense-tag that is the most frequent among the attached tags.

Note: This option can be used only if the answer tags are provided in the TEST file.

--rank_filter R

Allows to remove low frequency senses during evaluation. This will remove the senses that rank below R when senses in TEST are arranged in the descending order of their frequencies. In other words, it selects top R most frequent senses. An instance will be removed if it has all sense tags below rank R.

--percent_filter P

Allows to remove low frequency senses based on their percentage frequencies. This will remove senses whose frequency is below P% in the TEST data.

If rank or percent filters are specified, they are applied after removing the multiple sense tags.

--help

Displays the quick summary of program options.

--version

Displays the version information.

--verbose

Displays to STDERR the current program status.

--showargs

Displays to STDOUT values of compulsory and required parameters. [NOT SUPPORTED IN THIS VERSION]

OUTPUT

discriminate.pl creates several output files. The discrimination of contexts performed by discriminate.pl, (i.e., a cluster assigned to each context) is given by the file $PREFIX.clusters if the number of clusters was set manually, otherwise by the file $PREFIX.clusters.$CLUSTSTOP where the $CLUSTSTOP specifies the cluster stopping measure that was used to predict the number of clusters.

In addition, discriminate.pl also creates following files:

NOTE: If a cluster stopping measure was used then it is indicated in the names of several output files by appending the cluster stopping measure name with the file name. Represented below as filename[.$CLUSTSTOP]

  • $PREFIX.clusters_context[.$CLUSTSTOP] - File containing all the input instances grouped by the cluster-id assigned to them.

  • $PREFIX[.$CLUSTSTOP].cluster.CLUSTERID - All the identified clusters and their instances are separated into different files. The filenames end with the cluster-id. e.g.: File containing instances of cluster 0 will be named as $PREFIX.cluster.0

  • $PREFIX.report[.$CLUSTSTOP] - Confusion table if --eval is ON

  • $PREFIX.cluster_labels[.$CLUSTSTOP] - List of labels (word-pairs) assigned to each cluster.

  • $PREFIX[.$CLUSTSTOP].dendogram.ps - Dendograms + some information.

  • $PREFIX.features - Features file

  • $PREFIX.regex - File containing regular expressions for identifying the features listed in $PREFIX.features file.

  • $PREFIX.testregex - File containing only those regular expressions from the $PREFIX.regex file above, which match at least once in the test contexts, only created in second order context clustering mode (SC native as well as LSA) and LSA feature clustering mode

  • $PREFIX.wordvec - Word Vectors if --context = o2

  • $PREFIX.vectors - Context Vectors

  • $PREFIX.rlabel - Row Labels of $PREFIX.vectors

  • $PREFIX.clabel - Column Labels of $PREFIX.vectors

  • $PREFIX.rclass - Class Ids of $PREFIX.vectors if --eval is ON

  • $PREFIX.cluster_solution[.$CLUSTSTOP] - Cluster ids of $PREFIX.vectors

  • $PREFIX.cluster_output[.$CLUSTSTOP] - Clustering program output

  • $PREFIX.pk1 - crfun[k] values, delta values, PK1[k] values and predicted k value

  • $PREFIX.pk2 - crfun[k] values, delta values, PK2[k] values and predicted k value

  • $PREFIX.pk3 - crfun[k] values, delta values, PK3[k] values and predicted k value

  • $PREFIX.gap - crfun[k] values, delta values and predicted k value

  • $PREFIX.gap.log - Gap(k), Obs(crfun(k)), Exp(crfun(k)) values etc.

The following files are created to facilitate creation of plots, if needed:

  • $PREFIX.cr.dat - value-pairs :- k-value crfun-value

  • $PREFIX.pk1.dat - value-pairs :- k-value PK1[k] value

  • $PREFIX.pk2.dat - value-pairs :- k-value PK2[k] value

  • $PREFIX.pk3.dat - value-pairs :- k-value PK3[k] value

  • $PREFIX.gap.dat - value-pairs :- k-value Gap[k] value

  • $PREFIX.exp.dat - value-pairs :- k-value Exp(crfun[k]) value

AUTHORS

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Amruta Purandare, University of Pittsburgh

 Anagha Kulkarni, Carnegie-Mellon University

 Mahesh Joshi, Carnegie-Mellon Unversity

COPYRIGHT

Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni, Mahesh Joshi

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.