The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

README.Toolkit - SenseClusters Toolkit directory structure with links to all program documentation

DIRECTORY STRUCTURE

This briefly describes the structure of the Toolkit directory, and gives a brief idea of what each program does. Directories are indicated with a / at the end of their name (preprocess/) while programs end with the .pl suffix. All of this is contained in the Toolkits/ directory. Note that these are organized roughly in the order in which they will be used by SenseClusters.

Please review the flowcharts found in doc/Flowcharts for additional information.

preprocess/ (text preprocessing programs)

  • plain/ (processes input in plain text format)

    • text2sval.pl - Convert simple plain text into Senseval2 format

  • sval2/ (processes input in Senseval-2 format)

    • balance.pl - Balances sense distribution in a Senseval-2 input file by removing some instances

    • filter.pl - Removes instances associated with low frequency sense tags from Senseval-2 input

    • frequency.pl - Displays frequency distribution of senses

    • keyconvert.pl - Convert KEY file from Senseval-2 format to SenseCluster's format

    • maketarget.pl - Create a Perl regex for the target word by spotting all <head> tags in the given file

    • prepare_sval2.pl - Prepare Senseval-2 data for experiments

    • preprocess.pl - Tokenize and optionally split Senseval-2 input into training and test portions

    • sval2plain.pl - Convert a Senseval-2 input file to plain text format

    • windower.pl - Cut a window of context W words big around a target word in a given Senseval-2 input file

count/ (Modify count.pl output from Text-NSP)

  • reduce-count.pl - Reduce the size of the Text-NSP output created with huge training data

matrix/ - (Similarity matrix constructors)

  • bitsimat.pl - Create a similarity matrix for given bit vectors

  • simat.pl - Create a similarity matrix for given non-binary (integer or real) vectors

vector/ (Represent contexts as vectors to be clustered)

  • nsp2regex.pl - Creates regular expressions from Text-NSP output to represent features

  • order1vec.pl - Creates first order context vectors

  • order2vec.pl - Creates second order context vectors

  • wordvec.pl - Creates word vectors from Text-NSP output

svd/ (SVDPACKC interface)

  • mat2harbo.pl - Convert matrices from SenseClusters format to Harwell-Boeing format

  • svdpackout.pl - Reconstruct a matrix from its singular vectors as found by by SVDPACKC

clusterstopping/ (Cluster Stopping program)

  • clusterstopping.pl - Predicts the number of clusters that a given data should be divided into. Provides three such cluster stopping measures.

evaluate/ (Evaluate the results of SenseClusters by comparing to gold standard data)

  • cluto2label.pl - Convert clustering output of Cluto to a cluster by sense confusion matrix for evaluation

  • format_clusters.pl - Display contexts that were clustered with assigned sense id, or display senseval-2 format with assigned sense id

  • label.pl - Assign sense tags to the discovered clusters for evaluation

  • report.pl - Report performance in terms of the precision, recall, and F-Measure, and show a confusion matrix

clusterlabel/ (Cluster Labeling programs)

  • clusterlabeling.pl - Selects significant word-pairs from the contents/instances of the clusters and assigns them as the labels to the clusters. Also creates separate file for each cluster.

AUTHOR

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT

Copyright (c) 2003-2008, Ted Pedersen

Permission is granted to copy, distribute and/or modify this document under the terms of the GNU Free Documentation License, Version 1.2 or any later version published by the Free Software Foundation; with no Invariant Sections, no Front-Cover Texts, and no Back-Cover Texts.

Note: a copy of the GNU Free Documentation License is available on the web at http://www.gnu.org/copyleft/fdl.html and is included in this distribution as FDL.txt.