The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

format_clusters.pl - Map Cluto output to Senseval-2 format input file

SYNOPSIS

 format_clusters.pl [OPTIONS] CLUTO_SOLUTION RLABEL

DESCRIPTION

This program maps Cluto's clustering solution file into Senseval2 input file to give more legible forms of output.

INPUT

Required Arguments:

CLUTO_SOLUTION

This is an output file from Cluto that shows which cluster each context is assigned to. This is referred to as *.cluster_solution by the SenseClusters Web interface, or can be specified via the -clustfile option in Cluto. It consists of N lines, where N is the number of contexts, each each line contains an integer value indicating the cluster to which the context represented by that line is assigned.

Each line of this file shows the cluster id assigned to the instance id, specified at the same line number in *.rlabel file. The number of lines in the CLUTO_SOLUTION file should be the same as in the RLABEL file.

RLABEL

Row Label shows the instance id to which the cluster id, specified at the same line number in *.cluster_solution is assigned. The file name has an extension as .rlabel

Other Options :

--context SENSEVAL2

SENSEVAL2 should be a file of contexts formatted in the Senseval2 format. These are the contexts that have been clustered. The --context option causes the contexts to be reorganized such that those that occur in the same cluster are grouped together.

--senseval2 SENSEVAL2

SENSEVAL2 should be a file of contexts formatted in the Senseval2 format. These are the contexts that have been clustered. The --senseval2 option causes the contexts to be assigned (or tagged) with the cluster value assigned by Cluto. This cluster value will be put into the answer tag. They are displayed in their original order.

--help

Displays the summary of command line options.

--version

Displays the version information.

OUTPUT

If neither of the options (--context or --senseval2) are specified, the default behavior is that contexts are identified by instance id *only* and grouped together by clusters. Thus, the actual written contexts are not displayed in this case.

Each line is formatted as -

 <cluster id="CID"> 
   [<instance id="IID"/>]+ 
 </cluster>

If --context option is used, then all the instances along with the actual context data, grouped by clusters are displayed. The output sent to STDOUT looks like:

 <cluster id="CID"> 
   [<instance id="IID"><context>DATA</context></instance>]+ 
 </cluster>

If --senseval2 option is used, then output is copy of the input senseval2 file except that now, answer tags contain cluster id assigned to the instance. The output is sent to STDOUT.

Note: --context and --senseval2 cannot be used together.

BUGS

SYSTEM REQUIREMENTS

Cluto - http://www-users.cs.umn.edu/~karypis/cluto/

AUTHORS

Ted Pedersen, University of Minnesota, Duluth

Amruta Purandare, University of Pittsburgh

Anagha Kulkarni, Carnegie-Mellon University

COPYRIGHT

Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Anagha Kulkarni

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.