The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

cluto2label.pl - Convert Cluto output to a confusion matrix

SYNOPSIS

 cluto2label.pl [OPTIONS] CLUTO KEY

SYNOPSIS

Converts Cluto's clustering solution file to a cluster by sense distribution matrix to then be input to SenseClusters evaluation program label.pl.

INPUT

Required Arguments:

CLUTO

1st argument should be a clustering solution file (described in section 3.4.1 on page 34 in Cluto's manual) as created by Cluto's scluster and vcluster programs.

For N instances, CLUTO file will have exactly N lines, each ith line showing the cluster number(start from 0) to which the ith instance belongs.

e.g.

Cluto's clustering solution file =>

 0
 1
 1
 2
 0
 0
 1
 2

shows the cluster ids of each of the 8 instances clustered by Cluto's program.

 1st, 5th and 6th instance belong to 1st cluster (Cluster No 0)

 2nd, 3rd and 7th instance belong to 2nd cluster (Cluster No 1)

And

 4th and 8th instance belong to 3rd cluster (Cluster No 2)

Note: cluster id could be possibly -1 which means the corresponding instance is not assigned to any cluster

KEY

2nd argument should be a KEY file (in SenseCluster's format) showing true sense class labels of instances listed in CLUTO.

For N lines in file CLUTO, KEY should have exactly N lines. Each ith line in KEY should minimally show a space separated list of true sense labels of ith instance in following format -

        <sense id="S"/>+

e.g.

 <sense id="art2"/> <sense id="art4"/>
 <sense id="art1"/>
 <sense id="art3"/><sense id="art4"/>
 <sense id="art3"/>
 <sense id="art4"/> <sense id="art1"/>
 <sense id="art1"/>
 <sense id="art5"/> <sense id="art2"/> <sense id="art3"/>
 <sense id="art2"/> <sense id="art4"/>

Shows the true sense ids of instances in the CLUTO file described in (1).

If KEY is an actual KEY created by SenseClusters programs, KEY will also show the instance ids of corresponding instances in the beginning of each line.

e.g.

 <instance id="line-n.w7_098:6515:"/> <sense id="art2"/> <sense id="art4"/>

 <instance id="line-n.w8_083:14771:"/> <sense id="art1"/>

 <instance id="line-n.art} aphb 02700649:"/> <sense id="art3"/><sense id="art4"/>

 <instance id="line-n.art} aphb 53900889:"/> <sense id="art3"/>

 <instance id="line-n.w7_066:11025:"/> <sense id="art4"/> <sense id="art1"/>

 <instance id="line-n.art} aphb 42100373:"/> <sense id="art1"/>

 <instance id="line-n.w8_109:8774:"/> <sense id="art5"/> <sense id="art2"/> <sense id="art3"/>

 <instance id="line-n.w7_004:10784:"/> <sense id="art2"/> <sense id="art4"/>

Optional Arguments:

--numthrow N

Ignores clusters containing less than N instances.

--perthrow P

Ignores clusters containing less than P percent of the instances.

Number of instances contained in the thrown clusters will be counted as the unclustered instances.

--help

Displays this message.

--version

Displays the version information.

OUTPUT

This will show

  • Number of unclustered instances on 1st line.

  • Sense Lables of corresponding columns in Cluster Sense Matrix on 2nd line starting with marker //

  • Cluster Sense Matrix starting from 3rd line and onwards. The matrix shows the distribution of instances from each sense class (represented by column labels) in each of the clusters (represented by rows in the Cluster Sense Matrix).

    Each cell entry at [i][j] in Cluster Sense distribution matrix shows the number of instances from ith cluster having true sense class label represented by label of jth column.

    e.g.

     0
     // art1 art2 art3 art4 art5
     2 1 0 2 0
     1 1 2 1 1
     0 1 1 1 0

    Shows that there are no unclustered instances,

    1st cluster contains 2 instances having sense id art1 and art4, 1 instance having sense id art2 and no instances of sense id art3 and art5.

    Similar description applies to 2nd and 3rd clusters.

SYSTEM REQUIREMENTS

Cluto - http://www-users.cs.umn.edu/~karypis/cluto/

AUTHORS

 Amruta Purandare, University of Pittsburgh

 Ted Pedersen,  University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.