The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

order2vec.pl - Convert Senseval-2 contexts into second order context vectors in Cluto format

SYNOPSIS

 order2vec.pl [OPTIONS] SVAL2 WORDVEC FEATURE_REGEX

Type order2vec.pl --help for a quick summary of options.

DESCRIPTION

Creates second order context vectors by averaging word or feature vectors of the contextual features.

INPUT

Required Arguments:

SVAL2

A tokenized, preprocessed and well formatted Senseval-2 instance file showing instances whose context vectors are to be generated.

order2vec creates a context vector for each instance in the given SVAL2 file by averaging the word or feature vectors of the features that appear in the context.

WORDVEC

Should be one of the following type of files:

  1. A file containing word vectors as created by program wordvec.pl

  2. A file containing feature vectors as created by order1vec.pl, using its --transpose option.

Each line in WORDVEC should show a word or feature vector of the feature represented by the corresponding line in the FEATURE_REGEX file.

order2vec accepts WORDVEC in both sparse and dense formats. If WORDVEC is in dense format, switch --dense should be selected.

FEATURE_REGEX

Should be one of the following type of files:

  1. The output file generated by running nsp2regex.pl on the FEATURE file as generated by program wordvec.pl while creating the WORDVEC file.

  2. The TEST_REGEX file created by order1vec.pl using its --testregex option, while creating the feature-by-context output file using the --transpose option.

Each line in FEATURE_REGEX file should show a regular expression for a feature whose feature vector appears on the corresponding line in the WORDVEC file. FEATURE_REGEX should be formatted like the output of the nsp2regex.pl program.

Sample FEATURE_REGEX files:

  1. A file output by nsp2regex.pl when it is run on the file produced by --feats option of wordvec.pl:

     /\s(<[^>]*>)*details(<[^>]*>)*\s/ @name = details
     /\s(<[^>]*>)*weather(<[^>]*>)*\s/ @name = weather
     /\s(<[^>]*>)*test(<[^>]*>)*\s/ @name = test
     /\s(<[^>]*>)*cloth(<[^>]*>)*\s/ @name = cloth
     /\s(<[^>]*>)*health(<[^>]*>)*\s/ @name = health
     /\s(<[^>]*>)*art(<[^>]*>)*\s/ @name = art
  2. A TEST_REGEX file output by order1vec.pl using its --testregex option:

     /\s(<[^>]*>)*polygonal(<[^>]*>)*\s/ @name = polygonal
     /\s(<[^>]*>)*ectoderm(<[^>]*>)*\s/ @name = ectoderm
     /\s(<[^>]*>)*fluid(<[^>]*>)*\s/ @name = fluid
     /\s(<[^>]*>)*CEMx174(<[^>]*>)*\s/ @name = CEMx174
     /\s(<[^>]*>)*adjacent(<[^>]*>)*\s/ @name = adjacent
     /\s(<[^>]*>)*mutant(<[^>]*>)*\s/ @name = mutant
     /\s(<[^>]*>)*progenitor(<[^>]*>)*\s/ @name = progenitor
     /\s(<[^>]*>)*Ganglion(<[^>]*>)*\s/ @name = Ganglion
     /\s(<[^>]*>)*MLS(<[^>]*>)*\s/ @name = MLS
     /\s(<[^>]*>)*male(<[^>]*>)*\s/ @name = male
     /\s(<[^>]*>)*mother(<[^>]*>)*\s/ @name = mother

Optional Arguments:

--binary

Select this switch to create binary context vectors. Binary vectors are computed by taking the binary OR of the word vectors of features that are found in the context. By default, order2vec creates frequency score vectors that show arithmatic avearge of the word vectors of contextual features.

--dense

By default, word vectors in WORDVEC are assumed to be in sparse format. Also, the context vectors displayed by order2vec are in sparse format.

Select --dense if the word vectors are in dense format. This will automatically create output vectors in dense format as well.

 ****************     IMPORTANT NOTE    ************

 Dense word vectors (when --dense is ON) should be formatted i.e. each entry
 of WORDVEC should be represented using the same numeric format and should
 occupy exactly same number of spaces. Use --format option to specify the
 format of dense word vectors.

--rlabel RLABELFILE

Creates a RLABELFILE containing row labels for Cluto's --rlabelfile option. Each line in RLABELFILE shows an instance id of the instance whose context vector appears on the corresponding line on STDOUT.

Instance ids are extracted from the SVAL2 file by matching regex

                /<instance id\s*=\s*"IID"/>/

where 'IID' is an instance id of the <context> that follows this <instance> tag.

--rclass RCLASSFILE

Creates RCLASSFILE for Cluto's --rclassfile option. Each line in the RCLASSFILE shows the true sense id of the instance whose context vector appears on the corresponding line on STDOUT.

Sense ids are extracted from the SVAL2 file by matching regex

                /sense\s*id\s*=\s*"SID"\/>/

where SID shows the true sense tag of the instance whose IID is recently extracted by matching

                /<instance id\s*=\s*"IID"/>/

--showkey

Displays the name of the system generated KEY file on the first line of STDOUT. KEY file preserves the instance ids and sense tags of the instances in the given SVAL2 file. This information will be automatically used by some of the clustering and evaluation programs in SenseClusters that operate on purely numeric instance formats. The option should be selected if the user is planning to run SenseClusters' clustering code.

Other Options :

--format FORM

If --dense is ON, input WORD VECtors need to be formatted i.e. should be represented using same numeric format and occupy same number of digit spaces. If wordvec.pl was run using its --format option, then the value of --format to order2vec.pl should be same as that specified in wordvec.pl's --format option.

Format should be represented as

 iN   -> integer format where each entry occupies total N bytes/digits 

 fN.M -> floating point format where each entry occupies total N bytes/digits
         of which last M digits show the fractional part

When --binary is ON, default format is i2 that assumes 2 digit space for each entry. When --binary is OFF, default format is f16.10 that assumes each entry is fractional occupying total 16 digit equivalent spaces of which last 10 digits show the fractional part.

Output context vectors (sparse or dense) will be represented using the specified format value or default f16.10.

--help

Displays this message.

--version

Displays the version information.

OUTPUT

Output shows a single context vector on each line. Context vectors represent instances in the same order as they appear in the given SVAL2 file i.e. each ith vector on STDOUT shows a context vector of the ith instance in the SVAL2 file.

Each context vector is an average of the WORD VECtors of the features that are found in the context using FEATURE_REGEX.

Sample Sparse Output

Input Sval2 file => test.sval2

 <corpus lang="english">
 <lexelt item="LEXELT">
 <instance id="hard-a.sjm-098_3:">
 <answer instance="hard-a.sjm-098_3:" senseid="HARD1"/>
 <context>
 someone has to kill him to defeat him and that s <head>HARD</head> to do
 </context>
 </instance>
 <instance id="hard-a.w8_038:">
 <answer instance="hard-a.w8_038:" senseid="HARD3"/>
 <context>
 I find it <head>HARD</head> to believe that you don't believe me
 </context>
 </instance>
 <instance id="hard-a.sjm-255_13:">
 <answer instance="hard-a.sjm-255_13:" senseid="HARD3"/>
 <context>
 when you get bad credit data or are confused with another person your life gets  <head>HARD</head>
 </context>
 </instance>
 <instance id="hard-a.sjm-231_3:">
 <answer instance="hard-a.sjm-231_3:" senseid="HARD2"/>
 <context>
 Our life is <head>HARDER</head> now yes but it is better to live hungry and free life
 </context>
 </instance>
 <instance id="hard-a.sjm-096_2:">
 <answer instance="hard-a.sjm-096_2:" senseid="HARD1"/>
 <context>
 Ray who told his colleagues We have to face the <head>HARD</head> facts of life due to bad credit
 </context>
 </instance>
 </lexelt>
 </corpus>

Input FEATURE_REGEX file => test.regex

 /\s(<[^>]*>)*<head>HARD<\/head>(<[^>]*>)*\s/ @name = <head>HARD</head>
 /\s(<[^>]*>)*to(<[^>]*>)*\s/ @name = to
 /\s(<[^>]*>)*defeat(<[^>]*>)*\s/ @name = defeat
 /\s(<[^>]*>)*believe(<[^>]*>)*\s/ @name = believe
 /\s(<[^>]*>)*credit(<[^>]*>)*\s/ @name = credit
 /\s(<[^>]*>)*life(<[^>]*>)*\s/ @name = life
 /\s(<[^>]*>)*facts(<[^>]*>)*\s/ @name = facts
 /\s(<[^>]*>)*kill(<[^>]*>)*\s/ @name = kill

Input Saprse Word Vectors => test.sparse_wordvec

 8 10 41
 1 4.977 4 7.813 8 9.114 10 1.431
 1 5.944 2 5.728 3 2.978 5 5.604 7 9.444 9 3.680
 2 3.984 3 8.306 4 6.632 5 4.514 7 1.785 9 7.609
 2 9.147 4 3.086 5 0.325 9 1.456
 1 0.741 4 3.450 6 2.363
 1 9.549 2 3.921 3 8.131 4 4.301 5 9.059 6 8.607 10 1.138
 2 8.203 4 7.297 5 1.095 7 4.362 8 2.963 10 7.264
 2 4.296 4 9.802 7 9.268 9 8.856 10 9.723

Command =>

 order2vec.pl --format f9.4 test.sval2 test.sparse_wordvec test.regex

Output =>

 5 10 45
 1   3.8015 2   4.2440 3   2.8733 4   4.0412 5   3.5543 7   6.5642 8   1.5190 9   4.5842 10   1.8590
 1   2.7302 2   6.0055 3   0.7445 4   3.4962 5   1.5635 7   2.3610 8   2.2785 9   1.6480 10   0.3578
 1   5.0890 2   1.3070 3   2.7103 4   5.1880 5   3.0197 6   3.6567 8   3.0380 10   0.8563
 1   8.3473 2   4.5233 3   6.4133 4   2.8673 5   7.9073 6   5.7380 7   3.1480 9   1.2267 10   0.7587
 1   4.5258 2   3.9300 3   2.3478 4   3.8102 5   3.5603 6   1.8283 7   3.8750 8   2.0128 9   1.2267 10   1.6388

Explanation =>

First instance <hard-a.sjm-098_3:> contains features

 '<head>HARD</head>' once,
 'to' thrice,
 'defeat' once,
 'kill' once.

Hence, the context vector of instance <hard-a.sjm-098_3:> shown on Line 2 on STDOUT is an average of sparse word vectors ->

 [1 4.977 4 7.813 8 9.114 10 1.431]
 3 * [1 5.944 2 5.728 3 2.978 5 5.604 7 9.444 9 3.680]
 [2 3.984 3 8.306 4 6.632 5 4.514 7 1.785 9 7.609]
 [2 4.296 4 9.802 7 9.268 9 8.856 10 9.723]

OR

 [1 4.977 4 7.813 8 9.114 10 1.431]
 [1 17.832 2 17.184 3 8.934 5 16.812 7 28.332 9 11.04]
 [2 3.984 3 8.306 4 6.632 5 4.514 7 1.785 9 7.609]
 [2 4.296 4 9.802 7 9.268 9 8.856 10 9.723]

The Sum of above vectors is a sparse vector =>

 [1 22.809 2 25.464 3 17.24 4 24.247 5 21.326 7 39.385 8 9.114 9 27.505 10 11.154]

And the average is =>

 [1 3.8015 2 4.2440 3 2.8733 4 4.0412 5 3.5543 7 6.5642 8 1.5190 9 4.5842 10 1.8590]

Similarly, all context vectors are computed by averaging the word vectors of features that match in the context.

Sample Dense Output

In the above example, if WORDVEC is dense => test.dense_wordvec

   8 10
   4.9770   0.0000   0.0000   7.8130   0.0000   0.0000   0.0000   9.1140   0.0000   1.4310
   5.9440   5.7280   2.9780   0.0000   5.6040   0.0000   9.4440   0.0000   3.6800   0.0000
   0.0000   3.9840   8.3060   6.6320   4.5140   0.0000   1.7850   0.0000   7.6090   0.0000
   0.0000   9.1470   0.0000   3.0860   0.3250   0.0000   0.0000   0.0000   1.4560   0.0000
   0.7410   0.0000   0.0000   3.4500   0.0000   2.3630   0.0000   0.0000   0.0000   0.0000
   9.5490   3.9210   8.1310   4.3010   9.0590   8.6070   0.0000   0.0000   0.0000   1.1380
   0.0000   8.2030   0.0000   7.2970   1.0950   0.0000   4.3620   2.9630   0.0000   7.2640
   0.0000   4.2960   0.0000   9.8020   0.0000   0.0000   9.2680   0.0000   8.8560   9.7230

Command =>

 order2vec.pl --format f9.4 --dense test.sval2 test.dense_wordvec test.feat 

Output =>

   5 10
   3.8015   4.2440   2.8733   4.0412   3.5543   0.0000   6.5642   1.5190   4.5842   1.8590
   2.7302   6.0055   0.7445   3.4962   1.5635   0.0000   2.3610   2.2785   1.6480   0.3578
   5.0890   1.3070   2.7103   5.1880   3.0197   3.6567   0.0000   3.0380   0.0000   0.8563
   8.3473   4.5233   6.4133   2.8673   7.9073   5.7380   3.1480   0.0000   1.2267   0.7587
   4.5258   3.9300   2.3478   3.8102   3.5603   1.8283   3.8750   2.0128   1.2267   1.6388

Shows same context vectors as shown in Sample Sparse Output section only with --dense ON.

Note that, if --dense is ON, --format has to be used and must specify the format of dense word vectors.

SYSTEM REQUIREMENTS

PDL - http://search.cpan.org/dist/PDL/

AUTHORS

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

 Amruta Purandare, University of Pittsburgh

 Mahesh Joshi, Carnegie-Mellon University

COPYRIGHT

Copyright (c) 2002-2008, Ted Pedersen, Amruta Purandare, Mahesh Joshi

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.