The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

bitsimat.pl - Build a similarity matrix from binary context vectors

SYNOPSIS

 bitsimat.pl [OPTIONS] VECTOR

The input file represents 5 vectors, each with 4 possible features. The format of the input file is sparse, so if a feature has no value it is not listed.

 cat input

Output =>

 5 4 12
 1 1 3 1 4 1
 2 1 3 1
 1 1 3 1 4 1
 2 1 3 1
 2 1 3 1

Compute the pairwise similarities between all 5 binary vectors and display them in a 5x5 matrix.

 bitsimat.pl input --format f4.2

Output =>

 5 25
 1 1.00 2 0.41 3 1.00 4 0.41 5 0.41
 1 0.41 2 1.00 3 0.41 4 1.00 5 1.00
 1 1.00 2 0.41 3 1.00 4 0.41 5 0.41
 1 0.41 2 1.00 3 0.41 4 1.00 5 1.00
 1 0.41 2 1.00 3 0.41 4 1.00 5 1.00

Type bitsimat.pl --help for a quick summary of options

DESCRIPTION

Constructs a similarity matrix for the given binary context vectors. A similarity matrix shows the pairwise similarities between all the different vectors. Vectors are represented in an N x M matrix, where N is the number of vectors and M is the number of features. All NxN combinations of vector pairs will be measured for similarity and stored in a matrix.

INPUT

Required Arguments:

VECTOR

A binary vector file as created by vector constructor programs in Toolkit/vector.

Sparse Format (default)

By default, VECTOR is assumed to be in sparse format.

For sparse vectors, the first line of the VECTOR file should show 3 numbers separated in spaces as -

 N M NNZ

where

 N = Number of vectors/rows 
 M = Number of dimensions/columns
 NNZ = Total number of non-zero values

Each line after this line should show a single sparse vector on each line. A sparse vector is a list of pairs of numbers separated by space such that the first number in a pair is the column index of a non-zero value in the vector and the second number is the non-zero value itself corresponding to that index. bitsimat treats all non-zero values as 1s in the bit vectors.

Column indices start with 1.

Sample Sparse Input

 9 12 19
 4 1 7 1
 1 1
 1 1 4 1 8 1 11 1
 1 1 7 1
 4 1
 5 1 10 1
 1 1
 5 1 8 1 12 1
 1 1 2 1 8 1

Explanation :

  1. The first line shows that there are total 9 sparse vectors represented in 12 dimensions. Hence, bitsimat will consider next 9 lines in the VECTOR file as 9 sparse vectors and will allow the column indices only in the range [1-12]. Here, the total non-zero values are 19.

  2. Note that, each line shows a single sparse vector as a list of space separated 'index value' pairs. e.g. 2nd line shows that the 1st vector is non-zero (with value 1) at column indices 4 and 7. 3rd line shows that the 2nd sparse vector is non-zero only at index 1. And so on ...

Dense Format

If VECTOR uses dense format, switch --dense should be selected.

The 1st line in the dense VECTOR file should show -

N M

for N vectors represented using M columns.

Each line thereafter should show a single dense vector. All dense vectors should have M space separated numbers that indicate the values at the corresponding columns in the vector.

All non-zero values are treated as 1s in the binary vectors.

Sample Dense Input

 9 12
 0 0 0 1 0 0 1 0 0 0 0 0
 1 0 0 0 0 0 0 0 0 0 0 0
 1 0 0 1 0 0 0 1 0 0 1 0
 1 0 0 0 0 0 1 0 0 0 0 0
 0 0 0 1 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 0 0 1 0 0
 1 0 0 0 0 0 0 0 0 0 0 0
 0 0 0 0 1 0 0 1 0 0 0 1
 1 1 0 0 0 0 0 1 0 0 0 0

shows same vectors as shown in Sample Sparse Input in dense format.

VECTOR file could also optionally show the KEY file name on the first line. If the VECTOR file starts with the <keyfile> name on the 1st line, the above information should begin from the 2nd line onwards.

Optional Arguments:

--dense

This switch should be selected if the given VECTORs are in dense format. This will also create the output similarity matrix in the dense format. By default, sparse format is assumed for both input vectors and output similarity matrix.

--measure SIM

Specify the similarity measure to be used to construct the similarity matrix.

Possible option values for --sim are

        match                   Match Coefficient
        dice                    Dice Coefficient
        overlap                 Overlap Coefficient
        jaccard                 Jaccard Coefficient
        cosine                  Cosine Coefficient

The following table shows how the similarity values are computed by each of these measures.

        match                   |intersection(X,Y)|
        dice                    2*|intersection(X,Y)|/(|X|+|Y|)
        overlap                 |intersection(X,Y)|/(min(|X|,|Y|))
        jaccard                 |intersection(X,Y)|/|union(X,Y)|
        cosine                  |intersection(X,Y)|/sqrt(|X|*|Y|)

where X and Y represent any two binary vectors |X| shows the norm or number of bits set in vector X

--format FORM

Specifies numeric format for representing output similarity values.

Possible values of FORM are

 iN   -> integer format allocating total N bytes/digits for each entry

 fN.M -> floating point format allocating total N bytes/digits for each entry of which last M digits show the fractional part.

For matching coefficient, default is i8 and for other measures, default is f16.10.

Other Options :

--help

Displays this message.

--version

Displays the version information.

OUTPUT

If the input VECTORs are in sparse format (default), output is also created in sparse format, while, if the input vectors are in dense format, output is also in dense format.

Sparse Output (Default)

The 1st line in the sparse output shows 2 space separated numbers -

 N NNZ1

where

 N = Number of vectors (same as the N in the VECTOR file)
 NNZ1 = Number of non-zero entries in the output similarity matrix

Each line after this will show the corresponding row of the similarity matrix in sparse format. Specifically, each i'th line after the above line shows the list of 'j SIM' pairs separated by space such that SIM is the non-zero similarity value between the i'th and j'th vectors in the given VECTOR file.

Note that, only those pairs are listed in which the SIM value is non-zero.

Sample Sparse Output

 9 43
 1 1.000 3 0.354 4 0.500 5 0.707
 2 1.000 3 0.500 4 0.707 7 1.000 9 0.577
 1 0.354 2 0.500 3 1.000 4 0.354 5 0.500 7 0.500 8 0.289 9 0.577
 1 0.500 2 0.707 3 0.354 4 1.000 7 0.707 9 0.408
 1 0.707 3 0.500 5 1.000
 6 1.000 8 0.408
 2 1.000 3 0.500 4 0.707 7 1.000 9 0.577
 3 0.289 6 0.408 8 1.000 9 0.333
 2 0.577 3 0.577 4 0.408 7 0.577 8 0.333 9 1.000

Shows the actual full similarity matrix -

 9
 1.000 0.000 0.354 0.500 0.707 0.000 0.000 0.000 0.000
 0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
 0.354 0.500 1.000 0.354 0.500 0.000 0.500 0.289 0.577
 0.500 0.707 0.354 1.000 0.000 0.000 0.707 0.000 0.408
 0.707 0.000 0.500 0.000 1.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.408 0.000
 0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
 0.000 0.000 0.289 0.000 0.000 0.408 0.000 1.000 0.333
 0.000 0.577 0.577 0.408 0.000 0.000 0.577 0.333 1.000

Both these matrices show the pairwise cosine similarities among the same 9 vectors shown in the Sample Sparse and Dense Input sections.

Dense Output

If --dense is selected, output is created in dense format and shows all similarity values including 0s.

Dense output shows the number of vectors (N) on the first line. Each i'th line after this shows N space separated numbers such that a number at column index j shows the pairwise similarity among the i'th and j'th vectors.

Sample Dense Output

 9
 1.000 0.000 0.354 0.500 0.707 0.000 0.000 0.000 0.000
 0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
 0.354 0.500 1.000 0.354 0.500 0.000 0.500 0.289 0.577
 0.500 0.707 0.354 1.000 0.000 0.000 0.707 0.000 0.408
 0.707 0.000 0.500 0.000 1.000 0.000 0.000 0.000 0.000
 0.000 0.000 0.000 0.000 0.000 1.000 0.000 0.408 0.000
 0.000 1.000 0.500 0.707 0.000 0.000 1.000 0.000 0.577
 0.000 0.000 0.289 0.000 0.000 0.408 0.000 1.000 0.333
 0.000 0.577 0.577 0.408 0.000 0.000 0.577 0.333 1.000

Shows the pairwise similarities among the 9 vectors shown in the Sample Dense Input section.

Note that the similarity matrix will always be square and symmetric.

If the first line of the input VECTOR file shows the <keyfile> tag, bitsimat.pl will display the same <keyfile> tag on the first line of output.

SYSTEM REQUIREMENTS

bitsimat is dependent on the following CPAN modules :

Bit::Vector - http://search.cpan.org/dist/Bit-Vector/
Set::Scalar - http://search.cpan.org/dist/Set-Scalar/

AUTHORS

 Amruta Purandare, University of Pittsburgh

 Ted Pedersen, University of Minnesota, Duluth
 tpederse at d.umn.edu

COPYRIGHT

Copyright (c) 2002-2008, Amruta Purandare and Ted Pedersen

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.