The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

umls-targetword-senserelate.pl - This program performs target word disambiguation and determines the correct sense of an ambiguous term using semantic similarity measures.

SYNOPSIS

This program assigns senses from the UMLS or a given sense file to ambiguous terms using semantic simlarity or relatedness measures from the UMLS::Similarity package.

USAGE

Usage: umls-targetword-senserelate.pl [OPTIONS] INPUTFILE

OUTPUT

The output files will be stored in the directory "log" or the directory defined by the --log option.

Required Options

INPUTFILE

Input file either in sval2 or plain format. Indicated by the --sval2 or --plain options respectively. The --plain option is the default.

General Options:

--plain

The input format is in plain text. This is the default format.

In plain format each line of the text files contains a single context where the ambiguous word is identified by:

<head item="target word" instance="id" sense="sense">word</head>.

For example:

Paul was named <head item="art" instance="art.30002" sense="art">Art</head> magazine's top collector.

The sense information is optional. If you do not have the sense information either leave it blank such as:

Paul was named <head item="art" instance="art.30002" sense="">Art</head> magazine's top collector.

or do not include the sense tag such as:

Paul was named <head item="art" instance="art.30002">Art</head> magazine's top collector.

If the sense information is not there, then you can not use the --key FILE option

We also added a candidate tag so if would like to specifiy the possible senses for the target word in the instance, you can include a candidate attribute in which the candidate senses are seperated by a comma. For example:

<head item="target word" instance="id" sense="sense" candidate="s1,s2">word</head>.

You must use the --candidate option to actually use the candidate information, otherwise it will be ignored.

--sval2

The format is in sval2 format

--mmxml

The format is in metamap xml (mmxml) format in which each target word thoughis identified by a <Target></Target> tag similar to that of the <Token></Token> tags.

We have a conversion program in the coverters/ directory which will convert plain text into the what we refer to as mm-xml tagged text called: plain2mm-xml.pl

--candidates

This option uses the candidate senses as identified by metamap for the target word. This option can only be used with the --mmxml option.

--cuis

This option uses the CUIs tagged by metamap not the terms

--senses DIR|File

This is the directory that contains the candidate sense file for each target word you are going to disambiguate or just the file itself.

The files for the target word contains the possible senses of the target word.

This may be temporary but right now this is who I have it because often times the possible senses change depending on the version of the UMLS that you are using. I felt this allowed the most flexibility with it.

The naming convention for this is a file called: <target word>.choices

The format for this file is:

    <tag>|<target word name>|semantic type|CUI

This format is based on the choice files in the NLM-WSD dataset which we use for our experiments. If you are using the NLM-WSD dataset you can download these choice files from NLM's site. There are the 1999 tagset and the 2007 tagset available.

--log DIR

Directory in which the output files will be stored. Default: log

--compound

Use the compounds in the input text. For the plain and sval2 format these are indicated by an underscore. For example:

    white_house
    blood_pressure

--key

Stores the gold standard information in the <target word>.key file to be used in the evaluation programs. This file is stored in the log directory.

--window NUMBER

The window in which to obtain the context surrounding the ambiguous term.

Default: 2

--aggregator AGGREGATOR

The aggregator method to be used to combine the similarity scores. The available aggregators are: 1. max - the maximum similarity score 2. avg - the average similarity score (default) 3. orness - \frac{1}{(n-1)} Sum_{i=1}^{n} (n-i)w_{i} 4. andness - 1-orness 5. disp 6. closeness

--restrict

This restricts the window to be contain the context whose terms maps to UMLS, not just any old term

--measure MEASURE

Use the MEASURE module to calculate the semantic similarity. The available measure are: 1. Leacock and Chodorow (1998) referred to as lch 2. Wu and Palmer (1994) referred to as wup 3. The basic path measure referred to as path 4. Rada, et. al. (1989) referred to as cdist 5. Nguyan and Al-Mubaid (2006) referred to as nam 6. Resnik (1996) referred to as res 7. Lin (1988) referred to as lin 8. Jiang and Conrath (1997) referred to as jcn 9. The vector measure referred to as vector

--weight

Weight the scores based on the distance the content term is from the target word. This option can currently only be used with the --window option.

--stoplist FILE

A file containing a list of words to be excluded. This is used in the UMLS::SenseRelate::TargetWord module as well as the vector and lesk measures in the UMLS::Similarity package. The format required is one stopword per line, words are in regular expression format.

For example:

  /\b[a-zA-Z]\b/
  /\b[aA]board\b/
  /\b[aA]bout\b/
  /\b[aA]bove\b/
  /\b[aA]cross\b/
  /\b[aA]fter\b/
  /\b[aA]gain\b/

The sample file, stoplist-nsp.regex, is under the samples directory. We might change this to require two different stoplists in the future; one for the senserelate program and the other for the relatedness measures.

--trace FILE

This stores the trace information in FILE for debugging purposes.

--loadcache FILE

Preloads cache. The expected format is:

    score<>cui1<>cui2
    score<>cui3<>cui4
    ...

--getacache FILE

Outputs cache to FILE after run.

--version

Displays the version information.

--help

Displays the help information

UMLS-Interface General Options:

--config FILE

This is the configuration file. There are six configuration options that can be used depending on which measure you are using. The path, wup, lch, lin, jcn and res measures require the SAB and REL options to be set while the vector and lesk measures require the SABDEF and RELDEF options.

The SAB and REL options are used to determine which sources and relations the path information is to be obtained from. The format of the configuration file is as follows:

 SAB :: <include|exclude> <source1, source2, ... sourceN>
 REL :: <include|exclude> <relation1, relation2, ... relationN>

For example, if we wanted to use the MSH vocabulary with only the RB/RN relations, the configuration file would be:

 SAB :: include MSH
 REL :: include RB, RN

or

 SAB :: include MSH
 REL :: exclude PAR, CHD

The SABDEF and RELDEF options are used to determine the sources and relations the extended definition is to be obtained from. We call the definition used by the measure, the extended definition because this may include definitions from related concepts.

The format of the configuration file is as follows:

 SABDEF :: <include|exclude> <source1, source2, ... sourceN>
 RELDEF :: <include|exclude> <relation1, relation2, ... relationN>

The possible relations that can be included in RELDEF are:

  1. all of the possible relations in MRREL such as PAR, CHD, ...
  2. CUI which refers the concepts definition
  3. ST which refers to the concepts semantic types definition
  4. TERM which refers to the concepts associated terms

For example, if we wanted to use the definitions from MSH vocabulary and we only wanted the definition of the CUI and the definitions of the CUIs SIB relation, the configuration file would be:

 SABDEF :: include MSH
 RELDEF :: include CUI, SIB

Note: RELDEF takes any of MRREL relations and two special 'relations':

      1. CUI which refers to the CUIs definition

      2. TERM which refers to the terms associated with the CUI

If you go to the configuration file directory, there will be example configuration files for the different runs that you have performed.

For more information about the configuration options (including the RELA and RELADEF options) please see the README.

--realtime

This option will not create a database of the path information for all of concepts in the specified set of sources and relations in the config file but obtain the information for just the input concept

--forcerun

This option will bypass any command prompts such as asking if you would like to continue with the index creation.

--loadcache FILE

FILE containing similarity scores of cui pairs in the following format:

  score<>CUI1<>CUI2

--getcache FILE

Outputs the cache into FILE once the program has finished.

UMLS-Interface Debug Options:

--debug

Sets the UMLS-Interface debug flag on for testing

UMLS-Interface Database Options:

--username STRING

Username is required to access the umls database on mysql

--password STRING

Password is required to access the umls database on mysql

--hostname STRING

Hostname where mysql is located. DEFAULT: localhost

--database STRING

Database contain UMLS DEFAULT: umls

UMLS-Similarity IC Measure Options:

--icpropagation FILE

FILE containing the propagation counts of the CUIs. This file must be in the following format:

    CUI<>probability

where probability is the probability of the concept occurring.

See create-icpropagation.pl for more information.

--intrinsic [seco|sanchez]

Uses intrinic information content of the CUIs defined by Sanchez and Betet 2011 or Seco, et al 2004.

UMLS-Similarity Vector Measure Options:

--vectormatrix FILE

This is the matrix file that contains the vector information to use with the vector measure.

If you do not want to use the default, this file is generated by the vector-input.pl program. An example of this file can be found in the samples/ directory and is called matrix.

--vectorindex FILE

This is the index file that contains the vector information to use with the vector measure.

If you do not want to use the default, this file is generated by the vector-input.pl program. An example of this file can be found in the samples/ directory and is called index.

--debugfile FILE

This prints the vector information to file, FILE, for debugging purposes.

UMLS-Similarity vector and lesk Options:

--vectorstoplist FILE

A file containing a list of words to be excluded from the vector measure calculation. This is the same format as the --stopword option.

head3 --leskstoplist FILE

A file containing a list of words to be excluded from the lesk measure calculation. This is the same format as the --stopword option.

--dictfile FILE

This is a dictionary file for the vector or lesk measure. It contains the 'definitions' of a concept or term which would be used rather than the definitions from the UMLS. If you would like to use dictfile as a augmentation of the UMLS definitions, then use the --config option in conjunction with the --dictfile option.

The expect format for the --dictfile file is:

 CUI: <definition>
 CUI: <definition>
 TERM: <definition> 
 TERM: <definition>

There are three different option configurations that you have with the --dictfile.

1. No --dictfile - which will use the UMLS definitions

  umls-targetword-senserelate.pl --measure lesk hand foot

2. --dictfile - which will just use the dictfile definitions

  umls-targetword-senserelate.pl --measure lesk --dictfile samples/dictfile hand foot

3. --dictfile + --config - which will use both the UMLS and dictfile definitions

  umls-targetword-senserelate.pl --measure lesk --dictfile samples/dictfile --config
  configuration hand foot

Keep in mind, when using this file with the --config option, if one of the CUIs or terms that you are obtaining the similarity for does not exist in the file the vector will be empty which will lead to strange similarity scores.

An example of this file can be found in the samples/ directory and is called dictfile.

--defraw

This is a flag for the vector measures. The definitions used are 'cleaned'. If the --defraw flag is set they will not be cleaned.

--stem

This is a flag for the vector and lesk method. If the --stem flag is set, definition words are stemmed using the Lingua::Stem::En module.

--compoundfile FILE

This is a compound word file for the vector and lesk measures. It containsthe compound words which we want to consider them as one wordwhen we compare the relatedness. Each compound word is a line in the file and compound words are seperated by space. When using this option with vector, make sure the vectormatrix and vectorindex file are based on the corpus proprocessed by replacing the compound words in the Text-NSP package. An example is under /sample/compoundword.txt

SYSTEM REQUIREMENTS

  • Perl (version 5.8.5 or better) - http://www.perl.org

  • UMLS::Interface - http://search.cpan.org/dist/UMLS-Interface

  • UMLS::Similarity - http://search.cpan.org/dist/UMLS-Similarity

CONTACT US

  If you have any trouble installing and using UMLS-Similarity, 
  please contact us via the users mailing list :
    
      umls-similarity@yahoogroups.com
     
  You can join this group by going to:
    
      http://tech.groups.yahoo.com/group/umls-similarity/
     
  You may also contact us directly if you prefer :
    
      Bridget T. McInnes: bthomson at umn.edu 

      Ted Pedersen : tpederse at d.umn.edu

AUTHOR

 Bridget T. McInnes, University of Minnesota

COPYRIGHT

Copyright (c) 2010,

 Bridget T. McInnes, University of Minnesota Twin Cities
 bthomson at umn.edu
    
 Ted Pedersen, University of Minnesota Duluth
 tpederse at d.umn.edu
 
 Serguei Pakhomov, University of Minnesota Twin Cities
 pakh0002 at umn.edu

 Ying Liu, University of Minnesota Twin Cities
 liux0395 at umn.edu

This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 2 of the License, or (at your option) any later version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with this program; if not, write to:

 The Free Software Foundation, Inc.,
 59 Temple Place - Suite 330,
 Boston, MA  02111-1307, USA.