The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::Wordnet - Perl extension for accessing and manipulating Wordnet databases.

SYNOPSIS

  use Lingua::Wordnet;
  use Lingua::Wordnet::Analysis;

  $wn->unlock();
  $synset = $wn->lookup_synset("canary","n",4);
  $synset2 = $wn->lookup_synset("small","a",1);
  $synset->add_attributes($synset2);
  $synset->write();
  print $synset, "\n";
  $wn->close();

DESCRIPTION

Wordnet is a lexical reference system inspired by current psycholinguitics theories of human lexical memory. This module allows access to the Wordnet lexicon from Perl applications, as well as manipulation and extension of the lexicon. Lingua::Wordnet::Analysis provides numerous high-level extensions to the system.

Version 0.1 was a complete rewrite of the module in pure Perl, whereas the old module embedded the Wordnet C API functions. In order to use the module, the database files must first be converted to Berkeley DB files using the 'scripts/convertdb.pl' file. Why did I do that?

- The Wordnet API consists mostly of searching and text manipulation functions, something Perl is, um .. well suited for.

- Data retrieval is more fast with the hash lookup than with the binary searches

- Converting the databases allows optional manipulation of the data, including adding and editing synsets, as well as extension of the system to allow for more pointer types (including noun attributes and 'functions').

 - Developers can use the Wordnet databases without needing to compile the Wordnet API and browsers, allowing Wordnet to run on any Perl/Berkeley DB-capable platform (the database files are still needed for the conversion, of course)

- A pure Perl implementation allows easier debugging and modification for people who want to experiment or alter the processing.

With that said, there are actually two modules. Lingua::Wordnet impersonates the basic Wordnet API functions for searching and retrieving data, as well as adding, editing, and deleting synsets. Lingua::Wordnet::Analysis brings the interface up a level, allowing commands like "is 'yellow' an attribute of any 'birds'", and taking care of the recursive analysis.

Lingua::Wordnet functions

$wn = new Lingua::Wordnet( [DATA_DIR] );

Creates and assigns a new object of class Lingua::Wordnet. DATA_DIR is optional, and indicates the location of the index and data files.

$wn->unlock()

Allows files to be written to when data is added/edited/deleted.

$wn->lock()

Locks files to prohibit write permissions (default).

$wn->grep(TEXT)

Returns an array of compound words matching TEXT.

@synsets = $wn->lookup_synset( TEXT, POS [,NUMBER] )

Assigns a list of synset objects (Lingua::Wordnet::Synset) matching TEXT within POS, where POS is 'n', 'v', 'a', 's' or 'r'. Without NUMBER, lookup_synset() will return all matches in POS. NUMBER is the sequential order of the desired synset within POS.

$synset = $wn->lookup_synset_offset(SYNSET_OFFSET)

Assigns a synset object SYNSET_OFFSET.

$synset = $wn->new_synset(WORD,POS);

Creates a new (empty) synset entry in the database. Both WORD and POS are required. An offset will be assigned when write() is called.

Lingua::Wordnet::Synset functions

@words = $synset->words([TEXT ..)]

Retrieves or sets the list of words for this synset. add_words() should be used if you are only adding an entry, rather than setting all entries. Each word is in the format: TEXT%SENSE, where TEXT is the word, and SENSE is the sense number for the word. If SENSE is not supplied when assigning words to a synset, Lingua::Wordnet will assign the appropriate sense numbers to the words when $synset->write() is called (since they must be unique). In this case, the word list should consist only of the word text, without the '%'. The new words will be written to the data and index files.

$wn->familiarity(WORD, POS [, POLY_CNT])

Returns an integer of the familiarity/polysemy count for WORD in POS. Given a third value POLY_CNT, sets the polysemy count for WORD in POS. In Lingua::Wordnet, this is a value which must be updated by the user, and is not automatically modified. This makes it useful for recording familiarity or frequency counts outside of the Wordnet lexicons. Note that polysemy within Lingua::Wordnet can be identified for a given word by counting the synsets returned by lookup_synset().

$wn->morph(WORD, POS)

Returns an array containing the base form(s) of WORD in POS as found in the Wordnet morph files. The synset_lookup() functions performs morphological conversion automatically, so a call to morph() is not required. Changes: This fuynction now returns an array, because on WORD may have more than one base form.

$synset->overview()

Returns the terms and gloss for the synset in a format for printing. This method is also used to overload a print performed on the synset. Note that this is different from the "overview" parameter of the 'wn' executable, since it only returns information about the current synset.

$synset->write()

Writes any changes made to $synset to the database and updates all affected synset data and indexes. If $synset passes out of scope before write() is called, the changes are lost.

All of following functions retrieve data in synsets. Each has two corresponding functions which can be called by prepending 'add_' or 'delete_' before the function name. These functions accept a synset object or objects as input. Unless noted otherwise in the following functions, any returned data is a synset object or array of synset objects. See below for examples usages.

 $synset->antonyms()
 $synset->add_antonyms($synset2[, ...])
 $synset->delete_antonyms($synset2[, ...])

Returns, adds, or deletes antonyms for $synset. WARNING: When adding/deleting synset pointers to Wordnet, it is important to add pointer entries to the corresponding synset in order to maintain database accuracy. Earlier versions of this module planned to automate this function, however, they have been abandoned in favor of having control over database writes with the 'write()' function, and are now considered functionality which belongs outside of the module. Thus, your program must implement the functionality to, in the above examples, add an antonym entry to '$synset' for '$synset2', in addition to adding an antonym entry to '$synset2' for '$synset'.

$synset->hypernyms()

Returns hypernyms for $synset.

$synset->hyponyms()

Returns hyponyms for $synset.

$synset->entailment()

Returns verb entailment pointers.

$synset->synonyms()

Returns a list of words within $synset.

$synset->comp_meronyms()

Returns component-object meronyms for $synset.

$synset->member_meronyms()

Returns member-collection meronyms for $synset.

$synset->stuff_meronyms()

Returns stuff-object meronyms for $synset (a.k.a. substance-object).

$synset->portion_meronyms()

Returns portion-mass meronyms for $synset.

$synset->feature_meronyms()

Returns feature-activity meronyms for $synset.

$synset->place_meronym()

Returns place-area meronyms for $synset.

$synset->phase_meronym()

Sets or returns phase-process meronyms for $synset.

$synset->all_meronyms()

Returns an array of synset objects for all meronyms types of $synset.

$synset->all_holonyms()

Returns an array of synset objects for all holonyms of $synset.

The following seven functions mirror the above functionality for holonyms, and accordingly have corresponding add_ and delete_ functions which update any set values to the corresponding meronym pointers:

$synset->comp_holonyms()
$synset->member_holonyms()
$synset->stuff_holonyms()
$synset->portion_holonyms()
$synset->feature_holonyms()
$synset->place_holonyms()
$synset->phase_holonyms()
$synset->gloss([TEXT])

Returns the glass for $synset. If TEXT is present, the gloss for $synset will be assigned that value.

$synset->attributes()

Returns a list of synset objects of attribute pointers for $synset.

$synset->functions()

Returns a list of synset objects of function pointers for $synset.

$synset->causes()

Returns the 'cause to' pointers for verbs.

$synset->pertainyms()

Returns the 'pertains to' pointers for adj and adv.

$synset->frames()

Returns a text array of verb frames for $synset. The add_frames() and delete_frames() functions accept only integers corresponding to the frames. The list of frames can be edited in Wordnet.pm directly, but probably shouldn't be.

$synset->lex_info([INT])

Returns a string containing lexicographer file information. The optional INT assigns the lexicographer file information, and should correspond to the file list in Wordnet.pm.

$synset->offset()

Returns the synset offset of $synset.

EXAMPLES

Extensive examples can be found in the 'scripts/' directory; here I will summarize the basic functionality. There are also some examples in the pod documentation for Lingua::Wordnet::Analysis.

This will display a hypernym tree for $synset:

 my $synset = $wn->lookup_synset_offset("00333350%n"); 
 while ($synset = ($synset->hypernyms)[0]) {
    $i++;
    print " "x$i, "->", $synset->words, "\n";
 } 

Outputting the following for synset "baseball":

 -> field_game%0
  -> outdoor_game%0
   -> athletic_game%0
    -> sport%0athletics%0
     -> diversion%0recreation%0
      -> activity%0
       -> act%0human_action%0human_activity%0

The example below will create a synset object and print a list of the hyponyms for that object:

 use Lingua::Wordnet;
 my $wn = new Lingua::Wordnet;
 my $synset = $wn->lookup_synset("baseball","n",1);
 print "The following are kinds of baseball games:\n";
 foreach $bb_synset ($synset->hyponyms) {
     my $words;
     foreach $word ($bb_synset->words) {
         $word =~ s/\%\d+$//; $word =~ s/\_/ /g;
         $words .= "$word, ";
     }
     $words =~ s/\,\s*$//;
     print "  $words\n";
 }
 $wn->close();

This will output:

 The following are kinds of baseball games:
   professional baseball
   hardball
   perfect game
   no-hit game, no-hitter
   one-hitter, 1-hitter
   two-hitter, 2-hitter
   three-hitter, 3-hitter
   four-hitter, 4-hitter
   five-hitter, 5-hitter
   softball, softball game
   rounders
   stickball, stickball game

And an assignment example. This will create a new synset and add it to the kinds of baseball games. We unlock the Wordnet files to enable changes to the database:

 use Lingua::Wordnet;
 my $wn = new Lingua::Wordnet;
 $wn->unlock();
 my $synset = $wn->lookup_synset("baseball","n",1);
 my $newsynset = $wn->new_synset("fooball","n");
 $newsynset->gloss("A baseball game in which a foo is used.");
 $synset->add_hyponym($newsynset);
 $wn->close();

Remember, proceeded most synset functions with "add" will append the supplied data to the corresponding field, rather than replacing its value.

We could add an attribute 'fun' to "fooball" thus (not necessarily recommended pointer, but it will suffice for an example):

 $fun_synset = $wn->lookup_synset("fun","adj",1);
 $newsynset->add_attributes($fun_synset);

See the Lingua::Wordnet::Analysis documentation for examples to retrieving and searching entire trees and inheritance functions.

BUGS/TODO

Please send bugs and suggestions/requests to dbrian@brians.org. Development on this module is active as of Spring 2001.

Clean up code, put references where beneficial.

AUTHOR

Dan Brian <dbrian@brians.org>

SEE ALSO

Lingua::Wordnet::Analysis.