The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Lingua::FR::ResourceAdequacy - Measures to estimate the adequacy of a terminology given a text

SYNOPSIS

use Lingua::ResourceAdequacy;

my $RA = Lingua::ResourceAdequacy->new("word_list" => \@words, "term_list" => \@terms, "UP" => \@UP, "DUP" => \@DUP); $RA->term_list_stats(); $RA->word_list_stats(); $RA->AdequacyMeasures(); $RA->print_AdequacyMeasures();

DESCRIPTION

Lingua-ResourceAdequacy provides measures to estimate the adequacy of a terminological resource regarding a textual corpus, i.e. whether a terminological resource can be used on a specialised textual corpus.

Given a textual document collection and a terminological resource i.e. a term list, and its useful part, i.e. term found in the texts, the module provides four measures to estimate the adequacy of the resource regarding the document collection: Contribution, Recognition, Coverage and Density.

Four lists are required as input: a term list and its useful part (terms that matched in the texts), the decomposed useful part (each term is segmentized in words) and word list of the document collection.

As output, the adequacy measures are stored in the AdequacyMeasures field. The complementary measures are also provided for the contribution, i.e. excess, and the recognition, i.e. ignorance.

METHODS

new()

    $RA = Lingua::ResourceAdequacy->new("word_list" => \@words, "term_list" => \@terms, 
              "UP_list" => \@UP_list, "DUP_list" => \@DUP_list, );

This method creates a new Lingua::ResourceAdequacy object. The following optional key/value parameters may be employed to set the internal field. All keys have a corresponding method that can be used to change the behaviour later on. At the beginning, you can just ignore them.

word_list: this key can be used to set the word list of the corpus. The array containing the word list is recopied in a internal array.
term_list: this key can be used to set the term list. The array containing the term list is recopied in a internal array.
UP_list: this key can be used to set the useful part of the term list. The array containing this list is recopied in a internal array.
DUP_list: this key can be used to set the word segmentized useful part of the term list. The array containing this list is recopied in a internal array.

set_word_list()

   $RA->set_word_list(\@word_list);

This method sets the internal field containing the word list of the corpus. The parameter is a array reference.

set_term_list()

   $RA->set_term_list(\@term_list);

This method sets the internal field containing the term list i.e. all the terminological resource. The parameter is a array reference.

set_DUP_list()

   $RA->set_DUP_list(\@DUP_list);

This method sets the internal field containing the useful part of the term list, each term being word segmentized. Each array element contains a word of a term. The word can appear several times. The parameter is a array reference.

set_UP_list()

   $RA->set_DUP_list(\@DUP_list);

This method sets the internal field containing the useful part of the term list i.e. all the term matching in the corpus. Each array element contains a term. The term can appear several times. The parameter is a array reference.

word_list_stats()

     $RA->word_list_stats();

This method computes the basic statistics associated to the word list: the size of the list, the frequency of each word, the size of the vocabulary (the word list without duplicated ones), the average frequency of the words.

term_list_stats()

     $RA->term_list_stats();

This method computes the basic statistics associated to the term list: the size of the list, the frequency of each term, the number of term without duplicated ones, the average frequency of the terms.

     $RA->print_word_list_stats();

This method prints the statistics associated to the word list.

This method prints the statistics associated to the term list.

_average_Frequency()

     $RA->_average_Frequency($filed_name);

This internal method computes the average frequency of the elements issued form the list defined by $field_name.

_list_stats()

     $RA->_list_stats($field_name);

This method computes the basic statistics associated to the list defined by $field_name: the size of the list, the frequency of each element, the number of elements without duplicated ones, the average frequency of the elements.

get_Vocabulary_size()

     $RA->get_Vocabulary_size($field_name);

This method returns the vocabulary size, i.e. the number of elements without duplicated ones, given the list defined by $field_name. If the list doesn't exist, the method returns -1.

get_List_size()

     $RA->get_List_size($field_name);

This method returns the list size, i.e. the number of elements of the list defined by $field_name. If the list doesn't exist, the method returns -1.

get_Average_frequency()

     $RA->get_Average_frequency($field_name);

This method returns the average frequency of the list elements. The list name is defined by $field_name. If the list doesn't exist, the method returns -1.

get_FrequencyLength()

     $RA->get_FrequencyLength($field_name);

This method returns the sum of the product of the frequency by the length, for each element of the list defined by $field_name. If the list doesn't exist, the method returns -1.

_print_list_stats()

     $RA->_print_list_stats($field_name);

This internal method prints the statistics associated to the list defined by $field_name.

UP_list_stats()

     $RA->UP_list_stats();

This method computes the basic statistics associated to the useful part of term list: the size of the list, the frequency of each term, the number of term without duplicated ones, the average frequency of the terms and the sum of the product of the frequency by the length, for each element of the useful part of the term list.

     $RA->print_UP_list_stats();

This method prints the statistics associated to the useful part of the term list.

DUP_list_stats()

     $RA->UP_list_stats();

This method computes the basic statistics associated to the decomposed useful part of term list: the size of the list, the frequency of each term, the number of term without duplicated ones, the average frequency of the terms.

     $RA->print_DUP_list_stats();

This method prints the statistics associated to the decomposed useful part of the term list.

get_UP_VocabularySize()

     $RA->get_UP_Vocabulary_size();

This method returns the size of the useful part of term list where duplicated terms are removed. If the list doesn't exist, the method returns -1.

get_UP_ListSize()

     $RA->get_UP_List_size();

This method returns the size of useful part of the term list. If the list doesn't exist, the method returns -1.

get_UP_AverageFrequency()

     $RA->get_UP_Average_frequency();

This method returns the average frequency of the useful part of the term list. If the list doesn't exist, the method returns -1.

get_UP_FrequencyLength()

     $RA->get_UP_FrequencyLength();

This method returns the sum of the product of the frequency by the length, for each element of the useful part of the term list. If the list doesn't exist, the method returns -1.

get_DUP_VocabularySize()

     $RA->get_DUP_Vocabulary_size();

This method returns the size of the word segmentized useful part of term list where duplicated words are removed. If the list doesn't exist, the method returns -1.

get_DUP_ListSize()

     $RA->get_DUP_List_size();

This method returns the size of word segmentized useful part of the term list. If the list doesn't exist, the method returns -1.

get_DUP_AverageFrequency()

     $RA->get_DUP_Average_frequency();

This method returns the average frequency of the word segmentized useful part of the term list. If the list doesn't exist, the method returns -1.

get_term_list_VocabularySize()

     $RA->get_term_list_Vocabulary_size();

This method returns the size of the term list where duplicated terms are removed. If the list doesn't exist, the method returns -1.

get_term_list_ListSize()

     $RA->get_term_List_size();

This method returns the size of the term list. If the list doesn't exist, the method returns -1.

get_term_list_AverageFrequency()

     $RA->get_term_list_Average_frequency();

This method returns the average frequency of the term list. If the list doesn't exist, the method returns -1.

get_word_list_VocabularySize()

     $RA->get_word_list_Vocabulary_size();

This method returns the size of the word list where duplicated words are removed. If the list doesn't exist, the method returns -1.

get_word_list_ListSize()

     $RA->get_word_List_size();

This method returns the size of the word list. If the list doesn't exist, the method returns -1.

get_word_list_AverageFrequency()

     $RA->get_word_list_Average_frequency();

This method returns the average frequency of the word list. If the list doesn't exist, the method returns -1.

AdequacyMeasures()

     $RA->AdequacyMeasures();

This method computes the measures to estimate the adequacy of the terminological resource regarding the textual corpus: Contribution, Recognition, Coverage and Density.

     $RA->print_AdequacyMeasures();

This method prints the adequacy measures.

SEE ALSO

Goritsa Ninova, Adeline Nazarenko, Thierry Hamon et Sylvie Szulman. "Comment mesurer la couverture d'une ressource terminologique pour un corpus ?" TALN 2005. pages 293-302. 6-12 juin 2005. Dourdan, France.

AUTHORS

Thierry Hamon <thierry.hamon@lipn.univ-paris13.fr>

LICENSE

Copyright (C) 2007 by Thierry Hamon

This program is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.8.6 or, at your option, any later version of Perl 5 you may have available.