adaptAlSetToBilCorpus.pl - Looks if the Alignment Set sentence pairs are in another bilingual corpus, and for each sentence pair which is not in the corpus, it searches the corpus sentence pair with best longuest common subsequence (LCS) ratio. Finally, it detects the edits (word insertions, deletions, and substitutions) necessary to pass from the Alignment Set sentences to the corpus sentences with best LCS ratio, prints the edit list and transmits these edits in the output links file.
perl adaptAlSetToBilCorpus.pl [options] required_arguments
See description in the manual (-man option).
Required arguments:
-ist FILENAME Input source-to-target links file -if BLINKER|GIZA|NAACL Input file(s) format (required if not TALP) -cs FILENAME New corpus source text file -ct FILENAME New corpus target text file -ost FILENAME Output source-to-target links file -of BLINKER|GIZA|NAACL Output file(s) format (required if not TALP)
Options:
-pdiff FLOAT Percent number of words difference allowed to calculate LCS [default 20] -mindiff INT LCS calculated although word number difference is below mindiff [default 3] -maxdiff INT Maximum number of words difference allowed to calculated LCS [default 6] -wfirst INT Number of words to consider in the first LCS calculation [default 5] -is FILENAME Input source words file -it FILENAME Input target words file -its FILENAME Input target-to-source links file -os FILENAME Output source words file -ot FILENAME Output target words file -ots FILENAME Output target-to-source links file -range BEGIN-END Input Alignment Set range -alignMode as-is|null-align|no-null-align Alignment mode -help|? Prints the help and exits -man Prints the manual and exits -v 0-3 0:silent 1:verbose mode 2,3:debug
Input source-to-target (i.e. links) file name (or directory, in case of BLINKER format)
Input Alignment Set format (required if different from default, TALP).
New corpus source text file
New corpus target text file
Output (new format) source-to-target (i.e. links) file name (or directory, in case of BLINKER format)
Output (new) Alignment Set format (required if different from default, TALP)
Percent number of words difference allowed to calculate LCS [default 20]
LCS calculated although word number difference is below mindiff [default 3]
Maximum number of words difference allowed to calculated LCS [default 6]
Number of words to consider in the first LCS calculation [default 5]
Output (new format) source (words) file name. Not applicable in GIZA Format.
Output (new format) target (words) file name. Not applicable in GIZA Format.
Output (new format) target-to-source (i.e. links) file name (or directory, in case of BLINKER format)
Take alignment "as-is" or force NULL alignment or NO-NULL alignment (see AlignmentSet.pm documentation).
Prints a help message and exits.
This script looks if the Alignment Set sentence pairs are in the provided bilingual corpus, and for each sentence pair which is not in the corpus, it searches the corpus sentence pair with best LONGEST COMMON SUBSEQUENCE (LCS) ratio at character level. Because this can be extremely slow for a large corpus, various options are provided to avoid the calculation of LCS for most sentence pairs. First, sentences of very different length can't have a large LCS ratio and in those cases the calculation can be avoided. Then if the beginning of the sentences are totally different (LCS ratio at word level is zero), they can't either have a large LCS ratio. If LCS ratio of beginning is not zero, the LCS ratio of the whole sentences is calculated. To go faster, it is first calculated at word level, and then at the character level for the best matching pairs.
pdiff option determines the percentage number of words difference allowed to go for LCS calculation. For example if pdiff=20%, and the alignment set sentences have respectively 10 and 15 words lengths, only LCS of corpus sentences of respectively 8-12 words lengths and 12-18 words length will be calculated.
mindiff option garanties that even if the difference is less than this threshold, LCS will be calculated
maxdiff option permits to avoid LCS calculation if the length difference is more than this threshold number.
nfirst option determines the length considered for the first LCS calculation (if nfirst=5, LCS will be calculated for the whole sentences only if LCS of the first 5 words is not zero).
The final allowed length difference is max( min(al_set_length*pdiff/100,maxdiff) , mindiff )
Finally, the script detects the edits (word insertions, deletions, and substitutions) necessary to pass from the Alignment Set sentences to the corpus sentences with best LCS ratio, prints the edit list and transmits these edits in the output links file.
perl adaptAlSetToBilCorpus.pl -ist alignref-1.0/sum/tagged.engspa.naacl -is alignref-1.0/tagged.eng.iso.naacl -it alignref-1.0/tagged.spa.iso.naacl -cs euparl05may.tagged/train.eng.iso -ct euparl05may.tagged/train.spa.iso -os euparl05may.tagged/alignref-1.0/eng.iso -ot euparl05may.tagged/alignref-1.0/spa.iso
Patrik Lambert <lambert@gps.tsc.upc.es>
Copyright 2005 by Patrick Lambert
This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License (version 2 or any later version).
3 POD Errors
The following errors were encountered while parsing the POD:
You forgot a '=back' before '=head1'
'=item' outside of any '=over'
To install Lingua::AlignmentSet, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::AlignmentSet
CPAN shell
perl -MCPAN -e shell install Lingua::AlignmentSet
For more information on module installation, please visit the detailed CPAN module installation guide.