nat-pre - A pre-processor for parallel texts, counting words, checking sentence numbers, and creating auxiliary files.
nat-pre <crp-text1> <crp-text2> <lex1> <lex2> <crp1> <crp2>
This tools is integrated with nat-these command, and is not intended to be used directly by the user. It is an independent command so that we can use it inside other programs and/or projects.
nat-these
The tool objective is to pre-process parallel corpora texts and create auxiliary files, to access directly corpus and lexicon information.
The crp-text1 and crp-text2 should be in text format, and should be sentence aligned texts. Each one of these texts should contain lines with the single character $ as sentence separator. As the text is aligned, the number of sentences from both text files should be the same.
crp-text1
crp-text2
$
To use it, if you have the aligned text files txt_PT and txt_EN, you would say:
txt_PT
txt_EN
nat-pre txt_PT txt_EN txt_PT.lex txt_EN.lex txt_PT.crp txt_PT.lex
Where the .lex files are lexical files and .crp files are corpus files.
.lex
.crp
If you process more than one pair of files, giving the same lexical file names, identifiers will be reused, and lexical files expanded.
Corpus and lexical files are written on binary format, and can be accessed using NATools source code. Here is a brief description of their format:
these files describe words used on the corpus. For each different word (without comparing cases) it is associated an integer identifier. This file describe this relation.
The format for this file (binary) is:
number of words (unsigned integer, 32 bits) 'words number' times: word identifier (unsigned integer, 32 bits) word occurrences (unsigned integer, 32 bits) word (character sequence, ending with a null)
If you need to access directly these files you should download the NATools source and use the src/words.[ch] functions.
src/words.[ch]
these files describe corpora texts, where words were substituted by the corresponding integer identifier.
The binary format for this gzipped file is:
corpus size: number of words (unsigned integer, 32 bits) corpus size times: word identifier (unsigned integer, 32 bits) flags set (character, 8 different flags)
If you need to access directly these files you should download the NATools source and use the src/corpus.[ch] functions.
src/corpus.[ch]
The flags used are:
the word appeared all in UPPERCASE;
the word appeared Capitalized;
Two other files are created also, named .crp.index which index offsets for sentences on corpus files.
.crp.index
NATools documentation;
Copyright (C)2002-2009 Alberto Simoes and Jose Joao Almeida Copyright (C)1998 Djoerd Hiemstra GNU GENERAL PUBLIC LICENSE (LGPL) Version 2 (June 1991)
To install Lingua::NATools, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::NATools
CPAN shell
perl -MCPAN -e shell install Lingua::NATools
For more information on module installation, please visit the detailed CPAN module installation guide.