Uplug::PreProcess::Tokenizer
my $tokenizer = new Uplug::PreProcess::Tokenizer( lang => 'en' ); my @tokens = tokenizer->tokenize( 'Mr. Smith says: "What is a text anyway?"' ); my $text = detokenize( '" Big improvement ! " says Mr. Smith .');
tokenize
Tokenize a given text. Returns a list of tokens.
detokenize
De-tokenize a space-separated text or a list of tokens. Returns plain text.
load_prefixes
Load language specific abbreviations and other non-breaking prefixes.
This module heavily relies on the implementation of the tokenizer and detokenizer used in the Moses toolkit for SMT. All credits go to the original authors (Josh Schroeder and Philipp Koehn).
To install Uplug, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Uplug
CPAN shell
perl -MCPAN -e shell install Uplug
For more information on module installation, please visit the detailed CPAN module installation guide.