Lingua::Stem::Any - Unified interface to any stemmer on CPAN
This document describes Lingua::Stem::Any v0.02.
use Lingua::Stem::Any; # create German stemmer using the default source module $stemmer = Lingua::Stem::Any->new(language => 'de'); # create German stemmer explicitly using Lingua::Stem::Snowball $stemmer = Lingua::Stem::Any->new( language => 'de', source => 'Lingua::Stem::Snowball', ); # get stem for word $stem = $stemmer->stem($word); # get list of stems for list of words @stems = $stemmer->stem(@words);
This module aims to provide a simple unified interface to any stemmer on CPAN. It will provide a default available source module when a language is requested but no source is requested.
The following language codes are currently supported.
┌────────────┬────┐ │ Bulgarian │ bg │ │ Czech │ cs │ │ Danish │ da │ │ Dutch │ nl │ │ English │ en │ │ Finnish │ fi │ │ French │ fr │ │ Galician │ gl │ │ German │ de │ │ Hungarian │ hu │ │ Italian │ it │ │ Latin │ la │ │ Norwegian │ no │ │ Persian │ fa │ │ Portuguese │ pt │ │ Romanian │ ro │ │ Russian │ ru │ │ Spanish │ es │ │ Swedish │ sv │ │ Turkish │ tr │ └────────────┴────┘
They are in the two-letter ISO 639-1 format and are case-insensitive but are always returned in lowercase when requested.
# instantiate a stemmer object $stemmer = Lingua::Stem::Any->new(language => $language); # get current language $language = $stemmer->language; # change language $stemmer->language($language);
Country codes such as cz for the Czech Republic are not supported, nor are IETF language tags such as pt-PT or pt-BR.
cz
pt-PT
pt-BR
The following source modules are currently supported.
┌────────────────────────┬──────────────────────────────────────────────┐ │ Module │ Languages │ ├────────────────────────┼──────────────────────────────────────────────┤ │ Lingua::Stem::Snowball │ da nl en fi fr de hu it no pt ro ru es sv tr │ │ Lingua::Stem::UniNE │ bg cs fa │ │ Lingua::Stem │ da de en fr gl it no pt ru sv │ └────────────────────────┴──────────────────────────────────────────────┘
A module name is used to specify the source. If no source is specified, the first available source in the above list with support for the current language is used.
# get current source $source = $stemmer->source; # change source $stemmer->source('Lingua::Stem::UniNE');
Boolean value specifying whether to apply Unicode casefolding to words before stemming them. This is enabled by default and is performed before normalization when also enabled.
Boolean value specifying whether to apply Unicode NFC normalization to words before stemming them. This is enabled by default and is performed after casefolding when also enabled.
Accepts a list of strings, stems each string, and returns a list of stems. The list returned will always have the same number of elements in the same order as the list provided. When no stemming rules apply to a word, the original word is returned.
@stems = $stemmer->stem(@words); # get the stem for a single word $stem = $stemmer->stem($word);
The words should be provided as character strings and the stems are returned as character strings. Byte strings in arbitrary character encodings are not supported.
Accepts an array reference, stems each element, and replaces them with the resulting stems.
$stemmer->stem_in_place(\@words);
This method is provided for potential optimization when a large array of words is to be stemmed. The return value is not defined.
Returns a list of supported two-letter language codes using lowercase letters.
# all languages @languages = $stemmer->languages; # languages supported by Lingua::Stem::Snowball @languages = $stemmer->languages('Lingua::Stem::Snowball');
Returns a list of supported source module names.
# all sources @sources = $stemmer->sources; # sources that support English @sources = $stemmer->sources('en');
optional stem caching
custom stemming exceptions
Lingua::Stem::Snowball, Lingua::Stem::UniNE, Lingua::Stem
This module is brought to you by Shutterstock (@ShutterTech). Additional open source projects from Shutterstock can be found at code.shutterstock.com.
Nick Patch <patch@cpan.org>
© 2013 Nick Patch
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Lingua::Stem::Any, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Lingua::Stem::Any
CPAN shell
perl -MCPAN -e shell install Lingua::Stem::Any
For more information on module installation, please visit the detailed CPAN module installation guide.