Treex::Tool::Segment::RuleBased - Rule based pseudo language-independent sentence segmenter
version 2.20151102
Sentence boundaries are detected based on a regex rules that detect end-sentence punctuation ([.?!]) followed by a uppercase letter. This class is implemented in a pseudo language-independent way, but it can be used as an ancestor for language-specific segmentation by overriding the method segment_text (using around see Moose::Manual::MethodModifiers) or just by overriding methods unbreakers, openings and closings.
segment_text
around
unbreakers
openings
closings
See Treex::Block::W2A::EN::Segment
Returns list of sentences
Do the segmentation (handling use_paragraphs and use_lines)
use_paragraphs
use_lines
Adds newlines after terminal punctuation followed by an uppercase letter.
Add unbreakers (<<<DOT>>>) and hard breaks (\n) using the whole context, not just a single word.
<<<DOT>>>
\n
Returns regex that should match tokens that usually do not end a sentence even if they are followed by a period and a capital letter: * single uppercase letters serve usually as first name initials * in language-specific descendants consider adding: * period-ending items that never indicate sentence breaks * titles before names of persons etc.
Returns string with characters that can appear before the first word of a sentence
Returns string with characters that can appear after period (or other end-sentence symbol)
Martin Popel <popel@ufal.mff.cuni.cz>
Ondřej Dušek <odusek@ufal.mff.cuni.cz>
Copyright © 2011-2012 by Institute of Formal and Applied Linguistics, Charles University in Prague
This module is free software; you can redistribute it and/or modify it under the same terms as Perl itself.
To install Treex::Unilang, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Treex::Unilang
CPAN shell
perl -MCPAN -e shell install Treex::Unilang
For more information on module installation, please visit the detailed CPAN module installation guide.