The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

jspell - Command line interface to Jspell morphological analyzer

SYNOPSIS

jspell [-dfile | -pfile | -wchars | -Wn | -t | -n | -x | -b | -S | -B | -C | -P | -m | -Lcontext | -M | -N | -Ttype | -V | -o format | -g | -y | -u] file .....

jspell [-dfile | -pfile | -wchars | -Wn | -t | -n | -Ttype| -o format] -l

jspell [-dfile | -pfile | -ffile | -Wn | -t | -n | -B | -C | -P | -m | -Ttype | ] {-a | -A}

jspell [-dfile] [-wchars | -Wn] [-o format] -c

jspell [-dfile] [-wchars] [-o format] -e[1-4]

jspell [-dfile] [-wchars] -D

jspell -v [v]

DESCRIPTION

jspell is a morphological analyzer. It can be used in four different ways:

  • as a standard C library;

  • as a non buffered command line application;

  • as a command interpreter;

  • as an interactive program.

Interactive Application

jspell should be invoked with a text file name. This text correctness will be verified in following way: each word that does not exist on the dictionary will be shown in reverse video at the top of the screen, with the context text shown. The user should opt for one of the correction suggestion (if them exist).

The suggestion can be formed in two ways:

  • detection of approximated words (words that miss a letter, or have some of them changed; Normally we call this near misses);

  • using formation rules, starting at a known root (although there are no flags to tell that derivation is correct, it will be shown as well.

One of the last rows in the screen will show a mini-menu with some options:

<number>

digit the number of the chosen option to replace the original text;

Space

accepts the word only this time (does not change any thing);

R

replaces the word with user text;

E

replace all word occurrences in the text;

A

accepts the word through all the remaining text;

I,U

accept the word (in the case of I, with the same case that the original word, in the U case, all downcase) and actualizes the personal dictionary. We should note that our dictionary maintains more information about the word than itself, so the user will be prompt for a classification, flags and a small comment or, alternatively, we can choose some suggestions formed by jspell using AFF file rules.

L

search words on the system dictionray (this is controlled by the compilation variable WORDS);

X

write all the remaining file as it is, ignore all erroneous words and start the next text correction (if it exists);

Q

exit immediately and leave file without changes;

!

shell exit;

^L

redraw the screen;

^Z

suspend jspell;

?

show help screen.

Command line options

-M

actives mini-menu on the bottom of the screen;

-N

de-active mini-menu from the bottom of the screen;

-L

use this option to set the number of lines of context to be shown. The number should be glued to the flag;

-V

shows characters using more than 7 bits in the cat -v style. This option can be usefull when we are working with older terminals that can't show some characters;

-t

input file is written in TeX or LaTeX. This mode is automatically activated if the file extension if .tex;

-n

input file is in nroff/troff format;

-b

forces the creation of a backup file (using the extension .bak);

-x

disables the creation of the backup file;

-B

considers that two words concatenated without spaces between them are errors;

-C

considers that two correct words concatenated is a correct word, too! This option can be usefull on languages like German where some words are made of concatenations;

-P

do not make suggestions of combinations root/affix to be added to the personal dictionary;

-m

make it possible combinations of root/affix that aren't on the dictionray;

-S

sort the suggestion list by correctness probability instead of the alphabetic one;

-d file

specify an alternative dictionary;

-p file

specify an alternative personal dictionary. If file does not start by a slash, the $HOME preffix is assumed. If you specify one of the default fich-hash of the library dictionary and there is a file .jspell_hashfile, this will be used as the personal dictionary. If none of there conditions are true, we use the .jspell_words file.

Without this option, jspell will search personal dictionaries in the current directory and in the home dir. If both exists, they will be loaded.

-w chars

specify additional characters that can be used inside words; Using -w "&" we make "AT&T" a valid word;

-W n

specify the maximum size of legal words. If you want to verify all words, independently of the size, use -W 0;

-T type

assume some formating type for all files. Argument type can be one of the unique names defined on the affix file (example nroff) or a file suffix containing a dot (example .tex);

-l

used to produce the bad word list using standard input;

-a

this was thought to be used using pipes. This is a command line interpreter.

If the word is found directly on the dictionary, or using any of the flags, appears the information about the word root and it's root and affix/preffix features. This information appears using a format that can be defined by the user.

If the word isn't in the dictionary, the output line starts with and ampersand (&), a space, the original word, a space, the number of characters between the line start and the word, a two dots and a list of approximated words where appears the name of the word, the equal sign and the classification using the format specified. If the word can be formed using and illegal addition of affixes of a known root, there will be presented a suggestion list, too!

If there isn't an approximated word, but only formation using invalid affixes, the line uses a similar format but instead of an ampersan there will be a question mark.

Resuming:

If the word does Exist on the dictionary, the output will be:

  * <original> <offset>: <solution>, <solution>, ...

If there is NOT in the dictionary:

  & <original> <offset>: <err>, <err>, ..., <affix sugst>, ...

where err and affix sugst have the following fomat:

  <word> = format(<root>,<root fea>,<preffix fea>,
                       <suffix fea>,<suffix2 fea>)

This format if defined by the user, being the default:

  lex(<root>, [<root fea>], [<preffix fea>],
              [<suffix fea>], [<suffix2 fea>])

The separators ,, =, e : are defined using a #define clause. So, they can be changed on compile time.

Using the -a flag, there are a set of commands starting with these characters: *, @, &, +, -, ~, #, !, %, $ or ^.

*

Add to personal dictionary. You can add the class, flags and comments using the dictionary separator.

@

Accept the word, but do not add it to the dictionary;

&

Add the lowercase converted word to the personal dictionary;

#

Save current personal dictionary;

~

Indicates the parameters based on the file;

+

Enter in TeX mode;

-

Exit from TeX mode;

!

Enter terse mode;

%

Exit terse mode;

$ flag

Alters the function mode as the init_modes function (see the library section);

^

Verifies the rest of the line

Note that in the terse mode the information about correct words will be hidden. This can be used to make some programs fasters.

-A

works like the -a option, excepts that if the line starts with a string like &Include_File&, the rest of the line is considered to be the name of a file to be read words from;

-s

if used, jspell will stop with signal SIGSTP after reading a line of input, and continues reading the next line when it receives the SIGCONT signal.

This is only valid if -a or -A option is active too, and on BSD derived systems;

-f

used to specify a file name where jspell should write results, instead of the standard output. Only valid in conjuntion with a -a or -A option;

-v

makes jspell dump it's current version. If you double the option (-vv), will be printed compilation options, too!

-c

Makes words to be read from standard input and, for each of them, write a list of possible roots, classification and original word classification derived that way, as the used flags. Note that generated roots can be not found in the dictionary.

Example, the 'batatas' input (portuguese) makes:

  batatas lex(batata, [CAT=adj_nc], [N=p], []),
          lex(batatar, [CAT=v, [CAT=v,P=2,N=s,T=p], [])
-z

makes the used flag to be printed as well:

  lex(batata, [CAAT=adj_nc], [N=p], {})/p
-e

is the inverse of -c. Starting with a word and a flag, generates all hypothesis of derived words using the flag rules:

example: batata/p generates

  batata batatas= lex(batata, [], [N=p], [])
-D

makes the dictionary affix tables to be written on the standard output;

-o format

defines the format for the output. It should be a string containing five %s to be filled by the word root, classification of the root, and the classification associated to the flag. The default, as seen before, is lex(%s, [%s], [%s], [%s], [%s])

-g

indicates that should be shown only solutions and not suggestions. Using this, makes better performance.

-y

indicates that we want to obtain only the suggestions created using flags not defined for the word. There will be no near misses calculations.

-u

ignore punctuation. There is a define DEFAULT_SIGNS containing all punctuation marks.

Output in the options -a, -e and -c uses some separators that are defined with the following names:

  SEP1 ","
  SEP2 ";"
  SEP3 "="
  SEP4 "\n"

SEP1 is used to separate solution hypothesis. SEP2 is used when we show near misses indicating the end of that type of solutions. SEP3 is used to separate the original word from the information. SEP4 ends the word record.

Using the C library

Programs using jspell as a library should include jslib.h and link with jspell.a or jspell.so.

These programs should init the library calling init_jspell("...") and, after it, you can call other API functions.

init_jspell(char *options)

Init jspell with the flags in the options string. Normally the -a option is allways used. Example of calling jspell:

  init_jspell("-d dic-pe -W 0 -a -cf")

word_info(char* word, sols_type solutions, sols_type near_misses)

This function gives information about the word searched in the dictionary. If it is found, the possible ways to form it are given in the solutions array where, each element is a string containing the word root, it's classification and the classification that makes the word possible.

If the word is not found in the dictionary, the near_misses array contains the possible solutions using the format specified with the -o flag.

If solutions[i] or near_misses[i] contains an empty string, then, there is no more solutions/suggestions, respectively.

void init_modes(char* modes)

Used to change the suggestion output format. There are two types of suggestions: those done doing small changes in the original word (designated by near misses) and those that are constructed adding affixs not provided for that word.

Disponible flags are:

g

don't give suggestions from other words (disable near misses);

G

inverse of g: enable near misses;

P

don't give suggestion from combining not provided affixes to the word;

m

turns off P option;

y

don't give suggestions by swapping characters in the original word;

Y

turns off y option;

z

show flags used for the suggestion;

Z

turn off z option;

char* get_next_word(char *buf, char *next_word)

Given the buffer buf, put in next_word the next valid word encountered. Returns a pointer to buf position after the end of the word found. Returns NULL is none is found.

get_roots(char *word, sols_type solutions, char in_dic[MAXPOSSIBLE])

Given the word search its possible origins although they aren't in the dictionary. The vaious possibilities are returned on the solutions array, containing each position the root indication, it's classification and the classification related to the used flag. This information is in a string with the habitual output. Each entry in the in_dic array shows if the root is, or not, in the dictionary.

If solutions[i] is an empty string, then there aren't more solutions.

insert_word(char *word, char *class, char *flags, char *comm)

Inserts the word with it's classification (class), flags and comment (comm) in the personal dictionary.

accept_word(char *word, char *class, char *flags, char *comm)

Accepts the word with it's classification (class), flags and comment (comm) until the end of the utilization of the library.

char* replace_word(char *start, char* word, char* curchar)

Substitutes the word existing in the text in the start position by the word indicating in curchar where the last word ended.

Returns the position in the buffer where the new word ends.

save_pers_dic()

Saves the personal dictionary in the present state.

ID_TYPE word_it(char* word, char* feats, int* status)

For a word returns an unique identifier.

char *word_f_id(ID_TYPE id)

Given an identifier, returns a pointer to the position of the respective word.

char *class_f_id(ID_TYPE id)

Given an identifier, returns a pointer to the position of the respective classification.

char *flags_f_id(ID_TYPE id)

Given an word identifier, returns a string with it's respective flags identification.

Example

  #include "jslib.h"

  main() {
    int i;
    char X[BUFSIZ], char w[100], *p;
    sols_type solutions, near_misses;

    init_jspell("-d dict -W 0 -a");

    while(gets(X)) {
       p = X;

       while (p=get_next_word(p, w)) {
           word_info(w, solutions, near_misses);
           puts("solutions");
           i = 0;
           while(solutions[i][0])
              puts(solutions[i++]);
           puts("near misses");
           i = 0;
           while(near_misses[i][0])
              puts(near_misses[i++]);
      }
   }
}

If you save this file with the name exp-lib.c, you can compile it with:

  gcc -o exp-lib exp-lib.c -ljspell

THANKS

We should thanks Pace Willisson and Geoff Kuenning for putting ispell as a open source application, from where much of this application code was borrowed.

AUTHOR

 Ulisses Pinto
 J.Joao Almeida  <jj@di.uminho.pt>

SEE ALSO

 See the following man pages: jspell(3), jspell-aff(1), perl(1), agrep(1)

BUGS

 We wait for them at any of the author e-mails!