The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Algorithm::AM - Classify data with Analogical Modeling

VERSION

version 3.10

SYNOPSIS

 use Algorithm::AM;
 my $dataset = dataset_from_file(path => 'finnverb', format => 'nocommas');
 my $am = Algorithm::AM->new(training_set => $dataset);
 my $result = $am->classify($dataset->get_item(0));
 print @{ $result->winners };
 print ${ $result->statistical_summary };

DESCRIPTION

This module provides an object-oriented interface for classifying single items using the analogical modeling algorithm. To work with sets of items needing to be classified, see Algorithm::AM::Batch. To run classification from the command line without writing your own Perl code, see analogize.

This module logs information using Log::Any, so if you want automatic print-outs you need to set an adaptor. See the "classify" method for more information on logged data.

BACKGROUND AND TERMINOLOGY

Analogical Modeling (or AM) was developed as an exemplar-based approach to modeling language usage, and has also been found useful in modeling other "sticky" phenomena. AM is especially suited to this because it predicts probabilistic occurrences instead of assigning static labels for instances.

AM was not designed to be a classifier, but as a cognitive theory explaining variation in human behavior. As such, though in practice it is often used like any other machine learning classifier, there are fine theoretical points in which it differs. As a theory of human behavior, much of the value in its predictions lies in matching observed human behavior, including non-determinism and degradations in accuracy caused by paucity of data.

The AM algorithm could be called a probabilistic, instance-based classifier. However, the probabilities given for each classification are not degrees of certainty, but actual probabilities of occurring in real usage. AM models "sticky" phenomena as being intrinsically sticky, not as deterministic phenomena that just require more data to be predicted perfectly.

Though it is possible to choose an outcome probabilistically, in practice users are generally interested in either the full predicted probability distribution or the outcome with the highest probability. The entire outcome probability distribution can be retrieved via "scores_normalized" in Algorithm::AM::Result. The highest probability outcome can be retrieved via "winners" in Algorithm::AM::Result. If you're only interested in classification accuracy based on the highest probability outcome (treating AM like any other classification algorithm), use "result" in Algorithm::AM::Result. See Algorithm::AM::Result for other types of information available after classification. See Algorithm::AM::algorithm for details on the actual mechanism of classification.

AM practitioners often use specialized terminolgy, but most of this terminology has more common machine learning terminology equivalents. This software tries to use the specialized terminology for end-user-facing tasks like reports or command-line API's.

AM uses the term "exemplar" where ML uses "training instance". Historically the AM software used the word "item" to refer to either training or test instances, and that term is retained here. AM has "outcomes" and ML has "class labels" (we use the latter). Finally, AM practitioners refer to "variables", and we use the ML term "feature" here.

EXPORTS

When this module is imported, it also imports the following:

Algorithm::AM::Result
Algorithm::AM::DataSet

Also imports "dataset_from_file" in Algorithm::AM::DataSet.

Algorithm::AM::DataSet::Item

Also imports "new_item" in Algorithm::AM::DataSet::Item.

Algorithm::AM::BigInt

Also imports "bigcmp" in Algorithm::AM::BigInt.

METHODS

new

Creates a new instance of an analogical modeling classifier. This method takes named parameters which set state described in the documentation for the relevant methods. The only required parameter is "training_set", which should be an instance of Algorithm::AM::DataSet, and which defines the set of items used for training during classification. All of the accepted parameters are listed below:

"training_set"
"exclude_nulls"
"exclude_given"
"linear"

training_set

Returns (but will not set) the dataset used for training. This is an instance of Algorithm::AM::DataSet.

exclude_nulls

Get/set a boolean value indicating whether features with null values in the test item should be ignored. If false, they will be treated as having a specific value representing null. Defaults to true.

exclude_given

Get/set a boolean value indicating whether the test item should be removed from the training set if it is found there during classification. Defaults to true.

linear

Get/set a boolean value indicating whether the analogical set should be computed using occurrences (linearly) or pointers (quadratically). To understand what this means, you should read the algorithm page. A false value indicates quadratic counting. Defaults to false.

classify

  $am->classify(new_item(features => ['a','b','c']));

Using the analogical modeling algorithm, this method classifies the input test item and returns a Result object.

Log::Any is used for logging. The full classification configuration is logged at the info level. A notice is printed at the warning level if no training items can be compared with the test item, preventing any classification.

HISTORY

Initially, Analogical Modeling was implemented as a Pascal program. Subsequently, it was ported to Perl, with substantial improvements made in 2000. In 2001, the core of the algorithm was rewritten in C, while the parsing, printing, and statistical routines remained in C; this was accomplished by embedding a Perl interpreter into the C code.

In 2004, the algorithm was again rewritten, this time in order to handle more features and large data sets. The algorithm breaks the supracontextual lattice into the direct product of four smaller ones, which the algorithm manipulates individually before recombining. These lattices can be manipulated in parallel when using the right hardware, and so the module was named AM::Parallel. This implementation was written with the core lattice-filling algorithm in XS, and hooks were provided to help the user create custom reports and control classification dynamically.

The present version has been renamed to Algorithm::AM, which seemed a better fit for CPAN. While the XS has largely remained intact, the Perl code has been completely reorganized and updated to be both more "modern" and modular. Most of the functionality of AM::Parallel remains.

SEE ALSO

The <home page|http://humanities.byu.edu/am/> for Analogical Modeling includes information about current research and publications, as well as sample data sets.

The Wikipedia article has details and even illustrations on analogical modeling.

SUPPORT

Bugs / Feature Requests

Please report any bugs or feature requests through the issue tracker at https://github.com/garfieldnate/Algorithm-AM/issues. You will be notified automatically of any progress on your issue.

Source Code

This is open source software. The code repository is available for public review and contribution under the terms of the license.

https://github.com/garfieldnate/Algorithm-AM

  git clone https://github.com/garfieldnate/Algorithm-AM.git

AUTHOR

Theron Stanford <shixilun@yahoo.com>, Nathan Glenn <garfieldnate@gmail.com>

CONTRIBUTORS

  • garfieldnate <garfieldnate@gmail.com>

  • Nathan Glenn <garfieldnate@gmail.com>

  • Nick Logan <nlogan@gmail.com>

COPYRIGHT AND LICENSE

This software is copyright (c) 2013 by Royal Skousen.

This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.