The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Data::FeatureFactory - evaluate features normally or numerically

SYNOPSIS

 # in the module that defines features
 package MyFeatures;
 use base qw(Data::FeatureFactory);
 
 our @features = (
    { name => 'no_of_letters', type => 'int', range => '0 .. 5' },
    { name => 'first_letter',  type => 'cat', 'values' => ['a' .. 'z'] },
 );
 
 sub no_of_letters {
    my ($word) = @_;
    return length $word
 }
 
 sub first_letter {
    my ($word) = @_;
    return substr $word, 0, 1
 }

 # in the main script
 package main;
 use MyFeatures;
 my $f = MyFeatures->new;
 
 # evaluate all the features on all your data and format them numerically
 open FILEHANDLE, '>my_features.txt';
 print FILEHANDLE join(' ', $f->names), "\n";   # prepend a header
 for my $record (@data) {
     my @values = $f->evaluate('ALL', 'numeric', $record);
     print FILEHANDLE join(' ', @values);
 }
 close FILEHANDLE;
 
 # specify the features to evaluate and gather the result in binary form
 my @vector = $f->evaluate([qw(no_of_letters first_letter)], 'binary', 'foo');

 # translate the once evaluated features to other formats
 open SOURCE, 'my_features.txt';
 open SINK,  '>my_features.csv';
 $f->translate(SOURCE, SINK, {
    from_format => 'numeric', to_format => 'normal',
    FS => ' ', OFS => ',',  # fields from space-separated to comma-separated
    header => 1,    # the names of the features are in the first row of SOURCE
    # names => 'ALL',    # header specified, so we don't need this    
    from_NA => 0, to_NA => 'N/A'    # interpret zeroes as N/A's and substitute
 });

DESCRIPTION

Data::FeatureFactory automates evaluation of features of data samples and optionally encodes them as numbers or as binary vectors.

Defining features

The features are defined as subroutines in a package inheriting from Data::FeatureFactory. A subroutine is declared to be a feature by being mentioned in the package array @features. Options for the features are also specified in this array. Its minimum structure is as follows:

 @features = (
    { name => "name of feature 1" },
    { name => "name of feature 2" },
    ...
 )

The elements of the array must be hashrefs and each of them must have a name field. Other fields can specify options for the features. These are:

type

Specifies if the feature is categorial, numeric, integer or boolean. Only the first three characters, case insensitive, are considered, so you can as well say cat, Num, integral or Boo!. The default type is categorial.

Integer and numeric features will have values forced to numbers. Boolean ones will have values converted to 1/0 depending on Perl's notion of True/False. If you use warnings, you'll get one if your numeric feature returns a non-numeric string.

values

Lists the acceptable values for the feature to return. If a different value is returned by the subroutine, the whole feature vector is discarded. Alternatively, a default value can be specified. Whenever the order of the values matters, it is honored (as in transfer to numeric format). The values can be specified as an arrayref (in which case the order is regarded) or as a hashref, in which case the values are pseudo-randomly ordered, but the loading time is faster and transfer to numeric or binary format is faster as well. If the values are specified as a hashref, then keys of the hash shall contain the values of the feature and values of the hash should be 1's.

default

Specifies a default value to be substituted when the feature returns something not listed in values.

values_file

The values can either be listed directly or in a file. This option specifies its name. This option must not appear in combination with the values option. Each value shall be on one line, with no headers, no intervening whitespace no comments and no empty lines.

The file is expected to be encoded in UTF-8 on perls supporting the :encoding discipline for the open function.

range

In case of integer and numeric features, an allowed range can be specified instead of the values. This option cannot appear together with the values or values_file option. The behavior is the same as with the values option. The interval specified is closed, so returning the limit value is OK. The range shall be specified by two numeric expressions separated by two or more dots with optional surrounding whitespace - for example 2..5 or -0.5 ...... +1.000_005. The stuff around the dots are not checked to be valid numeric expressions. But you should get a warning if you use them when you supply something nonsensical.

You can also specify a range for numeric (non-integer) features. The return value will be checked against it but unlike integer features, this will not generate a list of acceptible values. Therefore, range is not enough to specify for a numeric feature if you want to have it converted to binary. (though converting floating-point values to binary vectors seems rather quirky by itself)

postproc

This option defines a subroutine that is to be used as a filter for the feature's return value. It comes in handy when you, for example, have a feature returning UTF-8 encoded text and you need it to appear ASCII-encoded but you need to specify the acceptable values in UTF-8. As this use-case suggests, the postprocessing takes place after the value is checked against the list of acceptable values. The value for this option shall either be a coderef or the name of the preprocessing function. If the function is not available in the current namespace, Data::FeatureFactory will attempt to find it.

The postprocessing only takes place when the feature is evaluated normally - that is, when its output is not being transformed to numeric or binary format.

code

Normally, the features are defined as subroutines in the package that inherits from Data::FeatureFactory. However, the definition can also be provided as a coderef in this option or in the %features hash of the package. The priority is: 1) the code option, 2) the %features hash, and 3) the package subroutine.

format

Features can be output in different ways - see below. The format in which the features are evaluated is normally specified for all features in the call to evaluate. You can override it for specific features with this option.

You'll mostly use this to prevent the target (to-predict) feature from being numified or binarified: { name => 'target', format => 'normal' }.

label

The value of this field can either be a string or an arrayref specifying a list of labels for the feature. It's usable when you want to evaluate a group of features without having to list them.

See the evaluate method for details.

Notice

Both the feature and the optional postprocessing routine are evaluated in scalar context.

When the N/A option is used, then undef is treated specially but if N/A is not specified, then it is not. Assume you have a feature with values specified, a default value and the feature returns undef. Then if you use the N/A option, the N/A value is substituted, but if you don't use the N/A option, the default value is substituted instead (or it is left as an empty string if it's a valid value).

The postprocessing subroutine, if specified, can be called several times during the construction of the object and within any methods. So it's highly advisable for postprocessing subroutines to have no side-effects.

Creating the features object

The new method creates an object that can then be used to evaluate features. Please do *not* override the new method. If you do, then be sure that it calls Data::FeatureFactory::new properly. This method accepts an optional argument - a hashref with options. Currently, only the 'N/A' option is supported. See below for details.

Getting the list of defined features

The names method returns a list of names of all the features defined.

Evaluating features

The evaluate method of Data::FeatureFactory takes these arguments: 1) names of the features to evaluate, 2) the format in which they should be output and 3) arguments for the features themselves.

The first argument can be an arrayref with the names of the features, or it can be a string. In case it's a string containing lowercase letters, then it's interpreted as the name of the only feature to evaluate. If all the letters in the string are UPPERCASE, then it's interpreted as a whitespace-separated list of labels. Each label can be prefixed by a - sign or a + sign. No prefix is the same as the + prefix. The features evaluated are then those, who have at least one +label but no -label. The presence of a negative label overrides the presence of a positive one in case of collisions. There is a special label ALL, which you should never define for a feature and which will match any feature. It must not be used with the minus sign. Only specifying negative labels implies the ALL label, so you can write -TARGET to get all but the target features. The features are sorted by how they appear in the @features array - the order of labels has no effect whatsoever and no feature is added twice. ALL ALL ALL is the same as just ALL. Since labels can only be specified in upper case for the evaluate method, they are matched case-insensitive.

The second argument is normal, numeric or binary. normal means that the features' return values should be left alone (but postprocessed if such option is set). numeric and binary mean that the features' return values should be converted into numbers or binary vectors, as for support vector machines or neural networks to like them.

The return value is the list of what the features returned. In case of binary, there can be a different (typically greater) number of elements in the returned list than there were features to evaluate.

During evaluation, the features can access the $Data::FeatureFactory::CURRENT_FEATURE variable, which holds the name of the feature evaluated.

Transfer to numeric / binary form

When you have the features output in numeric format, then integer and numeric features are left alone and categorial ones have a natural number (starting with 1) assigned to every distinct value. If you use this feature, it is highly recommended to specify the values for the feature. If you don't then Data::FeatureFactory will attempt to create a mapping from the categories to numbers dynamically as then feature is evaluated. The mapping is being saved to a file whose name is .FeatureFactory.package_name__feature_name and is located in the directory where Data::FeatureFactory resides if possible, or in your home directory or in /tmp - wherever the script can write. If none works, then you get a fatal error. The mapping is restored and extended upon subsequent runs with the same package and feature name, if read/write permissions don't change.

Binary format is such that the return value is converted to a vector of all 0's and one 1. The positions in the vector represent the possible values of the feature and 1 is on the position that the feature actually has in that particular case. The values always need to be specified for this feature to work and it is highly recommended that they be specified with a fixed order (not by a hash), because else the order can change with different versions of perl and when you change the set of values for the feature. And when the order changes, then the meaning of the vectors change.

N/A values

You can specify a value to be substituted when a feature returns nothing (an undefined value). This is passed as an argument to the new method.

 $f = MyFeatures->new({ 'N/A' => '_' }); # MyFeatures inherits from Data::FeatureFactory
 $v = $f->evaluate('feature1', 'normal', 'unexpected_argument');

If feature1 returns an undefined value, then $v will contain the string '_'. When evaluating in binary format, a vector of the usual length is returned, with all values being the specified N/A. That is, if feature1 has 3 possible values, then

 @v = $f->evaluate('feature1', 'binary', 'unexpected_argument');

will result in @v being ('_', '_', '_'). If feature1 returns undef, that is.

N/A values don't get postprocessed in case a postprocessing function is specified.

Conversion between formats

Once you evaluate the features on a million observations and save it in a file, you might want to get the values in another format without having to evaluate the features all over again (which can be time consuming). This is where the method translate comes in handy.

translate accepts three arguments: Source filehandle, destination filehandle and a hash with options (some of which aren't actually optional). The options are:

from_format
to_format

normal, numeric or binary.

names

The names of the features that are in the source file, in order. This can be anything that the evaluate method accepts: An arrayref with the actual names of the features present, or a label expression (see "Evaluating features").

If the names of the features are in the first line of the source file, don't specify the names option but set the header option to a true value instead.

The names of the features in the header shall be separated with the same string that separates the values on the following lines. There can be any number of separators between the feature names and Data::FeatureFactory will treat name1,name2 exactly the same as name1,,,name2 (assuming you use comma as separator). This only applies in the header and has a reason:

When the header is in the source file, then it's translated to the output as well. And since in the binary format, the features span usually more than one column, Data::FeatureFactory::translate will put so many separators after each feature name as there are columns to its value. This is so you need not use the module to figure out how many digits each feature has. For example, if you have feature feat1 with three possible values, then its name in the header will be followed by three separators: feat1,,,feat2.

When reading the header, this is discarded because 1) You may want to write the header yourself or use one from a non-binary version and 2) Data::FeatureFactory has all information it needs in the @features array.

FS

The field separator that delimits the values in the files.

OFS

The output field separator: If you want the values separated with a different character / string in the destination file than in the source file, then this is the option to use.

from_NA
to_NA

The value specified for from_NA is interpreted as denoting the N/A values in the source file. These values will be converted to to_NA in the destination file. If the Data::FeatureFactory object has the N/A option specified, then that value is assumed for any of these two options implicitly.

Required options are: from_format, to_format, FS and either names or header.

Note that when translating a categorial feature without values specified to/from numeric format, then the dynamic mapping of values created by Data::FeatureFactory must get resumed successfully. Otherwise you'll get an error about unexpected value as soon as the translation is attempted.

Translating rows (not files)

There's also a lower-level method available: translate_row. Unlike the translate method, it doesn't accept filehandles but accepts an arrayref with values and returns an array with the translated values. The arguments are:

names

This time a required argument, same as the names option to translate.

values

The arrayref of values to convert.

options

Same as with the translate method, except the names and header options are not accepted.

This method has the slight difference over translate (beside only translating one row per call) that if the to_NA option is specified but neither from_NA option to the method nor the N/A option to the object are specified, then undef's are interpreted as N/A values.

Other low-level methods

There are some other subroutines defined in Data::FeatureFactory. One of those that might be of use to you is the expand_names method, which you can give a label expression as an argument and it will give you the list of feature names that this label expression represents. For a description of how labels work, see "Evaluating features".

COPYRIGHT

Copyright (c) 2008 Oldrich Kruza. All rights reserved.

This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.