The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Microarray::DataMatrix::AnySizeDataMatrix - abstraction to DataMatrix

Abstract

anySizeDataMatrix.pm provides an abstraction layer to a dataMatrix, in that it provides methods for manipulating or querying the contents of a dataMatrix. anySizeDataMatrix is an abstract class - thus anySizeDataMatrix objects themselves cannot be instantiated - only objects of concrete subclasses can be instantiated.

Overall Logic

This is for programmers only - do not rely on any of these details when programming clients of the concrete subclasses of anySizeDataMatrix, or programming the subclasses themselves, as the underlying implementation is subject to change at any time without notice. Just stick to using the API! (described below).

Implementation of an anySizeDataMatrix depends on its size. Upon construction of concrete subclass, anySizeDatamatrix will determine if the matrix is too big to fit into memory. It then dynamically inherits from either smallDataMatrix, or bigDataMatrix. In the case that anySizeDataMatrix decides to inherit from smallDataMatrix, then all data are read into memory, and the actual data matrix from the file (used to instantiate the concrete subclass) is stored as a 2 dimensional array. The indexes of the array which are still valid (following some filtering step) are stored in internal hashes, such that subsequent manipulations of the data only consider the data that have not been filtered out. As rows or columns are filtered out by some of the methods, the entries for these rows or columns are deleted from the hashes that track valid data. Thus when data are redumped to a file, only those data that have not been filtered are printed out.

When anySizeDatamatrix instead inherits from bigDataMatrix, because a matrix is too large to be read into memory, then all operations are carried out on disk versions of the matrix. This is obviously less efficient, as it often requires the file to be read multiple times, and intermediate results to be stored in tmp files.

Whether the matrix fits into memory or not, all reading and writing of files is carried out by methods that have to be implemented in the concrete subclass, though the matrix class itself (small or big) actually deals with calling them appropriately in its own methods.

For all the data transformation / filtering methods, the methods are actually autoloaded in anySizeDatamatrix, then redispatched to protected versions of the method, and then again dispatched to protected methods within small and bigDataMatrix. This is done so that thay can all be wrapped by a generic eval clause, which only needs to appear once in the AUTOLOAD method.

Construction of objects is up to the client subclasses to implement, as anySizeDatamatrix has no new() method. Upon object construction, a concrete subclass must call the _init method of anySizeDatamatrix, giving it the name of the tmp directory that can be used for storage of any temp files. _init is defined thus:

_init

This method initializes the dataMatrix, such that it will inherit from the correct superclass, based on its size. The constructor of any concrete subclasses of anySizeDataMatrix MUST call this method in their constructors, after blessing the object reference, but before returning it to the client. The _init method will call the _numDataRows and _numDataColumns methods that the subclass must implement (see below) during initialization. Thus the concrete subclass must take care of any of its own initialization that will be required for these to work correctly, prior to calling the anySizeDataMatrixs _init() method.

A single argument, a directory that can be used to write tmp files if necessary, must be provided to _init.

Usage :

      $self->_init($tmpDir)

or

      $self->SUPER::_init($tmpDir);

Methods that must be implemented by concrete subclasses.

_numDataRows

This method should return the number of rows that can be expected to be read from the file with which the object was instantiated.

Usage:

        my $numRows = $self->_numDataRows;

_numDataColumns

This method should return the number of data columns that can be expected from the file with which the object was instantiated.

Usage:

        my $numColumns = $self->_numDataColumns;

_dataLine

This method should return the requested row of data from the file whose name is available from the _fileForReading() method. It should return a reference to an array of the data. Null values in the line of data should be dealt with as blanks in the array. A requested row number of zero means the first line of data in the file. If no line of data corresponds to the requested row number, then undef should be returned.

Usage:

        my $dataArrayRef = $self->_dataLine($lineNo);

_printLeadingMeta

This method should print the leading meta data from that corresponds to the file from which the object was instantiated.

Usage:

        $self->_printLeadingMeta($fh, $validColumnsArrayRef,
                                 $hasExtraInfo, $extraInfoName);

where:

$fh is a file handle to which the information should be printed.

$validColumnsArrayRef is a reference to an array of column numbers of the columns that are still valid in the dataMatrix. The column numbers are with respect to the columns in the original file with which the object was instantiated.

$hasExtraInfo is a boolean to indicate whether extra information should be interleaved with the data.

$extraInfoName is the name of the extra information.

Note these last two are to support the interleaving of percentile data into a datafile. This will probably be removed, such that a separate percentiles file, or any other type of extra information file is created instead. This is currently in here as legacy support.

_printRow

This method prints to the passed in file handle. It will only print information for those valid columns. If the $extraInfo variable is true (ie a non-zero value) it will print the extra info interleaved between each column of data. The extra info comes from the 2D hash whose reference is passed in. The relevant piece of extra information is accessed by the row as the key of the first hash dimension, and the columns as the key of the second hash dimension.

Usage :

      $self->_printRow($fh, $row, $dataRef, $validColumnsArrayRef, 
                       $hasExtraInfo, $extraInfoHashRef);

_printTrailingMeta

This method prints out any trailing meta data to the passed in file handle.

Usage:

        $self->_printTrailingMeta($fh, $validColumnsArrayRef,
                                  $hasExtraInfo, $extraInfoName);

Protected methods that are used by both the superclass and subclasses

_setFile

This protected method simply takes a fully qualified filename as an argument, checks if it exists, and if so stores it in the object. If not, it will die with a usage message.

_fileForReading

This method should return the fully qualified name of the file that is currently holding the data. This may change from the original file that was used to instantiate the object, as the abstract superclass of anySizeDatamatrix may use temp files to hold the contents of the matrix. If the file that the subclass is supposed to be reading data from is changed, then it is up to the subclass to deal with that appropriately, ie it should read from the correct file when it needs to.

Usage:

        my $file = $self->_fileForReading;

_setFileForReading

This method should allow the super class to set the file which the concrete subclass should be using to get data.

Usage:

        $self->_setFileForReading($newFile);

_fileHandle

This method returns a file handle on the matrix file that was used to instantiate the object, or on a file that it has been told is for reading. This second reason is so that the superclass can communicate with the subclass, to indicate that a new tmp file exists that should be used as the data source. If it receives a true value for the 'reset' argument, it will close any open file handle, then reopen a handle on the file. In doing so, it shall reset the current file data row to a value of -1.

Usage:

    my $fh = $self->_fileHandle(reset=>$reset);

Public methods implemented in the size independent anySizeDataMatrix

file

This methods returns the name of the file that was used to construct the object.

Usage:

        my $file = $matrix->file;

returns: a scalar

Public methods, implemented in the size dependent super classes of anySizeDataMatrix.

dumpData

This method dumps the current contents of the dataMatrix object to a file, either whose name was provided as a single argument, or to a file whose name was used to construct the object. If the data have been filtered based on columnPercentiles, and these were elected to be shown, then these will be dumped out too (see below).

Usage:

    $dataMatrix->dumpData($file);

or:

    $dataMatrix->dumpData;

Tranformation and Filtering Methods

Developer note : The initial 'front-end' common parts to these methods are implemented in anySizeDataMatrix, but the full, 'back-end' nuts and bolts of each method is implemented in the relevant size dependent super-classes, small- and bigDataMatrix, as the way in which each deals with filtering is fundamentally different, depending on whether they are memory constrained or not. This does mean that any new filtering/transformation methods that are implemented MUST be added to both small and bigDataMatrix, potentially with some shared common code being implemented in anySizeDataMatrix.

General note on methods that transform the data : If autodumping is on, then by default, they will overwrite the file that was used to create the object of the concrete subclass, unless a new filename is passed in. If a new filename is passed in (as an argument named 'file'), and autodumping is on, then further operations on the dataMatrix of filtered data will operate on the already filtered data. Note, the program MUST have permissions to overwrite the original file, if no new filename is provided.

All of the transformation and filtering methods return 1 upon success. If an error was encountered, then the method will return 0, and the error message associated with the problem can be retrieved using the errstr() method, eg:

      $dataMatrix->methodX(%args) || die "An error occured ".$dataMatrix->errstr."\n";

All of the transformation and filtering methods allow a verbose argument to be passed in, with valid values for the verbose argument being either 'text' or 'html'. For text, \n will be used as an end of line character after every line of reporting is printed. For html, \n<br> will be used, eg:

      $dataMatrix->center(rows=>'mean',
                          verbose=>'html') || die $dataMatrix->errstr;

center

This method allows either rows or columns of the dataMatrix to be centered using either means or medians (centering is when the average - mean or median - is set to zero, by subtracting the average from every value for that row/column). If centering both rows and columns, centering will be done iteratively, until no datapoint changes by more than 0.01. Alternatively, the maxNumIterations can be specified, or the maxAllowableChange can be specified. If used in combination, the first one that is met will terminate centering. The defaults are:

    maxAllowableChange  0.01 
    maxNumIterations      10

Usage: eg:

        $dataMatrix->center(rows=>'mean',
                            columns=>'median') || die $dataMatrix->errstr;

returns : 1 upon success, or 0 otherwise

filterByPercentPresentData

This method allows for filtering out of rows or columns which do not have greater than the specified percentage of data available. Note, if filtering by both rows and columns, filtering will be done sequentially, firstly by rows. To overide this, make two seperate calls to the method, in the opposite order. There is no fancy algorithm to maximize the amount of retained data (eg consider filter by rows, then by columns, that removal of a column means that some rows thrown out in the first step may have greater than 80% good data for the remaining columns - this method does not consider this).

        $dataMatrix->filterByPercentPresentData(rows=>80,
                                                columns=>80);

        $dataMatrix->filterByPercentPresentData(rows=>90,
                                                file=>$filename);

returns: 1 upon success, or 0 otherwise

filterRowsOnColumnPercentile

This method will filter out rows whose values do not have a percentile rank for their particular column above a specified percentile rank, in at least numColumns columns. In addition, this method will accept a 'showPercentile' argument, which if set to a non-zero value, will result in the percentiles of the datapoints being dumped out with the data, when the data are aubsequently dumped to a file. Columns of percentiles are interleaved with the data columns, so the resulting file can not be clustered.

Note: This method has not yet been implemented for matrices that do not fit into memory, so calling it on such a matrix will produce an error (which of course, you are always checking for).

Usage:

        $dataMatrix->filterRowsOnColumnPercentile(percentile=>95,
                                                  numColumns=>1,
                                                  showPercentiles=>1);

returns : 1 upon success, or 0 otherwise

filterRowsOnColumnDeviation

This method will filter out rows whose values do not deviate from the column mean by a specified number of standard deviations, in at least numColumns columns.

Usage:

        $dataMatrix->filterRowsOnColumnDeviation(deviations=>2,
                                                 numColumns=>1);

returns : 1 upon success, or 0 otherwise

filterRowsOnValues

This method filters out rows whose values do not pass a specified criterion, in at least numColumns columns. To specify the criterion, a value, and an operator must be specified. The valid operators are:

  "absolute value >"  also aliased by "absgt"   and "|>|" 
  "absolute value >=" also aliased by "absgteq" and "|>=|"
  "absolute value ="  also aliased by "abseq"   and "|=|"
  "absolute value <"  also aliased by "abslt"   and "|<|"
  "absolute value <=" also aliased by "abslteq" and "|<=|"
  ">"                 also aliased by "gt"
  ">="                also aliased by "gteq"
  "="                 also aliased by "eq"      and "=="
  "<="                also aliased by "lteq"
  "<"                 also aliased by "lt"
  "not equal"         also aliased by "ne"      and "!="

Usage:

        $dataMatrix->filterRowsOnValues(operator=>"absolute value >",
                                        value=>2,
                                        numColumns=>1);

returns : 1 upon success, or 0 otherwise

filterRowsOnVectorLength

This method filters out rows based on whether the vector that their values define has a length of greater than the specified length.

Usage:

        $dataMatrix->filterRowsOnVectorLength(length=>2);

returns : 1 upon success, or 0 otherwise

logTransformData

This method log transforms the contents of the data matrix, using the specified base. If any values less than or equal to zero are encountered, then the transformation will fail. The matrix may be left in an indeterminate state if the operation fails, so the object should not be used further if the transformation is unsuccessful.

Usage :

    $dataMatrix->logTransformData(base=>2);

returns : 1 upon success, or 0 otherwise

scaleColumnData

This method scales the data for particular columns as specified by the client, by dividing the values by specified factors. It could, for instance, be used to renormalize the data. Note it is only appropriate to normalize ratio data, not log transformed data.

The client passes in a hash, by reference, of the column numbers (starting from zero) as the keys, and the scaling factors as the values.

If a column number which is invalid is specified, then a warning to STDERR will be printed. Also, if a scaling factor of zero (or undef) is supplied for a column, a warning will also be printed to STDERR, and the column data for that column will not be scaled.

Usage:

    $datamatrix->scaleColumnData(columns=>{0=>1.2,
                                           2=>0.8});

returns : 1 upon success, or 0 otherwise

Accessor Methods

Developer note : The following methods are actually implemented in the dataMatrix class, which is a superclass of both small- and bigDataMatrix.

numRows

This method returns the number of rows that are currently valid in the data matrix.

Usage:

        my $numRows = $dataMatrix->numRows;

returns: a scalar

numColumns

This method returns the number of columns that are currently valid in the data matrix.

Usage:

        my $numColumn = $dataMatrix->numColumns;

returns: a scalar

errstr

This method returns an error string that is associated with the last failed call to a data transformation/filtering method. Calling this method will clear the contents of the error string.

Setter Methods

setNumColumnsToReport

This method accepts a positive integer, that indicates the number of columns that have been processed during a filtering/transformation method that is carried out on a column basis, after which progress should be indicated. If a client has not set this value, then it defaults to 50.

Usage :

    $matrix->setNumColumnsToReport(50);

setNumRowsToReport

This method accepts a positive integer, that indicates the number of rows that have been processed during a filtering/transformation method that is carried out on a row basis, after which progress should be indicated. If a client has not set this value, then it defaults to 5000.

Usage :

    $matrix->setNumRowsToReport(5000);

allowedOperators

This public method returns an array of all the allowed operators that may be used by methods (in subclasses) that employ the operators for whatever reason (their interface should indicate that they employ such operators).

Usage :

    my @operators = $matrix->allowedOperators;

AUTHOR

Gavin Sherlock

sherlock@genome.stanford.edu