The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Microarray::DataMatrix::BigDataMatrix - abstraction to matrix that won't fit in memory

Abstract

bigDataMatrix is an abstract class, which provides as abstraction to a matrix of data that is too large to fit into memory. It should not be subclassed by concrete subclasses. Instead, the subclass anySizeDataMatrix, can be subclassed with concrete subclasses, which will provide abstractions to dataMatrices stored in particular file formats, such as pcl files.

Overall Logic

Internally, bigDataMatrix simply keeps track of which rows and columns are still valid. As it runs filters, or transformations on data, it creates temp files, which contain the result of such an operation. It keeps track of which rows and columns in the latest temp file are valid, and also to which rows and columns in the original file, used for object construction, that they map. Then, when it comes time to dump data to disk, it is able to instruct a concrete subclass what the original index of a row or column was, so that the concrete subclass can print out the appropriate meta data.

Construction

As bigDataMatrix is an abstract class, it has no constructor. However, the subclass, anySizeDataMatrix, once it has determined that a matrix will not fit into memory, MUST call the _init method, which is described below.

_init

This protected method will read determine how many rows and columns there are in the initial matrix. It MUST be called during initialization of a subclass object, before any other methods can be called by the client (in practice it is called from anyDataMatrix).

Usage:

    $self->_init;

or:

    $self->SUPER::_init;

private utility methods

__validateRows

This private method receives the number of rows that the matrix has, and simply records each one of them as valid. In addition it sets up a map of the current row positions in a temp file, to their original positions in the file used for construction.

Usage:

    $self->__validateRows($numRows);

__validateColumns

This method receives the number of columns that the matrix has, and simply records each one of them as valid. In addition it sets up a map of the column positions in the original file to their current positions, and a reverse map of the columns current positions, to their positions in the original matrix file used for object construction.

Usage :

    $self->__validateColumns($numColumns);

__invalidateMatrixRow

This private mutator method makes a row invalid. We actually do the invalidation in the super class, but have to use this method to call the superclass method, so that we mark the row index from the original file as invalid, rather than marking as invalid the row index based on the rows current index. We do this by translating the current row number to its original row index.

Usage :

    $self->__invalidateMatrixRow($num);

__origRow

This private method returns the original row number to which a row in the current file now corresponds.

Usage :

    my $origRow = $self->__origRow($row);

__invalidateMatrixColumn

This private mutator method makes a column invalid. We actually do the invalidation in the super class, but have to use this method to call the superclass method, so that we mark the columns index from the original file as invalid, rather than marking as invalid the column index based on the current index of a column. We do this by translating the current column number to its original.

Usage :

    $self->__invalidateMatrixColumn($column);

__origColumn

This method returns the original column number to which a column in the current file now corresponds.

Usage :

    my $column = $self->__origColumn($column);

__remapRow

This private method remaps a line in a tmp file to the line it corresponds to in the original file. It receives the current tmp file line, and the number of lines that have been removed before it. It then remaps the current row to point to the original row that the current row, $numFilteredLines ahead of it, is pointing to.

Usage :

    $self->__remapRow($numRowsPrinted, $numFilteredLines);

__remapColumn

This private method remaps a column in a tmp file to the column it corresponds to in the original file. It receives the current file column index. It then remaps the current column to point to the original column that the current column, $numFilteredColumns ahead of it, is pointing to.

Usage :

    $self->__remapColumn($numColumnsPrinted, $numFilteredColumns);

__remapColumns

This method remaps all of the columns in the current tmp file with respect to their column index in the original file. This method must be called after carrying out any operation that may reduce the number of columns in the matrix.

Usage :

    $self->__remapColumns;

__currentRowIsValid

This private accessor returns a boolean to indicate whether a given row in the tmp file is still valid (ie has not been filtered out). It does this by mapping the row to what its original index was in the file used to instantiate the matrix, and then determining whether that index is still valid.

Usage :

    if ($self->__currentRowIsValid($row)){ # blah }

__currentColumnIsValid

This private accessor returns a boolean to indicate whether a given column in the tmpfile is still valid (ie has not been filtered out). It does this by mapping the column to what its original index was in the file used to instantiate the matrix, and then determining whether that index is still valid.

Usage :

    if ($self->__currentColumnIsValid($column)){ # blah }

__currentValidColumnsArrayRef

This private method returns an reference to an array that contains the indices of the columns that are currently valid in the present file.

Usage :

    my $validColumnsArrayRef = $self->__currentValidColumnsArrayRef;

__currentValidRowsArrayRef

This private method returns an reference to an array that contains the indices of the rows that are currently valid in the present file

Usage :

    foreach (@{$self->__currentValidRowsArrayRef}){ # blah }

__currentColumnForOrig

This method returns the index where a column in the original file, now maps in the current temp file.

Usage :

    my $col = $self->__currentColumnForOrig($column);

__tmpFileHandle

This private method returns a handle to a tmpfile. If given a 'new' argument, which will be passed to the tmpFile method, the file will have its associated number increased by one - ie a new file will be used. Otherwise, a previously created file will be opened.

Usage :

    my $fh = $self->__tmpFileHandle(new=>1);

__tmpFile

This private method returns the name of a tmpfile. If given the 'new' argument then the name will be of a new file. Otherwise it will be of the last generated tmpfile.

Usage :

    my $tmpFile = $self->__tmpFile(new=>1);

__tmpNum

This method returns the number of the temp file. Generally this number should be incremented by 1 each time a new tmp file is created.

Usage :

    my $num = $self->__tmpNum;

__setTmpNum

This method allows the tmp num to be set.

Usage :

    $self->__setTmpNum($num);

__centerColumns_mean

This private method is used for centering the columns of a dataset by the mean value.

Usage :

    my $largestVal = $self->__centerColumns_mean($lineEnding, $numColumnsToReport);

__centerColumns_median

This method is used for centering the columns of a dataset by the median value.

Because we need to have a sorted list of values to calculate the median, we are going to have to read through all the data multiple times (potentially), to avoid having to have too much data in memory.

Usage :

    my $largestVal = $self->__centerColumns_median($lineEnding, $numColumnsToReport);

__subtractColumnAverages

This method retrieves data from the dataMatrix through the subclass, subtracts the column average from each value, then, through the subclass methods writes out a new file.

Usage :

    $self->__subtractColumnAverages(\@medians, $lineEnding, $numColumnsToReport); 

__filterRowsByCount

This private method filters out rows that do not have a count for some particular property above or equal to a threshold. It accepts a hash reference, that hashes the row number to a count, and a threshold value. Note that not all rows are necessarily entered into the hash, so this method iterates over all rows, and checks each valid one for its count in the hash, then invalidates those with too low a count.

Usage :

    $self->__filterRowsByCount(\%count, $numColumns);

__validColumnsStdDevAndMeanHashRefs

This method calculates the standard deviations for each valid column, and returns references to two hashes. Both have the column index as the key, and one has the standard deviation as the values, the other has the column means as the values.

    mean = Sum of values/n
    std dev = square root (((n * sum of (x^2)) - (sum of x)^2)/n(n-1))

Usage :

    my ($stddevHashRef, $meansHashRef) = $self->__validColumnsStdDevAndMeanHashRefs($lineEnding);

Protected data transformation/filtering methods

Note: These methods provide the backend nuts and bolts for a transformation or filtering. They should only be called by the immediate subclass, anySizeDataMatrix, and not directly by the concrete subclasses of anySizeDataMatrix. In addition, note that the companion smallDataMatrix must (and does) provide identical interfaces to these methods (obviously with different underlying implementations), such that anySizeDataMatrix can call the methods without regard to the size of the underlying matrix.

_centerColumns

This protected method centers each column of data, and returns the largest absolute value that was used in the centering. The caller of the method must specify whether to center by means or medians.

Usage :

    $self->_centerColumns('mean', $lineEnding, $numColumnsToReport);

_centerRows

This protected method actually centers the row data, by calculating the average (mean or nedian, depending on what was requested) for each row, and then subtracting that value from each valid datapoint in the row.

Usage :

    $self->_centerRows('median', $lineEnding, $numRowsToReport);

_filterRowsByPercentPresentData

This protected method invalidates rows that do not have greater than the requested percentage of present data.

Usage :

    $self->_filterRowsByPercentPresentData($percent, $lineEnding, $numRowsToReport);

_filterColumnsByPercentPresentData

This protected method invalidates columns that do not have greater than the requested percentage of present data.

Usage :

    $self->_filterColumnsByPercentPresentData($percent, $lineEnding, $numColumnsToReport);

_filterRowsOnColumnPercentile

NB: THIS METHOD HAS NOT YET BEEN IMPLEMENTED FOR BIG DATAMATRICES

This protected method filters out rows based on their column percentile, when all data are known to be in memory, and optionally allows for the percentiles of each datapoint to be displayed in the output file.

Usage:

    $self->_filterRowsOnColumnPercentile($lineEnding, $numColumnsToReport, $percentile, $numColumns, $showPercentile);    

_filterRowsOnColumnDeviation

This protected method will filter out rows whose values do not deviate from the column mean by a specified number of standard deviations, in at least numColumns columns.

Usage:

    $self->_filterRowsOnColumnDeviation($lineEnding, $numRowsToReport, $deviations, $numColumns);

_filterRowsOnValues

This protected method filters out rows whose values do not pass a specified criterion, in at least numColumns columns.

Usage :

    $self->_filterRowsOnValues($value, $method, $lineEnding, $numRowsToReport, $numColumns);

_filterRowsOnVectorLength

This protected method filters out rows based on whether the vector that their values define has a length of greater than the specified length.

Usage :

    $self->_filterRowsOnVectorLength($requiredLength, $lineEnding, $numRowsToReport);

_logTransformData

This method log transforms the contents of the data matrix, using the specified base for the log transformation.

Usage:

    $self->_logTransformData($logBase, $lineEnding, $numRowsToReport);

_scaleColumnData

This protected method scales the data for particular columns as specified by the client, when all data are in memory.

Usage :

    $self->_scaleColumnData($columnsToFactorsHashRef, $lineEnding, $numColumnsToReport);

public methods

dumpData

This method dumps the current contents of the dataMatrix object to a file, either whose name was provided as a single argument, or to a file whose name was used to construct the object.

Usage:

    $self->dumpData($file);

AUTHOR

Gavin Sherlock

sherlock@genome.stanford.edu