The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

NAME

Microarray::CdtDataset - an abstraction to the files produced from clustering

Abstract

 This package implements an object that serves as an abstraction to a
 cdtDataset.  It is different than the Microarray::DataMatrix::CdtFile
 abstraction, because it deals with the cdtFile in the context of gtr
 and/or atr files.  It also provides methods by which the geneXplorer
 program can interact with a cdtDataset.
    The essential purpose of CdtDataset's initialization functions is to
 de-construct the .cdt file into its constituent data parts of the
 dataset:
    1) the data matrix (.data_matrix)
    2) the bioassay names or slidenames (.expt_info)
    3) the annotations of the spotted features/reporters/sequences
       (.feature_info)
    4) any additional meta information about the set (.meta)
    5) additionally, it computes or creates the following:
        a) a binary file containing a list of feature-feature
           correlations (.binCor) 
        b) a 2-color image representation of the data matrix
           (.data_matrix.png)
        c) a image representation of the expt_info file
           (.expt_info.png)

Known Issues

 There are good reasons to add additional meta data to a dataset,
 including possibly the organism of the set or the location of the
 default display configuration file to display the .feature_info.
 These would probably have to be called in the constructor.

Future Plans

 Currently, only the .cdt file of a clustered dataset
 is utilized.  In the future, the other data files detailing the
 clustering [gene tree(.gtr) and array tree(.atr)] should be
 utilized, and DatasetImageMaker should export suitable image
 representations for these files.  Furthermore, It would be great to
 pull general dataset methods from this class into a future class,
 Microarray::Dataset.  That way, you could make a MageMLDataset class
 as well, and still keep many of the general class attributes/methods
 in the same locations.  Microarray:Dataset would inherit constructor
 methods (i.e. knowledge of the file structure) from either
 CdtDataset orMageMLDataset at initialization (perhaps a run-time ISA
 declaration within the constructor).  Otherwise, I don't see a huge
 advantage to having these specialized (and somewhat misnamed)
 classes, in the sense that Dataset only need to know how to parse
 the initialization file while converting a new dataset

Instance Constructor

new

 This is the constructor.  There are two modes in which the
 constructor can be used.  In one mode, it will create various files
 which support the dataset, using the cdt, (and hopefully in the
 future, gtr and atr files).  In the second mode, it will assume that
 these files already exist and just return the constructed objevt.
 Thus when a dataset is first created, there will be the overhead of
 creating the additional files, but subsequent creation of a
 cdtDataset object will not have that overhead.  The constructor
 takes the following arguments:
 name         :  The fully qualified name of the dataset (slash/delimited),
                 which encodes the location and stem of the files,
                 without any extensions, and with no path
                 information. If the 'initialize' argument is set
                 (see below), a directory tructure of the same name
                 will also be created to contain the exported data
                 files.
 datapath     :  This required path prefix is where any newly created data
                 files should be placed (or read from).
 imagepath    :  An optional path prefix where any newly created image files
                 should be placed (or read from). Will default to
                 datapath if none is specified.
 contrast     :  If a dataset is being instantiated for the first
                 time, then a contrast is needed for image
                 generation.  If no contrast is provided, then a
                  default value of 4 will be used.  As the data are
                  expected to be in log base 2, this corresponds to a
                  16-fold change as the maximum color in any image.
 colorscheme  :  Can either be 'red/green' (the default if none is
                 specified) or 'yellow/blue'
 initialize   :  A filepath of the originating .cdt file indicate
                 whether to initialize all the required supporting
                 files that a cdtDataset needs.  This defaults to 0
                 (assumes that the necessary supporting files already
                 exist.  If it is a filepath, then the dataset is
                 initialized using it
 Note that if you supply a contrast, you must set initialize to 1, as
 a contrast is useless in the absence of initialization.  Both the
 'dataset' and 'path' arguments are absolutely required.
 Usage, eg if you have a file:
    my $ds = Microarray::CdtDataset->new(name=>dataset/name, # name of the dataset
                                               datapath=>$dir,     # prefix path where dataset files will be written
                                               contrast=>2,        # image contrast
                                               initialize=>/path/to/file.cdt);

Instance Methods

name

 This method returns the fully qualified name of the dataset

contrast

 This method returns the contrast

colorScheme

 This method returns the colorScheme

fileBaseName

 This method returns the base name string of the files comprising of
 the dataset, sans suffices

height

 This method returns the number of data rows in the cdtFile

width

 This method returns the number of data columns in the cdtFile

image

 Returns the data matrix as a GD::Image, drawn with 1x1 pixel per
 value at the contrast last used/initialized with $ds->new()
 Usage: $ds->image();

experiment

getFeatureKeys

 returns the keys (attributes) for the features (gene expression row
 vectors)
Usage: $ds->getFeatureKeys()

feature

 required by the search function of Explorer

getFeature

 Returns an array of data matrix row numbers where <query> matched in
 column <column_name>.  When using 'ALL' as <column_name>, all
 columns will be searched

correlations

 Returns the precalculated correlation values for row <index>.  Up to
 50 correlations values > 0.5 are stored.  As an example client
 usage, see Explorer's/gx retrieval of those profiles correlated to
 the query (user-clicked profile within zoom view).

Protected Methods

_cdtFileName

 This method returns the name of the cdtFile

_cdtBase

 This method returns the base name string of the files comprising of
 the dataset, sans suffices

_cdtPath

 This method returns the path to the cdt file of thebeing converted
 into a dataset

datapath

 This method returns the path to which data files either written
 or read from

imagepath

 This method returns the path to which image files are either written
 or read from

_load_meta

 This method loads in previously cached meta data

_load_image

 this protected method just opens up the previously stored matrix
 image (from dataset initialization) , created a GD::Image object
 with it, and returns it.  Possible bug: it relies on GD::Image
 version (>1.19) to pick $kImgType, when perhaps it should rely on
 the filename suffix (.gif, .png) instead.  This may prevent the
 portability of intact datasets from one filesystem to another, but
 in the end, you're always going to be limited by the version of GD...

_search_feature

 usage: $hit = $self->_search_feature( 100, "kinase", ['ACC','NAME','SYMBOL'])
 this function returns true, if the feature queried contains the passed
 string values(s). The parameters to this function are:
 - required: the index number of the feature
 - required: a search term
 - optional: an array reference, containing the names of fields to search,
   if not passed, all fields will be searched.

_get_correlations

 required for Explorer to retrieve those profiles highly correlated
 to the query (user-clicked profile within zoom view)

Private Methods

__init

 This method takes care of all of the initialization of the
 attributes of the cdtDataset

__checkAndSetConstructorArguments

 This private method checks that the constructor arguments pass all
 sanity checks, and that files that should exist do exist.

__checkAndSetInitializationState

 This method checks and sets whether the object needs full
 initialization.  There are meant to be 2 initilization requests.
 The first (initialization=><path>) would request that the dataset be
 created de novo from an initial file, and the second
 (initialization=>1) would just remake the images with a different
 constrast and different colors.  The second initialization has not
 been adequately tested.

__checkAndSetDataPath

 This private method checks that an Path is supplied, that
 corresponds to an existent directory, then stores it in the object.

__checkAndSetImagePath

 This private method checks that an Path is supplied, that
 corresponds to an existent directory, then stores it in the object.

__checkAndSetDatasetName

 This method checks that a dataset was given to the constructor.  In
 addition because CdtDataset creates and stores all its images and data
 in a directory hierarchy, the initially specified data and image
 paths are augmented with the dataset name directories (which are
 created upon initialization)

__checkAndSetContrast

 This method determines if the contrast is valid, and then stores the
 value in the object

__checkAndSetColorScheme

 This method determines if the colorscheme is valid, and then stores
 the value in the object

__checkRequiredFilesExist

 This method checks that all the required files for the dataset exist
 If they do not, it will cause a fatal error

__setCdtInfo

 this subroutine takes the initalize arguement and store the path and
 the stem of the .cdt filename

__setFileBaseName

 This method allows the filename stem (no suffix) of the datafiles
 use to initialize the dataset to be set

__setDataPath

 This method allows the path to where the data files for the dataset
 exist to be set

__setImagePath

 This method allows the path to where the image files for the dataset
 exist to be set

__setDatasetName

 This method allows the name of the dataset to be set.

__setCdtFileName

 This method sets the name of the cdtFile

__setContrast

 This method allows the contrast to be set.

__setColorScheme

 This method allows the colorscheme to be set.

__setShouldInitialize

 This method allows a flag to be set as to whether full
 initialization need to take place

__setHeight

 This private method allows the 'height' of the dataset to be set.
 This in fact corresponds to the number of rows in the cdt file.

__setWidth

 This private method allows the 'height' of the dataset to be set.
 This in fact corresponds to the number of rows in the cdt file.

__ensureDirectoriesExist

 This subroutine checks to see that the full outpath is created if
 necessary, by extended a previouslt validated filepath.  It is
 tended for use only when initializating a dataset, where the dataset
 directories might need to be created and appended to the data and
 image out paths

__cdtFileObject

 This private method returns a cdtFile Object.  If one does not exist
 within the object, one will be created.  If one does exist, that
 will simply be returned.  This will likely fail for sets that are
 already converted, because the .cdt file is not copied into the
 dataset location.  This is a design issue that needs to be
 discussed, in addition to the fact that it is private method, when
 it seems like other software might actually *want* to retrieve the
 Datamatix object

__shouldInitialize

 This private method returns whether the object needs initialization

__initializeDataset

 This method creates a new dataset from a CDT (clustered data) file.
 The CDT file format was defined by Michael Eisen for his Windows
 applications TreeView and Cluster. It has certain drawbacks, for
 example not more then two columns per gene can be used to store
 additional information.  This can be partly resolved by putting more
 data into one record field.  A kludgy fix.

__lock

 This method locks the dataset

__unlock

 This method unlocks the dataset

__dissectCDT

 This method determines the contents of the cdtfile, and stores some
 of the cdtMeta data for quick retrieval.  Note that the previous
 version did its own parsing of the cdtFile.  This is now delegated
 to the cdtFile object.

__saveCdtExptNames

 This method (we may eliminate it later) save the names of the data
 columns from the cdtFile (these are usually the experiment names) to
 a file.  This is later used by GeneXplorer, but also provides a
 quick way of looking up the data, without having to read the cdtFile
 in.

__prepareCorrelations

 This method prepares a correlations file 

__createIndexedPclFile

 This method creates a pcl file from the cdt file that was used to
 instantiate the object.  This is coded here, rather than using the
 cdtFile method to convert to a pcl, because the pcl file must have 
 an index for it's names, rather than the names themselves.

__compressCorrelations

 This method takes a correlations file as output by Gavin Sherlocks
 correlations program.  These represent the correlation values of a
 certain gene (array element) intensity vector vs. all other vectors
 in a data matrix.
 The output generated is a binary representation of the list of
 correlation values for each row in the data matrix (= expression
 vectors).
 The file is built like this:
 name        content           bytes
 header
 index_size  length of index   2
 index       offset for rows   index_size * 2
 body
 data 1..n   correlation data  4 * look up in index
 -> index    correlated vector 2 \
 -> corr     correlation       2 / 2 words (16 int)

__prepareMetaFile

 This method writes out a file of meta information that pertain to
 the dataset, in the form of name=value pair.

__loadExptInfo

# This method loads the expt_info data

__load_table

 loads an ASCII table. It is expected that the first row contains the
 column headers It is also expected that the first column contains
 numeric id's starting at '0'.  returns a reference to the table
 structure

Authors

John C. Matese jcmatese@genome.stanford.edu