Microarray::GEO::SOFT - Reading microarray data in SOFT format from GEO database.
use Microarray::GEO::SOFT; use strict; # initialize my $soft = Microarray::GEO::SOFT->new; # download $soft->download("GDS3718"); $soft->download("GSE10626"); $soft->download("GPL1261"); # or else you can read local data $soft = Microarray::GEO::SOFT->new(file => "GDS3718.soft"); $soft = Microarray::GEO::SOFT->new(file => "GSE10626_family.soft"); $soft = Microarray::GEO::SOFT->new(file => "GPL1261.annot"); # parse # it returns a Microarray::GEO::SOFT::GDS, # Microarray::GEO::SOFT::GSE or Microarray::GEO::SOFT::GPL object # according the the GSE ID type my $data = $soft->parse; # some meta info $data->meta; $data->title; $data->platform; # for GPL and GDS, you can get the data table $data->table; $data->colnames; $data->rownames; $data->matrix; # sinece GSE can contain more than one GPL # we can get the GPL list in a GSE my $gpl_list = $data->list("GPL"); # merge samples belonging to a same GPL into a data set my $gds_list = $data->merge; # if the GSE only have one platform # then the merged data set is the first one in gds_list # and the platform is the first one in gpl_list my $g = $gds_list->[0]; my $gpl = $gpl_list->[0]; # since GPL data contains different mapping of genes or probes # we can transform from probe id to gene symbol # it returns a Microarray::ExprSet object my $e = $g->id_convert($gpl, "Gene Symbol"); my $e = $g->id_convert($gpl, qr/gene[-_\s]?symbol/i); # if you pased a GDS data # you can first find the platform my $platform_id = $data->platform; # downloaded or parse the local file my $gpl = Microarray::GEO::SOFT->new->download($platform_id); # and do the id convert thing my $e = $data->id_convert($gpl, qr/gene[-_\s]?symbol/i); # or just transform into Microarray::ExprSet direct from GDS my $e = $g->soft2exprset; # then you can do some simple processing thing # eliminate the blank lines $e->remove_empty_features; # make all symbols unique $e->unify_features; # obtain the expression matrix $e->save('some-file');
Also, you can use the module under command line
getgeo --id=GDS3718 getgeo --file=GDS3718.soft --verbose
GEO (Gene Expression Omnibus) is the biggest database providing gene expression profile data. This module provides method to download and parse files in GEO database and transform them into simple format for common usage.
There are always four type of data in GEO which are GSE, GPL, GSM and GDS.
GPL: Platform of the microarray, like Affymetrix U133A, see Microarray::GEO::SOFT::GPL
GSM: A single microarray, see Microarray::GEO::SOFT::GSM
GSE: A complete microarray experiment, always contains multiple samples and multiple platforms see Microarray::GEO::SOFT::GSE
GDS: manually collected data sets from GSE, with only 1 platform. see Microarray::GEO::SOFT::GDS
Data stored in GEO database has several formats. We provide method to parse the most used format: SOFT formatted family files. The origin data is downloaded from GEO ftp site.
new("file" => $file, HASH )
Initial a Microarray::GEO::SOFT class object. The argument is file path for the microarray data in SOFT format or a file handle that has been openned. Other arguments are.
'tmp_dir' => '.tmp_soft' 'verbose' => 1 'sample_value_column' => 'VALUE'
'tmp_dir' is the name for the temporary directory. 'verbose' determines whether print the message when analysis. 'sample_value_column' is the column name for table data when parsing GSM data.
$soft->download(ACC, %options)
Download GEO record from NCBI website. The first argument is the accession number such as (GSExxx, GPLxxx or GDSxxx). Your can set the timeout and proxy via %options. the proxy should be set as http://username:password@server-addr:port/.
%options
GSE data is downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/by_series/GSExxx/GSExxx_family.tar.gz
GDS data is downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/SOFT/GDS/GDSxxx.soft.gz
GPL data is downloaded from ftp://ftp.ncbi.nih.gov/pub/geo/DATA/annotation/platforms/GPLxxx.annot.gz
$soft->soft_dir
Temp dir for storing downloaded GEO data. It is ".tmp_soft".
$soft->parse
Proper parsing method is selected according to the accession number of GEO record. E.g. if a GSExxx record is required, then the parsing function would choose method to parse GSExxx part and return a Microarray::GEO::SOFT::GSE class object. The return value is one of Microarray::GEO::SOFT::GSE, Microarray::GEO::SOFT::GPL or Microarray::GEO::SOFT::GDS object.
$data->meta
Get meta information, more detailed meta information can be get via platform, title, accession.
platform
title
accession
$data->set_meta(HASH)
Set meta information, arguments are 'platform', 'title' and 'accession'
$data->platform
Get accession number of the platform. If a record has multiple platforms, the function return a reference of array (only for GSE).
$data->title
Title of the record
$data->accession
Accession number for the record
$gds->table
Get the table part in the object. Note it is not work for Microarray::GEO::SOFT::GSE object.
$gds->set_table
Set the table part in the object. Note it is not work for Microarray::GEO::SOFT::GSE object.
$gds->rownames
Row names for the table part in the object. Note it is not work for Microarray::GEO::SOFT::GSE object.
$gds->colnames
Column names for the table part in the object. Note it is not work for Microarray::GEO::SOFT::GSE object.
$gds->colnames_explain
A little more detailed explain for column names. Note it is not work for Microarray::GEO::SOFT::GSE object.
$gds->matrix
Expression value matrix or ID mapping matrix. Note it is not work for Microarray::GEO::SOFT::GSE object.
getgeo
getgeo is a simple command line tool to download or parse the GEO data. Options are as follows:
--id=[GEOID] GEO ID. such as GSE123, GDS123 or GPL123. If this is set, the script would download data from GEO FTP site. --proxy=[PROXY] Proxy to connect to GEO FTP site. Format should look like http://username:password@host:port/. --file=[FILE] Filename for local GEO file. If --id is set, this option is ignored. --tmp-dir=[DIR] Temporary directory name for processing of GEO data. By default it is '.tmp_soft' in your working directory. --verbose Whether print message while processing. --sample-value-column=[FIELD] Since there may be multiple columns in GSM record, users may specify which column is the expression value they want. By default it is 'VALUE'. Ignored when analyzing GPL and GDS data. --output-file=[FILE] Filename for the output file. By default it is 'GEOID.table' in your current working directory. --help Help message.
Zuguang Gu <jokergoo@gmail.com>
Copyright 2012 by Zuguang Gu
This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself, either Perl version 5.12.1 or, at your option, any later version of Perl 5 you may have available.
Microarray::ExprSet
To install Microarray::GEO::SOFT, copy and paste the appropriate command in to your terminal.
cpanm
cpanm Microarray::GEO::SOFT
CPAN shell
perl -MCPAN -e shell install Microarray::GEO::SOFT
For more information on module installation, please visit the detailed CPAN module installation guide.