The Perl Toolchain Summit needs more sponsors. If your company depends on Perl, please support this very important event.

TopoView Glyph

Warning: This software is still in the developmental stage and is distributed "as is", without packaging and in the same exact condition as currently used by FlyBase. You are free to use it, modify and develop further, but proper reference to the original author and FlyBase is required.

"fb_shmiggle.pm" TopoView (AKA shmiggle) glyph was developed for fast 3D-like demonstration of RNA-seq data consisting of multiple individual subsets. Main purposes were to compact presentation as much as possible (in one reasonably sized track) and to allow easy visual detection of coordinated behavior of the expression profiles of different subsets.

It was found that log2 conversion dramatically changes perception of expression profiles and kind of illuminates coordinated behavior of different subsets. Glyph and data indexer/formatter were in fact modified with the assumption that final data produced by indexer/formatter will always be a log2 conversion of the original coverage, therefore represented by short integer with values in range of 0-200 or so.

Comparing performance (retrieval of several Kbp of data profiles for several subsets of some RNA-seq experiment) of wiggle binary method and of several possible alternatives, it was discovered that one of the approaches remarkably outperforms wiggle bin method (although it requires several times more space for formatted data storage). Optimal storage/retrieval method stores all experiment data (all subsets of the experiment) in one text file, where structure of the file in fact is one of the most simple wiggle (coverage files) formats with the addition of some positioning data (two-column format, without runlength specification, without omission of zero values). This is the only format which glyph is able to handle (there are many reasons for that) so any modification of indexer/formatter _must_ produce exact equivalent of that format. In my experience, 90% of the debugging with new incoming data was related to the problems of that exact format conversion. Example of the formatted data:

# subset=BS107_all_unique chromosome=2LHet
-200000 0
0       0
19955   1
19959   0
19967   2
19972   0
19977   2
20027   0
20031   2
20035   0
20043   1
20045   0
20049   1
20055   0
20062   2
20069   0
20073   2
20082   0
20097   3
20115   0
20125   3
20127   0
20134   3
20139   0
20140   3
20144   0
20145   3
20150   0
20157   3
20162   0
20172   3
20183   0

Glyph is supplied with a "index_cov_files.pl" data indexer/formatter which is converting original coverage (wiggle) files into data structure which will be used for fast retrieval. You should run this script in some separate directory, containing original coverage files (gzipped form works too). After it finishes, directory will contain two new files: data.cat and index.bdbhash. Both files required for data retrieval by glyph. Files can be moved freely between different directories or even operational systems (Mac and PC included, I think). Content of the dat file is subject of accurate check - this is if you want to avoid long debugging sessions on the level of running GBrowse. Size of files is quite big, but in my experience it is like twice less than gzipped size of all initial coverage files - which is quite acceptable.

Example of GBrowse conf file insert (shows actual FlyBase config sections for Baylor and modENCODE RNA-seq tracks):

[baylor_wiggle]
feature       = RNAseq_profile:Baylor
glyph         = fb_shmiggle
height        = 124
bgcolor       = sub { my $f= shift;
        $f->{datadir}= '/.data/genomes/dmel/current/rnaseq-gff/baylor/'; # trick it this way..
        my @subsetsorder= qw(
                E2-4hr
                E2-16hr
                E2-16hr100
                E14-16hr
                L
                L3i
                L3i100
                P
                P3d
                MA3d
                FA3d
                A17d
                );
        $f->{subsetsorder}= \@subsetsorder;
        return 'lightgrey';
        }
key           = Baylor group RNA-seq coverage by subsets (devel.stages) [log2 converted]
category      = RNA-seq data
label         = ""
title         = ""
link = sub { my $f= shift;
  my $id= $f->{'id'};
  my $lnk="javascript:void(0);";
  "$lnk\" id=\"$id\" onmouseover=\"showdata_description('Baylor');return false;\" onmouseout=\"delsumm_overlib();";
  }

[celniker_wiggle]
feature       = RNAseq_profile:Celniker 
glyph 				= fb_shmiggle
height      	= 250
bgcolor       = sub { my $f= shift;
	$f->{datadir}= '/.data/genomes/dmel/current/rnaseq-gff/celniker/'; # trick it this way..
	my @subsetsorder= qw(
		BS40_all_unique
		BS43_all_unique
		BS46_all_unique
		BS49_all_unique
		BS54_all_unique
		BS55_all_unique
		BS58_all_unique
		BS62_all_unique
		BS66_all_unique
		BS67_all_unique
		BS71_all_unique
		BS73_all_unique
		BS107_all_unique
		BS111_all_unique 
		BS113_all_unique 
		BS196_all_unique 
		BS200_all_unique 
		BS203_all_unique 
		BS129_all_unique 
		BS133_all_unique 
		BS136_all_unique 
		BS137_all_unique 
		BS140_all_unique 
		BS143_all_unique 
		BS150_all_unique 
		BS156_all_unique 
		BS162_all_unique 
		BS153_all_unique 
		BS159_all_unique 
		BS165_all_unique 
		);
	$f->{subsetsorder}= \@subsetsorder;
	my %subsetsnames= qw(
		BS40_all_unique em0-2hr
		BS43_all_unique em2-4hr
		BS46_all_unique em4-6hr
		BS49_all_unique em6-8hr
		BS54_all_unique em8-10hr
		BS55_all_unique em10-12hr
		BS58_all_unique em12-14hr
		BS62_all_unique em14-16hr
		BS66_all_unique em16-18hr
		BS67_all_unique em18-20hr
		BS71_all_unique em20-22hr
		BS73_all_unique em22-24hr
		BS107_all_unique L1
		BS111_all_unique L2
		BS113_all_unique L3_12hr
		BS196_all_unique L3_PS1-2
		BS200_all_unique L3_PS3-6
		BS203_all_unique L3_PS7-9
		BS129_all_unique WPP
		BS133_all_unique WPP_12hr
		BS136_all_unique WPP_24hr
		BS137_all_unique WPP_2days
		BS140_all_unique WPP_3days
		BS143_all_unique WPP_4days
		BS150_all_unique AdM_Ecl_1days
		BS156_all_unique AdM_Ecl_5days
		BS162_all_unique AdM_Ecl_30days
		BS153_all_unique AdF_Ecl_1days
		BS159_all_unique AdF_Ecl_5days
		BS165_all_unique AdF_Ecl_30days
		);
	$f->{subsetsnames}= \%subsetsnames;
	return 'lightgrey';
	}
key           = modENCODE Transcription Group RNA-seq coverage (unique reads only) by subsets (devel. stages) [log2 converted]
category      = RNA-seq data
label         = "" 
title         = ""
link = sub { my $f= shift;
  my $id= $f->{'id'};
	my $lnk="javascript:void(0);";
	"$lnk\" id=\"$id\" onmouseover=\"showdata_description('Celniker');return false;\" onmouseout=\"delsumm_overlib();";
	}

In configuration, it is very important to set 'datadir' variable (relative to server DOCUMENT_ROOT) so that glyph will know where to take data and index.

Setting 'subsetsorder' allows you to display expression profiles of subsets in some predefined order. If setting omitted, glyph will display sets in alphabetical order of the initial subsets names.

Setting 'subsetsnames' allows to rename subsets (very important as in most cases workflow names of subsets are unsutable for intelligent data display to end users). If setting omitted, initial subsets names will be used for display.

For the glyph to be properly activated, you need to insert in all of your GFF files (ones for which you have RNA-seq data) virtual contig-long features which will activate expression data display. To cover whole range of the contig (chromosome arm), it is better to use coordinates presented in 'sequence-region' definition at the top of GFF file. Example of such feature lines for FlyBase data is shown below:

2LHet   Baylor  RNAseq_profile  1       368874  .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
2L      Baylor  RNAseq_profile  1       23011544        .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
2RHet   Baylor  RNAseq_profile  1       3288763 .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
2R      Baylor  RNAseq_profile  1       21146708        .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
3LHet   Baylor  RNAseq_profile  1       2555493 .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
3L      Baylor  RNAseq_profile  1       24543557        .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
3RHet   Baylor  RNAseq_profile  1       2517509 .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
3R      Baylor  RNAseq_profile  1       27905053        .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
4       Baylor  RNAseq_profile  1       1351857 .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
XHet    Baylor  RNAseq_profile  1       204113  .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
X       Baylor  RNAseq_profile  1       22422827        .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks
YHet    Baylor  RNAseq_profile  1       347040  .       +       .       Comment=This is a reference feature for RNAseq wiggle tracks

Questions about TopoView glyph should be directed to Victor Strelets (strelets@bio.indiana.edu).