eXpress

 

Streaming quantification for high-throughput sequencing

Loading

About

Description

eXpress is a streaming tool for quantifying the abundances of a set of target sequences from sampled subsequences. Example applications include transcript-level RNA-Seq quantification, allele-specific/haplotype expression analysis (from RNA-Seq), transcription factor binding quantification in ChIP-Seq, and analysis of metagenomic data. It is based on an online-EM algorithm [1] that results in space (memory) requirements proportional to the total size of the target sequences and time requirements that are proportional to the number of sampled fragments. Thus, in applications such as RNA-Seq, eXpress can accurately quantify much larger samples than other currently available tools greatly reducing computing infrastructure requirements. eXpress can be used to build lightweight high-throughput sequencing processing pipelines when coupled with a streaming aligner (such as Bowtie), as output can be piped directly into eXpress, effectively eliminating the need to store read alignments in memory or on disk.

In an analysis of the performance of eXpress for RNA-Seq data, we have observed that this efficiency does not come at a cost of accuracy. eXpress is more accurate than other available tools, even when limited to smaller datasets that do not require such efficiency [2]. Moreover, like the Cufflinks program [3], eXpress can be used to estimate transcript abundances in multi-isoform genes. eXpress is also able to resolve multi-mappings of reads across gene families, and does not require a reference genome so that it can be used in conjunction with de novo assemblers such as Trinity, Oases, or Trans-ABySS. The underlying model is based on previously described probabilistic models developed for RNA-Seq [4] but is applicable to other settings where target sequences are sampled, and includes parameters for fragment length distributions, errors in reads, and sequence-specific fragment bias [5].

eXpress can be used to resolve ambiguous mappings in other high-throughput sequencing based applications. The only required inputs to eXpress are a set of target sequences and a set of sequenced fragments multiply-aligned to them. While these target sequences will often be gene isoforms, they need not be. Haplotypes can be used as the reference for allele-specific expression analysis, binding regions for ChIP-Seq, or target genomes in metagenomics experiments. eXpress is useful in any analysis where reads multi-map to sequences that differ in abundance.

Back to top.

Features

  • Time proportional to number of reads, memory proportional to transcriptome size.
  • Maximum of 3 free processor cores required (2 without bias correction). Try it on your laptop!
  • Outputs FPKM, estimated counts, and posterior count distributions for differential analysis.
  • Supports alignments with unlimited multi-mappings and errors.
  • Transcript-level abundance estimation when used for RNA-Seq.
  • Can be used for allele-specific expression estimates in RNA-Seq.
  • Corrects for sequence-specific fragment biases.
  • Learns first-order Markov model for sequencing errors for use in probablistic read assignment.
  • Models indels in reads.
  • Learns fragment length distribution for use in probablistic assignment of paired-end reads.
  • Supports both directional and non-directional sequencing.
  • Written in C++ and runs on Mac, Linux, and Windows.

Back to top.

Methods

A complete description of the eXpress method can be found in the manuscript [2].

Back to top.

References

  1. Cappé O and Moulines E. (2009). On-line expectation–maximization algorithm for latent data models. Journal of the Royal Statistical Society.
    doi:10.1111/j.1467-9868.2009.00698.x
  2. Roberts A and Pachter L (2012). Streaming fragment assignment for real-time analysis of sequencing experiments. Nature Methods.
    doi:10.1038/nmeth.2251
  3. Trapnell C, Williams BA, Pertea G, Mortazavi AM, Kwan G, van Baren MJ, Salzberg SL, Wold B,  Pachter L (2010). Transcript assembly and quantification by RNA-Seq reveals unannotated transcripts and isoform switching during cell differentiation. Nature Biotechnology.
    doi:10.1038/nbt.162
  4. Pachter, L (2011). Models for transcript quantification from RNA-Seq. Submitted.
    arXiv:1104.3889v2
  5. Roberts A, Trapnell C, Donaghey J, Rinn JL, Pachter L (2011). Improving RNA-Seq expression estimates by correcting for fragment bias. Genome Biology.
    doi:10.1186/gb-2011-12-3-r22

Back to top.