IVT-seq reveals extreme bias in RNA sequencing
1 Department of Pharmacology, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
2 Department of Chemical and Biological Engineering, Koc University, Istanbul, Turkey
3 Department of Molecular Biology and Genetics, Koc University, Istanbul, Turkey
4 The Institute for Translational Medicine and Therapeutics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
5 The Hamner Institutes for Health Sciences, Research Triangle Park, NC, USA
6 Department of Biology, University of Pennsylvania, Philadelphia, PA, USA
7 Amazon Web Services, Herndon, VA, USA
8 Department of Biostatistics, Johns Hopkins University Bloomberg School of Public Health, Baltimore, MD, USA
9 Department of Genetics, University of Pennsylvania Perelman School of Medicine, Philadelphia, PA, USA
Genome Biology 2014, 15:R86 doi:10.1186/gb-2014-15-6-r86Published: 30 June 2014
RNA-seq is a powerful technique for identifying and quantifying transcription and splicing events, both known and novel. However, given its recent development and the proliferation of library construction methods, understanding the bias it introduces is incomplete but critical to realizing its value.
We present a method, in vitro transcription sequencing (IVT-seq), for identifying and assessing the technical biases in RNA-seq library generation and sequencing at scale. We created a pool of over 1,000 in vitro transcribed RNAs from a full-length human cDNA library and sequenced them with polyA and total RNA-seq, the most common protocols. Because each cDNA is full length, and we show in vitro transcription is incredibly processive, each base in each transcript should be equivalently represented. However, with common RNA-seq applications and platforms, we find 50% of transcripts have more than two-fold and 10% have more than 10-fold differences in within-transcript sequence coverage. We also find greater than 6% of transcripts have regions of dramatically unpredictable sequencing coverage between samples, confounding accurate determination of their expression. We use a combination of experimental and computational approaches to show rRNA depletion is responsible for the most significant variability in coverage, and several sequence determinants also strongly influence representation.
These results show the utility of IVT-seq for promoting better understanding of bias introduced by RNA-seq. We find rRNA depletion is responsible for substantial, unappreciated biases in coverage introduced during library preparation. These biases suggest exon-level expression analysis may be inadvisable, and we recommend caution when interpreting RNA-seq results.