DNA sequencing technologies deviate from the ideal uniform distribution of reads. These biases impair scientific and medical applications. Accordingly, we have developed computational methods for discovering, describing and measuring bias.
We applied these methods to the Illumina, Ion Torrent, Pacific Biosciences and Complete Genomics sequencing platforms, using data from human and from a set of microbes with diverse base compositions. As in previous work, library construction conditions significantly influence sequencing bias. Pacific Biosciences coverage levels are the least biased, followed by Illumina, although all technologies exhibit error-rate biases in high- and low-GC regions and at long homopolymer runs. The GC-rich regions prone to low coverage include a number of human promoters, so we therefore catalog 1,000 that were exceptionally resistant to sequencing. Our results indicate that combining data from two technologies can reduce coverage bias if the biases in the component technologies are complementary and of similar magnitude. Analysis of Illumina data representing 120-fold coverage of a well-studied human sample reveals that 0.20% of the autosomal genome was covered at less than 10% of the genome-wide average. Excluding locations that were similar to known bias motifs or likely due to sample-reference variations left only 0.045% of the autosomal genome with unexplained poor coverage.
The assays presented in this paper provide a comprehensive view of sequencing bias, which can be used to drive laboratory improvements and to monitor production processes. Development guided by these assays should result in improved genome assemblies and better coverage of biologically important loci.
Ideal whole-genome shotgun DNA sequencing would distribute reads uniformly across the genome and without sequence-dependent variations in quality. All existing sequencing technologies fall short of this ideal and exhibit various types and degrees of bias. Sequencing bias degrades genomic data applications, including genome assembly and variation discovery, which rely on genome-wide coverage. Undercovered regions might lead to a missed SNP in an important region or cause an assembler to produce shorter contigs. For example, Figure 1 plots the coverage of the transcription start site and first exon of human gene NCS1, which encodes a neurotransmitter regulator , in whole-genome shotgun sequencing (data set A2). Despite 198-fold mean coverage of the genome, the first 72 bases of this exon are completely uncovered. This type of bias can reduce the effectiveness of biological and medical research. Recently published work on drug-resistant tuberculosis identified thousands of zero-coverage sites in an entire class of the bacterium's genes, despite sequencing to an average depth of 134× . Alleviating gaps or dips in coverage through additional reads inflates sequencing costs, and may have limited effectiveness. For these reasons, improving our knowledge of sequencing bias is essential to improving the utility of DNA sequencing data.
Figure 1. Diagram illustrating the low coverage of NCS1 exon 1 in 198× Illumina HiSeq shotgun data. The first 72 bases of the first exon of human gene NCS1, including the transcription start site, were uncovered in a 198× whole-genome shotgun data set (#A2). The displayed 2,000 base region is chromosome 9:132,933,910-132,935,910. NCS1 encodes calcium-binding proteins that regulate neurotransmitter release .
Our goal in this work was to develop a rigorous method for discovering and monitoring coverage and error biases, then to apply it to data from a wide range of sequencing platforms (Illumina HiSeq and MiSeq, Ion Torrent PGM, Pacific Biosciences RS, and the Complete Genomics sequencing service). This study complements previous work in the field [3-7].
Bias manifests in multiple ways. Coverage bias is a deviation from the uniform distribution of reads across the genome. Similarly, error bias is a deviation from the expectation of uniform mismatch, insertion, and deletion rates in reads across the genome. This paper focuses primarily on coverage bias because it is the most damaging sequencing failure.
Sequencing technologies are vulnerable to multiple sources of bias. Methods based on bacterial cloning and Sanger-chemistry sequencing  were subject to many coverage-reducing biases, notably at GC extremes, palindromes, inverted repeats, and sequences toxic to the bacterial host [9-17]. Illumina sequencing  has been shown to lose coverage in regions of high or low GC [19-22], a phenomenon also seen in other 'next-generation' technologies [3,6]. PCR amplification during library construction is a known source of undercoverage of GC-extreme regions [20,21] and similar biases may also be introduced during bridge PCR for cluster amplification on the Illumina flowcell . Illumina strand-specific errors can lead to coverage biases by impairing aligner performance . Ion Torrent , like 454 , utilizes a terminator-free chemistry that may limit its ability to accurately sequence long homopolymers [4,27,28], and may also be sensitive to coverage biases introduced by emulsion PCR in library construction. Complete Genomics  also uses amplification along with a complex library construction process. The Pacific Biosciences  process is amplification-free; therefore, one might expect it to exhibit lower levels of coverage bias than the other technologies.
In addition to sources in the wet lab, bias can be introduced by any of the computational steps in the sequencing pipeline. Signal-processing and base calling limitations could result in under-representation or increased error rates in some locations, as can inaccurate alignment. An inaccurate reference or sample-reference differences can cause coverage or accuracy variations that may be misdiagnosed as sequencing bias. Therefore, detecting bias is only the first step and must be followed by more detailed experiments to assign responsibility to the library preparation, sequencing, or computational stages.
We employ two methodologies for measuring bias. Per-base bias measurements, which rely on deep-coverage sequencing, are hypothesis-free and ideal for discovering new types of bias. Motif bias measurements, which require only shallow-coverage sequencing, are ideal for comparisons across experimental conditions and for monitoring ongoing sequencing pipeline performance at known bias-prone sequence contexts and locations. Bias motif monitoring plays a useful role by providing a critical metric in determining and ameliorating the sources of sequencing bias. Together these methodologies can be used to compare platforms, to measure the utility of combining data from multiple platforms, and to determine the extent to which coverage bias is described by the statistics of known motifs.
Results and discussion
We begin by defining the bias statistics. The fundamental statistic of coverage bias is 'relative coverage', which is defined as:
The coverage of a given reference base is computed by counting the number of read bases mapped to it in an alignment (see Materials and methods). The mean coverage is computed by averaging this value across every base in the reference. Then the relative coverage for a particular base is computed as the ratio of these values. A relative coverage of 1 indicates that a particular base is covered at the expected average rate. A relative coverage above 1 indicates higher than expected coverage and below 1 indicates lower than expected coverage.
Some reads cannot be mapped to a single locus, and the probability of ambiguous mapping increases as reads become shorter or less accurate. Ambiguous mapping is also more likely for reads that derive from repetitive or low complexity regions of the genome, including some regions with extreme GC content. To solve this problem, we rely on the aligner employing a policy of random assignment when there are multiple 'best' alignments. This provides the optimal measurement of coverage bias given the data: it is impossible to know whether specific locations are evenly represented, but we can nonetheless expect to accurately assess the coverage of classes of bases as defined by some local sequence context (for example, involving GC content, and so on). All the alignment algorithms used in this work (see Materials and methods) use this random-placement policy.
Bases having low relative coverage are of particular interest, provided that the low coverage is not an accident of sample size. For example, at 20-fold mean coverage, some bases whose 'true' relative coverage is 1 (corresponding to an expectation of 20 overlapping reads), will occasionally have measured relative coverage of 0.5 (corresponding to an observation of 10 overlapping reads), as that measurement is only off by standard deviations (based on a Poisson model). Thus, deep sequencing is required to accurately identify bases having low relative coverage.
Typically, only a small fraction of a genome has 'low' relative coverage. For example, 198-fold mean coverage of the human genome by Illumina HiSeq 2000 version 2 chemistry only left 0.23% of bases undercovered by a factor of 10 or more (data set A2). At first glance, this portion of the genome appears minuscule, but if the data were unbiased, we would expect no bases to have such a low level of coverage (more than 12 standard deviations less than the mean). Additionally, this small undercovered fraction included important loci. For example, this deep-coverage HiSeq data set contained no reads overlapping the transcription start sites of several genes associated with early development, transcriptional regulation, cell-cell adhesion, actin binding, neural development, and intracellular signaling (for an example, see Figure 1). Thus, understanding the specific nature of undercovered sequences is important. We approached this problem in two ways: by evaluating specific biologically important regions of the genome that are significantly undercovered, and by identifying specific sequence motifs that are systematically undercovered. Anecdotal results suggested that many transcription start sites or first exons in the human genome tend to have poor coverage. By a systematic analysis of these regions we defined the 1,000 with the lowest relative coverage based on low coverage by an Illumina data set, which we term the 'bad promoters' list (see Materials and methods). The bad promoters are, like many exons, GC-rich (averaging 79% GC composition).
It is well established that extreme base composition is associated with bias in multiple technologies [3,4,6,13,14,19-22,27]. In this work, we define specific base composition categories that are associated with bias, which we refer to as 'motifs'. Motif bias statistics can be measured accurately with much less data than per-base statistics (see below). They are also valuable because they can suggest underlying causes of bias that can then be investigated in laboratory experiments and can be used to track performance of attempted process improvements.
We developed a list of five bias motifs that encapsulate several common sources of coverage bias:
• GC ≤ 10%, 200-base regions in which the middle 100 bases have ≤10% GC content;
• GC ≥ 75%, 200-base regions in which the middle 100 bases have ≥75% GC content;
• GC ≥ 85%, 200-base regions in which the middle 100 bases have ≥85% GC content;
• (AT)15, 130-base regions in which the middle 30 bases are repeated AT dinucleotides;
• G|C ≥ 80%, 130-base regions in which the middle 30 bases are either 80% Gs or 80% Cs (and, therefore, match long G or C homopolymers).
For human data, we added a sixth motif based on the aforementioned list of undercovered transcription start sites: the 1,000 empirically defined 'bad promoter' 200-base intervals from the human genome (as defined above; coordinates reported in Additional file 1).
Additional file 1. The 'bad promoters' list for Human assembly 19 (GRCh 37), as described in the main text and method section, computed from HiSeq v2 data set A2; intervals are annotated with gene names and the coverage ratios used to select them (see Materials and methods for details).
Format: TXT Size: 44KB Download file
The 'special' motifs (AT)15 and G|C ≥ 80% are included based on anecdotal evidence that contig breaks in assemblies are frequently associated with these motifs. The extents of all the motifs in the reference genomes studied in this paper are presented in Table 1. The decision to attend to regions of 100 to 200 bases was an empirical choice influenced by considerations such as the distribution of fragment sizes in our Illumina libraries. Computing our statistics using larger or smaller regions might make different biases apparent depending on the properties of the assayed data set.
Table 1. Genomes and motifs
These motifs focus on known trouble spots. Because GC base composition is frequently implicated in coverage bias, it is also useful to measure the relative coverage across the entire GC spectrum by grouping all 100-base sliding windows across the genome by their GC content and reporting the average relative coverage for each GC-content percentage (in effect defining a motif for each percentage). The results can be presented as a GC-bias plot, as exemplified in Figure 2. Unbiased sequencing would be unaffected by GC composition, resulting in a flat line along relative coverage = 1.
Figure 2. GC-bias plots for three microbial genomes. Top: plots showing the relative coverage GC-bias for Illumina MiSeq, Ion Torrent PGM, and Pacific Biosciences RS on the P. falciparum (19% GC), E. coli (51%), and R. sphaeroides (69%) genomes (Table 2, data sets 1 to 9). Unbiased coverage would be represented by a horizontal line at a relative coverage = 1 (black dashed line). Relative coverage is only plotted for GC percentages for which there are at least 1,000 100-base windows in the genome. Bottom: the GC composition distribution of each genome.
Because motifs are typically represented by many loci in a genome, the number of reads incident upon a motif is much larger than the number of reads incident upon a single base, and hence the relative coverage of a motif (that is, the mean of the relative coverages of its constituent bases) can be accurately measured even with low sequencing coverage.
Table 2 presents relative coverage of the six motifs across 16 data sets. From data set 14 (120-fold coverage of Illumina from the human genome) we also chose ten 0.5-fold random subsets, in each case computing the relative coverage across the motifs. For each motif, we show (data set 14') the mean of these ten measurements, which for all motifs were within 0.01 of the full sample value, and the observed standard deviation, which for all motifs was approximately 0.02 or less. This shows that for the human genome, relative coverage of the six motifs can be accurately assayed using low coverage.
Table 2. Data sets and their relative coverage on bias motifs
Comparing bias across technologies
Bias in a GC-spanning set of microbes
To assess the bias profile across technologies efficiently, we generated data from three microbial genomes that together span a wide range of GC base composition: Plasmodium falciparum (mean 19% GC), Escherichia coli (51%) and Rhodobacter sphaeroides (69%). All three genomes have finished reference sequences, thus facilitating a definitive analysis (see Materials and methods). Only data from Illumina (MiSeq), Ion Torrent and Pacific Biosciences were examined, not from Complete Genomics, which generates only human sequencing data. For all analyses, we note that although results are categorized by sequencing technology, in fact bias can also be introduced by library construction, and that disentanglement of these variables would require additional experiments. For the following bacterial genome analysis, Illumina libraries were made following a low-input variation of the protocol detailed in Fisher et al. , modified with Kapa Biosystems reagents (see Materials and methods), and both Ion Torrent and Pacific Bioscience libraries were generated using the respective manufacturers' reagents and recommended protocols (see Materials and methods).
We first asked how much of each of the three genomes was undercovered by each of the three technologies (Table 3, 1 to 9, italics), ensuring comparability by downsampling each data set to 100-fold coverage, and testing several levels of relative coverage (0.5, 0.25, 0.1 and no coverage). While modest variation was seen for E. coli on all three platforms, the results for the GC-extreme genomes were striking. For example, the fraction of the GC-poor P. falciparum genome that had relative coverage ≤0.25 (that is, four-fold undercovered or worse) ranged from 0.33% in Pacific Biosciences data (best) to 3.7% in Illumina data to 22% in Ion Torrent data (worst). In the GC-rich R. sphaeroides genome, the four-fold undercoverage fractions were 0.0071% for Pacific Biosciences (best), 0.39% for Illumina, and 36% for Ion Torrent (worst). The better performance of Pacific Biosciences is probably attributable to the lack of any amplification in their process (compare [20,21]).
Table 3. Percentage of undercovered microbial genome given 100× coverage
Next, to better understand what parts of the genome were undercovered, we generated GC-bias plots (Figure 2), showing relative coverage at each GC level (and for context, the fraction of the genome at each level). These plots provide fine detail but also mirror the preceding conclusions, exhibiting the same hierarchy at GC extremes. For example, Ion Torrent coverage dropped severely below 10% and above 75% GC. On the other hand, all three technologies provided nearly even coverage of the moderate-GC range (30 to 70%) in E. coli. At the lowest GC, even Pacific Biosciences showed approximately two-fold coverage reduction, perhaps attributable to dissociation of fragment ends in adapter ligation, a phenomenon that could apply to all three technologies.
Finally, Table 2 (data sets 1 to 9) presents the relative coverage of the previously described motifs, although not all are present in each sample (the G|C ≥ 80% motif is absent in all of the microbes, and the set of bad promoters was only defined for the human genome). We note that the single statistic of relative coverage for the GC ≥ 85% motif provided a suitable assay for bias on R. sphaeroides, with Pacific Biosciences scoring 0.87 (best), Illumina 0.60 and Ion Torrent 0.10 (worst), while GC ≥ 75% did not clearly distinguish between Illumina and Pacific Biosciences data. The GC ≤ 10% motif was similarly useful for P. falciparum, with Pacific Biosciences scoring 0.89 (best), Illumina 0.58, and Ion Torrent 0.39 (worst). For these data, the (AT)15 motif also stood out, with Pacific Biosciences at 0.85, Illumina at 0.43, and Ion Torrent at 0.11. Importantly, just these few statistics provided a meaningful readout on the performance of the different technologies.
Bias on human samples
The human genome is far larger and more complex than the previously analyzed microbes and contains many examples of all of the motifs, as well as the 1,000 bad promoters (Table 1). We generated slightly more than one-fold coverage on the Ion Torrent PGM platform and 120-fold coverage on Illumina HiSeq. We also analyzed a 79-fold coverage data set generated by Complete Genomics. Complete Genomics sequencing, like Illumina and Ion Torrent, uses amplification in its process. We did not analyze the performance of Pacific Biosciences on human samples because, at the time of these experiments, the system's throughput made it impractical to generate sufficient coverage. To maximize comparability and avoid misinterpreting biological variation as sequencing variation, all data sets utilized the well-studied NA12878 sample  and were aligned to the Human Genome Assembly 19 (GRCh37) reference.
Table 2 and Figure 3 show the motif results and bias curves comparing Illumina HiSeq (data set 14), Ion Torrent PGM (data set 15), and Complete Genomics (data set 16) coverage of NA12878. The HiSeq libraries were prepared using the low-input Fisher et al. protocol  modified with Kapa Biosystems reagents (see Materials and methods), the other libraries used the manufacturers' standard protocols (see Materials and methods). We use data set 14 to represent HiSeq performance, rather than the other HiSeq human data sets in Table 2, because it represents our current best Illumina library construction protocol. Of the data sets tested, the bias curves clearly suggest that the Illumina HiSeq data provided the most even coverage of the human genome. Complete Genomics coverage dropped more severely at both GC extremes and only provided 0.092 relative coverage of the bad promoters, compared to 0.36 relative coverage by HiSeq. The Ion Torrent coverage dropped even more quickly than Complete Genomics as GC increased and only provided 0.046 relative coverage of the bad promoters. Ion Torrent also had the worst performance of these three data sets on the (AT)15 and G|C ≥ 80% motifs.
Figure 3. GC-bias plots for the human genome. Left: the GC composition distribution of the human genome (HG19, GRCh37). Center and right: GC-bias plots for several data sets from human NA12878. Unbiased coverage would be represented by a horizontal line at relative coverage = 1. Center: HiSeq v3 with sample-preparation reagents from Kapa Biosystems (Table 2, data set 14), Ion Torrent PGM (data set 15), and Complete Genomics data (data set 16). Right: HiSeq v3 with sample-preparation reagents from Kapa Biosystems (data set 14, as in center panel) and HiSeq v3 with the standard Fisher et al.  reagents (data set 13). Note that Illumina relative coverage exceeded the y-axis above 93% GC content. Relative coverage is only plotted for GC percentages for which there are at least 1,000 100-base windows in the genome.
In Table 2 we can also see how updates to the Illumina HiSeq platform have affected bias. Notably, the HiSeq version 3 data (data sets 13 and 14) show better coverage of high-GC motifs and the bad promoters compared to the HiSeq version 2 data (data sets 10 to 12). We have also compared the standard list of bad promoters, computed from HiSeq version 2 data, to a new list computed from HiSeq version 3 data (see Materials and methods and Additional files 1 and 2 for details). The lists have 47% of their bases in common, which indicates that many bad promoters are still resistant to sequencing despite Illumina's improvements.
Additional file 2. The 'bad promoters' list for Human assembly 19 (GRCh 37), as described in the main text and method section, computed from HiSeq v3 data set A3; intervals are annotated with gene names and the coverage ratios used to select them (see Materials and methods for details).
Format: TXT Size: 43KB Download file
The inter-platform GC-bias comparisons on human and microbial samples presented above are broadly compatible with previously published work [3,5]. However, we clearly observed more bias between 60% and 70% GC on R. sphaeroides in Ion Torrent data than on MiSeq data, while Liu et al.  found the reverse when comparing Ion Torrent to HiSeq. Our R. sphaeroides results are compatible with the results reported by Ion Torrent for the high-GC Rhodopseudomonas palustris genome .
Comparing bias across libraries
Library construction methods affect evenness of coverage [20-22]. Table 2 includes human Illumina data produced using the methods described in Aird et al.  that are illustrative of this, showing a striking improvement at high GC when the PCR enzyme Phusion HF (data set 10) was supplemented by betaine (data set 11) or replaced by AccuPrime Taq HiFi (data set 12). Figure 3 shows a marked flattening of relative coverage between 15% and 70% GC when we replaced some reagents in the low-input Fisher et al. protocol (data set 13)  with reagents from Kapa Biosystems (data set 14) (see Materials and methods), although the large improvement at low-GC was partly offset by a small decline in high-GC coverage (Figure 3, Table 2). Oyola et al.  achieved a similar improvement in low-GC coverage of P. falciparum by utilizing Kapa HiFi enzymes and the PCR additive tetramethylammonium chloride in library construction.
It is also true that there can be variation in bias between 'technical replicates', data sets created from the same sample using the same protocols. For example, the HiSeq 'Kapa' human data set (data set 14) was created from three libraries and sequenced in fourteen lanes on two flowcells, with no deliberate variation in protocol at any point. Yet when bias statistics are computed lane-by-lane, one sees substantial variation in bias between libraries, and between flowcells - although not between lanes from the same library and flowcell (Table 4). Most notable is the between-flowcell variation of the G|C ≥ 80% motif, which is approximately two-fold undercovered in the first flowcell, but very well covered in the second. Possible sources of unexplained variation include variability of library construction instantiations, cluster amplification devices (cBot), flowcells, and HiSeq instruments that were used. Although variations between technical replicates are of interest, they are, for the most part, smaller than those observed between platforms.
Table 4. Per-lane bias statistics for Illumina HiSeq (Kapa) human NA12878
It is now possible to create 'PCR-free' Illumina libraries, in which there is no DNA amplification prior to cluster generation and sequencing. A comparison of libraries prepared with our standard Fisher et al. protocol and a PCR-free protocol (Table S1 in Additional file 3) reveals that the PCR-free libraries lead to less bias across all bias motifs on P. falciparum, E. coli, and R. sphaeroides samples. On human samples, PCR-free library construction produced improved coverage of all motifs except for GC ≥ 75% and G|C ≥ 80%. Additionally, the bad promoters, although improved, were still two-fold undercovered. These results suggest that PCR-free library construction reduces, but does not cure, coverage bias.
Format: DOCX Size: 89KB Download file
Combining the outputs of multiple sequencing technologies might create a composite data set whose overall bias is reduced. Two technologies provide complementary coverage if, on the same sample, they tend to fill in each other's low-coverage regions. Complementary technology mixtures should have bias statistics that are better than either one of the components. Precedent for this approach stretches back to the practice of combining data from dye-terminator and dye-primer chemistries in Sanger sequencing to reduce error biases . Note that there can be other benefits from mixing technologies, by taking advantage of a broader range of complementary properties (and not just bias). For example, for genome assembly there are benefits from combining the long, relatively unbiased but lower accuracy reads from Pacific Biosciences with shorter Illumina reads that provide per-base accuracy [34-36].
To evaluate complementarity, we created mixed-technology microbial data sets for each possible platform pairing (MiSeq and Ion Torrent, Ion Torrent and Pacific Biosciences, Pacific Biosciences and MiSeq) using the previously described data sets (data sets 1 to 9). Each pairing consisted of 100-fold total coverage, composed of 50-fold randomly sampled coverage from each component technology. Then we measured the fraction of each genome that fell beneath several relative coverage thresholds, comparing those results to the undercoverage values from 100-fold 'pure' coverage from the component technologies (Table 3). If the coverage biases were complementary, we would expect that the undercoverage fractions from the mixed data sets would be smaller than those measured in the component pure data sets. This did happen in some cases. For E. coli, using a mixture of Illumina and Ion Torrent data, the two-fold undercovered fraction was 0.075%, compared to 0.54% and 0.27%, respectively, for the two technologies taken separately. Similar improvements occurred for E. coli with other platform combinations. However, for the other organisms, for the technologies tested, combining data did not reduce the overall level of bias. In most cases, one technology had much lower bias than the other and mixing tended to result in an intermediate level of bias. Therefore, in these cases, mixing provided no coverage benefit; lower bias could have been achieved by only using data from the lower bias technology.
While coverage bias is an important sequencing metric, it ignores possible variations in sequence accuracy. For many applications, decreases in accuracy could offset the advantages of better relative coverage in difficult regions. To compare between platforms and assess the influence of sequence context, Figure 4 plots the mismatch, deletion, and insertion rates on P. falciparum, R. sphaeroides, and human for the four surveyed technologies, as a function of GC content, whereas Figure 5 plots the same as a function of homopolymer length. A logarithmic scale is used to facilitate comparison between technologies and between error types because rates vary greatly. Table 5 lists the genome-wide error rates for the four platforms. For human, the reported errors include bona fide differences between the NA12878 sample and the reference sequence, and hence the error rates were somewhat inflated. When Illumina NA12878 data (data set 14) were aligned to an NA12878-specific reference , the mismatch rate declined by 40%, and the indel rate declined by 80% (Table S2 in Additional file 3). Because of their larger magnitude, a similar experiment yielded no substantial change in the Ion Torrent error rates.
Figure 4. Error rates as a function of GC composition. Each graph shows mismatch (light blue), deletion (dark blue), and insertion (maroon) rates (y-axis) as a function of GC composition (x-axis). Data are shown for the Ion Torrent PGM from three organisms (P. falciparum, R. sphaeroides, and human), for the Illumina MiSeq on the two microbes, for the Illumina HiSeq on human, for Pacific Biosciences from the two microbes and from Complete Genomics for human (Table 2, data sets 1 to 3, 7 to 9, and 14 to 16). For human we note that bona fide differences between the sample and the reference sequence were recorded as errors. Error rates are only plotted for GC percentages for which there are at least 1,000 100-base windows in the genome.
Figure 5. Error rates as a function of homopolymer length. Each graph shows mismatch (light blue), deletion (dark blue), and insertion (maroon) rates (y-axis) within homopolymers of various lengths (x-axis). Data are plotted from P. falciparum and human as available (Table 2, data sets 1 to 3 and 14 to 16). For human we note that bona fide differences between the sample and the reference sequence were recorded as errors.
Table 5. Sequencing technology error rates
Briefly, while the details depend on the technology, these plots document changes in error rates at GC extremes and on long homopolymers, for every technology. For example, Illumina, which had very low insertion and deletion error rates, had a substantial rise in insertions and deletion rates at both GC extremes. The Ion Torrent insertion and deletion rates were more consistent, albeit higher than Illumina's, across a range of GC contents, but the mismatch rate was elevated at low and high-GC regions. As another example, we note that for Pacific Biosciences, the deletion rate rose at high GC, while the insertion rate declined. This behavior appears to result from lower signal-to-noise ratios for the dyes attached to G and C bases in C1 chemistry (personal communication, Edwin Hauw, Pacific Biosciences, USA). Complete Genomics showed consistent (relatively high) mismatch and (relatively low) insertion rates across the GC spectrum, but the deletion rate rose substantially at the extremes. Within long homopolymers, the behavior of insertion and deletion errors would depend on whether a technology systematically over- or under-reports homopolymer length. For example, as homopolymer lengths increased, Ion Torrent showed an increased deletion rate, but the insertion rate stayed about the same. In contrast, the insertion and deletion rates of Illumina data increased in longer homopolymers, which is consistent with their behavior in GC-extreme regions. In the Illumina and Ion Torrent human data, these trends were unchanged when the data were realigned to a sample-specific reference  that accounted for known biological variations (Figure S1 in Additional file 4, Figure S2 in Additional file 5). Similarly consistent with GC behavior are the decrease in insertions and increase in deletions observed in Pacific Biosciences data in long homopolymers. In general, the sequence-context dependence of error rates varied considerably from technology to technology.
Additional file 4. Figure S1 - Human error rates as a function of GC composition and reference. Each graph shows mismatch (light blue), deletion (dark blue), and insertion (maroon) rates (y-axis) as a function of GC composition (x-axis). Data are shown for the human NA12878 sample sequenced by Illumina HiSeq (Table 2, data set 14) and Ion Torrent PGM (Table 2, data set 15) aligned both to the standard Human assembly 19 (GRCh37) reference and to the NA12878-specific diploid reference created by the Gerstein lab . Error rates are only plotted for GC percentages for which there are at least 1,000 100-base windows in Human assembly 19.
Format: PDF Size: 144KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional file 5. Figure S2 - Human error rates as a function of homopolymer length and reference. Each graph shows mismatch (light blue), deletion (dark blue), and insertion (maroon) rates (y-axis) within homopolymers of various lengths (x-axis). Data are plotted from human sample NA12878 as sequenced by Illumina HiSeq (Table 2, data set 14) and Ion Torrent PGM (Table 2, data set 15) and aligned both to the standard Human assembly 19 (GRCh37) reference and to the NA12878-specific diploid reference created by the Gerstein lab .
Format: PDF Size: 63KB Download file
This file can be viewed with: Adobe Acrobat Reader
PCR amplification in library construction is a source of error in sequencing data [38-40]. In a matched comparison, we found that our production libraries had lower error rates than a PCR-free protocol on E. coli and human samples, and only a slight increase in error rate on R. sphaeroides and P. falciparum (Table S3 in Additional file 3), possibly due to their extreme base composition.
Discovering uncategorized bias
Finally, with the goal of understanding bias in the human genome that was not explained by our motifs, we generated >100-fold coverage of NA12878 using Illumina HiSeq data, from libraries generated with Kapa Biosystems reagents (Table 2, data set 14). We note that some apparently low or missing coverage will be due to true biological differences, including sequences that are present in the reference but not in NA12878. However, we used other deeply sequenced data sets and an assembly-based analysis to filter out many of these variant loci, as described below.
Initially we identified 5.5 Mb of the human reference sequence (HG19) having 0.1 or less relative coverage. If the data were unbiased, then 0.1 relative coverage would be more than 9 standard deviations from the expected coverage at each base. Therefore, we would expect no bases in the human genome to have such low coverage in the absence of sequencing bias. We then applied two filters to this 'undercovered set' to remove sequence that is unlikely to be present in the NA12878 genome (see Materials and methods). These filters, one based on analysis of the NA12878 assembly and the other based on a comparisons between NA12878 and a diverse population of other samples, excluded 8.7% (23 Mb) of the autosomal reference from further consideration. After this filtering, 3.6 Mb of undercovered reference genome remained.
Finally, because we were interested in discovering new bias contexts, we excluded regions that were similar (but not necessarily identical) to previously known motifs. Similarity was defined by matching at least one of the following motifs:
• GC ≤ 13%, 200-base regions in which the middle 100 bases have ≤13% GC content (a superset of the GC ≤ 10% motif);
• GC ≥ 70%, 200-base regions in which the middle 100 bases have ≥70% GC content (a superset of the GC ≥ 75% and GC ≥ 85% motifs);
• (AT)10, 130-base regions in which the middle 20 bases are repeated AT dinucleotides (a superset of the (AT)15 motif);
• G|C ≥ 75%, 130-base regions in which the middle 30 bases are either 75% Gs or 75% Cs (a superset of the G|C ≥ 80% motif);
• the list of 1,000 'bad promoters'.
Except for the bad promoters, which were unaltered, these generalized motifs were selected to each cover roughly twice as many bases as their equivalents in the original motif list. Together they covered 2.8% (74 Mb) of the autosomal bases in HG19. The generalized motifs included 7.5% (1.7 Mb) of the bases previously excluded as probable biological variations. This enrichment may indicate that the biological variation filters excluded bases whose low coverage had a non-biological origin, or it may indicate a correlation between bias motifs and sites with high mutation rates.
Filtering out the probable biological differences between the sample and the reference and the areas similar to known motifs excluded 78% of the ten-fold undercovered locations in HG19. The remaining 35,389 undercovered intervals represented 0.045% (1.2 Mb) of the human autosomal reference genome with an N50 interval size of 98 bp.
Performance on this fraction is hidden from our monitoring methods by its dissimilarity with the current set of motifs. On the Illumina HiSeq 'Kapa' data set, these bases had mean relative coverage of 0.037. They also suffered from high error rates - a mismatch rate of 0.020 (6.7 times the whole-genome average), a deletion rate of 0.11 (470 times the whole-genome average), and an insertion rate of 0.0021 (12 times the whole-genome average). The high deletion rate suggests that some of the undercoverage may have been due to short biological deletions in NA12878 relative to the reference sequence, but even if all the deletions originated in the sample, these regions would still be more than ten-fold undercovered. Their GC-content and homopolymer distributions did not differ appreciably from the overall genome (Figure 6). Clearly, these regions were either exceptionally resistant to the Illumina HiSeq technology or are places where the reference is inaccurate for NA12878 or for human samples generally. A list of the intervals' coordinates, GC content, and homopolymer N50 statistics are included in Additional file 6.
Figure 6. GC and homopolymer distributions of uncharacterized Illumina undercoverage of human sample NA12878. The graphs show the distribution of GC-content and homopolymer length for bases in the overall human genome and in the genome intervals that are ten-fold undercovered but which were not explained by known sequence biases or differences between the sample and reference sequence. Data are from Table 2, data set 14.
Additional file 6. The intervals of the human reference that had less than 0.1 relative coverage in data set 14 and could not be categorized as biological variations or as similar to known bias motifs. Also included are the GC content fraction and homopolymer N50 for each interval.
Format: CSV Size: 1.2MB Download file