Quality scores are of limited use in predicting accuracy of unknown sequences. The quality scores reported by the GS20 software correlate with decreased confidence in calling the correct homopolymer length rather than the accuracy of the called bases. (a, b) The average quality score of reads decreases as the number of errors in the read increases. (c) The average quality score as a function of position in the homopolymer: as the length of the homopolymer increases, the quality scores decrease, for both correctly and incorrectly called bases. (d) The average quality scores of perfect reads containing differing numbers of homopolymers. The average quality scores decrease with the number of homopolymers. Our sequences contain only short homopolymers, primarily 3-mers. As the length and frequency of homopolymers increases, the expected quality scores will decrease. Without a priori knowledge of the number and length of homopolymers in a particular read, it will be difficult to assess an appropriate quality threshold - a low threshold may not cull data adequately and a high threshold may remove homopolymeric regions.
Huse et al. Genome Biology 2007 8:R143 doi:10.1186/gb-2007-8-7-r143