Overview of sequencing and annotation for a whole-genome shotgun project, for example, sequencing a bacterial genome. First (a), genomic DNA is purified, broken into short fragments and cloned into E. coli. The cloned fragments are then sequenced from both ends on an automated sequencing machine. The resulting sequences (shown in (b) as they appear on the sequencing machine display) are then assembled using a complex software program that identifies overlaps into (c) large, contiguous sequences representing the chromosomes from the original DNA. Gaps are filled until the genome is complete. (d) Annotation begins with the execution of several gene-finding programs, such as Glimmer, which identifies protein-coding genes, tRNAScan, which identifies tRNAs, and other programs for other genome features. (e) These initial predictions are used as the basis for BLAST searches against large protein databases, which identify related proteins based on sequence similarity. Translated (Blastx) searches are then used to scan the databases to detect any proteins that match the DNA regions in between predicted genes. Customized annotation programs are used to decide what name and function to assign to each protein, leading to (f) the final annotated genome.
Salzberg Genome Biology 2007 8:102 doi:10.1186/gb-2007-8-1-102