Gene-finding strategies. Given a genome DNA sequence, information on the location of genes and transcripts can be obtained from different sources: conservation with one or more informant genomes (1); intrinsic signals involved in gene specification, such as start and stop codons and splice sites (2); the statistical properties of coding sequences (3); and, most importantly, known transcript sequences (either full-length cDNAs or partial ESTs) and protein sequences (4). Over the past two decades, a plethora of programs and strategies has been developed to combine these sources of information to obtain reliable gene predictions. The 'intrinsic' evidence from sequence signals and statistical bias can be combined (using a variety of frameworks often related to hidden Markov models ), to produce gene predictions (6). These programs are often referred to as ab initio or de novo gene finders. They are the programs of choice in the absence of known transcript or protein sequences or phylogenetically related genomes. If related genome sequences are available, the intrinsic information can be combined with patterns of genomic sequence conservation using programs often referred to as comparative (or dual- or multi-genome) gene finders (5). With these programs, maximum resolution is achieved when the compared genomes are at a phylogenetic distance such that there is maximum separation between the conservation in coding and noncoding regions. To increase resolution, programs have been developed that use multiple informant genomes. The most sophisticated use an underlying phylogenetic tree to appropriately weight sequence conservation depending on evolutionary distance. If cDNA and EST sequences are available, these often take priority over other sources of information. The initial map of the transcript or protein sequences onto the genome, which can be obtained using a variety of tools, including sequence-similarity searches, is refined using more sophisticated 'splice alignment' algorithms, whose explicit splice-site models allow more precise alignment across gaps corresponding to introns (8). Alternatively, cDNA and protein information can be fed into an ab initio gene-finder algorithm to give information on the exons included in the prediction (7). Often, cDNA and protein evidence is only partial; in such cases, the initial reliable gene and transcript set may be extended with more hypothetical models derived from ab initio or comparative gene finders, or from the genome mapping of cDNA and protein sequences from other species. Pipelines have been derived that automate this multi-step process (9). More recently, programs have been developed that combine the output of many individual gene finders (10). The underlying assumption in these 'combiners' is that consensus across programs increases the likelihood of the predictions. Thus, predictions are weighted according to the particular features of the program producing them. The most general frameworks allow the integration of a great variety of types of predictions - not only gene predictions, but also predictions of individual sites and exons. Despite all the developments in computational gene finding, the most reliable and complete gene annotations are still obtained after the initial alignments of cDNA and proteins onto the genome sequence are inspected manually to establish the exon boundaries of genes and transcripts (11). This is the task carried out by the HAVANA team at the Sanger Institute. The initial manual annotation can be refined even further by subsequent experimental verification of those transcript models lacking sufficiently strong evidence, as in the GENCODE project (12). Examples of gene-prediction programs (with references and URLs) corresponding to each strategy outlined here are provided in Additional data file 1.
Harrow et al. Genome Biology 2009 10:201 doi:10.1186/gb-2009-10-1-201