A Drosophila full-length cDNA resource
1 Berkeley Drosophila Genome Project Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
2 Genome Sciences Department, Lawrence Berkeley National Laboratory, Berkeley, CA 94720, USA
3 Department of Molecular and Cell Biology, University of California, Berkeley, CA 94720, USA
4 Howard Hughes Medical Institute, University of California, Berkeley, CA 94720, USA
5 Current addresses: Incyte Genomics, 3160 Porter Drive, Palo Alto, CA 94304, USA
6 Applied Biosystems, 850 Lincoln Centre Drive, Foster City, CA 94404, USA
7 Department of Bioinformatics and Computational Biology, Iowa State University, Ames, IO 50011, USA
Genome Biology 2002, 3:research0080-0080.8 doi:10.1186/gb-2002-3-12-research0080
This article is part of a series of refereed research articles from Berkeley Drosophila Genome Project, FlyBase and colleagues, describing Release 3 of the Drosophila genome, which are freely available at http://genomebiology.com/drosophila/.Published: 23 December 2002
A collection of sequenced full-length cDNAs is an important resource both for functional genomics studies and for the determination of the intron-exon structure of genes. Providing this resource to the Drosophila melanogaster research community has been a long-term goal of the Berkeley Drosophila Genome Project. We have previously described the Drosophila Gene Collection (DGC), a set of putative full-length cDNAs that was produced by generating and analyzing over 250,000 expressed sequence tags (ESTs) derived from a variety of tissues and developmental stages.
We have generated high-quality full-insert sequence for 8,921 clones in the DGC. We compared the sequence of these clones to the annotated Release 3 genomic sequence, and identified more than 5,300 cDNAs that contain a complete and accurate protein-coding sequence. This corresponds to at least one splice form for 40% of the predicted D. melanogaster genes. We also identified potential new cases of RNA editing.
We show that comparison of cDNA sequences to a high-quality annotated genomic sequence is an effective approach to identifying and eliminating defective clones from a cDNA collection and ensure its utility for experimentation. Clones were eliminated either because they carry single nucleotide discrepancies, which most probably result from reverse transcriptase errors, or because they are truncated and contain only part of the protein-coding sequence.