Functional annotation of differentially expressed genes is a necessary and critical step in the analysis of microarray data. The distributed nature of biological knowledge frequently requires researchers to navigate through numerous web-accessible databases gathering information one gene at a time. A more judicious approach is to provide query-based access to an integrated database that disseminates biologically rich information across large datasets and displays graphic summaries of functional information.
Database for Annotation, Visualization, and Integrated Discovery (DAVID; http://www.david.niaid.nih.gov webcite) addresses this need via four web-based analysis modules: 1) Annotation Tool - rapidly appends descriptive data from several public databases to lists of genes; 2) GoCharts - assigns genes to Gene Ontology functional categories based on user selected classifications and term specificity level; 3) KeggCharts - assigns genes to KEGG metabolic processes and enables users to view genes in the context of biochemical pathway maps; and 4) DomainCharts - groups genes according to PFAM conserved protein domains.
Analysis results and graphical displays remain dynamically linked to primary data and external data repositories, thereby furnishing in-depth as well as broad-based data coverage. The functionality provided by DAVID accelerates the analysis of genome-scale datasets by facilitating the transition from data collection to biological meaning.
The post-genomic era has introduced high-throughput methodologies that generate experimental data at rates that exceed knowledge growth. In particular, high-density biochips including complementary deoxyribonucleic acid (cDNA) microarrays, oligonucleotide microarrays, and rapidly evolving proteomics platforms represent modern tools able to interrogate biology on a genome-wide scale and generate tens of thousands of data points simultaneously . While researchers are beginning to appreciate the statistical rigors required for the analysis of genome-scale datasets, a rate-limiting step in knowledge growth is at the transition from statistical significance to biological discovery.
There are currently a number of public efforts focusing on the annotation and curation of gene-specific functional data including, LocusLink, Protein Information Resource (PIR), GeneCards, Proteome, Kyoto Encyclopedia of Genes and Genomes (KEGG), Ensembl, and Swiss-Prot to name but a few [1-8]. These resources provide exceptional depth and coverage of the functional data available for a given gene, but are not designed to effectively aggregate biological knowledge for 100s or 1000s of genes in parallel. In order to facilitate the functional annotation and analysis of large lists of genes we have developed a Database for Annotation, Visualization, and Integrated Discovery (DAVID), which provides a set of data mining tools that systematically combine functionally descriptive data with intuitive graphical displays. DAVID provides exploratory visualization tools that promote discovery through functional classification, biochemical pathway maps, and conserved protein domain architectures, while simultaneously remaining linked to rich sources of biological annotation. DAVID's functionality is demonstrated using the Affymetrix Genechip data of Cicala et al., . However, DAVID expedites the functional annotation and analysis of any list of genes encoded by the human, mouse, rat, or fly genomes.
Materials and Methods
Details of the experimental, RNA preparation, and Genechip hybridization procedures, along with details of the chip-to-chip normalizations and statistical analysis of differential gene expression are provided in Cicala et al., . Briefly, primary human peripheral blood mononuclear cells (PBMCs) and monocyte-derived macrophages were incubated for 16 hours with HIV-1 envelope protein (gp120). High-density oligonucleotide microarrays (Affymetrix HU-95A GeneChip) were used to monitor gp120 induced transcriptional events.
System Architecture and Maintenance
An automated procedure written in Microsoft Visual Basic (VB) 6.0 updates DAVID weekly with the following procedures: (i) call a series of Perl and Java applications that download public data through anonymous file transfer protocols (FTP) (Table 1); (ii) unpack and parse desired annotation data; (iii) create tab-delimited data files ready for database import; and (iv) import data into an Oracle 8i relational database management system (RDBMS) using Oracle's SQL*Loader application. An Apache webserver and Java Server Pages (JSP) access the database using JavaBeans and the structured query language (SQL). LocusLink numbers for Affymetrix probe sets are derived from NetAffx  or University of Michigan associations .
Table 1. Sources of Annotation Data Integrated into DAVID
Format, Submit, and Save Files
Uploading or pasting a list of gene identifiers into DAVID initiates the data mining process. Uploaded files must be tab-delimited text files and can contain one or two columns, gene identifiers must be in the first column and an optional second column can contain any other type of information (e.g. fold change, p-value, cluster number, etc). Genes separated by any white character can also be copied and pasted into a textbox and uploaded to DAVID. Gene identifiers can be in the form of Affymetrix probe set identifiers, Genbank (and RefSeq) accession numbers, Unigene cluster numbers, or LocusLink identifiers. HTML tables containing analysis results can be saved in Microsoft Excel format by choosing File > Save As > 'filename.xls', where the '.xls' extension allows Microsoft Excel to directly import the the tab-delimited data. HTML table results can also be copied and pasted into Microsoft Word and Excel directly.
The Annotation Tool is an automated method for the functional annotation of gene lists. Any combination of annotation data can be chosen from ten options by selecting the appropriate checkboxes (Table 2). The annotations are added to the submitted gene list by selecting the upload button, which returns an HTML table containing the user's original list of identifiers appended with the chosen functional annotations. Unannotated genes are included in the output with no appended data for tracking purposes.
Table 2. Options Provided by the Annotation Tool
The GoCharts tool graphically displays the distribution of differentially expressed genes among functional categories using the controlled vocabulary of the Gene Ontology Consortium (GO), which provides a structured language that can be applied to the functions of genes and proteins in all organisms even as knowledge continues to accumulate and change . The language is structured in a directed acyclic graph (DAG), wherein term specificity increases and genome coverage decreases as one moves down the hierarchy. In contrast with a true hierarchy, child terms in a DAG may have more than one parent term and may have a different class of relationship with its different parents. The structure of GO starts with three main categories, Biological Process, Molecular Function, and Cellular Component. Biological Process includes broad biological goals, such as mitosis or purine metabolism, that are accomplished by ordered assemblies of molecular functions. Molecular Function describes the tasks performed by individual gene products; examples are transcription factor and DNA helicase. The Cellular Component classification type involves subcellular structures, locations, and macromolecular complexes; examples include nucleus, telomere, and origin recognition complex. After choosing a classification type, levels that determine list coverage and specificity are chosen by selecting the appropriate radio button. Level 1 provides the highest list coverage with the least amount of term specificity. With each increasing level coverage decreases while specificity increases so that level 5 provides the least amount of coverage with the highest term specificity. Classification data is displayed as a bar chart, where the length of the bar represents the number of gene identifiers in each category.
The user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes. Selecting an individual bar opens a new HTML table displaying the gene identifier, LocusLink number, Gene Name, the current classification, and other classifications for each gene in that category. A 'Show All' button opens a new HTML table displaying all classification data and a 'Show Chart Data' button opens an HTML table containing the underlying chart data, thus allowing users to recreate customized chart graphics in a spreadsheet program. A new chart can be displayed for any subset of genes by selecting the classification type and level using the checkboxes and radio buttons available within the users current page allowing for drill-down capabilities. A count of the number of genes annotated is included in the output and unannotated genes are binned into the 'unclassified' category, thus providing users with an automated tracking system for genes not annotated.
The KeggCharts tool graphically displays the distribution of differentially expressed genes among KEGG biochemical pathways. Each pathway is linked to the KEGG pathway map, wherein differentially expressed genes from the original list are highlighted in red. In this view genes are further linked to additional annotations available through KEGG's DBGET retrieval system . As with GoCharts, the user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes. Selecting an individual bar opens an HTML table displaying the gene identifier, LocusLink number, Gene Name, the pathway, and other classifications data. A 'Show All' button opens a new HTML table displaying all classification data. A 'Show Chart Data' button opens an HTML table containing the underlying chart data. Genes not classified by KeggCharts are handled in the same manner as with GoCharts.
The DomianCharts tool graphically displays the distribution of differentially expressed genes among PFAM protein domains . Each domain designation is linked to the Conserved Domain Database (CDD) of NCBI, where details regarding domain function, structure and sequence are readily available. As with GoCharts and KeggCharts, the user can set visualization parameters for sorting output data and displaying categories that contain at least a minimum number of genes. Selecting an individual bar opens an HTML table displaying the gene identifier, LocusLink number, Gene Name, the pathway, and other classifications data and a 'Show All' button opens a new HTML table displaying all classification data. The 'Show Chart Data' button opens an HTML table containing the underlying chart data. Genes not classified by DomainCharts are handled in the same manner as with GoCharts and KeggCharts.
Incubation of primary human PBMCs with HIV-1 gp120 resulted in the differential expression of 402 genes. While 16 genes modulated by HIV-1 gp120 have been previously been associated with HIV replication and/or envelope signaling, the remaining genes are of unknown function or have never been associated with HIV-1 or gp120. Converting this list of genes into biological meaning requires the gathering of pertinent information from several data repositories. For many researchers this process consists of iterative browsing through several databases for each gene, manually gathering gene-specific information regarding sequence, function, pathway, and disease association. In contrast, the systematic approach of DAVID simultaneously adds biologically rich information derived from several public data sources to lists of genes in parallel. Selecting DAVID's Annotation Tool and uploading the list of 402 differentially expressed genes initiates the functional annotation and analysis of the entire dataset. Once submitted the gene list is stored for the entire analysis session, allowing users to switch between modules without having to resubmit data.
The Annotation tool provides several annotation options and builds a tabular view of the users gene list and the available annotations (Table 2). Choosing the annotation fields Gene Symbol, LocusLink, OMIM, Unigene, Reference Sequence, and Gene Name followed by selecting the 'Upload' button produces an HTML table in the web-browser containing all genes and their available annotations, where gene identifiers, descriptive and classification data are pulled from the database and appended to the gene list (Figure 1). Gene identifiers such as Gene Symbol and LocusLink are hyper-linked to additional gene-specific data available at their original sources, thus providing in-depth gene specific details and annotation pedigrees. Classification data and functional summaries can be used to quickly scan for information relevant to the researchers experimental system. The server time required for execution of this module correlates linearly with gene list size and takes less than 45 seconds for lists up to 1000 genes in length (Figure 2, numbers in parentheses represent r2 values). These results demonstrate the power and efficiency of an integrated approach to the functional annotation of large datasets.
Figure 1. Ouput of AnnotationTool. Shown are appended annotations for the first several Affymetrix probesets in an HTML table containing all 402 entries. Categorical information about the experimental conditions were submitted along with the Affymetrix probe set identifiers and included in the output in the value column. Identifiers such as Symbol, LocusLink, OMIM, RefSeq, and Unigene accessions are hyper-linked to their origin sources for more detailed information. Text included in summary fields is derived from descriptive, functional information provided in NCBI's LocusLink Reports.
Figure 2. Time analysis of Annotation Tool. Server time required (y axis) to simultaneously append all ten annotation options to gene lists ranging in size from 100 to 1000 (x axis). The average of three trials for gene lists containing Affymetrix, Genbank, LocusLink, and Unigene identifiers are shown and the numbers in parentheses represent r2 value of the correlation between gene list size and the server time required for annotation.
Choosing the GoCharts module opens a new window with a variety options. Users choose between three general types of classification (biological process, molecular function, and cellular component and five levels of annotation that represent term coverage and specificity (see material and methods). Any combination of classification and coverage level can be specified. Also included are options to annotate gene lists with all GO terms available or only the most specific terms, which are referred to as terminal nodes. The option to choose different levels of term specificity provides needed flexibility and thus allows researchers to dynamically determine which level of coverage and specificity best suites their data and stage of analysis. For instance, early stage analyses may consist of annotating gene lists with very general terms in order to gain a broad understanding of the data. In this case selecting biological process and level 1 classifies genes using general terms such as 'death' and 'cell communication'. Using increased term specificity facilitates the extraction of more detailed functional information. In this case selecting biological process and level 5 classifies genes using terms such as 'apoptotic mitochondrial changes' and 'chemosensory perception'.
However, increased term specificity comes a cost, in that as term specificity increases list coverage decreases (Figure 3). In our studies we find that level 2 typically maintains good coverage while also providing meaningful term specificity. Since our gene list is still stored in memory, selecting biological process, level 2, and the 'Chart Values' button executes the program and opens a new window containing a histogram depicting the number of genes in each category (Figure 4A). The GoCharts visualization quickly reveals that 35 differentially expressed genes are involved in stress responses. Since HIV-1 has major impact on the function of cells of the immune system and their ability to carry out stress responses, we selected the histogram bar representing the number of genes involved in stress response, which opens an HTML table containing the Affymetrix identifier, LocusLink number, Gene Name, the current classification, and other classifications for all 35 genes (Figure 4B). Now that we have reduced our gene list to those genes involved in stress responses, we further characterized this subset by repeating the GoCharts procedure available at the top of the stress response HTML table.
Figure 3. Analysis of gene list coverage using GoCharts. A list of 402 Affymetrix probe set identifiers were annotated with the Proteome assigned functional classifications provided by LocusLink. Percent coverage represents the number of genes out of 402 that were annotated at a term specificity level within the Biological Process, Molecular Function, and Cellular Component classification types. Percent coverage decreases as term specificity increases.
Figure 4. Ouput of GoCharts. A) A bar chart showing the distribution of differentially expressed genes among Gene Ontology (GO) Biological Processes. Parameters were set to GO level 2, a hit threshold of five, and output was sorted by hit count. Blue bars are linked to additional annotation data shown in figure 4B. B) Selecting the blue bar in figure 4A corresponding to 'response to stress' opens an HTML table showing the LocusLink, Gene Name, Current Classification, and Other Classification data for the genes in that category. C) This subset of genes involved in 'stress response' was further characterized by selecting GO Molecular Function, GO level 3, a hit threshold of 2, and sorted by hit count. Selecting the 'Chart Values' button creates a new histrogram revealing that 16 of the 35 stress response genes possess cytokine activity.
Choosing molecular function, level 3 and selecting the 'Chart Values' button produces a new histrogram that quickly reveals that nearly half (16/35) of the stress response genes possess cytokine activity (Figure 4C). Indeed, cytokines have been shown to play an important role in the HIV-1 lifecycle and the results obtained here suggest that treatment of PBMCs with HIV-1 envelope proteins significantly modulate the transcription numerous cytokine genes. The efficiency with which GoCharts systematically summarized this large dataset with graphic visualizations, while remaining linked to primary data and external resources drastically improved the discovery process.
The KeggCharts module functions in much the same way as GoCharts. Selecting the 'KeggCharts' button and choosing a minimum hit threshold opens a histogram displaying the distribution of differentially expressed genes among biochemical pathways (Figure 5A). The chart shows that a KEGG pathway of apoptosis includes 5 genes induced by HIV-1 gp120. As with GoCharts, selecting the histogram bar representing the number of genes in the pathway opens an HTML table containing the underlying genes and additional annotation data. Selecting the pathway name opens the corresponding KEGG biochemical pathway map and highlights in red outline the differentially expressed genes functioning in that pathway (Figure 5B). In this view genes are further linked to additional annotations available through KEGG's DBGET retrieval system . Note that only four genes in the KEGG apoptosis pathway are highlighted in red, while the KeggCharts tool mapped five Affymetrix probesets to the apoptosis pathway. This difference is due to the fact that two of the Affymetrix probesets are targeting the same "TNFα" gene.
Figure 5. Ouput of KeggCharts. A) Visualization chart showing the distribution of 402 genes among KEGG biochemical pathways. The hit threshold was set to three and the output was sorted by hit count. The large number of unclassified identifiers is due to the fact that KEGG is biochemical pathway centric and thus provides low coverage of gene lists. Similar to the output of GoCharts, blue bars represent the number of genes in each pathway. Selecting a blue bar opens an HTML table showing the LocusLink, Gene Name, Current Classification, and Other Classification data for the genes in that pathway (data no shown). B) The KEGG biochemical pathway that appears following the selection of the pathway name "apoptosis" in Figure 5A depicts 4 differentially expressed genes within the apoptosis pathway by highlighted them in light green and red. The fact that the KEGG pathway highlights only 4 genes while the KeggChart maps 5 Affymetrix probesets to the apoptosis pathway is due to the fact that two probesets target the same "TRFα" gene.
The DomainCharts module functions in much the same way as KeggCharts. Selecting the 'DomainCharts' button and choosing a minimum hit threshold opens a visualization of the distribution of genes among PFAM protein domains (Figure 6A). The DomainCharts histogram identifies 16 genes with kinase domains (pkinase), likely reflecting the effects that HIV-1 gp120 on signal transduction machinery. The chart also identifies 6 genes with interleukin 8 domains (IL8), a domain which represents a highly conserved motif among stress response cytokines. As with KeggCharts, selecting the histogram bar representing the number of genes with that domain opens an HTML table containing the underlying genes and additional annotation data, whereas selecting the domain name opens the Conserved Domain Database (CDD) page corresponding to that PFAM domain (Figure 6B). This page provides detailed sequence, structure, and functional information about the IL8 domain and the proteins that contain it.
Figure 6. Ouput of DomainCharts. A) Visualization chart showing the distribution of 402 genes among protein domains. The parameters were set to a minimum hit threshold of 4 and output was sorted by hit count. Similar to the output of GoCharts and KeggCharts, blue bars represent the number of genes containing that particular domain. Selecting a blue bar opens an HTML table showing the LocusLink, Gene Name, Current Classification, and Other Classification data for the genes in that pathway (data not shown). B) Selecting the domain name 'IL8' in Figure 6A that contains 6 differentially expressed genes brings the user to a new page containing the output from the Conserved Domain Database (CDD) of NCBI, which provides detailed information about the IL8 domain including structural information, multiple sequence alignments, and descriptive information about the domain and the proteins that possess it.
DAVID's Annotation Tool, GoCharts, KeggCharts, andDomainCharts combine to provide high-throughput methods for functional annotation and biological discovery, all of which can be accessed via the internet at http://www.david.niaid.nih.gov webcite. The Annotation Tool efficiently appended annotations to 402 genes in less than twelve seconds and provided functional summaries and links to external data sources, all of which could be downloaded to a users personal workstation for further analysis. Complementary features including graphic visualizations of functional categories, conserved protein domains, and biochemical pathways were provided by GoCharts, DomainCharts, and KeggCharts that quickly led to the identification of stress response cytokines and protein kinases as major functional categories modulated by HIV-1 envelope proteins. This analysis supports the findings reported by the original authors and illustrates the utility of DAVID in the rapid annotation and analysis of large datasets commonly generated by high-throughput expression profiling.
The development of any complete, in-silico discovery system requires full, query-based access to an integrated, up-to-date view of all relevant information, regardless of its physical location and content structure. Still in its infancy, DAVID represents the foundation of our continued development efforts that aim to integrate information-rich data sources and provide quantitative analysis methods that promote biological discovery and knowledge growth. We have immediate plans to add new data for identifying relationships among receptor/ligand interaction networks. The incorporation of data that links transcription factors to their respective binding sites within promoter regions is of equal priority. Quantitative methods able to identify enriched functional categories in a list of genes are also under development (Hosack DA et. al., manuscript in preparation). While committed to maintaining a system able to co-evolve with technological advancement and the novel forms of data that are sure to follow, DAVID's current design elements provide automated solutions that enable researchers to rapidly discover biological themes in large datasets consisting of lists of genes.
We thank Bill Wilton for information technology and network support. The project has been funded with federal funds from the National Institute of Allergy and Infectious Diseases, National Institutes of Health, under Contract No. NO1-C0-56000. The contents of this tool does not necessarily reflect the views or policies of the Department of Health and Human Services, nor does mention of trade names, commercial products or organizations imply endorsement by the United States government.
Cathy Wu H, Hongzhan Huang, Leslie Arminski, Jorge Castro-Alvear, Yongxing Chen, Zhang-Zhi Hu, Robert Ledley S, Kali Lewis C, Hans-Werner Mewes, Bruce Orcutt C, Baris Suzek E, Akira Tsugita, Vinayaka CR, Lai-Su Yeh L, Jian Zhang, Winona Barker C: The Protein Information Resource: an integrated public resource of functional annotation of proteins.
Costanzo MC, Crawford ME, Hirschman JE, Kranz JE, Olsen P, Robertson LS, Skrzypek MS, Braun BR, Hopkins KL, Kondu P, Lengieza C, Lew-Smith JE, Tillberg M, Garrels JI: YPD™, PombePD™, and WormPD™: model organism volumes of the BioKnowledge™ library, an integrated resource for protein information.
Hubbard T, Barker D, Birney E, Cameron G, Chen Y, Clark L, Cox T, Cuff J, Curwen V, Down T, Durbin R, Eyras E, Gilbert J, Hammond M, Huminiecki L, Kasprzyk A, Lehvaslaiho H, Lijnzaad P, Melsopp C, Mongin E, Pettett R, Pocock M, Potter S, Rust A, Schmidt E, Searle S, Slater G, Smith J, Spooner W, Stabenau A, Stalker J, Stupka E, Ureta-Vidal A, Vastrik I, Clamp M: The Ensembl genome database project.
Curr Issues Mol Biol 2001, 3:47-55. PubMed Abstract
Cicala C, Arthos J, Selig SM, Dennis G Jr, Hosack DA, Van Ryk D, Spangler ML, Steenbeke TD, Khazanie P, Gupta N, Yang J, Daucher M, Lempicki RA, Fauci AS: HIV envelope induces a cascade of cell signals in non-proliferating target cells that favor virus replication.
University of Michigan Annotation data [http://dot.ped.med.umich.edu:2000/ourimage/pub/shared/JMR_pub_affyannot.html] webcite