Abstract
Background
Cluster analysis is often used to infer regulatory modules or biological function by associating unknown genes with other genes that have similar expression patterns and known regulatory elements or functions. However, clustering results may not have any biological relevance.
Results
We applied various clustering algorithms to microarray datasets with different sizes, and we evaluated the clustering results by determining the fraction of gene pairs from the same clusters that share at least one known common transcription factor. We used both yeast transcription factor databases (SCPD, YPD) and chromatin immunoprecipitation (ChIP) data to evaluate our clustering results. We showed that the ability to identify coregulated genes from clustering results is strongly dependent on the number of microarray experiments used in cluster analysis and the accuracy of these associations plateaus at between 50 and 100 experiments on yeast data. Moreover, the modelbased clustering algorithm MCLUST consistently outperforms more traditional methods in accurately assigning coregulated genes to the same clusters on standardized data.
Conclusions
Our results are consistent with respect to independent evaluation criteria that strengthen our confidence in our results. However, when one compares ChIP data to YPD, the falsenegative rate is approximately 80% using the recommended pvalue of 0.001. In addition, we showed that even with large numbers of experiments, the falsepositive rate may exceed the truepositive rate. In particular, even when all experiments are included, the best results produce clusters with only a 28% truepositive rate using known gene transcription factor interactions.
Background
Cluster analysis is a popular exploratory technique to analyze microarray data. It is often used for pattern discovery  to identify groups (or clusters) of genes or experiments with similar expression patterns. Cluster analysis is an unsupervised learning approach in which genes or experiments are assigned to groups (or clusters) based on their expression patterns and no prior knowledge of the data is required. A common application of cluster analysis is to identify potentially meaningful relationships between genes or experiments or both [13].
Transcription of a gene is determined by the interaction of regulatory proteins (that is, transcription factors) with DNA sequences in the gene's promoter region [4]. A common application of cluster analysis is to identify potential transcriptional modules, for example genes that share common promoter sites. An example of this is the largescale analysis of gene expression as a function of cell cycle in yeast [5]. The study focused on genes that behaved similarly to other genes that are known to be regulated during the cell cycle. A total of 800 genes were found to be regulated during the cell cycle, and 700 base pairs (bp) of genomic sequence immediately upstream of the start codon for each of these 800 genes was analyzed to identify potential binding sites for known or novel factors that might control expression during the cell cycle. The majority of the genes were shown to have good matches to known cellcycle transcription factor binding sites.
The approach pioneered by Spellman et al. [5]  for example the metaanalysis of massive amounts of geneexpression data to identify genes that are coexpressed followed by promoter analysis  is now commonplace [610]. Cluster analysis is often used to identify genes whose expression levels are correlated across numerous experiments. However, using cluster analysis to infer regulatory modules or biological function has its limitations. In general, cluster analysis always returns clusters independent of the biological relevance of the clusters. Microarray data can be quite noisy owing to measurement errors and technical variations, and cluster analysis will find patterns in noise as well as in signal. In this paper, we address two main questions. The first is how often do we discover coregulated genes (that is, genes that are regulated by common transcription factors) from coexpressed genes (that is, genes that share similar expression patterns). The second asks how the following factors affect the likelihood of finding coregulated genes: the number of microarray experiments in the microarray datasets; the clustering algorithm used; and the diversity of experiments in a microarray dataset.
The primary thrust of this paper is to provide guidance to researchers who wish to use cluster analysis of gene expression data to identify coregulated genes. In particular, we provide an estimate of the accuracy of this association as a function of the number of experiments used in cluster analysis. This information is critical for researchers in assessing how much effort (if any) should go into promoter analysis of genes that cluster together in a fixed number of experiments.
Our approach
Our goal is to study the likelihood that coexpressed genes are regulated by the same transcription factor(s). We define coexpressed genes as genes that share similar expression patterns as discovered by cluster analysis, and we define coregulated genes as genes that are regulated by at least one common known transcription factor. Our overall approach is illustrated in Figure 1. In brief, we first defined a set of genes that are controlled by known transcription factors. As there is both an abundance of yeast array data and many available resources on yeast transcription factors such as yeast transcription factor databases [11,12] and yeast chromatin immunoprecipitation (ChIP) data [13,14], we ran our experiments on yeast data. Genes that share common transcription factors are taken as our 'gold standards' for evaluating the ability of cluster analysis to infer coregulation. We then identified large publicly available yeast microarray datasets, preprocessed these datasets to remove genes with many missing values and created randomly sampled subsets of the data on which we performed cluster analysis. The randomly sampled subsets with different numbers of microarray experiments allow us to study the effect of the size of microarray datasets on the likelihood of discovering coregulated genes. We used two publicly available yeast microarray datasets consisting of hundreds of microarray experiments: the yeast compendium data [2] and the yeast environmental stress data [15,16]. The yeast compendium dataset [2] consists of 300 knockout microarray experiments, whereas the yeast environmental stress dataset [15,16] consists of 225 concatenated time course microarray experiments. We investigated the effect of different clustering algorithms on identifying coregulated genes by applying different clustering algorithms to subsets of these microarray datasets, including heuristicbased clustering algorithms such as hierarchical completelink and hierarchical averagelink algorithms, and modelbased clustering algorithms such as MCLUST [1719] and the infinite mixture modelbased method (IMM) [2022].
Figure 1. Our overall approach. We applied different clustering algorithms to cluster the genes in yeast microarray datasets with different sizes to identify coexpressed genes. The level of coregulation is evaluated using yeast transcription factor databases (SCPD and YPD) and ChIP data. The clustering results are then evaluated by determining the fraction of gene pairs from the same clusters that share at least one known common transcription factor.
We used two independent sources of data to define coregulated genes: yeast transcription factor databases [11,12] and yeast chromatin immunoprecipitation (ChIP) data [13,14]. Transcription factor databases are based on published results in the literature and are generally based on specific measures of physical interactions between the transcription factor, promoter, and some measure that the transcription factor truly regulates the downstream gene. We used two different transcription factor databases: the Saccharomyces cerevisiae Promoter Database (SCPD) [11], and the Yeast Proteome Database (YPD) [12]. The SCPD lists approximately 230 yeast genes that are regulated by 90 transcription factors, while the YPD lists approximately 580 yeast genes that are regulated by 120 transcription factors, as of November 2001. We extracted two subsets of genes that are listed in the SCPD and YPD databases from each of the yeast compendium [2] and the environmental stress [15,16] microarray datasets. After eliminating genes and experiments with many missing values, the two gene subsets from the compendium data evaluated using SCPD and YPD consist of 215 genes under 273 experiments, and 537 genes under 258 experiments, respectively. The two gene subsets from the environmental stress data evaluated using SCPD and YPD consist of 205 genes under 205 experiments, and 526 genes under 198 experiments, respectively. The ChIP data represents a systematic technique to determine target genes bound to a set of transcription factors in vivo. However, the binding of a transcription factor to the promoter sequence of a gene does not necessarily imply that the transcription factor actually regulates the gene. We evaluated the two gene subsets from each of the yeast compendium and environmental stress datasets using the ChIP data [14] in addition to the corresponding transcription factor database. Two genes are considered coregulated if they are bound to at least one common transcription factor in the ChIP data. The publicly available ChIP data [14] adopts an error model in which a confidence value (pvalue) is assigned to each regulatorDNA interaction, and we used the recommended pvalue threshold of 0.001. In other words, we assume that a gene binds to a given transcription factor if the pvalue is at most 0.001.
To assess the reliability of cluster analysis in the inference of coregulation, we evaluated the clustering results by computing the true positive rate (TP rate), which is defined as the fraction of coclustered gene pairs that share at least one common known transcription factor. A high TP rate indicates a high level of coregulation from a given clustering result. As we do not have complete knowledge of all transcription factors, this TP rate is expected to be underestimated. Moreover, we compared the TP rate from a clustering algorithm to that from random partitions over a range of numbers of clusters because the TP rate may be sensitive to the number of clusters and/or the size distribution of clusters. Our primary evaluation criterion is a zscore, which measures the significance of the TP rate from a clustering result relative to the distribution of TP rates from random partitions with the same number of clusters and same cluster size distributions. Hence, the zscore is a measure of how accurately cluster analysis infers coregulation relative to a random guess. A high zscore implies that the TP rate from the given clustering result is significantly higher than those of random partitions, and hence, indicates a high level of coregulation.
Results
Effect of clustering algorithms
In order to study the effect of different clustering algorithms, we applied different clustering algorithms to subsets of genes listed in SCPD or YPD using all available experiments from the compendium dataset and the environmental stress dataset. For each dataset, we extracted two overlapping gene subsets according to the genes listed in SCPD and YPD respectively. For each of these four gene subsets, we evaluated the proportion of coregulated genes from clustering results using two criteria: transcription factor databases (SCPD or YPD) and ChIP data. Because we do not have perfect knowledge of the optimal number of clusters, we applied each clustering algorithm over a range of numbers of clusters.
Two typical results are shown in Figure 2a,b, which compare the zscores from different clustering algorithms (hierarchical averagelink using correlation and Euclidean distance, hierarchical completelink using correlation and Euclidean distance, MCLUST and IMM) on the yeast compendium dataset with 215 genes and 273 experiments evaluated using SCPD and ChIP data, respectively. The modelbased clustering algorithm MCLUST with the equalvolume spherical model and hierarchical completelink with correlation as the similarity measure produce the highest zscores (hence, proportion of coregulated genes) over the entire range of number of clusters (from 5 to 100), using either SCPD or ChIP data as the evaluation criterion. Figure 2a,b also shows that using correlation coefficient as the pairwise similarity measure produces significantly higher proportions of coregulated genes from clustering results than using Euclidean distance. The results from MCLUST and IMM shown in Figure 2a,b represent zscores from the algorithms applied to the standardized data. Standardization of the data dramatically increases the zscores from modelbased methods (see Figure A.1.b in Additional data file 1). Standardization means that the average expression value of each gene across all experiments is subtracted from the expression value of each gene and then divided by the standard deviation of its expression levels across all experiments. It can be shown that correlation and Euclidean distance are equivalent after standardization.
Figure 2. Effect of different clustering algorithms using all available microarray experiments. We compared the ability of different clustering algorithms to identify coregulated genes using all 273 microarray experiments from the subset of compendium data with 215 genes. The clustering algorithms we compared include hierarchical averagelink using correlation and Euclidean distance as the similarity measure, hierarchical completelink using correlation and Euclidean distance as the similarity measure, and modelbased clustering algorithms MCLUST and IMM on standardized data. A high zscore indicates a high proportion of coregulated genes from clustering results compared to those from random partitions with the same numbers of clusters and cluster size distributions. Since the optimal number of clusters is not known, we compared the performance of clustering algorithms over a range of different numbers of clusters (from 5 to 100). (a) The transcription factor database SCPD is used as the evaluation criterion for coregulated genes. (b) ChIP data is used as the evaluation criterion for coregulated genes. The modelbased clustering algorithm MCLUST produces relatively high zscores using either SCPD or ChIP as our evaluation criterion.
We observed similar results on another subset from the yeast compendium dataset, and the two gene subsets from the environmental stress data (see Figures A.2A.4 in Additional data file 1): MCLUST with the equalvolume spherical model on the standardized data typically produces the highest zscores and using correlation as the similarity measure always produces higher zscores than Euclidean distance. In addition, the two independent evaluation criteria, transcription factor databases (SCPD or YPD) and ChIP data, produce very similar results on the two gene subsets from both the compendium and environmental stress data.
One of the advantages of the modelbased clustering methods over traditional heuristic clustering algorithms is that we can estimate the optimal number of clusters. In our previous work on modelbased clustering methods [21,23], we showed that both MCLUST and IMM produced reasonable estimates of the numbers of clusters on microarray data. Using IMM, we estimated that there are 25 clusters on the compendium data subset with 215 genes and 273 experiments, 42 clusters on the compendium dataset with 537 genes and 258 experiments, 16 clusters on the environmental stress dataset with 205 genes and 205 experiments, and 34 clusters on the environmental stress dataset with 526 genes and 198 experiments. On the other hand, MCLUST does not offer any reasonable estimates of numbers of clusters in this case.
Effect of the number of microarray experiments
In order to study the effect of the number of microarray experiments on the proportion of coregulated genes from cluster analysis on typical microarray datasets, we randomly sampled (with replacement) subsets of E experiments from each of the two gene subsets of the compendium and environmental stress data, where E = 5, 10, 20, 50, 100. We repeated this random sampling procedure 100 times for each E on each dataset. We then applied clustering algorithms to these randomly sampled subsets. Thus, we generated a clustering result for each clustering algorithm on each of these 100 randomly sampled subsets with different sizes (E). The performance of a clustering algorithm on a dataset with E experiments is summarized by the median zscore over the 100 randomly sampled subsets with E experiments.
Figure 3a shows a typical result comparing the median zscores from a given clustering algorithm (hierarchical completelink, in this case) on randomly sampled subsets with different numbers of microarray experiments (E) over a range of numbers of clusters on the compendium data subset with 215 genes evaluated using SCPD. Figure 3a shows that the median zscores (and hence, the proportions of coregulated gene pairs) increase as the number of microarray experiments (E) increases for hierarchical completelink over different numbers of clusters. We observed the same trend using other clustering algorithms. In particular, the median zscores increase drastically from five experiments to 50 experiments, and then the increase in median zscore starts to flatten. We observed the same trend on all our datasets (two different gene subsets from both the compendium and environmental stress data) evaluated using either a transcription factor database (SCPD or YPD) or ChIP data (see Figures B.1B.4 in Additional data file 1 for detailed results).
Figure 3. Effect of the number of microarray experiments on the compendium data subset with 215 genes. We compared the extent of coregulated genes using different numbers of microarray experiments on the subset of compendium data with 215 genes. In order to produce typical datasets with E experiments (where E = 5, 10, 20, 50, 100), we randomly sampled (with replacement) 100 different subsets of E experiments from the compendium data with 215 genes and 273 experiments. The ability to identify coregulated genes from clustering results is summarized by the median zscores over the 100 randomly sampled datasets. A high median zscore indicates a high proportion of coregulated genes from clustering results compared to those from random partitions. (a) We compared the median zscores using different numbers of experiments (E) from hierarchical completelink over a range of different numbers of clusters (from 5 to 100). The transcription factor database SCPD is used as the evaluation criterion for coregulated genes. The median zscores generally increase as E increases over different numbers of clusters. This shows that higher proportions of coregulated genes are identified on microarray datasets with higher numbers of experiments. (b) Using SCPD as our evaluation criterion, we compared the median zscores using different numbers of experiments (E) and different clustering algorithms (hierarchical averagelink and completelink using correlation, modelbased clustering algorithms MCLUST and IMM on standardized data) on the compendium data subset with 215 genes at 25 clusters. We estimated the optimal number of clusters on this dataset to be 25 using IMM, and we observed similar results at different numbers of clusters. (c) Using ChIP data as our evaluation criterion, we compared the median zscores using different numbers of experiments (E) and different clustering algorithms on the compendium data subset with 215 genes at 25 clusters. Using either SCPD or ChIP as our evaluation criterion, the median zscores typically increase as E increases, and MCLUST typically produces relatively high median zscores.
Figure 3b shows a typical result comparing the median zscores from different clustering algorithms (hierarchical averagelink using correlation, hierarchical completelink using correlation, MCLUST and IMM on standardized data) over different sizes of microarray datasets (E = 5, 10, 20, 50, 100, and 273) at 25 clusters on the compendium data subset with 215 genes evaluated using SCPD. Again, we observed that the median zscores increase as the numbers of microarray experiments in the randomly sampled datasets (E) increase. Moreover, the modelbased algorithm MCLUST produces the highest median zscores (and hence, proportion of coregulated gene pairs) for E = 5, 10, 20, 50 and 100. When all the experiments are used (E = 273), hierarchical completelink produces the highest median zscore. We observed the same trend (that is MCLUST generally produces the highest median zscores) at other numbers of clusters.
We also compared the distribution of zscores over the 100 randomly sampled subsets as a function of the size of the randomly sampled subsets (E). Specifically, we created boxplots of the zscores over different sizes of randomly sampled data (E) for a given clustering algorithm and a fixed number of clusters. The medians and percentiles of the zscores generally increase when there are more experiments in the subsets (see Figure B.1.d in Additional data file 1). In other words, a higher proportion of coregulated genes are identified on microarray datasets with a higher number of experiments.
Using ChIP data as the evaluation criterion, Figure 3c shows a typical result comparing the median zscores from different clustering algorithms over different E at 25 clusters on the compendium data subset with 215 genes. Both evaluation criteria (SCPD and ChIP data) produce very similar results: MCLUST generally produces the highest median zscores and the median zscores increase as E increases. The only difference is that MCLUST produces higher zscores than hierarchical completelink using correlation with all 273 experiments when ChIP data is used as our evaluation criterion instead of SCPD.
We observed the same results on all other datasets: the median zscores increase as E increases and MCLUST generally produces the highest median zscores compared to other clustering algorithms. For example, Figure 4 compares the performance of different clustering algorithms (hierarchical averagelink using correlation, hierarchical completelink using correlation, and MCLUST on standardized data) over different sizes of randomly sampled data (E = 5, 10, 20, 50, 100 and 258) on another gene subset from the compendium data with 537 genes evaluated using YPD. Figures 5 and 6 show the results on the two gene subsets from the environmental stress data evaluated using SCPD and YPD respectively. The zscores of IMM are not available on all the plots because it is very computationally expensive to run IMM on 100 randomly sampled subsets for each E on each dataset. We observed the same trends when ChIP data is used as the evaluation criterion (see Figures B.1B.4 in Additional data file 1).
Figure 4. Effect of the number of microarray experiments on the compendium data subset with 537 genes. Using YPD as the evaluation criterion, we compared the median zscores using different numbers of experiments and different clustering algorithms (hierarchical averagelink and completelink using correlation, modelbased clustering algorithms MCLUST and IMM on standardized data) on the compendium data subset with 537 genes and 258 experiments at 40 clusters. The median zscores (and hence, proportions of coregulated genes) increase as the number of experiments increases, and MCLUST produces relatively high median zscores. We observed very similar results using ChIP data as the evaluation criterion.
Figure 5. Effect of the number of microarray experiments on the environmental stress data subset with 205 genes. Using SCPD as the evaluation criterion, we compared the extent of coregulated genes using different numbers of microarray experiments on the subset of environmental stress data with 205 genes and 205 experiments at 20 clusters. The median zscores (and hence, proportions of coregulated genes) increase as the number of experiments increases, and MCLUST typically produces relatively high median zscores. We observed very similar results using ChIP data as the evaluation criterion.
Figure 6. Effect of the number of microarray experiments on the environmental stress data subset with 526 genes. Using YPD as the evaluation criterion, we compared the extent of coregulated genes using different numbers of microarray experiments on the subset of environmental stress data with 526 genes and 198 experiments at 30 clusters. The median zscores increase as the number of experiments increases, and MCLUST produces relatively high median zscores. We observed very similar results using ChIP data as the evaluation criterion.
Diversity of microarray experiments
We investigated the effect of the diversity of experimental conditions on the level of coregulation from cluster analysis. Specifically, we adopted a greedy algorithm to search for a subset of E experiments with high diversity and another subset of E experiments with low diversity from each of the compendium and environmental stress datasets. For these searches, diversity was defined as average pairwise correlation in gene expression between experiments (the Materials and methods section gives details of the greedy algorithm and our definition of diversity). Cluster analysis was then applied to these subsets with high and low diversities.
Contrary to our expectations, we did not observe any consistent patterns between the diversity of experimental conditions and the zscore. For example, relatively similar subsets of knockout experiments from the compendium data tend to produce higher proportions of coregulated genes than diverse subsets of such experiments. On the other hand, relatively diverse subsets of timecourse experiments from the environmental stress data tend to produce higher proportions of coregulated genes than similar subsets of such experiments. It is possible that the diversity of experimental conditions has a different effect on different types of microarray datasets: the compendium dataset consists of knockout experiments, while the environmental stress dataset consists of concatenated timecourse experiments. However, we need more evidence (in particular, more microarray datasets of different natures) to confirm this possibility. Another possible reason for the inconsistent patterns is that our definition of diversity in terms of average correlation between all pairs of experiments may not be the best definition of diversity for this purpose of grouping experiments that are likely to help identify coregulated genes. A third possible reason is that the diversity of experimental conditions has no significant effect on coregulation at all. With our current results, we cannot rule out this third possibility.
Discussion
It is important to note that even when all experiments are included, the best results produce clusters with only a 28% true positive rate (see Figure E.1.a in Additional data file 1). That is, most of the genes in a given cluster do not share a common, known transcription factor. There are several possible reasons for this. First, with the present state of knowledge, it is possible that genes in the same cluster do in fact share a common transcription factor that is not (yet) represented in the databases used as gold standards (YPD, SCPD and ChIP data). We note for example, that when one compares ChIP data to YPD, the falsenegative rate is approximately 80% using the recommended pvalue of 0.001. That is, known gene transcription factor interactions from YPD are identified only about 20% of the time by ChIP (see Table F in Additional data file 1). Hence, it is possible that our evaluation criteria all underestimate the number of coregulated genes in a cluster. Second, gene regulation is more complex than accounted for in our approach; for example, we define sharing a common transcription factor as 'coregulated', and each gene belongs to exactly one cluster in the clustering algorithms we considered. Individual genes are often regulated by multiple transcription factors, some of which may enhance or repress transcription. Hence, genes may be coexpressed as a result of a combination of the effects of multiple transcription factors that need not be shared across all genes. Third, genes may be included in a cluster primarily because of noise (measurement errors or technical variations) in the data rather than true signal. Finally, the range of conditions under which the experimental data was obtained may not produce changes in gene expression that would result in segregation of genes into appropriate clusters.
Even with the above caveats, our methodology simulates a common approach of experimental biologists. That is, clustering of diverse geneexpression datasets under the assumption that coregulated genes will cocluster, followed by attempts to identify the common transcription factors and transcription factor binding sites. While this approach has been very successful when applied to very large datasets (in particular in yeast), it is clear that the accuracy of inference should be highly dependent on the number of experimental conditions included in the analysis. The motivation of our study is to provide guidance as to the likelihood that this approach will produce true and falsepositive results and to study these rates as a function of the number of experiments that are clustered.
Our current study does not provide completely quantitative results; for example, how many experiments are sufficient for coclustered genes to have x% probability of being coregulated? Ideally, we would like to minimize both false positives and false negatives. However, we believe that it is of greater importance to focus on false positives, because false positives potentially lead to a waste of resources and effort to verify nonexistent relationships, while falsenegatives represent missed opportunities. We also do not provide reliable falsenegative rates because of our incomplete knowledge: the transcription factor databases document known gene transcription factor interactions, but they do not give any information on known genes that are not regulated by given transcription factors. It is also not clear how to determine the pvalue threshold for nonbinding between genes and transcription factors on ChIP data. Furthermore, our study does not take the information limit of microarray data into consideration. For example, the environmental stress data consist of 225 concatenated time course microarray experiments, while the number of distinct experimental conditions is significantly less than 225. However, our results show that the ability of clustering algorithms to identify coregulated genes increases dramatically as the number of microarray experiments used in cluster analysis is increased. In general, the ability to identify coregulated genes in yeast datasets starts to plateau when the number of microarray experiments is greater than 50. In addition, our study indicates that the likelihood of correctly identifying coregulated genes by clustering a small number (< 10) geneexpression experiments in yeast is quite small, and that even with large numbers of experiments, the falsepositive rate may exceed the truepositive rate. For example, cluster analysis on five experiments identifies coregulated genes only 1.5 to 6fold more accurately than random assignment of genes to clusters on our yeast datasets (Figures 3b,c, 4 and 5). Moreover, our results were, for the most part, independent of the number of genes in the datasets and the gold standards used. That is, using SCPD, YPD, or ChIP data to identify which genes share a common transcription factor yielded very similar results in most cases. Since we extracted different (but overlapping) subsets of genes using the genes listed in SCPD and YPD for evaluation, and each of these two gene subsets are independently evaluated using ChIP data again, we have very strong confidence that our observations and general results are highly representative despite our incomplete knowledge of yeast transcription factors.
Therefore, caution is indicated before embarking on computational approaches to identify putative transcription factor binding sites in genes that cocluster in small numbers of experiments. In addition, prudence should be exercised before embarking on expensive bench experiments to characterize these putatively identified transcription factor binding sites. Finally, it is worth noting that our current study focused on yeast, which is a simple eukaryote consisting of only 6,200 genes. We would expect that the correspondence between coclustering and coregulation would be lower in more complex organisms. We are interested in extending our investigation to other organisms, such as Escherichia coli and Caenorhabditis elegans, both of which have fully sequenced genomes and for which there are large microarray datasets and available resources on their transcription factors. Another possible extension to our work is the inclusion of tentative regulatory sequences as our fourth evaluation criterion. A third direction of future work would be to derive mathematically the mean and standard deviation of the distribution of the TP rates from random partitions as a function of the number of clusters and cluster sizes, so as to minimize the computational running time of our study.
Conclusions
Our results demonstrate several important overall features. First, the ability to identify coregulated genes from coexpressed genes is strongly dependent on the number of microarray experiments used in cluster analysis, and the accuracy of these associations plateaus at between 50 and 100 experiments. Second, the modelbased clustering algorithm MCLUST consistently outperforms more traditional methods in accurately assigning coregulated genes to the same clusters. Third, using correlation as the similarity measure in heuristicbased clustering algorithms generally produces relatively higher proportions of coregulated genes compared to Euclidean distance. Fourth, our two independent evaluation criteria for coregulation (transcription factor databases and ChIP data) produced similar conclusions.
Materials and methods
Microarray datasets
We used two publicly available yeast microarray datasets consisting of hundreds of microarray experiments: the yeast compendium data [2,24] and the yeast environmental stress data [15,16,25,26]. The yeast compendium dataset [2] consists of 300 twocolor cDNA microarray experiments in which the transcript levels of diverse mutations or chemically treated culture in yeast were compared to that of a wildtype or mocktreated cultures. The yeast environmental stress dataset [15,16] consists of 225 concatenated time course cDNA microarray experiments. These experiments represent the temporal program of gene expression in response to diverse environmental transitions (such as heat shock, hydrogen peroxide or nitrogen depletion) and to DNAdamaging agents (the methylating agent methylmethane sulfonate and ionizing radiation).
Evaluation criteria and microarray data preprocessing
We adopted two types of evaluation criteria for coregulated genes: yeast transcription factor databases [11,12] and yeast ChIP data [13,14,27]. The SCPD [11] lists approximately 230 yeast genes that are regulated by 90 transcription factors, while the YPD [12] lists approximately 580 yeast genes that are regulated by 120 transcription factors as of November 2001. We extracted two subsets of genes from each of the compendium and environmental stress datasets: one subset of genes as listed in SCPD and another subset of genes as listed in YPD. These gene subsets are selected on the basis of the genes listed in SCPD and YPD. We did not preprocess the microarray data with any filtering steps based on differential expression or absolute levels of expression (see section G of Additional data file 1 for a discussion of the effect of filtering). After eliminating genes and experiments with lots of missing values, the full compendium datasets evaluated using SCPD and YPD consist of 215 genes under 273 experiments, and 537 genes under 258 experiments respectively, and the full environmental stress datasets evaluated using the SCPD and the YPD consist of 205 genes under 205 experiments, and 526 genes under 198 experiments respectively. These data subsets contain at most 1% missing expression values. As the current implementations of the modelbased clustering algorithms (MCLUST and IMM) require the input data to have no missing values, we filled in the missing values using KNNimpute [28] which imputes missing values using weighted average expression levels from other genes with similar expression patterns. To our surprise, most of the gene transcription factor interactions listed in YPD are not listed in SCPD and vice versa. Table 1 shows that there are only 156 common genes listed in both YPD and SCPD, and only 119 common gene transcription factor interactions are listed in both YPD and SCPD.
Table 1. Comparing YPD and SCPD
Because the gene transcription factor interactions from transcription factor databases are incomplete, we independently evaluated the extent of coregulation of these four gene subsets (two gene subsets from each of the compendium and environmental stress data) using the yeast ChIP data [14], which systematically identify target genes bound in vivo by a set of 106 known transcription factors. The publicly available ChIP data [14] adopts an error model in which a confidence value (pvalue) is assigned to each regulatorDNA interaction. A pvalue close to 0 implies that we have high confidence that a gene of interest binds to a given transcription factor. Lee et al. [14] recommended a pvalue threshold of 0.001 to minimize false positives and to maximize legitimate regulatorDNA interactions.
There are 791 gene transcription factor interactions from YPD for which both the gene names and the transcription factors are present in the ChIP data [14]. Out of these 791 interactions, only 20% (or 159) were detected by the ChIP data using a pvalue threshold of 0.001. On the other hand, there are 642 gene transcription factor interactions from the ChIP data [14] using a pvalue threshold of 0.001. Out of these 642 interactions, only 34% (or 221) were reported in YPD. Therefore, the gene transcription factor interactions inferred from the ChIP data are quite different from those listed in YPD (see Table F in Additional data file 1).
Measure of statistical significance
We evaluated the level of coregulation of clustering results by considering pairs of genes assigned to the same clusters and counting the fraction of these gene pairs that share at least one common known transcription factor. Specifically, we defined the truepositive rate (TP rate) as
A high truepositive rate indicates a high proportion of coregulated genes from a given clustering result. However, as we do not have complete knowledge of all transcription factors (for example, the genetranscription factor interactions listed in transcription factor databases are likely to be incomplete and the pvalue threshold used in ChIP data may not be optimal), the TP rates we computed are likely to be underestimates.
As the TP rate may change as a function of the number and size distribution of clusters, we compared the TP rate from a clustering result to that from random partitions over a range of numbers of clusters. Specifically, we randomly partitioned the set of genes from a clustering result many times (typically 1,000 times in our experiments) to produce the same number of clusters and cluster size distribution as the given clustering result. We computed the TP rates of these random partitions and compared the distribution of these TP rates from random partitions to the TP rate of the given clustering result. The TP rates from random partitions typically closely follow the normal distribution. Hence, we computed the mean μ and standard deviation σ for the distribution of TP rates from random partitions. Let us denote the TP rate from a clustering result as X. The zscore, z, associated with the TP rate is defined as
A high zscore implies that the TP rate from the given clustering result is significantly higher than those of random partitions, and thus indicates a high level of coregulation. We computed the zscores as a function of clustering algorithm, number of experiments and number of clusters.
Effect of the number of microarray experiments
To study the effect of the number of microarray experiments on the likelihood of discovering coregulated genes from clustering results, we randomly sampled (with replacement) subsets of E experiments from each of the four data subsets (two gene subsets from each of the compendium and environmental stress data), where E = 5, 10, 20, 50, 100. We repeated this random sampling procedure 100 times for each E on each dataset.
Effect of clustering algorithms
There are numerous algorithms and associated programs to perform cluster analysis (for example, hierarchical methods [29], selforganizing maps [30], kmeans [31], modelbased approaches [1719,32,33]) and many of these techniques have been applied to expression data (for example [1,6,2023,33,34]). In our previous work [20,21,23], we defined cluster quality in terms of functional categories on real microarray datasets and the true underlying clusters (or classes) on synthetic datasets, and we showed that modelbased clustering algorithms such as MCLUST [1719] and the infinite mixture modelbased method (IMM) [2022] typically produced higher cluster quality than other heuristicbased clustering methods. We now focus on comparing the likelihood of assigning coregulated genes to the same clusters from different clustering methods. In particular, we studied the performances of both heuristicbased clustering algorithms (hierarchical completelink and hierarchical averagelink) and modelbased clustering algorithms (MCLUST and IMM).
Similarity measures and heuristicbased algorithms
Most heuristicbased clustering algorithms take the pairwise similarities of objects (genes or experiments) as input and create as output an organization of the objects grouped by similarity to each other. There are many similarity measures, among which the two most popular ones for geneexpression data are correlation coefficient and Euclidean distance. Correlation is a similarity measure, that is, a high correlation coefficient implies high similarity, and it captures the directions of change of two expression profiles. Euclidean distance is a dissimilarity measure, that is, a high distance implies low similarity, and it measures both the magnitudes and directions of change between two expression profiles.
Hierarchical algorithms define a dendrogram (tree) relating similar objects in the same subtrees. In agglomerative hierarchical algorithms (such as averagelink and completelink), each object is initially assigned to its own subtree (cluster). In each step, similar subtrees (clusters) are merged to form the dendrogram. We obtain clusters from the dendrogram by stopping the merging process when the desired number of clusters (subtrees) is produced. Different definitions of cluster similarity yield different clustering algorithms. In hierarchical completelink algorithm, cluster similarity is defined to be the minimum similarity between a pair of genes, one from each of the two clusters. In hierarchical averagelink algorithm, the cluster similarity of two clusters is the average pairwise similarity between genes in the two clusters.
MCLUST
The finite Gaussian mixture modelbased approach assumes that each cluster follows the multivariate normal distribution, with model parameters that specify the location and property of each cluster. Different models in MCLUST assume different cluster properties (shape, volume and orientation). The most constrained model is the equalvolume spherical model in which all clusters are assumed to have equalvolume and spherical in shape. The unconstrained model is the most general model, in which the clusters can be elliptical and may have different orientations and volumes. MCLUST implements the expectationmaximization (EM) algorithm for clustering via finite Gaussian mixture models, as well as modelbased hierarchical clustering algorithms, with optional crosscluster constraints [19].
Infinite mixture modelbased approach (IMM)
Medvedovic et al. [22] postulated an infinite Gaussian mixture model for geneexpression data which incorporates an error model for repeated measurements. Each cluster is assumed to follow a multivariate normal distribution, and the measured repeated expression levels are assumed to follow another multivariate normal distribution. They used a Gibbs sampler to estimate the posterior pairwise probabilities of coexpression. These posterior pairwise probabilities are treated as pairwise similarities, which are used as inputs to clustering algorithms like hierarchical completelink algorithm. Our recent work showed that applying hierarchical completelink to these posterior pairwise probabilities using a particular cluster similarity parameter (minimum distance = 0.9999) yields reasonable estimates of the optimal number of clusters [21]. In this work, IMM produced very reasonable estimates of the optimal numbers of clusters for all our datasets.
Diversity of microarray experiments
We defined the diversity of a set of E microarray experiments as the average correlation of all pairs of experiments in this set of E experiments. A high correlation implies low diversity, while a low correlation implies high diversity. To investigate the effect of diversity of experimental conditions on the level of coregulation, we used a greedy algorithm to select a set of E experiments with high diversity (low average correlation) and another set of E experiments with low diversity (high average correlation) from the compendium and environmental stress data subsets, where E = 5, 10, 15, 20, 30, 40, 50, 80, 100.
Let S be the current set of experiments. In the case of searching for a set of E highly diverse experiments, we initialized S with the pair of experiments with minimum pairwise correlation. After the initialization step, the following two steps are repeated until there are E experiments in S. First, search for an experiment e with minimum total correlation to all the current experiments in S; then, add experiment e to the set S. The greedy algorithm to search for a set of E experiments with low diversity is very similar, except that experiments with maximum correlation are added instead.
Additional data files
A pdf file (Additional data file 1) available with the online version of this article gives the datasets used in this work. The files and software are also available from our website [35].
Additional data file 1. The datasets used in this work
Format: PDF Size: 325KB Download file
This file can be viewed with: Adobe Acrobat Reader
Acknowledgements
We thank Professor Elton T Young from Biochemistry at University of Washington, members of the Bumgarner lab and the working group organized by Adrian Raftery at University of Washington for their feedback and comments for this project. K.Y.Y. and R.E.B. are supported by NIHNIDDK grant 5U24DK05881303. R.E.B. is also supported by NIHNIAID grants 5P01 AI05210602, 1R21AI05202801 and 1U54AI05714101, NIHNIEHA grant 1U19ES01138702, NIHNIDA grant 1 P30 DA01562501, NIHNHLBI grants 5R01HL07237002 and 1P50HL07399601. M.M. is supported by NHGRI grant 1R21HG00284901.
References

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Nat Acad Sci USA 1998, 95:1486314868. PubMed Abstract  Publisher Full Text

Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, et al.: Functional discovery via a compendium of expression profiles.
Cell 2000, 102:109126. PubMed Abstract  Publisher Full Text

Wu LF, Hughes TR, Davierwala AP, Robinson MD, Stoughton R, Altschuler SJ: Largescale prediction of Saccharomyces cerevisiae gene function using overlapping transcriptional clusters.
Nat Genet 2002, 31:255265. PubMed Abstract  Publisher Full Text

Wyrick JJ, Young RA: Deciphering gene expression regulatory networks.
Curr Opin Genet Dev 2002, 12:130136. PubMed Abstract  Publisher Full Text

Spellman PT, Sherlock G, Zhang MQ, Iyer VR, Anders K, Eisen MB, Brown PO, Botstein D, Futcher B: Comprehensive identification of cell cycleregulated genes of the yeast Saccharomyces cerevisiae by microarray hybridization.
Mol Biol Cell 1998, 9:32733297. PubMed Abstract  Publisher Full Text

Tavazoie S, Huges JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture.
Nat Genet 1999, 22:281285. PubMed Abstract  Publisher Full Text

Geiss GK, Carter VS, He Y, Kwieciszewski BK, Holzman T, Korth MJ, Lazaro CA, Fausto N, Bumgarner RE, Katze MG: Gene expression profiling of the cellular transcriptional network regulated by alpha/beta interferon and its partial attenuation by the hepatitis C virus nonstructural 5A protein.
J Virol 2003, 77:63676375. PubMed Abstract  Publisher Full Text

Ohler U, Niemann H: Identification and analysis of eukaryotic promoters: recent computational approaches.
Trends Genet 2001, 17:5660. PubMed Abstract  Publisher Full Text

Wolfsberg TG, Gabrielian AE, Campbell MJ, Cho RJ, Spouge JL, Landsman D: Candidate regulatory sequence elements for cell cycledependent transcription in Saccharomyces cerevisiae.
Genome Res 1999, 9:775792. PubMed Abstract  Publisher Full Text

Jelinsky SA, Estep P, Church GM, Samson LD: Regulatory networks revealed by transcriptional profiling of damaged Saccharomyces cerevisiae cells: Rpn4 links base excision repair with proteasomes.
Mol Cell Biol 2000, 20:81578167. PubMed Abstract  Publisher Full Text

Zhu J, Zhang MQ: SCPD: a promoter database of the yeast Saccharomyces cerevisiae.
Bioinformatics 1999, 15:607611. PubMed Abstract  Publisher Full Text

Costanzo MC, Hogan JD, Cusick ME, Davis BP, Fancher AM, Hodges PE, Kondu P, Lengieza C, LewSmith JE, Lingner C, et al.: The yeast proteome database (YPD) and Caenorhabditis elegans proteome database (WormPD): comprehensive resources for the organization and comparison of model organism protein information.
Nucleic Acids Res 2000, 28:7376. PubMed Abstract  Publisher Full Text

Ren B, Robert F, Wyrick JJ, Aparicio O, Jennings EG, Simon I, Zeitlinger J, Schreiber J, Hannett N, Kanin E, et al.: Genomewide location and function of DNA binding proteins.
Science 2000, 290:23062309. PubMed Abstract  Publisher Full Text

Lee TI, Rinaldi NJ, Robert F, Odom DT, BarJoseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al.: Transcriptional regulatory networks in Saccharomyces cerevisiae.
Science 2002, 298:799804. PubMed Abstract  Publisher Full Text

Gasch AP, Spellman PT, Kao CM, CarmelHarel O, Eisen MB, Storz G, Botstein D, Brown PO: Genomic expression programs in the response of yeast cells to environmental changes.
Mol Biol Cell 2000, 11:42414257. PubMed Abstract  Publisher Full Text

Gasch AP, Huang M, Metzner S, Botstein D, Elledge SJ, Brown PO: Genomic expression responses to DNAdamaging agents and the regulatory role of the yeast ATR homolog Mec1p.
Mol Biol Cell 2001, 12:29873003. PubMed Abstract  Publisher Full Text

Fraley C, Raftery AE: MCLUST: Software for modelbased cluster analysis.
J Classification 1999, 16:297306. Publisher Full Text

Fraley C, Raftery AE: How many clusters? Which clustering method?  Answers via modelbased cluster analysis.

Fraley C, Raftery AE: Modelbased clustering, discriminant analysis, and density estimation.
J Am Stat Assoc 2002, 97:611631. Publisher Full Text

Yeung KY, Medvedovic M, Bumgarner RE: Clustering gene expression data with repeated measurements.
Genome Biol 2003, 4:R34. PubMed Abstract  BioMed Central Full Text

Medvedovic M, Yeung KY, Bumgarner R: Bayesian mixture model based clustering of replicated microarray data.
Bioinformatics 2004, 20:12221232. PubMed Abstract  Publisher Full Text

Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles.
Bioinformatics 2002, 18:11941206. PubMed Abstract  Publisher Full Text

Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Modelbased clustering and data transformations for gene expression data.
Bioinformatics 2001, 17:977987. PubMed Abstract  Publisher Full Text

Functional discovery via a compendium of expression profiles [http://www.rii.com/publications/2000/cell_hughes.html] webcite

Genomic response of yeast to diverse stress conditions [http://genomewww.stanford.edu/yeast_stress] webcite

Genomic responses to DNAdamaging agents [http://wwwgenome.stanford.edu/Mec1] webcite

Iyer VR, Horak CE, Scafe CS, Botstein D, Snyder M, Brown PO: Genomic binding sites of the yeast cellcycle transcription factors SBF and MBF.
Nature 2001, 409:533538. PubMed Abstract  Publisher Full Text

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays.
Bioinformatics 2001, 17:520525. PubMed Abstract  Publisher Full Text

Kohonen T: Selforganizing maps. Berlin: SpringerVerlag; 1997.

MacQueen J: Some methods for classification and analysis of multivariate observations. In In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability. Edited by Cam LML, Neyman J. Berkeley: University of California Press; 1965:281297.

McLachlan GJ, Basford KE: Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker; 1988.

McLachlan GJ, Bean RW, Peel D: A mixture modelbased approach to the clustering of microarray expression data.
Bioinformatics 2002, 18:413422. PubMed Abstract  Publisher Full Text

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation.
Proc Nat Acad Sci USA 1999, 96:29072912. PubMed Abstract  Publisher Full Text

UW Department of Microbiology  bioinformatics publications [http://expression.washington.edu/publications/kayee/coregulation] webcite