Abstract
Clustering is a common methodology for the analysis of array data, and many research laboratories are generating array data with repeated measurements. We evaluated several clustering algorithms that incorporate repeated measurements, and show that algorithms that take advantage of repeated measurements yield more accurate and more stable clusters. In particular, we show that the infinite mixture modelbased approach with a builtin error model produces superior results.
Background
The two most frequently performed analyses on geneexpression data are the inference of differentially expressed genes and clustering. Clustering is a useful exploratory technique for geneexpression data as it groups similar objects together and allows the biologist to identify potentially meaningful relationships between the objects (either genes or experiments or both). For example, in the work of Eisen et al. [1] and Hughes et al. [2], cluster analysis was used to identify genes that show similar expression patterns over a wide range of experimental conditions in yeast. Such genes are typically involved in related functions and are frequently coregulated (as demonstrated by other evidence such as shared promoter sequences and experimental verification). Hence, in these examples, the function(s) of gene(s) could be inferred through 'guilt by association' or appearance in the same cluster(s).
Another common use of cluster analysis is to group samples by relatedness in expression patterns. In this case, the expression pattern is effectively a complex phenotype and cluster analysis is used to identify samples with similar and different phenotypes. Often, there is the additional goal of identifying a small subset of genes that are most diagnostic of sample differences. For example, in the work of Golub et al. [3] and van't Veer et al. [4], cluster analysis was used to identify subsets of genes that show different expression patterns between different types of cancers.
There are numerous algorithms and associated programs to perform cluster analysis (for example, hierarchical methods [5], selforganizing maps [6], kmeans [7] and modelbased approaches [810]) and many of these techniques have been applied to expression data (for example [1,1114]). Whereas one might anticipate that some algorithms are inherently better for cluster analysis of 'typical' geneexpression data, nearly every software vendor is compelled to provide access to most published methods. Hence, the biologist wishing to perform cluster analysis is faced with a dizzying array of algorithmic choices and little basis on which to make a choice. In addition, in nearly all published cases, cluster analysis is performed on geneexpression data for which no estimates of error are available  for example, the expression data do not contain repeated measurements for a given data point. Such algorithms do not take full advantage of repeated data when it is available. In this paper we address two questions. First, how well do different clustering algorithms perform on both real and synthetic gene expression data? And second, can we improve cluster quality by using algorithms that take advantage of information from repeated measurements?
Introduction to cluster analysis
A dataset containing objects to be clustered is usually represented in one of two formats: the data matrix and the similarity (or distance) matrix. In a data matrix, rows usually represent objects to be clustered (typically genes), and columns usually represent features or attributes of the objects (typically experiments). An entry in the data matrix usually represents the expression level or expression ratio of a gene under a given experiment. The similarity (or distance) matrix contains the pairwise similarities (or dissimilarities) between each pair of objects (genes or experiments).
There are many similarity measures that can be used to compute the similarity or dissimilarity between a pair of objects, among which the two most popular ones for gene expression data are correlation coefficient and Euclidean distance. Correlation is a similarity measure, that is, a high correlation coefficient implies high similarity, and it captures the directions of change of two expression profiles. Euclidean distance is a dissimilarity measure, that is, a high distance implies low similarity, and it measures both the magnitudes and directions of change between two expression profiles.
Most clustering algorithms take the similarity matrix as input and create as output an organization of the objects grouped by similarity to each other. The most common algorithms are hierarchical in nature. Hierarchical algorithms define a dendrogram (tree) relating similar objects in the same subtrees. In agglomerative hierarchical algorithms (such as average linkage and complete linkage), each object is initially assigned to its own subtree (cluster). In each step, similar subtrees (clusters) are merged to form the dendrogram. Cluster similarity can be computed from the similarity matrix or the data matrix (see Sherlock [15] or Sharan et al. [16] for reviews of popular clustering algorithms for geneexpression data).
Once a clustering algorithm has grouped similar objects (genes and samples) together, the biologist is then faced with the task of interpreting these groupings (or clusters). For example, if a gene of unknown function is clustered together with many genes of similar, known function, one might hypothesize that the unknown gene also has a related function. Or, if biological sample 'A' is grouped with other samples that have similar states or diagnoses, one might infer the state or diagnosis of sample 'A'. However, before one does subsequent laboratory work to confirm a hypothesis or, more important, makes a diagnosis based on the results of cluster analysis, a few questions need to be asked. The first is how reproducible are the clustering results with respect to remeasurement of the data. Then, what is the likelihood that the grouping of the unknown sample or gene of interest with other known samples or genes is false (due to noise in the data, inherent limitations of the data or limitations in the algorithm)? And finally, is there a better algorithm that will reduce errors in clustering results?
Related work
Kerr and Churchill [17] applied an analysis of variance model and bootstrapping to array data to assess stability of clusters (for example, 'if one remeasured the data and did the same analysis again, would the same genes/samples group together?'). In their approach, the original data was resampled using variability estimates and cluster analysis was performed using the resampled data. This posthoc analysis uses variability estimates to provide a good indication of cluster stability. However, this method does not improve the overall clustering results, it only provides an indication of the reproducibility of the clusters with a given dataset and algorithm.
Hughes et al. [2] analyzed their yeast datasets using the commercial software package Resolver (Rosetta Inpharmatics, Kirkland, WA). Resolver was developed with a builtin error model that is derived from repeated data obtained on the array platform of interest. Resolver uses this error model and available repeated data to estimate the error in expression ratios for each gene sampled. In addition, as described below and in [2], Resolver's clustering algorithms use the error estimates to weigh the similarity measures. This results in lower weights for data points with lower confidence in the cluster analysis. The net result of this treatment (as we show below) is an improvement in both cluster accuracy and cluster stability.
Medvedovic et al. [18] have taken a different approach by adopting the Bayesian infinite mixture model (IMM) to incorporate repeated measurements in cluster analysis. They postulated a probability model for geneexpression data that incorporates repeated data, and estimated the posterior pairwise probabilities of coexpression with a Gibbs sampler. They showed that the estimated posterior pairwise distance allowed for easy identification of unrelated objects. These posterior pairwise distances can be clustered using average linkage or complete linkage hierarchical algorithms.
Our contributions
We have implemented several approaches to take advantage of repeated measurements in cluster analysis and performed an empirical study evaluating clustering results using both real and synthetic geneexpression datasets. We tested several different clustering algorithms and similarity measure combinations on the same datasets and evaluated the quality of each approach using the same criteria. We also assessed four different approaches to clustering repeated array data: clustering the averaged expression levels over the repeated measurements; using variability estimates in similarity measures (assigning low weights to noisy data points); clustering the repeated measurements as individual data points and assigning them to the same subtrees in agglomerative hierarchical algorithms; and an IMMbased approach with builtin error models for repeated data. We use two assessment criteria to evaluate clustering results: cluster accuracy (comparing clustering results to known external knowledge of the data); and cluster stability (the consistency of objects clustered together on synthetic remeasured data). In addition, we extended the IMMbased approach and the variabilityweighted approach. We also created synthetic array datasets with error distributions taken from real data. These synthetic data in which the clusters are known are crucial for the development and testing of novel clustering algorithms.
Over a variety of clustering algorithms, we showed that array data with repeated measurements yield more accurate and more stable clusters. When repeated measurements are available, both the variabilityweighted similarity approach and the IMMbased approach improve cluster accuracy and cluster stability to a greater extent than the simple approach of averaging over the repeated measurements. The modelbased approaches (hierarchical modelbased algorithm [8] and the IMM approach [18]) consistently produce more accurate and more stable clusters.
Results
Overview of our empirical study
In our empirical study, we compare the quality of clustering results from a variety of algorithms on array data with repeated measurements. We use two methods to assess cluster quality: cluster accuracy and cluster stability. External validation compares clustering results to known independent external knowledge of which objects (genes, experiments or both) should cluster together [19]. A clustering result that agrees with the external knowledge is assumed to be accurate. However, for most biological data, there is little or no a priori knowledge of this type. We also evaluate the stability of clusters with respect to synthetic remeasured array data. That is, if one remeasures the array data, how often are objects clustered together in the original data assigned to the same clusters in the remeasured data?
In this section, we discuss the clustering algorithms implemented, approaches to clustering repeated measurements, and the real and synthetic datasets used in our empirical study. We will also discuss assessment of cluster quality in greater detail. Finally, we present and discuss results of our study.
Test algorithms and similarity measures
We studied the performance of a wide variety of clustering algorithms, including several agglomerative hierarchical algorithms (average linkage, centroid linkage, complete linkage and single linkage), a divisive hierarchical algorithm called DIANA [20], kmeans [7], a graphtheoretic algorithm called CAST [21], a finite Gaussian mixture modelbased hierarchical clustering algorithm from MCLUST [8], and an IMMbased approach [18]. Agglomerative hierarchical clustering algorithms successively merge similar objects (or subtrees) to form a dendrogram. To evaluate cluster quality, we obtain clusters from the dendrogram by stopping the merging process when the desired number of clusters (subtrees) is produced. The objects in these subtrees form the resulting clusters. Except for the two modelbased approaches, all other clustering algorithms require a pairwise similarity measure. We used both correlation and Euclidean distance in our empirical study.
How to cluster array data with repeated measurements
Average over repeated measurements
The simplest approach is to compute the average expression levels over all repeated measurements for each gene and each experiment, and store these average expression levels in the raw data matrix. The pairwise similarities (correlation or distance) can be computed using these average expression values. This is the approach taken in the vast majority of published reports for which repeated measurements are available.
Variabilityweighted similarity measures
The averaging approach does not take into account the variability in repeated measurements. Hughes et al. [2] proposed an errorweighted clustering approach that uses error estimates to weigh expression values in pairwise similarities such that expression values with high error estimates are downweighted. These errorweighted pairwise similarities are then used as inputs to clustering algorithms. Hughes et al. [2] developed an error model that assigns relatively high error estimates to genes that show greater variation in their repeated expression levels than other genes at similar abundance in their control experiments. In our empirical study, we use variability estimates instead of error estimates in the weighted similarity measures. Intuitively, gene expression levels that show larger variations over the repeated measurements should be assigned lower confidence (weights). We use either the standard deviation (SD) or coefficient of variation (CV) as variability estimates. Let us illustrate this approach with an example: suppose our goal is to compute the variabilityweighted correlation of two genes G1 and G2. For each experiment, we compute the SD or CV over the repeated measurements for these two genes. Experiments with relatively high variability estimates (SD or CV) are downweighted in the variabilityweighted correlation of G1 and G2 (see Materials and methods for mathematical definitions of these weighted similarities).
Hierarchical clustering of repeated measurements
An alternative idea is to cluster the repeated measurements as individual objects in hierarchical clustering algorithms. The idea is to initialize the agglomerative algorithm by assigning repeated measurements of each object to the same subtrees in the dendrogram. In each successive step, two subtrees containing repeated measurements are merged. This approach of forcing repeated measurements into the same subtrees is abbreviated as FITSS (forcing into the same subtrees). In addition to heuristically based hierarchical algorithms (such as average linkage, complete linkage, centroid linkage and single linkage), we also investigate the performance of clustering repeated data with MCLUSTHC, which is a modelbased hierarchical clustering algorithm from MCLUST [8].
IMMbased approach
Medvedovic et al. [18] postulated a probability model (an infinite Gaussian mixture model) for geneexpression data which incorporates repeated data. Each cluster is assumed to follow a multivariate normal distribution, and the measured repeated expression levels follow another multivariate normal distribution. They used a Gibbs sampler to estimate the posterior pairwise probabilities of coexpression. These posterior pairwise probabilities are treated as pairwise similarities, which are used as inputs to clustering algorithms such as average linkage or complete linkage hierarchical algorithms. They showed that these posterior pairwise probabilities led to easy identification of unrelated objects, and hence are superior to other pairwise similarity measures such as Euclidean distance.
The model published in Medvedovic et al. [18] assumes that the variance between repeated measurements of the same genes is homogeneous across all experiments. We call this model the spherical model. We extended the IMM approach to include an elliptical model, in which repeated measurements may have different variance across the experiments. In other words, genes may have different noise levels in the spherical model, while both genes and experiments may have different noise levels in the elliptical model.
Table 1 summarizes the clustering algorithms and similarity measures implemented in our empirical study, and the corresponding methods to cluster repeated data.
Table 1. Summary of various clustering approaches used in our empirical study
Datasets
Assessment of cluster accuracy requires datasets for which there is independent knowledge of which objects should cluster together. For most biological data, there is little or no a priori knowledge of this type. In addition, to develop and test clustering algorithms that incorporate repeated measurements, we require datasets for which repeated measurements or error estimates are available. Unfortunately, very few publicly available datasets meet both criteria. Repeated microarray measurements are, unfortunately, still rare in published data. In addition, one rarely has a priori knowledge of which objects should cluster together. This is especially the case when we are grouping in the gene dimension. To overcome these limitations, we used both real and synthetic datasets in our empirical study. Some of these data will be described in the following sections (and see Materials and methods for details).
Completely synthetic data
Because independent external knowledge is often unavailable on real data, we created synthetic data that have error distributions derived from real array data. We use a twostep process to generate synthetic data. In the first step, data are generated according to artificial patterns such that the true class of each object is known. We created six equalsized classes, of which four are sine waves shifted in phase relative to each other (a periodic pattern) and the remaining two classes are represented by linear functions (nonperiodic). In the second step, error is added to the synthetic patterns using an experimentally derived error distribution. The error for each data point is randomly sampled (with replacement) from the distribution of standard deviations of log ratios over the repeated measurements on the yeast galactose data (described below). The erroradded data are generated from a random normal distribution with mean equal to the value of the synthetic pattern (from the first step), and SD equal to the sampled error. The signaltonoise of the synthetic data is adjusted by linearly scaling the error before adding it to the pattern. We generate multiple synthetic datasets with 400 data points, 20 attributes, 1, 4 or 20 repeated measurements and 2 different levels of signaltonoise (low and high noise levels). In our synthetic data, all genes in each class have identical patterns (before error is added). The cluster structure of real data will, in general, be less distinguishable than that of these synthetic data. Hence, it is of interest to study the performance of various clustering approaches as a function of noise level in the synthetic data. Figure 1a,1b shows the expression profiles of the classes in typical datasets with four repeated measurements at low and high noise levels respectively.
Figure 1. Expression profiles of the classes in typical completely synthetic datasets with four repeated measurements. (a) Low noise level; (b) high noise level. For each class, the log ratios are plotted against the experiment numbers, and each class is shown in a different color. There are four sine (periodic) classes with different phase shifts and two linear (nonperiodic) classes. Only four (out of six) classes are shown in (b) for clarity.
Real data: yeast galactose data
In the yeast galactose data of Ideker et al. [22], four replicate hybridizations were performed for each cDNA array experiment. We used a subset of 205 genes that are reproducibly measured, whose expression patterns reflect four functional categories in the Gene Ontology (GO) listings [23] and that we expect to cluster together. On this data, our goal is to cluster the genes, and the four functional categories are used as our external knowledge. That is, we evaluate algorithm performance by how closely the clusters reproduce these four functional categories.
Synthetic remeasured data
To generate synthetic remeasured array data to evaluate cluster stability, we need an error model that describes repeated measurements. Ideker et al. [24] proposed an error model for repeated cDNA array data in which the measured fluorescent intensity levels in each of the two channels are related to their true intensities by additive, multiplicative and random error parameters. The multiplicative error parameters represent errors that are proportional to the true intensity, while the additive error parameters represent errors that are constant with respect to the true intensity. The measured intensity levels in the two channels are correlated such that genes at higher intensities have higher correlation. Ideker et al. [24] estimated these parameters (additive, multiplicative and correlation parameters) from repeated cDNA array data using maximum likelihood, and showed that this model gives reasonable estimates of the true expression intensities with four repeated measurements. We used this error model to estimate the true intensity for each gene, and the correlation, additive and multiplicative error parameters on the yeast galactose data. We generate synthetic remeasured data by generating the random error components in the model from the specified random distributions.
Assessment of cluster quality
Cluster accuracy
To assess algorithm performance, we need a statistic that indicates the agreement between the external knowledge and the clustering result. A clustering result can be considered as a partition of objects into groups. In all subsequent discussion, the term 'class' is used to refer to the external knowledge, while the term 'cluster' refers to the partitions created by the algorithm. Assuming known categories (classes) of objects are available, we can compare clustering results by assessing the agreement of the clusters with the classes. Unfortunately, the results of a given cluster analysis may merge partitions that the external knowledge indicates should be separate or may create additional partitions that should not exist. Hence, comparison of clusters with classes is not as simple as counting which objects are placed in the 'correct' partitions. In fact, with some datasets and algorithms, there is no obvious relationship between the classes and the clusters.
The adjusted Rand index [25] is a statistic designed to assess the degree of agreement between two partitions. On the basis of an extensive empirical study, Milligan and Cooper [26] recommended the adjusted Rand index as the measure of agreement even when comparing partitions with different numbers of clusters. The Rand index [27] is defined as the fraction of agreement, that is, the number of pairs of objects that are either in the same groups in both partitions or in different groups in both partitions, divided by the total number of pairs of objects. The Rand index lies between 0 and 1. When the two partitions agree perfectly, the Rand index is 1. The adjusted Rand index [25] adjusts the score so that its expected value in the case of random partitions is 0. A high adjusted Rand index indicates a high level of agreement between the classes and clusters.
Cluster stability
A few recent papers suggested that the quality of clusters could be evaluated via cluster stability, that is, how consistently objects are clustered together with respect to synthetic remeasured data. The synthetic remeasured data is created by randomly perturbing the original data using error parameters derived from repeated measurements. For example, Kerr and Churchill [17] and Li and Wong [28] generated randomly perturbed data from cDNA and oligonucleotide arrays respectively to identify objects that are consistently clustered.
In our empirical study, we assess the level of agreement of clusters from the original data with clusters from the synthetic remeasured data by computing the average adjusted Rand index over all the synthetic datasets. We also compute the average adjusted Rand index between all pairs of clustering results from the randomly remeasured data. A high average adjusted Rand index implies that the clusters are stable with respect to data perturbations and remeasurements. The external knowledge is not used in computing cluster stability.
Completely synthetic data at low noise level
Table 2a,2b shows selected results on cluster accuracy and cluster stability on the completely synthetic datasets with four simulated repeated measurements. Table 2a,2b show results from average linkage, complete linkage and centroid linkage hierarchical algorithms, kmeans, MCLUSTHC (a hierarchical modelbased clustering algorithm from MCLUST) and IMM. Both single linkage and DIANA produce very lowquality and unstable clusters and their adjusted Rand indices are not shown. For each clustering approach, we produced six clusters (which is the number of classes). The results from CAST are not shown because the input parameter cannot be tuned to produce exactly six clusters in many cases. The FITSS column refers to the method of forcing repeated measurements into the same subtrees. Because kmeans is not hierarchical, its results are not available (NA) under the FITSS column. Both centroid linkage hierarchical algorithm and kmeans algorithm require the raw data matrix as input, so we cannot apply these two algorithms to cluster the posterior pairwise probabilities from the IMM approach.
Table 2. Cluster accuracy and stability on the completely synthetic data with four repeated measurements at low noise level
In terms of cluster accuracy, the elliptical model of IMM produced the highest level of agreement (adjusted Rand index = 0.957) with the six classes, and the hierarchical modelbased clustering algorithm (MCLUSTHC) also produced clusters with high agreement (adjusted Rand index = 0.930) with the six classes. Within the same clustering algorithm, different similarity measures and different methods to deal with repeated measurements yield different cluster accuracy. For example, average linkage hierarchical algorithm produced more accurate clusters with Euclidean distance (variabilityweighted or average overrepeated measurements) than correlation. The variabilityweighted similarity approach produced more accurate clusters using SDs as the variability estimates than using the CVs. It is also interesting to note that SDweighted correlation produced relatively lowquality clusters, whereas SDweighted distance produced relatively accurate clusters. The FITSS approach of forcing repeated measurements into the same subtrees in hierarchical clustering algorithms does not yield high cluster accuracy.
In terms of cluster stability, most clustering approaches yield stable clusters (with average adjusted Rand indices above 0.900) except the spherical model of the IMM approach. This is because the spherical model assumes homogeneous variability for each gene across the experiments (which is not true on this synthetic data).
Completely synthetic data at high noise level
Tables 3a,3b show the results on cluster accuracy and cluster stability on the completely synthetic data with four repeated measurements at high noise level. Even at a higher noise level, the elliptical model of IMM produced much more accurate clusters (average adjusted Rand index = 0.911 and 0.910 using average linkage or complete linkage) than all other approaches (SDweighted distance and kmeans produced an average adjusted Rand index of 0.801). In general, the relative rankings of various clustering approaches at high noise level are similar to those at low noise level, except that the modelbased hierarchical approach (MCLUSTHC) produced less accurate clusters than the SDweighted distance approach using the heuristically based algorithms.
Table 3. Cluster accuracy and stability on the completely synthetic data with four repeated measurements at high noise level
At high noise level, the approach of averaging over the repeated measurements produced relatively lowquality clusters, especially when Euclidean distance is used (for example, both average linkage and centroid linkage produced an average adjusted Rand index of 0). In addition, the quality of clusters produced using Euclidean distance deteriorates more rapidly than correlation at high noise level. The SDweighted distance approach produced substantial improvement in cluster quality over the approach of averaging over repeated measurements using the same algorithms at high noise level.
In terms of cluster stability (see Table 3b), the following three approaches yield average adjusted Rand index above 0.900: the elliptical model of the IMM approach; the SDweighted distance using average linkage and centroid linkage. It is interesting that the spherical model of the IMM approach produces unstable clusters at both high and low noise levels.
Yeast galactose data
Table 4a,4b show selected results on cluster accuracy and cluster stability on real yeast galactose data. The true mean column in Table 4a refers to clustering the true mean data (estimated with the error model suggested by Ideker et al. [24]) instead of clustering the repeated measurements. For each clustering approach, we produced four clusters (which is the number of functional categories). The highest level of cluster accuracy (adjusted Rand index = 0.968 in Table 4a) was obtained with several algorithms: centroid linkage hierarchical algorithm with Euclidean distance and averaging over the repeated measurements; hierarchical modelbased algorithm (MCLUSTHC); complete linkage hierarchical algorithm with SDweighted distance; and IMM with complete linkage. Clustering with repeated measurements produced more accurate clusters than clustering with the estimated true mean data in most cases.
Table 4. Cluster accuracy and stability on yeast galactose data
Table 4b shows that different clustering approaches lead to different cluster stability with respect to remeasured data. Similar to the results from the completely synthetic data, Euclidean distance tends to produce more stable clusters than correlation (both variabilityweighted and average over repeated measurements). Clustering results using FITSS were less stable than the variabilityweighted approach and the averaging over repeated measurements approach.
SD produced more accurate and more stable clusters than CV in the variabilityweighted similarity approach, especially when Euclidean distance is used. In addition, the modelbased approaches (MCLUSTHC and IMM) produced relatively accurate and stable clusters on this data.
Effect of different numbers of repeated measurements
To study the effect of different numbers of repeated measurements on the performance of various clustering approaches, we generated completely synthetic data with different numbers of simulated repeated measurements for each data point. Specifically, we generated 1, 4, or 20 repeated measurements at both the low and high noise levels. The quality of clustering results on datasets with higher numbers of repeated measurements is usually higher (Table 5). For example, using the same algorithms and same similarity measures cluster accuracy is considerably improved with synthetic datasets of four repeated measurements relative to datasets with no repeated measurement. With 20 repeated measurements, Euclidean distance is less sensitive to noise, and the SDweighted distance approach produces comparable cluster accuracy to IMM. This is probably because the variability estimates computed over 20 repeated measurements are much more robust than those with four repeated measurements. Nevertheless, the elliptical model of IMM consistently produced the most accurate clusters over different numbers of simulated repeated measurements and different noise levels.
Table 5. Cluster accuracy on the completely synthetic datasets with different numbers of repeated measurements
Discussion
We showed that different approaches to clustering array data produce clusters of varying accuracy and stability. We also showed that the incorporation of error estimates estimated from repeated measurements improves cluster quality. We also show that the elliptical model of IMM consistently produced more accurate clustering results than other approaches using both real and synthetic datasets, especially at high noise levels. The variabilityweighted approach tends to produce more accurate and more stable clusters when used with Euclidean distance than the simple approach of averaging over the repeated measurements. In addition, the SDweighted distance usually produces more accurate and more stable clusters than the CVweighted distance. In general, the results are consistent across both real and synthetic datasets.
Limitations
In all the above results, we produced clustering results in which the number of clusters was set equal to the number of classes. In agglomerative hierarchical clustering algorithms (for example, average linkage), we successively merged clusters until the desired number of clusters, K, is reached, and considered the K subtrees as our K clusters, whereas in other algorithms the number of clusters was provided as input. A concern is that using a fixed number of clusters will force different classes into the same cluster owing to one or more outliers occupying a cluster. In such cases, the adjusted Rand index might improve with a larger number of clusters.
However, we chose to use a fixed number of clusters for several reasons. First, with the exception of the modelbased algorithms, all other clustering algorithms (directly or indirectly) require the number of clusters as input. Even with the modelbased algorithms, the number of clusters can only be estimated. In MCLUSTHC, the number of clusters can be estimated using a statistical score (see [29]). In the IMM approach, the number of clusters can be estimated from the posterior distribution of clustering results (see [18]). Second, it is very difficult, if not impossible, to compare cluster quality over a range of different clustering algorithms when the number of clusters is not fixed. Finally, increasing the number of clusters does not always yield better clusters or higher Rand indices (data not shown).
There are also some limitations with the external criteria for the real datasets used in our empirical study. With the yeast galactose data, we used a subset of 205 genes, which contains many genes previously shown to be strongly coregulated and which reflect four functional categories in the GO listings [23]. This subset of genes may be biased in the sense that they are not chosen entirely independently of their expression patterns. In addition, there may be good biological reasons why some genes in the chosen set of 205 should not cluster into groups segregated by the GO classifications.
Distributions of variabilityweighted similarity measures
The essence of the variabilityweighted similarity approach is that the pairwise similarities take into account the variability in repeated measurements. In an attempt to understand the effect of variability between repeated measurements on these similarity measures, we computed the correlation coefficients between all pairs of genes in the yeast galactose data and plotted the distribution of the fraction of gene pairs against correlation coefficient by averaging over repeated measurements and against SDweighted correlation in Figure 2. The distribution of CVweighted correlation is similar to that of SDweighted.
Figure 2. Distribution of the fraction of gene pairs against correlation coefficient. Correlation coefficients are computed from averaging over repeated measurements and using SD over repeated measurements as weights on the yeast galactose data. There are more gene pairs with correlation coefficients around 0 and fewer gene pairs with correlation coefficients near 1 when SDweighted correlation is used.
Figure 2 shows that when SD is used in variabilityweighted correlation, there are more gene pairs with correlation coefficients around 0 and fewer gene pairs with correlation coefficients near 1. Figure 3 shows the distribution of Euclidean distance by averaging over the repeated measurements and the SDweighted distance on the same data. There are more gene pairs with distance close to zero when variability estimates are used to weigh distance. This shows that weighing similarity measures with variability estimates produces more conservative estimates of pairwise similarities.
Figure 3. Distribution of the fraction of gene pairs against Euclidean distance. Euclidean distances are computed from averaging over repeated measurements and using SD over repeated measurements as weights on the yeast galactose data. There are more gene pairs with Euclidean distances near 0 when SDweighted distance is used.
Moreover, we showed that on average, variabilityweighted similarity measures (both correlation and distance) computed from repeated measurements produced pairwise similarities closer to the true similarity than similarity measures computed from data with no repeated measurement. In our simulation experiment, we computed the true pairwise correlation and distance between all pairs of genes on the estimated true mean yeast galactose data (using the error model in Ideker et al. [24]). We also computed the variabilityweighted correlation and distance between all pairs of genes on the synthetic remeasured data generated from the same error parameters and mean intensities as the yeast galactose data. In addition, we computed correlation and distance using only one of the repeated measurements in the remeasured data. Then, we compared the average deviation of the variabilityweighted similarity measures from the truth, and the average deviation of the similarity measures on the data with no repeated measurements to the truth (see Materials and methods for detailed results).
Modified variabilityweighted approach
One of the drawbacks of the current definitions of the variabilityweighted similarity approach is that only noisy experiments are downweighted, whereas noisy genes are not. Suppose we have a dataset in which some genes are noisier than others, but the noise levels across the experiments stay relatively constant. In this scenario, the variabilityweighted approach would not improve cluster quality. Genes expressed at low levels are frequently expressed at low levels across all experiments and usually have higher variability (see Figure 4). Hence, unless we filter out lowintensity genes, the weighting methods developed by Hughes et al. ([2] and see Materials and methods) will not downweight these genes. We attempted to correct for this effect by removing the normalizing factor in the definition of variabilityweighted distance (see Materials and methods for mathematical definitions). This improved the clustering accuracy when Euclidean distance was used. However, we did not see improvement using this method with correlation as the similarity measure.
Figure 4. Distribution of error plotted against intensity. The SDs over the log ratios from repeated measurements are plotted against the average intensities over repeated measurements in a typical experiment on the yeast galactose data.
Conclusions
Our work shows that clustering array data with repeated measurements can significantly improve cluster quality, especially when the appropriate clustering approach is applied. Different clustering algorithms and different methods to take advantage of repeated measurements (not surprisingly) yield different clusters with different quality. In practice, many clustering algorithms are frequently run on the same dataset and the results most consistent with previous beliefs are published. A better approach would be to use a clustering algorithm shown to be the most accurate and stable when applied to data with similar signaltonoise and other characteristics as the data of interest. In this work, we analyzed both real and completely synthetic data with many algorithms to assess cluster accuracy and stability. In general, the modelbased clustering approaches produce higherquality clusters, especially the elliptical model of the IMM. In particular, the higher the noise level, the greater the performance difference between the IMM approach and other methods.
For the heuristically based approaches, average linkage hierarchical clustering algorithm combined with SDweighted Euclidean distance also produces relatively stable and accurate clusters. On the completely synthetic data, we showed that the infinite mixture approach works amazingly well with only four repeated measurements, even at high noise levels. The variabilityweighted approach works almost as well as the IMM with 20 repeated measurements. From our results on the synthetic data, we showed that there is significant improvement in cluster accuracy from one to four repeated measurements using IMM at both low and high noise levels (Table 5). However, there is no substantial improvement in cluster accuracy from 4 to 20 repeated measurements with the IMM approach (Table 5).
There are many possible directions of future work, both methodological and experimental. Because the elliptical model of IMM produces very highquality clusters, it would be interesting to develop a similar error model in the finite modelbased framework on MCLUST and to compare the performance of the finite versus infinite mixture approaches. Another practical methodological development would be to incorporate the estimation of missing data values into the modelbased approaches. It would also be interesting to develop other variabilityweighted similarity measures that would downweight both noisy genes and noisy experiments.
In terms of future experimental work, we would like to evaluate the performance of various clustering algorithms on array data with repeated measurements on more real datasets. One of the difficulties we encountered is that there are very few public datasets that have both repeated measurements and external criteria available. We would greatly appreciate it if readers would provide us with access to such datasets as they become available.
Materials and methods
Datasets
Yeast galactose data
Ideker et al. [22] studied galactose utilization in yeast using cDNA arrays by deleting nine genes on the galactose utilization pathway in the presence or absence of galactose and raffinose. There are a total of 20 experiments (nine singlegene deletions and one wildtype experiment with galactose and raffinose, nine deletions and one wildtype without galactose and raffinose). Four replicate hybridizations were performed for each experiment. We used a subset of 205 genes from this data, whose expression patterns reflect four functional categories in the GO [23].
Synthetic remeasured cDNA data
Let x_{ijr }and y_{ijr }be the fluorescent intensities of the two channels (fluorescent dyes) for gene i, experiment j and repeated measurement r, where i = 1, ..., G, j = 1, .., E, r = 1, .., R. For the yeast galactose data, G is approximately 6,000, E is 20 and R is 4. Ideker et al. [24] proposed an error model for replicated cDNA array data in which the observed fluorescent intensity levels are related to their true expression levels by the following model:
where (μ_{xij}, μ_{yij}) are the true mean intensity levels for gene i under experiment j in the two channels. The multiplicative error parameters in the two channels (ε_{xijr}, ε_{yijr}) are assumed to follow the bivariate normal distribution with mean 0, SDs σ_{εxj}, σ_{εyj }and correlation ρ_{εj}. Similarly, the additive error parameters (δ_{xijr}, δ_{yijr}) are assumed to follow the bivariate normal distribution with mean 0, SDs σ_{δxj}, σ_{δyj }and correlation ρ_{δj}. The geneindependent parameters (σ_{εxj}, σ_{εyj}, ρ_{εj}, σ_{δxj}, σ_{δyj}, ρ_{δj}) and the genedependent parameters (μ_{xij}, μ_{yij}), where i = 1, ..., G and j = 1, ..., E, are estimated by maximum likelihood [24].
Using this error model, we estimate the true expression intensities for each gene and the geneindependent parameters for each of the 20 experiments in the yeast galactose data. From the gene independent parameters (σ_{εxj}, σ_{εyj}, ρ_{εj}, σ_{δxj}, σ_{δyj}, ρ_{δj}), we generate random (ε_{xijr}, ε_{yijr}) and (δ_{xijr}, δ_{yijr}) from the bivariate normal distributions. Hence, we can generate random remeasured data (and log ratios) using the estimated true mean intensities (μ_{xij}, μ_{yij}).
Completely synthetic data
The completely synthetic datasets consist of 400 data points (genes), 20 attributes (experiments) and 6 classes. Let φ(i,j) be the artificial pattern of gene i and experiment j before error is added, and suppose gene i belongs to class k. Four of the six classes follow the periodic sine function (φ(i,j) = sin (2πj/10  ω_{k})), and the remaining two classes follow the nonperiodic linear function (φ(i,j) = j/20 or φ(i,j) = j/20), where i = 1, 2, 3, ..., 400, j = 1, 2, 3, ..., 20, k = 1, 2, 3, 4 and ω_{k }is a random phase shift between 0 and 2π. Let X(i,j,r) be the erroradded value for gene i, experiment j and repeated measurement r. Let the randomly sampled error be σ_{ij }for gene i and experiment j, and X(i,j,r) is generated from a random normal distribution with mean equal to φ(i,j), and SD equal to σ_{ij}.
We define the signaltonoise ratio of a synthetic dataset to be the ratio of the range of signals (in our case, 1(1) = 2) to the average sampled error. For the completely synthetic data shown in Figure 1a,1b the signaltonoise ratios are 14.3 and 2.5 respectively.
Missing data
The yeast galactose dataset [22] contains approximately 8% of missing data values. There are many possible sources of missing data values, for example, low signaltonoise ratios, dust or scratches on slides. As the current versions of MCLUST [30] and the IMM implementation [18] do not handle missing data values, we impute the missing data values. We experimented with two imputation methods, namely modelbased multiple imputation [31] as implemented in Splus, and weighted knearest neighbors (KNNimpute) [32]. We found that data after KNNimpute produce higherquality clusters than data after modelbased multiple imputation. Therefore, we applied KNNimpute to the yeast galactose data before applying the modelbased approaches.
Notations and similarity measures
Suppose there are G genes, E experiments, and R repeated measurements. Denote the measured expression level from repeated measurement r of gene g under experiment e as X_{ger}, where g = 1, ..., G, e = 1, ..., E and r = 1, ..., R. Let D be the raw data matrix such that D(g,e) represents the average expression level over R repeated measurements for gene g under experiment e, that is,
,
where g = 1, ..., G, e = 1, ..., E. The correlation coefficient between a pair of genes i and j (i,j = 1, .., G) is defined as
where
is the average expression level of gene i over all E experiments. The Euclidean distance between a pair of genes i and j (i,j = 1, .., G) is defined as
.
Similarly, we can define correlation and Euclidean distance between a pair of experiments by swapping the positions of the gene and experiment indices.
Variabilityweighted similarity measures
Hughes et al. [2] defined errorweighted similarity measures that weight expression values with error estimates such that expression values with relatively high errors are downweighted. Let σ_{ge }be the error estimate of the expression level of gene g under experiment e, where g = 1, ..., G and e = 1, ..., E. The errorweighted correlation between a pair of genes i and j is defined as
where
is the weighted average expression level of gene i. Similarly, the errorweighted Euclidean distance [2] is defined as
.
In our empirical study, variability estimates are used instead of error estimates. In particular, we use either the SD or CV over the R repeated measurements as σ_{ge}. These variabilityweighted similarity measures serve as inputs to many clustering algorithms.
Modified variabilityweighted distance
The above definitions of variabilityweighted correlation and distance downweight noisy experiments in computing the pairwise similarity, but would not work in the case of noisy genes. Consider two pairs of genes, (X, Y) and (W, Z), such that D(X,e) = D(W,e) and D(Y,e) = D(Z,e) and σ_{Xe }= σ_{Ye }<<σ_{We }= σ_{Ze }for all experiments e. In other words, the expression patterns of gene X and gene W are identical, so are the patterns of gene Y and gene Z. The levels of noise (or variability) are constant across all the experiments for each pair of genes, but genes (W,Z) are much more noisy than genes (X,Y). Using the above definitions of variabilityweighted similarity, _{XY }= _{WZ }and _{XY }= _{WZ}. Intuitively, one would expect the pairwise similarity between genes (W,Z) to be lower than that of genes (X,Y) because genes (W,Z) are more noisy. We experimented with a modified definition of variabilityweighted distance by removing the scaling factor in the denominator:
This modified definition tends to give slightly better clusters (see Additional data files and [33]).
Clustering algorithms
Agglomerative hierarchical algorithms
In agglomerative hierarchical clustering algorithms [5], each object is initially assigned to its own cluster (subtree), and the number of initial clusters is equal to the number of objects. Similar clusters (subtrees) are successively merged to form a dendrogram. In each merging step, the number of clusters (subtrees) is reduced by one. This merging process is repeated until the desired number of clusters, K, is produced. The objects in these K subtrees form the resulting K clusters, and the hierarchical structures of the subtrees are ignored.
Different definitions of cluster similarity yield different clustering algorithms. In a single linkage hierarchical algorithm, the cluster similarity of two clusters is the maximum similarity between a pair of genes, one from each of the two clusters. In a complete linkage hierarchical algorithm, the cluster similarity is defined as the minimum similarity between a pair of genes, one from each of the two clusters. In an average linkage hierarchical algorithm, the cluster similarity of two clusters is the average pairwise similarity between genes in the two clusters. In a centroid linkage hierarchical algorithm, clusters (subtrees) are represented by the mean vectors of the clusters, and cluster similarity is defined as the similarity between the mean vectors.
kmeans
Kmeans [7] is a classic iterative clustering algorithm, in which the number of clusters is an input to the algorithm. Clusters are represented by centroids, which are cluster centers. The goal of kmeans is to minimize the sum of distances from each object to its corresponding centroid. In each iteration, each gene is assigned to its closest centroid. After the gene reassignment, new centroids are computed. The steps of assigning genes to centroids and computing new centroids are repeated until no genes are moved between clusters. In our implementation, we use the clusters from average linkage hierarchical algorithm to compute initial centroids to start kmeans.
MCLUST
The finite Gaussian mixture modelbased approach assumes that each cluster follows the multivariate normal distribution with model parameters that specify the location and shape of each cluster. MCLUST [8] implements the expectationmaximization (EM) algorithm for clustering via finite Gaussian mixture models, as well as modelbased hierarchical clustering algorithms, with optional crosscluster constraints. MCLUST also includes a clustering function (hcVVV) that uses modelbased hierarchical clustering to initialize the EM algorithm. Because the current version of MCLUST does not have any mechanism to incorporate repeated measurements, but does allow initializations at nontrivial partitions, we initialize the hierarchical modelbased algorithm with subtrees containing repeated measurements. We use the most general model (unconstrained) for hierarchical clustering, which allows each cluster to have different volume, orientation and shape. This approach is abbreviated as MCLUSTHC.
IMM
The IMM approach uses a Gibbs sampler to estimate the posterior pairwise probabilities. The Gibbs sampler requires two sets of parameters for input: initialization parameters (random seed and the initial number of mixture components) and convergence parameters (initial annealing coefficient, the rate of 'cooling' and the 'burnin' period). A posterior distribution with multiple peaks could result in Gibbs samplers' inability to escape from a suboptimal peak. The role of the annealing coefficient [34] is to flatten the posterior distribution of clustering results and thus alleviate the difficulty in transitioning between highprobability regions that are separated by regions of low probability, which is a common problem of Gibbs samplers in general [35]. Burnin corresponds to the number of initial iterations that the Gibbs sampler takes to converge to the posterior distribution, and the burnin iterations are not used in calculating posterior pairwise probabilities. We tuned the convergence parameters by running independent samplers with different initialization parameters, and chose the set of convergence parameters that yielded the highest correlation between pairwise probabilities over different runs and over different random perturbations of the data. Using this simple principle, we identified a single combination of the annealing parameters that resulted in excellent convergence in all datasets we analyzed, including some not reported in this paper. This combination consisted of the initial annealing coefficient of 0.01, rate of cooling of 0.999 and the burnin of 10,000 iterations. For investigators analyzing their own data, we suggest that they run at least five independent Gibbs samplers with this combination of parameters from five different initial numbers of clusters and establish that all five converge to the same posterior distribution. This can be done by calculating correlation between posterior pairwise probabilities from different runs. Alternatively, the adjusted Rand index can be used for comparing clustering results generated by different runs of the Gibbs sampler. If the correlations or adjusted Rand indices suggest that all five samplers did not converge to the same solution, investigators should try increasing the annealing coefficient (to say 0.9995) and the burnin number of iterations (to say 20,000), and repeat the process. The Readme.txt file that accompanies the IMM software describes these parameters in detail.
CAST
The cluster affinity search technique (CAST) [21] is an iterative algorithm, in which objects are added to or removed from the current cluster until there are no more similar objects to be added and no more dissimilar objects to be removed. At this point, the current cluster is assumed to be done. A new cluster is started and the iterative process of adding and removing objects is repeated until all objects are assigned to clusters. The inputs to the algorithm include the pairwise similarities and a parameter that indirectly controls the number of clusters.
DIANA
DIANA [20] is a hierarchical divisive clustering algorithm, in which we start with all objects in one cluster. In each step, clusters are successively split to form two clusters until the desired number of clusters is reached. The cluster with maximum diameter (maximum pairwise dissimilarity) is split in each step. Let us call this the current cluster. The most dissimilar element in the current cluster is identified to start a new cluster. An object in the current cluster is moved to the new cluster if the average similarity with the new cluster is higher than that with the current cluster.
Completely synthetic data with different numbers of repeated measurements
Table 5 shows some selected results produced using average linkage hierarchical algorithm on the completely synthetic data over varying numbers of repeated measurements and different noise levels. In general, increasing the number of repeated measurements increases cluster accuracy (average adjusted Rand index with respect to the six classes). The elliptical model of IMM produced superior quality of clusters, especially at high noise levels.
Simulation experiment: variabilityweighted similarity
We computed the true pairwise correlation and Euclidean distance between all pairs of genes on the estimated true mean yeast galactose data. Denote the correlation between estimated true means for gene i and gene j as . We generated synthetic remeasured datasets using the same error parameters and true mean intensities of the yeast galactose data. Let the variabilityweighted correlation for gene i and gene j be , and the correlation computed using only one of the repeated measurements, r (no repeated data) be , where k is one of the randomly generated synthetic remeasured data.
The column   ρ^{true} in Table 6 shows the average of over all pairs of genes i, j, and all randomly remeasured datasets k, while the column ρ^{r } ρ^{true} shows the average of over all pairs of genes i, j, all randomly remeasured datasets k, and all repeated measurements r. The corresponding results using distance are also shown in Table 6, which shows that on average the variabilityweighted similarities are closer to the 'truth' than similarities computed from data with no repeated measurement.
Table 6. Simulation results
Additional data files
Datasets (both real and synthetic) and the software (executables and documentation) used in this work are available as additional files and from our website [33]. They comprise additional results (Additional data file 1), documentation (Additional data file 2), bytecode files for hierarchical agglomerative algorithms (Additional data file 3), bytecode files for kmeans (Additional data file 4), bytecode files for hierarchical agglomerative algorithms using FITSS (Additional data file 5), subset of 205 genes from yeast data (Additional data files 6, 7, 8), and the completely synthetic datasets (Additional data files 9, 10, 11, 12, 13, 14). Our website [33] also has external links to publicly available software, yeast galactose data and lung cancer data.
Additional data file 1. Additional results
Format: PDF Size: 35KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 2. Documentation
Format: PDF Size: 32KB Download file
This file can be viewed with: Adobe Acrobat Reader
Additional data file 3. Bytecode files for hierarchical agglomerative algorithms
Format: GZ Size: 15KB Download file
Additional data file 4. Bytecode files for kmeans
Format: GZ Size: 16KB Download file
Additional data file 5. Bytecode files for hierarchical agglomerative algorithms using FITSS
Format: GZ Size: 16KB Download file
Additional data file 6. A subset of 205 genes from yeast data
Format: TXT Size: 1KB Download file
Additional data file 7. A subset of 205 genes from yeast data
Format: TXT Size: 282KB Download file
Additional data file 8. A subset of 205 genes from yeast data
Format: TXT Size: 108KB Download file
Additional data file 9. A completely synthetic dataset
Format: GZ Size: 1.4MB Download file
Additional data file 10. A completely synthetic dataset
Format: GZ Size: 1.4MB Download file
Additional data file 11. A completely synthetic dataset
Format: GZ Size: 370KB Download file
Additional data file 12. A completely synthetic dataset
Format: GZ Size: 373KB Download file
Additional data file 13. A completely synthetic dataset
Format: GZ Size: 6.8MB Download file
Additional data file 14. A completely synthetic dataset
Format: GZ Size: 6.9MB Download file
Acknowledgements
We thank Vesteinn Thorsson for providing us with the yeast galactose data, and Chris Fraley and Adrian Raftery for their valuable comments. We also acknowledge the following publicly available software packages: MCLUST [30], Vera and SAM [24], KNNimpute [32], dCHIP [36] and the BioConductor project. R.E.B. and K.Y.Y. are supported by NIHNIDDK grant 5U24DK05881302. M.M. is supported by NIEHS grants 2P30 ES0609611 and ES0490812.
References

Eisen MB, Spellman PT, Brown PO, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Natl Acad Sci USA 1998, 95:1486314868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, et al.: Functional discovery via a compendium of expression profiles.
Cell 2000, 102:109126. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh ML, Downing JR, Caligiuri MA, et al.: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.
Science 1999, 286:531537. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

van't Veer LJ, Dai H, van de Vijver MJ, He YD, Hart AA, Mao M, Peterse HL, van der Kooy K, Marton MJ, Witteveen AT, et al.: Gene expression profiling predicts clinical outcome of breast cancer.
Nature 2002, 415:530536. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Hartigan JA: Clustering Algorithms. New York: John Wiley and Sons; 1975.

Kohonen T: Selforganizing Maps. Berlin/Heidelberg: SpringerVerlag; 1997.

MacQueen J: Some methods for classification and analysis of multivariate observations. In In Proc 5th Berkeley Symp Math Stat Probability. Edited by University of California Press. Cam LML, Neyman J; 1965:281297.

Fraley C, Raftery AE: Modelbased clustering, discriminant analysis, and density estimation.
J Am Stat Assoc 2002, 97:611631. Publisher Full Text

McLachlan GJ, Basford KE: Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker; 1988.

McLachlan GJ, Peel D: Finite Mixture Models. New York: Wiley; 2000.

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: methods and application to hematopoietic differentiation.
Proc Natl Acad Sci USA 1999, 96:29072912. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tavazoie S, Huges JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture.
Nat Genet 1999, 22:281285. PubMed Abstract  Publisher Full Text

McLachlan GJ, Bean RW, Peel D: A mixture modelbased approach to the clustering of microarray expression data.
Bioinformatics 2002, 18:413422. PubMed Abstract  Publisher Full Text

Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Modelbased clustering and data transformations for gene expression data.
Bioinformatics 2001, 17:977987. PubMed Abstract  Publisher Full Text

Sherlock G: Analysis of largescale gene expression data.
Curr Opin Immunol 2000, 12:201205. PubMed Abstract  Publisher Full Text

Sharan R, Elkon R, Shamir R: Cluster analysis and its applications to gene expression data. In In Ernst Schering Workshop on Bioinformatics and Genome Analysis. Berlin: SpringerVerlag; 2002:83108.

Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments.
Proc Natl Acad Sci USA 2001, 98:89618965. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Medvedovic M, Sivaganesan S: Bayesian infinite mixture model based clustering of gene expression profiles.
Bioinformatics 2002, 18:11941206. PubMed Abstract  Publisher Full Text

Jain AK, Dubes RC: Algorithms for Clustering Data. Englewood Cliffs. NJ: Prentice Hall; 1988.

Kaufman L, Rousseeuw PJ: Finding Groups in Data: An Introduction to Cluster Analysis. New York: John Wiley and Sons; 1990.

BenDor A, Shamir R, Yakhini Z: Clustering gene expression patterns.
J Comput Biol 1999, 6:281297. PubMed Abstract  Publisher Full Text

Ideker T, Thorsson V, Ranish JA, Christmas R, Buhler J, Eng JK, Bumgarner RE, Goodlett DR, Aebersold R, Hood L: Integrated genomic and proteomic analyses of a systemically perturbed metabolic network.
Science 2001, 292:929934. PubMed Abstract  Publisher Full Text

Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al.: Gene Ontology: tool for the unification of biology.
Nat Genet 2000, 25:2529. PubMed Abstract  Publisher Full Text

Ideker T, Thorsson V, Siegel AF, Hood L: Testing for differentiallyexpressed genes by maximumlikelihood analysis of microarray data.
J Comput Biol 2000, 7:805817. PubMed Abstract  Publisher Full Text

Milligan GW, Cooper MC: A study of the comparability of external criteria for hierarchical cluster analysis.

Rand WM: Objective criteria for the evaluation of clustering methods.

Li C, Wong WH: Modelbased analysis of oligonucleotide arrays: model validation, design issues and standard error application.
Genome Biol 2001, 2:research0032.10032.11. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Banfield JD, Raftery AE: Modelbased Gaussian and nonGaussian clustering.

Fraley C, Raftery AE: MCLUST: software for modelbased clustering, discriminant analysis, and density estimation. [ftp://ftp.u.washington.edu/public/mclust/tr415.pdf] webcite
Technical Report No. 415. Seattle, WA: Department of Statistics, University of Washington 2002.

Schafer JL: Analysis of Incomplete Multivariate Data. London: Chapman and Hall; 1997.

Troyanskaya O, Cantor M, Sherlock G, Brown P, Hastie T, Tibshirani R, Botstein D, Altman RB: Missing value estimation methods for DNA microarrays.
Bioinformatics 2001, 17:520525. PubMed Abstract  Publisher Full Text

University of Washington (Microbiology): public microarray data download site [http://www.expression.washington.edu/public] webcite

Medvedovic M, Succop P, Shukla R, Dixon K: Clustering mutational spectra via classification likelihood and Markov chain Monte Carlo algorithm.
J Agric, Biol Environ Stat 2001, 6:1937. Publisher Full Text

Jennison C: Discussion on the meeting on the Gibbs sampler and other Markov chain Monte Carlo methods.

Li C, Wong WH: Modelbased analysis of oligonucleotide arrays: expression index computation and outlier detection.
Proc Natl Acad Sci USA 2001, 98:3136. PubMed Abstract  Publisher Full Text  PubMed Central Full Text