Abstract
Background
Microarray technologies are emerging as a promising tool for genomic studies. The challenge now is how to analyze the resulting large amounts of data. Clustering techniques have been widely applied in analyzing microarray geneexpression data. However, normal mixture modelbased cluster analysis has not been widely used for such data, although it has a solid probabilistic foundation. Here, we introduce and illustrate its use in detecting differentially expressed genes. In particular, we do not cluster geneexpression patterns but a summary statistic, the tstatistic.
Results
The method is applied to a data set containing expression levels of 1,176 genes of rats with and without pneumococcal middleear infection. Three clusters were found, two of which contain more than 95% genes with almost no altered geneexpression levels, whereas the third one has 30 genes with more or less differential geneexpression levels.
Conclusions
Our results indicate that modelbased clustering of tstatistics (and possibly other summary statistics) can be a useful statistical tool to exploit differential gene expression for microarray data.
Background
The pattern of genes expressed in a cell can provide important information about the cell state. DNA microarray technology can measure the expression of thousands of genes in a biological sample. DNA microarrays have been increasingly used in the last few years and have the potential to help advance our biological knowledge at a genomic scale [1,2]. In analyzing DNA microarray geneexpression data, a major role has been played by various clusteranalysis techniques, most notably by hierarchical clustering [3], Kmeans clustering [4] and selforganizing maps [5]. These clustering techniques contribute significantly to our understanding of the underlying biological phenomena. A recent review of various methods is provided by Tibshirani et al. [6]. However, many methods, including the three mentioned above, have some restrictions, one of which is their inability to determine the number of clusters. The difficulty may be related to the fact that in many methods there is no clear definition of what a cluster is in the first place. Furthermore, their clustering results may not be stable [7,8]. An important clustering technique that improves on and/or provides alternative solutions to these issues is modelbased clustering (see, for example, [9]). It has a clear definition that a cluster is a subpopulation with a certain distribution, and several statistical methods can be applied to estimate the number of clusters. Some authors have considered its application to cluster geneexpression patterns [10,11,12].
Here we consider the use of modelbased clustering in the context of detecting differentially expressed genes, which is to identify all the genes with altered expression under two experimental conditions (for example, normal cells versus cancer cells). We note that the goal here is different from that of clustering geneexpression patterns, as done by other researchers in using modelbased clustering. In modeling differential expression levels of genes, it is natural to assume that genes are from two subpopulations, one with constant and another with changed expression levels. Hence, a twocomponent mixture is a reasonable model. This is the approach proposed by Lee et al. [13], where it is assumed that each of the two components has a normal (in the statistical sense) distribution. However, in general, each component does not necessarily have a normal distribution. It is well known that many distributions can be well approximated by a finite mixture of normal distributions. Hence, the normal mixture modelbased clustering can be regarded as a more general and flexible approach along these lines and we pursue this approach here. In particular, we summarize a possible change of expression of a gene using a tstatistic, which automatically accounts for differential variations of expression levels across genes. Then we apply modelbased clustering to these tstatistics to exploit which genes have differential expression levels. The methodology is illustrated with an application to a dataset containing the expression levels of 1,176 genes of normal rats and those with pneumococcal middleear infection.
Results and discussion
Data and preprocessing
Pneumococcal otitis media is one of the most common diseases in children. Almost every child in the United States experiences at least one episode of acute otitis media by the age of 5 years. To understand the pathogenesis of otitis media, it is important to identify genes involved in response to pneumococcal middleear infection and to study their roles in otitis media. A study was recently carried out at the University of Minnesota, applying radioactively labeled cDNA microarrays [14] to the mRNA analysis of 1,176 genes in middleear mucosa of rats with and without subacute pneumococcal middleear infection. It consisted of six experiments: two cDNA microarrays were run with controls while four were run with pneumococcal middleear infection. We first take a natural logarithm transformation for all the observed geneexpression levels so that they are more likely to have a normal distribution, which will reduce the number of clusters found in a modelbased clustering. The histograms of geneexpression levels before and after logtransformation for the first experiment are shown in Figure 1. It can be seen that the logtransformation reduces the skewness of the distribution of geneexpression levels.
Figure 1. Histograms of radioactivity intensity levels for the first experiment, a cDNA microarray analysis of 1,176 genes in middleear mucosa of healthy (control) rats. (a) Before logtransformation; (b) after logtransformation.
After taking logtransformation, for each experiment we then standardize the transformed geneexpression levels by subtracting their median value. The above standardization is based on the assumption that most genes, at least a half, will not be expressed. The median is used because it is more robust against outliers than is the more commonly used mean. We use x_{ij} to denote the resulting expression level of gene i from experiment j. Note that the first two experiments (that is, j = 1 and 2) were conducted using control rats whereas the last four (that is, j = 3, 4, 5, 6) using infected rats. Some scatterplots showing comparisons between experiments are presented in Figure 2. It can be seen that, in general, there is a good agreement as well as some variation between the experiments under the same condition, that is, either within the control group or within the infected group. It appears that expression of some genes are altered with pneumococcal infection.
Figure 2. Comparison of the logtransformed, standardized expression data between experiments. Experiments 1 and 2 were conducted using control rats; experiments 36 used infected rats.
On the basis of the above observation, we calculate the following twosample tstatistic for each gene as its measure of possible differential expression:
where:
for i = 1, ...,1176. The numerator of y_{i} is the difference of average geneexpression levels under the two conditions (infected versus control), whereas the denominator is the sample standard error of the numerator and serves to standardize the observed difference by penalizing those with large (and thus less reliable) variations. Previous studies have found evidence that genes may have differential variability of expression levels [15,16,17]. Note that although the tstatistic is constructed, we shall not conduct ttests because there is no evidence to support the questionable normality assumption required by the ttest. We also do not carry out permutation or other nonparametric tests [18] because of the small sample size (that is, 2 + 4). This is also related with the fact that there exists the problem of multiple comparisons if we test gene by gene [18]. Our goal here is to apply modelbased cluster analysis to the preprocessed relative geneexpression levels y_{i}, i = 1, ..., 1176, and see which genes will have relative levels far away from the majority.
Modelbased clustering
Finite mixtures of distributions provide a flexible as well as rigorous approach to modeling various random phenomena (for example, [19]). For continuous data, such as geneexpression data, the use of normal components in the mixture distribution is natural. With a normal mixture modelbased approach to clustering, it is assumed that the data to be clustered are from several subpopulations (or clusters or components) with distinguished normal distributions. That is, each data point y is taken to be a realization from a normal mixture distribution with the probability density function:
where φ (y;μ_{i},V_{i}) denotes the normal density function with mean μ_{i} and (co)variance matrix V_{i}, and π_{i}'s are mixing proportions. We use Φ_{g} to represent all unknown parameters (π_{i}, μ_{i}, V_{i}): i = 1, ... g in a gcomponent (or gcluster) mixture model.
In modelbased clustering, first, the above mixture model is fitted to the data and obtain the maximum likelihood estimate _{g}. Second, the posterior probabilities of each data point belonging to each of the g normal components can be calculated. Finally, each data point is assigned to the component with the largest posterior probability. We review the major steps in the following.
The mixture model is typically fitted by maximum likelihood using the expectationmaximization (EM) algorithm [20]. Given n observations y_{1}, ..., y_{n}, we want to maximize the loglikelihood
to obtain the maximum likelihood estimate _{g}. The EM algorithm computes _{g} by iterating the following steps.
Suppose that at the kth iteration, the parameter estimates are π_{i}^{(k)}'s, μ_{i}^{(k)}'s and V_{i}^{(k)}'s. Then in the (k + 1)th iteration, the estimates are updated by
for i = 1, ..., g where
is the posterior probability that y_{j} belongs to the ith component of the mixture, using the current parameter estimate for Φ_{g}, for i = 1, ..., g and j = 1, ..., n.
At convergence, we obtain _{g} = as the maximum likelihood estimate. As local maxima can be found by the EM algorithm, it is desirable to run the algorithm multiple times with various starting values and choose the estimate as the one resulting in the largest loglikelihood.
One interesting but difficult problem in cluster analysis is to determine the number of components g. In contrast to many other approaches that fail to accomplish this goal, modelbased clustering provides several useful and objective selection criteria, which have been used in other model selection problems. The best known are the Akaike Information Criterion (AIC) [21] and the Bayesian Information Criterion (BIC) [22]:
where v_{g} is the number of independent parameters in Φ_{g}. In using the AIC or BIC, one first fits series of models with various values of g, then one picks up the g with the smallest AIC or BIC.
In many studies related to model selection, it is found that AIC may select too large a model whereas BIC may select too small a model. This phenomenon appears to hold in selecting g in the mixture analysis [23]. Some other criteria have been studied but there does not seem to be a clear winner [23]. Banfield and Raftery [24] proposed using approximate weight of evidence as an approximate Bayesian model selection criterion. Some empirical studies seem to favor the use of BIC [25]. We feel that a combined use of AIC and BIC is helpful, at least in providing a range of reasonable values of g.
A different approach to selecting g is through hypothesis testing. This could be done through the use of the loglikelihood ratio test (LRT) to test for the null hypothesis H_{0}: g = g_{0} against the alternative H_{1}: g = g_{0} +1 for any given positive integer g_{0}. The LRT statistic is 2 log L(_{0}+1)  2 log L(_{0}), which, however, does not have the usual asymptotically chisquared distribution as a result of violation of required regularity conditions (for example, the maximum likelihood estimate may lie in the boundary of its parameter space). McLachlan [26] proposed using the bootstrap to approximate the distribution of the LRT statistic under the null hypothesis. On the basis of the resulting p value, one can decide whether to reject H_{0}.
Implementation
McLachlan et al. [27] have implemented modelbased clustering in a standalone Fortran program called EMMIX, which is freely available from the web [28]. It supports all the functions we described above, including multiple start of the EM algorithm using random partition or Kmean clustering, calculation of the model selection criteria AIC and BIC, and the use of the bootstrap to test a given number of components g_{0}. We will use EMMIX to analyze the geneexpression data described earlier.
The MCLUST software [29], implementing modelbased clustering, is also freely available [30]. It is designed to interface with the commercial statistical package SPlus. For users familiar with SPlus, it is convenient to take advantage of the power and flexibility of SPlus. However, at the same time, it can have some serious restrictions on the size of the data being analyzed because of the overhead on CPU speed and memory induced by SPlus.
Application
We fitted five mixture models with g ranging from 1 to 5. Table 1 summarizes the model fitting results. Using AIC or BIC, we would select g = 4 or g = 3 respectively. Also, from the loglikelihood values, there is a dramatic change when g is increased from 1 or 2. However, from g = 3 log L increases very slowly. Hence, both g = 3 and g = 4 appear reasonable. To determine which one is better, we applied the bootstrap method (also implemented in EMMIX) to test H_{0}: g = 3 versus H_{1}: g = 4. Using 100 bootstrap resamples, we were unable to reject H_{0} as the resulting p value is 0.18, larger than the usual 0.05 nominal level. In contrast, if we test H_{0}: g = 2 versus H_{1}: g = 3, then we will reject H_{0} with a small p value 0.01. Therefore, we choose to fit a threecomponent normal mixture model.
Table 1. Clustering results with various number of components g
The fitted mixture model is
f(y; ) = 0.042 x N(6.74, 77.07) + 0.510 × N(0.88, 5.56) + 0.448 × N(0.31, 1.15).
More than 95% of data points fall into the two clusters with means close to 0. That means there is either no or little change in geneexpression levels for most genes. On the other hand, 30 genes classified into the first cluster seem to have a change in geneexpression levels. This can be verified from Figure 3, which shows the profiles of geneexpression levels across all six experiments for each cluster.
Figure 3. Geneexpression profiles of the four clusters found using the method described. Each line represents a single gene. Clusters 2 and 3 (containing over 95% of genes) show little change in geneexpression levels; cluster 1 (30 genes) and cluster 4 (6 genes) do show changes in geneexpression levels.
In addition to determining the number of clusters, modelbased clustering has another advantage in providing posterior probabilities of observations belonging to each cluster. The posterior probabilities are calculated using Equations (1) and (2), and are presented in Figure 4. Recall that a gene is classified to a cluster if its posterior probability of being in the cluster is the largest. From Figure 4, it can be seen that if a gene's tstatistic has a large absolute value, then it will be classified into cluster 1. Specifically, if a tstatistic, y_{i}, is smaller than 6.54 or larger than 7.39, then the corresponding gene i is judged to be from cluster 1. Hence, cluster 1 consists of genes with large absolute values of tstatistics, implying that cluster 1 corresponds to genes with large changes of expression levels (after standardization by the variation of expression levels).
Figure 4. Posterior probability of a gene being in each cluster as a function of the tstatistic y, calculated using Equations (1) and (2). A gene is classified to a cluster if its posterior probability of being in the cluster is the largest.
Furthermore, the posterior probability can serve as a quantitative measurement of the strength of each gene being classified into each cluster. For instance, among 30 genes classified into the first cluster, there are respectively 17, 18, 20 and 21 genes with a posterior probability of being in the first cluster greater than 0.99, 0.95, 0.90 and 0.85. Hence, those 17 or 18 genes are likely to have expression levels significantly different from those of other majority genes. The posterior probability might also provide information about possible misclassifications. In addition to those classified into cluster 1, there might be other observations classified into the other two clusters but nevertheless with not too small probabilities of being classified into cluster 1. The lower right panel of Figure 4 shows six such observations, all belonging to cluster 2 but with probabilities of being in cluster 1 ranging from 0.30 to 0.48. These six genes show somewhat differential geneexpression levels, but the evidence is not strong and more experiments may be needed to verify this.
We hope we have shown that modelbased clustering is a powerful method that is useful in analyzing geneexpression data. It is flexible as well as intuitively understandable. However, it does have some limitations. Although it provides posterior probabilities for classification results, in the context of detecting differentially expressed genes its use is more in the line of exploratory data analyses. For instance, in our example, we treat cluster 1 as representing genes with changed expression whereas clusters 2 and 3 consist of genes without expression changes. Although this treatment is reasonable, it is somewhat subjective and is debatable. Some new statistical approaches [31,32,33] are interesting alternatives that provide a more quantitative answer to detecting genes with altered expression, but they require replicates of spots or arrays. Modelbased clustering is less restrictive and can be applied to data without replicates and to cluster (relative) geneexpression levels directly [13].
Materials and methods
Three young pathogenfree SpragueDawley rats were inoculated with pneumococcus in phosphatebuffered saline (PBS) and served as the pneumococcus group. Three other rats inoculated with PBS served as controls. All animals were sacrificed on day 42 after inoculation. The bullae from each of the pneumococcus or PBSinoculated groups were pooled and submitted for mRNA purification. Purified mRNAs, [α ^{32}P]dATP, dNTP mix and reverse transcriptase were incubated at 50°C for 25 min for the synthesis of radioactively labeled cDNA probes. The Atlas cDNA array membranes (Atlas rat 1.2 array, Clontech, CA) were hybridized with the cDNA probes and nonspecific binding washed away. Specific binding of cDNA probes with the membranes was scanned into a computer and the radioactive signal intensities of specific binding were quantitated with the OptiQant software (version 3.0, DeltaPackard, Boston, MA) and presented in digitalized light unit (DLU). The intensity level in DLU is the observed geneexpression level. As described earlier, the logtransformation was conducted on the intensity level in DLU, and the centering and scaling procedures were followed using the logtransformed data. The original data representing the intensity level (in DLU) for each gene from each of the six experiments are available from our website [34].
Acknowledgements
This research was partially supported by NIH grants.
References

Brown P, Botstein D: Exploring the new world of the genome with DNA microarrays.
Nat Genet 1999, 21(Suppl):3337. PubMed Abstract  Publisher Full Text

Nat Genet 1999, 21(Suppl):34. PubMed Abstract  Publisher Full Text

Eisen M, Spellman P, Brown P, Botstein D: Cluster analysis and display of genomewide expression patterns.
Proc Natl Acad Sci USA 1998, 95:1486314868. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tavazoie S, Hughes JD, Campbell MJ, Cho RJ, Church GM: Systematic determination of genetic network architecture.
Nat Genet 1999, 22:281285. PubMed Abstract  Publisher Full Text

Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR: Interpreting patterns of gene expression with selforganizing maps: methods and applications to hematopoietic differentiation.
Proc Natl Acad Sci USA 1999, 96:29072912. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Tibshirani R, Hastie T, Eisen M, Ross G, Botstein D, Brown P: Clustering methods for the analysis of DNA microarray data. [http://wwwstat.stanford.edu/~tibs/research.html] webcite
Technical Report, Department of Statistics, Stanford University, 1999.

Zhang K, Zhao H: Assessing reliability of gene clusters from gene expression data.
Funct Integr Genomics 2000, 1:156173. PubMed Abstract  Publisher Full Text

Kerr MK, Churchill GA: Bootstrapping cluster analysis: assessing the reliability of conclusions from microarray experiments.
Proc Natl Acad Sci USA 2001, 98:89618965. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Mixture Models: Inference and Applications to Clustering. New York: Marcel Dekker,. 1988.

Holmes I, Bruno WJ: Finding regulatory elements using joint likelihoods for sequence and expression profile data.
Proc 8th Int Conf Intelligent Systems for Molecular Biology. Menlo Park, CA: AAAI Press, 2000.

Barash Y, Friedman N: Contextspecific Bayesian clustering for gene expression data.
Proc Fifth Annual Int Conf Computational Biology. New York: Association for Computing Machinery, 2001.

Yeung KY, Fraley C, Murua A, Raftery AE, Ruzzo WL: Modelbased clustering and data transformations for gene expression data.
Bioinformatics 2001, 17:977987. PubMed Abstract  Publisher Full Text

Lee MLT, Kuo FC, Whitmore GA, Sklar J: Importance of replication in microarray gene expression studies: statistical methods and evidence from repetitive cDNA hybridizations.
Proc Natl Acad Sci USA 2000, 97:98349839. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Friemert C, Erfle V, Strauss G: Preparation of radiolabeled cDNA probes with high specific activity for rapid screening of gene expression.

Chen Y, Dougherty ER, Bittner ML: Ratiobased decisions and the quantitative analysis of cDNA microarray images.
J Biomed Optics 1997, 2:364367. Publisher Full Text

Ideker T, Thorsson V, Siehel AF, Hood LE: Testing for differentiallyexpressed genes by maximum likelihood analysis of microarray data.
J Comput Biol 2000, 7:805817. PubMed Abstract  Publisher Full Text

Newton MA, Kendziorski CM, Richmond CS, Blattner FR, Tsui KW: On differential variability of expression ratios: improving statistical inference about gene expression changes from microarray data.
J Comput Biol 2001, 8:3752. PubMed Abstract  Publisher Full Text

Dudoit S, Yang YH, Callow MJ, Speed TP: Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments.
Technical Report, Statistics Department, University of CaliforniaBerkeley, 2000.

Titteringto DM, Smith AFM, Makov UE:
Statistical Analysis of Finite Mixture Distributions. New York:Wiley,. 1985.

Dempster AP, Laird NM, Rubin DB: Maximum likelihood from incomplete data via the EM algorithm.

Akaike H: Information theory and an extension of the maximum likelihood principle.
In In 2nd Int Symp Information Theory. Edited by Petrov BN, Csaki F. Budapest: Akademiai Kiado, Edited by . 1973, 267281.

Biernacki C, Govaert G: Choosing models in modelbased clustering and discriminant analysis.

Banfield JD, Raftery AE: Modelbased Gaussian and nonGaussian clustering.

Fraley C, Raftery AE: How many clusters? Which clustering methods? Answers via modelbased cluster analysis.

McLachlan GL: On bootstrapping the likelihood ratio test statistic for the number of components in a normal mixture.

McLachlan GL, Peel D, Basford KE, Adams P: Fitting of mixtures of normal and tcomponents. [http://www.jstatsoft.org/v04/i02/] webcite

EMMIX [http://www.maths.uq.oz.au/~gjm/emmix/emmix.html] webcite

Fraley C, Raftery AE: MCLUST: Software for modelbased cluster analysis.
J Classification 1999, 16:297306. Publisher Full Text

Modelbased Clustering Software [http://www.stat.washington.edu/fraley/mclust] webcite

Efron B, Tibshirani R, Goss V, Chu G: Microarrays and their use in a comparative experiment. [http://wwwstat.stanford.edu/~tibs/research.html] webcite
Technical Report, Department of Statistics, Stanford University, 2000.

Tusher VG, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98:51165121. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Pan W, Lin J, Le C: A mixture model approach to detecting differentially expressed genes with microarray data. [http://www.biostat.umn.edu/cgibin/rrs?print+2001] webcite
Technical Report 2001011, Division of Biostatistics, University of Minnesota, 2001.

Wei Pan website [http://www.biostat.umn.edu/~weip] webcite