A response to Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset by SE Choe, M Boutros, AM Michelson, GM Church and MS Halfon. Genome Biology 2005, 6:R16.
Correspondence
In a recent Genome Biology article, Choe et al. [1] described a control dataset for Affymetrix GeneChips. By spiking RNA at known quantities, the identities of all null and differentially expressed genes are known exactly, as well as the fold change of differential expression. With the wealth of analysis methods available for microarray data, a control dataset would be very useful. Unfortunately, serious errors are evident in the Choe et al. data, disproving their conclusions and implying that the dataset cannot be used to validly evaluate statistical inference methods. We argue that problems in the dataset are at least partially due to a flaw in the experimental design.
qvalue estimates incorrectly criticized
Figure 8 in Choe et al. [1] suggests that estimated qvalues substantially underestimate the true qvalues. For example, Figure 8b shows that, when a qvalue cutoff of 0.20 is used, the true qvalue is closer to 0.60. This reflects a serious error somewhere in the analysis. The question is whether the error occurs prior to, or as a result of, the qvalue calculations.
Falsediscovery rates were originally proposed by Soric [2] and Benjamini and Hochberg [3]. The qvalue was developed as the FDR analog of the pvalue [47]. There is sound statistical justification behind both FDR and qvalue methods; that is, there is rigorous mathematical proof for their claimed operating characteristics. For example, conditions have been detailed where the qvalue estimates are guaranteed to be (simultaneously) conservative [5,7]. This means that, if a gene is assigned an estimated qvalue of 0.05, then its true, population average, qvalue is no larger than 0.05.
Note that the qvalue methodology employed by Choe et al. [1] was credited to the SAM method proposed by Tusher et al. [8], even though the qvalue methodology was developed separately from the SAM method in references [47]. The SAM software (different from the SAM method [8]) is based on at least four different articles [4,810], each of which contributes unique methodology. Failing to differentiate between the SAM method of Tusher et al. and the SAM software seems to have caused a good deal of confusion in the literature when evaluating and understanding the operating characteristics of the methods in the software [11].
We now show that the fundamental requirements for employing qvalues (and even pvalues) are not satisfied by Choe et al.'s dataset, leading to spurious conclusions. These requirements are stated in each of the original papers detailing the qvalue methodology [47]. In particular, the qvalue methodology draws on the fact that the pvalues for null genes should be uniformly distributed on the interval (0,1) [12]. As stated by Storey and Tibshirani [5]: "If the null pvalues are not uniformly distributed, then one wants to err in the direction of overestimating pvalues (that is, underestimating significance). Correctly calculated pvalues are an important assumption underlying our methodology." This is not a special requirement for qvalues  it is a requirement for any type of standard significance criterion [13].
Choe et al. [1] identified 10 combinations of preinference steps that seemed to work best. Their qvalue calculations were then done on these "10 best" datasets. We reproduced their analyses of each dataset using standard twosample tstatistics (conclusions did not depend on whether parametric, permutation, or bootstrap null distributions were used). Figure 1 of this correspondence compares the observed quantiles of the null genes' pvalues to the corresponding quantiles from the uniform distribution. If the null pvalues were distributed uniformly, then the observed quantiles should fall along the dashed line of equality. It is clear that the null pvalues are not uniformly distributed, tending to be much smaller than they should be. In other words, when observing the pvalue distribution among the null genes (which are known here), they show a substantial amount of significance beyond what would be expected by chance. It is therefore clear that the problems evident in Figure 8 of Choe et al. are not due to the qvalue methodology, but rather to the fact that the calculated pvalues are not valid.
Figure 1. QQ plots of null pvalues corresponding to null genes. A plot of the observed versus expected quantiles of the null genes' pvalues are shown for each of the 10 best datasets. The observed trends indicate that the null genes' pvalues trend are substantially smaller than they should be.
Problems with the experimental design
The question remains as to the cause of the pvalues being incorrect. One source of the problem is the experimental design itself. Consider Figure 1 in Choe et al. [1]. In column three, there are three aliquots of labeled cRNA clones. Each of these aliquots is divided, and one half is then spikedin with known concentrations of RNA among the genes selected to be differentially expressed. Because random variability is introduced in the spikein step (between columns 3 and 4 in Figure 1 of Choe et al. [1]), even null genes will have some differences in RNA amount. That is, both halves of each aliquot undergo some modification (even the control half), leading to random variation being introduced to the RNA amounts of all genes. This leaves three matched pairs of independent samples, where some variation exists within pairs for all genes.
A major flaw occurs when the three samples from each treatment (control or spikein) are combined into two aliquots in column 4. Now, instead of three independent matched pairs, there is only a single matched pair in column 4. Each half of this single matched pair is hybridized to three chips. Therefore, the random variation introduced at the spikein stage will not be detectable among the six resulting chips. If every chip is treated as an independent observation, then the variation introduced in the previous step among null genes will appear to be true signal. Unfortunately, this is exactly what Choe et al. do, leading to the incorrect pvalues that we observed earlier. This problem cannot be fixed by modifying the statistics.
Consider the following scenario, which suffers from the same problem as Choe et al.'s design but is phrased in more familiar terminology. Suppose we are interested in differences in gene expression for two biological groups under two conditions, A and B. To this end, we obtain expression measurements for a single individual (that is, a single biological replicate) under both conditions. On the basis of the sample from this single individual, we then form three replicated chips for each condition.
By chance, there will be small differences in the expression measurements under both conditions A and B. With a proper estimate of the variability, these differences will be identified as being random and the associated tests called null. The problem is that we cannot use our single individual to estimate the true variability of expression measurements under conditions A and B, taking into account all sources of variation. If we treat the six replicated observations as three independent matched pairs, then our estimated variance is due only to random aspects of the hybridization process. This will significantly underestimate the true variances, and, as a result, we will grossly inflate the significance of differential expression  even among null genes.
We performed a simple simulation of the above scenario; details are given in Additional data file 1. We considered both the case where three technical replicates are formed on a single individual and the case where three independently sampled individuals were used. Figure 2 of this correspondence shows a histogram of the null pvalues under the first scenario where only technical replicates are used, and Figure 3 shows the results using biological replicates. It is clear that the null pvalues are not uniformly distributed under Choe et al.'s design (Figure 2), tending to be much smaller than they should be. Figure 4 compares estimated and actual qvalues, analogous to Figure 8b in Choe et al. [1]. When a single individual is used, we see that qvalues substantially underestimate the truth due to the flawed underlying pvalues. Meanwhile, when three independent individuals are used, the estimated qvalues are conservative as the theory says they should be [47].
Figure 2. Histograms of null pvalues from simulation representing the experimental design of Choe et al. [1]. The null pvalues generated from the simulation as described in the text are shown. The dashed line represents the expected height of the bars assuming the null pvalues are uniformly distributed. The null pvalues are not uniformly distributed when only technical replicates are used.
Figure 3. Histograms of null pvalues from simulation based on independent samples. The null pvalues using three independently sampled individuals as described in the text are shown. The dashed line represents the expected height of the bars assuming the null pvalues are uniformly distributed. The null pvalues are uniformly distributed when biological replicates are used.
Figure 4. A plot of the true versus estimated qvalues from simulations described in the text. The solid gray line shows the results averaged over 30 simulations when using a design similar to that of Choe et al. [1]. The solid black line is the analogous comparison when using three independent individuals. The dashed line represents equality; conservatively estimated q values should fall beneath this line. The Choe et al. [1] design produces anticonservative qvalues estimates due to the incorrect underlying pvalues, while the more statistically sound design produces conservative qvalue estimates. The Monte Carlo variation of the qvalue estimates is small enough that these conclusions are not affected.
A largescale "spikein" control dataset would be invaluable for headtohead comparisons of statistical methods for microarrays. The Choe et al. dataset was intended to serve this purpose. Unfortunately, the data set is flawed in that even the null genes appear to be differentially expressed. As a result, the dataset cannot be relied upon for evaluating statistical inference methods. We note further that, when applying statistical methods such as the qvalue estimates, one must take care to ensure that all necessary assumptions are met.
Sung E Choe, Michael Boutros, Alan M Michelson, George M Church and Marc S Halfon respond:
One of the main purposes of our paper [1] was to challenge the community to improve on existing microarray analysis methods and to promote a better understanding of the experimental conditions for which these methods are appropriate. Dabney and Storey make a valuable contribution to this effort with their clarification as to why our qvalue calculations underestimate the falsediscovery rate by noting that the underlying pvalue distributions for the "null genes"  those with no expected fold change between the S spike (S) and control (C) samples  are nonuniform in each of the 10 best datasets considered in our original manuscript [1]. Dabney and Storey also provide simulation results to suggest that the problem is due to our use of technical replicates. However, we demonstrate here that this aspect of our design is sound, and that their model is consistent neither with our actual experimental design nor with the observed distribution of the null pvalues. We also offer a more likely explanation for the problems that they note.
Dabney and Storey claim that "serious errors are evident in the Choe et al. data" due to a "flaw in the experimental design" concerning technical instead of independent replication, and they provide simulation results to support their view. They fail, however, to provide justification for their simulation parameters, which appear to be greatly exaggerated and inconsistent with our actual experimental design. This is especially true with respect to their value for ρ (the correlation between the S and C concentrations). Whereas Dabney and Storey place ρ at 0.85 (see their Additional data file 1), in reality it should be close to 1.0, as our design ensures that cRNAs from the same pool have virtually identical fold changes between the S and C samples. The "null genes", listed in the original paper as pools 1319, were combined into a single pool before labeling and division into the S and C samples (Figure 5). In other words, all of the null genes were partitioned as a group into either the S or C samples. Common intuition as well as standard experimental practice tell us that the variation in fold change for genes from the same pool will be negligible, but we can also estimate it as follows. Consider the set of RNAs at the detection limit of the assay, which Affymetrix puts at 1.5 pM [14]. Assuming that equal partitioning from the null gene pool to the S and C samples follows a binomial distribution with p = 0.5, the standard deviation in RNA amount is only slightly more than 10^{4 }molecules for the approximately 10^{8 }molecules in the pool. Thus there is no appreciable variation in the amount of any given RNA species (nor in the fold changes between the S and C samples for cRNAs within the same pool). As there was a single pipetting event from the null gene pool to each of the S and C samples, there is some uncertainty as to the exact ratio of allocation to S versus C; however, this degree of uncertainty is low (less than 0.3% according to the pipette manufacturer's specifications). Importantly, this is manifest as a scaling error that affects the pool of null genes as a whole  for example, the true ratio might be 1:0.997 for all null genes, rather than 1:1. Any such differences, if actually detectable above the variation introduced by hybridization itself, were resolved by using our knowledge of the null genes to normalize between arrays whenever applicable. The dominant source of variation in our experiment is indeed, by design, that due to "aspects of the hybridization process". When parameters in keeping with a greatly reduced amount of variation are used in Dabney and Storey's model, the pvalue distribution is uniform (data not shown). The observed nonuniformity therefore cannot be attributed to our use of technical replication.
Figure 5. A detailed description of the Choe et al. [1] experiment. Individual PCR products (a) were pooled together (b) and converted to labeled cRNA (c). Note that all mixing and labeling within each pool was performed at this stage, before splitting the pools into C and S samples. Therefore, relative concentrations of individual cRNA species are identical for all cRNAs in a given pool. (d) The labeled pools were then divided into the C and S samples. Poly(C) RNA (20 μg) was added to the C sample at this step to equalize the amount of nucleic acid present in each hybridization. (e) Each sample contained enough labeled cRNA for three hybridizations. Relative concentrations for each pool are shown in (f). Note that the 1× ("null gene") pools 1319 were combined together at step (b), before labeling at step (c), creating a single 1× pool before labeling and splitting. The '1×' concentration of RNA used for this pool was approximately 6× greater than the 1× concentrations of the other pools to reflect the greater number of individual RNAs (that is, so that the 1× concentrations of all RNAs were approximately equal).
A nonuniform null pvalue distribution does not necessarily invalidate our dataset; it may simply indicate that the analysis methods we applied do not adequately model the hybridization process. In this regard, we note that the models proposed by Dabney and Storey are not truly consistent with the observed null gene pvalues in our data. Their model cannot explain unexpected discrepancies present in the distributions of the onesided pvalues, and neither can it explain the fact that the actual distributions of the null gene tstatistics and pvalues appear to depend upon signal intensity. Figure 6a of this correspondence shows quantile plots for the ttest pvalues associated with the 150 preprocessing combinations we used. The lines highlighted in black correspond to the "10 best datasets" and are consistent with the curves presented in Figure 1 of this correspondence. The black curves in Figures 6b and 6c contain sample quantile curves; the solid lines correspond to the twosided pvalues and the dashed and dotted lines correspond to the pvalues from the onesided tests. Note the discrepancy in the onesided pvalues in Figure 6b. This was observed for eight of the 10 datasets and is not accounted for in the Dabney and Storey model, under which the distributions of these pvalues should be similar. This discrepancy appears to follow from the fact that each of the 10 best Choe et al. datasets were obtained using preprocessing steps that included loess correcting the observed intensity values from a set of "null genes" that included both "empty null" (not present in either sample) and "present null" (present with fold change = 1) probe sets. We have calculated a new set of 10 best datasets in which the correction is based only on the present null probesets (Figure 5, red curves). This 'reloessing' of the data eliminates the discrepancy in the one sided pvalues (Figure 6a,b) and provides proof of principle that preprocessing algorithms can have a substantial effect on the pvalue distribution.
Figure 6. Sample quantile plots for the pvalues of the observed test statistics for the "null genes". The xaxes correspond to the expected quantiles for a uniform distribution and the yaxes correspond to the observed (sample) quantiles. (a) Sample quantile plots for the ttest pvalues associated with the 150 preprocessing combinations described by Choe et al. [1]. Black lines correspond to the 10 best datasets and are consistent with the curves presented in Figure 1 of this correspondence. The red lines correspond to reloessed datasets that were obtained using the same combinations of preprocessing steps as the original 10 sets with the exception that the invariant subsets consisted only of the 'present null' (present with fold change = 1) probe sets (versus both the 'present null' and 'empty null' probe sets used in [1]). The distribution of the pvalues thus depends upon the choice of the invariant subset. (b) Sample quantile curves for dataset 10a. Solid lines correspond to the twosided pvalues and the dashed and dotted lines correspond to the pvalues associated with the onesided tests. Dabney and Storey's model does not account for the discrepancy in the onesided pvalues observed for this dataset, which is not manifest in the reloessed data (red lines). Similar results are seen with datasets 10b, c, d and 9a, b, c, d. (c) As in (b) but showing sample quantile curves for dataset 10e; dataset 9e is similar. The pvalue discrepancies are much less pronounced for these two datasets. Used with permission from [15].
Although reloessing the data improved the null distributions, they are still nonuniform. If the model proposed by Dabney and Storey is neither consistent with the actual experimental design (their parameter values are not realistic) nor consistent with the observed data (the real onesided pvalues are dissimilar), what does account for the nonuniform distribution of the pvalues? We find that the distributions of the null gene tstatistics and pvalues depend upon signal intensity. Figure 7 of this correspondence demonstrates this point using smoothed estimates of the pvalue and tscore quartiles as a function of signal intensity in representative datasets. Figure 7c shows that the reloessed dataset 10a provides for ttests that are better centered than the original dataset, an observation consistent with the results depicted in Figure 6b. While the medians for the tstatistics seem to be properly centered after reloessing the data, the quartiles (and hence the amount of variation) remain greatly inflated, however, and change with intensity.
Figure 7. Smoothed estimates of the quartiles as a function of signal intensity for the pvalues and observed ttest statistics for (a, c) dataset 10a and (b, d) dataset 10e. Distributions of the null pvalues and test statistics vary with intensity, although less so for the reloessed datasets (red lines). Even though the medians for the tstatistics seem to be properly centered after reloessing the data, the quartiles (and hence the variation) are greatly inflated and appear to be intensity dependent. The xaxes show the rankit of the log of the product of the expression means. The yaxes show the observed twosided pvalues (a, b) or the observed tstatistics (c,d). Solid and dashed gray lines indicate the theoretical medians and quartiles, respectively. Black curves correspond to the original datasets and red curves correspond to the reloessed datasets. Used with permission from [15].
We speculate that the preprocessing algorithms are unable to properly adjust for systematic differences in overall signal strength that exist between the control and spikein samples. Figure 8 of this correspondence contains smoothed estimates of the average rank of expression values and squared deviations (with respect to the appropriate group mean) of the three control ("C") replicates for the original (Figure 8a,b) and reloessed (Figure 8c,d) datasets. Comparison of Figures 8a and 8c reveals that reloessing appears to adequately recenter the control expression values relative to the spikein (S) expression values (the average rank over six arrays should be 3.5). The ranks of the squared deviations for the control replicates, however, remain below those of the spikein replicates, suggesting that the control expression values are less variable than the spikein values. This difference again appears to be intensity dependent.
Figure 8. Smoothed estimates of the average rank of expression values and squared deviations (with respect to the appropriate group mean) of the three control replicates for the (a, b) original and (c, d) reloessed datasets. The xaxes correspond to the rankit of the log of the product of the expression means. The yaxes correspond to the observed ranks and were calculated across all six samples. If the control (C) and spikein (S) expression values are interchangeable, the average rank of the control values should be 3.5. (c) Reloessing adequately recenters the control expression values relative to the spikein expression values. (d) Despite reloessing, however, the ranks of the squared deviations for the control replicates remain below those of the spikedin replicates, suggesting that the expression values for the control replicates are less variable than those for the spikedin replicates. This difference appears to be intensity dependent. Used with permission from [15].
The preceding analysis suggests that the observed nonuniformity of the pvalues is not easily explained by a simple model. Although relevant mainly to just one aspect of our study (regarding qvalue estimates), this is an important and complex issue in need of further investigation, and we are grateful to Dabney and Storey for bringing it to our attention. Whether these nonuniform pvalues are manifest in other datasets or are merely a byproduct of the unbalanced signal strengths present in our experiment also remains an open question to be addressed by future studies. However, in most microarray experiments it is impossible to check if the null pvalues are uniformly distributed (as the true set of null genes is not known) and it is therefore impossible to determine whether or not the requirements for accurate estimation of qvalues are met. We therefore caution that the preprocessing issues seen here might be relevant in other microarray experiments. We can easily conceive of biological conditions in which imbalances similar to those in our original study could exist  for example, when comparing different tissue types, in certain developmental time courses, or in cases of immune challenge.
Correspondence should be sent to Marc S Halfon: Department of Biochemistry and Center of Excellence in Bioinformatics and the Life Sciences, State University of New York at Buffalo, Buffalo, NY 14214, USA. Email: mshalfon@buffalo.edu
Additional data files
Additional data file 1 available with this paper provides details of the simulations reported here.
Additional data file 1. Details of the simulations reported here
Format: PDF Size: 71KB Download file
This file can be viewed with: Adobe Acrobat Reader
Acknowledgements
We are indebted to Daniel Gaile and Jeffrey Miecznikowski of the University at Buffalo Department of Biostatistics for the analysis presented in Figures 68 and for their permission to use material from [15] in this response.
References

Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS: Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset.
Genome Biol 2005, 6:R16. PubMed Abstract  BioMed Central Full Text  PubMed Central Full Text

Soric B: Statistical discoveries and effectsize estimation.

Benjamini Y, Hochberg Y: Controlling the false discovery rate: A practical and powerful approach to multiple testing.

Storey JD: A direct approach to false discovery rates.
J Roy Stat Soc, Ser B 2002, 64:479498. Publisher Full Text

Storey JD, Tibshirani R: Statistical significance for genomewide studies.
Proc Natl Acad Sci USA 2003, 100:94409445. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Storey JD: The positive false discovery rate: A Bayesian interpretation and the q value.
Annls Stat 2003, 31:20132035. Publisher Full Text

Storey JD, Taylor JE, Siegmund D: Strong control, conservative point estimation, and simultaneous conservative consistency of false discovery rates: A unified approach.
J Roy Stat Soc, Ser B 2004, 66:187205. Publisher Full Text

Tusher V, Tibshirani R, Chu G: Significance analysis of microarrays applied to the ionizing radiation response.
Proc Natl Acad Sci USA 2001, 98:51165124. PubMed Abstract  Publisher Full Text  PubMed Central Full Text

Efron B, Tibshirani R, Storey JD, Tusher V: Empirical Bayes analysis of a microarray experiment.
J Am Stat Ass 2001, 96:11511160. Publisher Full Text

Taylor J, Tibshirani R, Efron B: The miss rate for the analysis of gene expression data.
Biostatistics 2005, 6:111117. PubMed Abstract  Publisher Full Text

Dudoit S, Shaffer JP, Boldrick JC: Multiple hypothesis testing in microarray experiments.
Stat Sci 2003, 18:71103. Publisher Full Text

Lehmann EL: Testing Statistical Hypotheses. Berlin: Springer; 1997.

Rice JA: Mathematical Statistics and Data Analysis. 2nd edition. Belmont California: Duxbury Press; 1995.

Affymetrix: GeneChip Expression Analysis: Technical Manual. Santa Clara, CA: Affymetrix; 2004.

Gaile DP, Miecznikowski JC, Choe SE, Halfon MS: Putative null distributions corresponding to tests of differential expression in the Golden Spike dataset are intensity dependent. Technical report 0601. [http:/ / sphhp.buffalo.edu/ biostat/ research/ techreports/ UB_Biostatistics_TR0601.pdf] webcite
Buffalo, NY: Department of Biostatistics, State University of NewYork; 2006.