Figure 1.

Comparison of mRNA expression and protein abundance. (a) A plot comparing our mRNA reference expression set [29] with our newly compiled protein abundance dataset. The mRNA axis is in copies per cell; the protein axis is in thousand copies per cell. The protein dataset is the result of iteratively fitting two MudPit datasets (MudPit-1 [32] and MudPit-2 [31]) and two two-dimensional electrophoresis datasets (2DE-1 [7] and 2DE-2 [28]). Given the semi-quantitative nature of the MudPit data [31], we transformed the data into a more quantitative set by fitting each set individually onto our reference mRNA expression dataset. In addition, we fit the MudPit-1 dataset onto the more finely-grained MudPit-2 dataset. Each of the datasets was then moved back into 'protein space' using an inverse transformation derived from the 2DE-1 set, as this set has the most precise values. These datasets were then combined into the new reference abundance dataset. In cases in which there were overlapping values for a given ORF we used the dataset in accord with the following ordering: 2DE-1, 2DE-2, MudPit-2, MudPit-1. The resulting reference protein abundance dataset (N = 2044) had a correlation of 0.66 with the mRNA reference dataset. (b,c) Additionally, we show that when looking at specific subsets (subcellular localization [52] or functional groups [34,35]) we can find both higher and lower correlations amongst these groups. The lower correlations are generally reflective of a more heterogeneous category. This analysis indicates that while correlations may be weak when looking at the global data, we tend to find higher correlations when looking at smaller well-defined subsets of ORFs. Further analysis is available at [33].

Greenbaum et al. Genome Biology 2003 4:117   doi:10.1186/gb-2003-4-9-117