Quantitative relationship between chromatin feature and expression. (a) Scatter plot of predicted expression values using the two-step prediction model (random forests classification model and linear regression model) versus the measured PolyA+ cytosolic RNA from K562 cells measured by CAGE. Each blue dot represents one gene. The red dashed line indicates the linear fit between measured and predicted expression values, which are highly correlated (PCC r = 0.9, P-value <2.2 × 10-16), indicating a quantitative relationship between chromatin features and expression levels. The accuracy for the overall model is indicated by RMSE (root-mean-square error), which is 1.9. Accuracy for the classification model is indicated by AUC (area under the ROC curve), which is 0.95. The accuracy for the regression model is r = 0.77 (RMSE = 2.3). (b) The relative importance of chromatin features in the two-step model. The most important features for the classifier (upper panel) include H3K9ac, H3K4me3, and DNase I hypersensitivity, while the most important features for the regressor (bottom panel) include H3K79me2, H3K36me3, and DNase I hypersensitivity. (c) Summary of overall prediction accuracy on 78 expression experiments on whole cell, cytosolic or nuclear RNA from seven cell lines. The bars are sorted by correlation coefficient in decreasing order for each high throughput technique (CAGE, RNA-PET and RNA-Seq). Each bar is composed of several colors, corresponding to the relative contribution of each feature in the regression model. The red dashed line represents median PCC r = 0.83. Code for cell lines: K, K562; G, GM12878; 1, H1-hESC; H, HepG2; E, HeLa-S3; N, NHEK; U, HUVEC. Code for RNA extraction: +, PolyA+; -, PolyA-. Code for cell compartment: W, whole cell; C, cytosol; N, nucleus.
Dong et al. Genome Biology 2012 13:R53 doi:10.1186/gb-2012-13-9-r53