Open Access Highly Accessed Open Badges Research

Modeling gene expression using chromatin features in various cellular contexts

Xianjun Dong1, Melissa C Greven1, Anshul Kundaje2, Sarah Djebali3, James B Brown4, Chao Cheng5, Thomas R Gingeras6, Mark Gerstein5, Roderic Guigó3, Ewan Birney7 and Zhiping Weng1*

Author Affiliations

1 Program in Bioinformatics and Integrative Biology, Department of Biochemistry and Molecular Pharmacology, University of Massachusetts Medical School, Worcester, MA 01605, USA

2 Department of Computer Science, Stanford University, 318 Campus Drive, Stanford, CA 94304, USA

3 Centre for Genomic Regulation (CRG) and UPF, Dr. Aiguader, 88, 08003 Barcelona, Spain

4 Department of Statistics, University of California, Berkeley, 367 Evans Hall, University of California, Berkeley, Berkeley, CA 94720, USA

5 Computational Biology and Bioinformatics Program, Yale University, 266 Whitney Ave, New Haven, CT 06511, USA

6 Cold Spring Harbor Laboratory, Genome Center, Woodbury, New York 11797, USA

7 Vertebrate Genomics Group, European Bioinformatics Institute (EMBL-EBI), Wellcome Trust Genome Campus, Hinxton, Cambridgeshire, CB10 1SA, UK

For all author emails, please log on.

Genome Biology 2012, 13:R53  doi:10.1186/gb-2012-13-9-r53

Published: 5 September 2012

Additional files

Additional file 1:

Supplementary tables. Table S1: bestbin and pseudocount results for each mark. Table S2: results of all predictions, including the correlation coefficient, P-value for the correlation, the individual correlation, and relative importance of each chromatin feature. Table S3: list of experiments used in the analysis.

Format: XLS Size: 4.5MB Download file

This file can be viewed with: Microsoft Excel Viewer

Open Data

Additional data file 2:

Supplementary figures. Figure S1: model diagnosis. (A) ROC curve for random forests classifier in predicting the 'on' and 'off' expression status for the CAGE PolyA+ cytosolic RNA from K562 cells. The AUC (area under the curve) is 0.95 and error rate is 9.56%. (B) Residual plot for the fitted values. The red line is the mean of residuals, which should be centered around 0 for a model without systematic bias. The sharp border at the bottom of the scatter plot is due to the limited resolution of measured expression (for example, not enough data points between 0 and first non-zero value). (C) Q-Q plot of standardized residuals, which shows that standardized residuals are normally distributed. (D) Scatter plot of predicted expression and measured expression using the 'rankit' transformation (which samples from an equivalent normal distribution that respects the rank order of the expression data; see Materials and methods). PCC r = 0.86 for overall prediction (P-value <2.2 × 10-16), AUC for classification is 0.94 and PCC r for regression is 0.72. Figure S2: comparison of the performance of three regression models. Figure S3: model stability. Each bar is a set of randomly sampled genes (10%, 20%,... 100% of all genes). The blue line represents the PCC r for each set. The black line with filled circles is the percentage of high-CpG promoter (HCPs) genes and the open circle black line is the percentage of low-CpG promoter (LCPs) genes in each set. The model performance is stable regardless of sample size. Figure S4: comparison of performance between HCP and LCP genes. (A,B) The performance of different chromatin feature categories for predicting HCP genes versus LCP genes (A) and highly expressed versus lowly expressed genes (B). It shows the results of the top X% of genes (X =10, 20, 30,... 100) in decreasing order of expression for CAGE PolyA+ cytosolic RNA from K562 cells. Figure S5: heatmap of correlation between replicates of expression experiments. Among the total of 98 experiments, 55 experiments have two biological replicates (replicates 1 and 2). The heatmap indicates that two replicates from the same technique, RNA type, cell line, and compartment are generally highly correlated. Code for RNA type: t, total RNA; +, PolyA+; -, PolyA-. Code for cell lines: K, K562; G, GM12878; 1, H1-hESC; H, HepG2; E, HeLa-S3; N, NHEK; U, HUVEC. Code for cell compartment: W, whole cell; C, cytosol; N, nucleus; h, chromatin; u, nucleolus; l, nucleoplasm. Figure S6: heatmap of correlation between CAGE and RNA-Seq experiments for single-transcript genes. Each row (or column) depicts a PolyA+ RNA expression experiment from one of the cellular compartments (cytosol, nucleus, and whole cell) and one of seven cell lines (H1-hESC, HeLA-S3, GM12878, HepG2, K562, NHEK, and HUVEC) from CAGE or RNA-Seq. It shows that CAGE and RNA-Seq expression from the same cell lines are well-correlated (black-frame boxes), even though the correlation is weaker than experiments using same quantification method (the red blocks along the diagonal). There are a total of 31,484 genes with single transcripts. Figure S7: model performance using DNase I hypersensitivity only and promoter marks only. Each bar is the correlation coefficient of predicting expression using only either DNase I hypersensitivity or promoter marks (that is, H3K4me2, H3K4me3, H2A.Z, H3K9ac, and H3K27ac). It shows that promoter marks are more predictive than DNase I hypersensitivity (paired Wilcoxon test P-value = 4 × 10-15). Figure S8: heatmap of correlations between PolyA+ RNA-Seq and PolyA- RNA-Seq. Figure S9: stability of the 'bestbin' selection. Each panel is a histogram of the 'bestbin' index for a chromatin mark. Since the 'bestbin' is calculated based on a randomly selected one-third of the total dataset (D1 in Figure 1) for each experiment, the most stable 'bestbin' will be shown as a sharp peak on the histogram. Figure S10: improvement by pseudocount optimization. Correlation coefficient of histone modification (H3K79me2) density with expression level is calculated at each bin, using a fixed pseudocount of 0.001 or an optimized pseudocount (see Materials and methods). The pseudocount optimization (black line) consistently performs better than the fixed pseudocount (gray line). The blue line indicates the average H3K79me2 level. Figure S11: prediction using single-transcript genes. PCCs (r) of all 78 RNA expression experiments using only the single-transcript gene subset. Comparing Figures S10 and 2c, we can see that there is no significant change in model performance or most important variables when including genes with multiple transcripts.

Format: PDF Size: 3.5MB Download file

This file can be viewed with: Adobe Acrobat Reader

Open Data