Modeling pipeline. Genes longer than 4,100 bp were extended and divided into 81 bins. The chromatin feature density in each bin is logarithm-transformed and then used to determine the best bin (the bin that has the strongest correlation with the expression values). To avoid log2(0), a pseudocount is added to each bin, which is then optimized using one-third of genes in each dataset (D1) and then applied to the other two-thirds of genes in the datasets (D2) for the rest of the analysis. D2 was divided into training set (TR) and testing set (TS) in a ten-fold cross-validation manner. A two-step model was built using the training set. First, a classification model C(X) was learned to discriminate the 'on' and 'off' genes, followed by a regression model R(X) for predicting the expression levels of the 'on' genes. Finally, the correlation between the predicted expression values for testing set, C(TS_X)*R(TS_X), and the measured expression values of testing set (TS_Y) was used to measure the overall performance of the model. TSS, transcription start site; TTS, transcription termination site; RMSE, root-mean-square error.
Dong et al. Genome Biology 2012 13:R53 doi:10.1186/gb-2012-13-9-r53