Data integration outperforms individual data sources in terms of quantifying functional links between human genes. The x-axis represents linkage sensitivity, defined as the fraction of the gold standard positive (GSP) gene pairs that are linked at different linkage weight cutoffs (see Materials and methods). The y-axis represents linkage precision, defined as the fraction of the linked gold standard gene pairs that belong to the GSP set (see Materials and methods). GSPs are defined as gene pairs sharing the same biological process term in Gene Ontology (GO). Gold-standard negatives (GSNs) are defined as gene pairs annotated with GO biological process terms that do not share any term. To generate the random control curve, we randomize the class labels in the gold standard datasets and then perform the same evaluation. In Figure S1 of Additional data file 5, we provide the same plot with the x-axis in log scale to show details for individual data sources. CC, cellular component; Co-exp, co-expressed; DDI, domain-domain interaction; DS, protein domain sharing; GN, gene neighbor; Masspect, mass spectrometry; MF, molecular function; PG, phylogenetic profiles; PPI, protein-protein interaction; TexM, text mining; Y2H, yeast two hybrid experiments. The descriptions of the 16 individual data sources are listed in Table 1.
Linghu et al. Genome Biology 2009 10:R91 doi:10.1186/gb-2009-10-9-r91