It has recently been shown that the detection of gene fusion events across genomes can be used for predicting functional associations of proteins, including physical interaction or complex formation. To obtain such predictions we have made an exhaustive search for gene fusion events within 24 available completely sequenced genomes.
Each genome was used as a query against the remaining 23 complete genomes to detect gene fusion events. Using an improved, fully automatic protocol, a total of 7,224 single-domain proteins that are components of gene fusions in other genomes were detected, many of which were identified for the first time. The total number of predicted pairwise functional associations is 39,730 for all genomes. Component pairs were identified by virtue of their similarity to 2,365 multidomain composite proteins. We also show for the first time that gene fusion is a complex evolutionary process with a number of contributory factors, including paralogy, genome size and phylogenetic distance. On average, 9% of genes in a given genome appear to code for single-domain, component proteins predicted to be functionally associated. These proteins are detected by an additional 4% of genes that code for fused, composite proteins.
These results provide an exhaustive set of functionally associated genes and also delineate the power of fusion analysis for the prediction of protein interactions.
Recent progress in genome analysis has shown that it is possible to predict protein interactions or, more generally, functional associations of proteins using genome sequences alone [1,2,3]. These powerful methods rely on the observation that pairs of genes encoding proteins of known function (usually interacting or forming a complex) tend to be found in other species as a fused gene encoding a single multifunctional protein . This type of event is known as gene fusion and is a well-known process in molecular evolution . Many of these gene fusion events appear to be selectively advantageous by decreasing the regulational load in the cell for a particular process [1,3,5]. Therefore, the detection of gene fusions in one genome (defined as 'composite' proteins) allows the prediction of functional associations between homologous genes that remain separate in another genome (defined as 'component' proteins).
Although gene fusion events appear to be relatively rare, the accurate detection of a gene fusion event in one genome allows interactions to be predicted between many proteins in other genomes. It is this kind of one-to-many relationship that makes this method unique for discovering possible interactions or functional associations between proteins, even for those of unknown function. Unlike previous methods that rely on gene proximity to predict functional coupling , this robust method can also detect distal genes within a genome that may be involved in the same process. Furthermore, we have previously demonstrated  the high precision of our algorithm, which with an additional constraint of minimum alignment overlap has now increased to over 86% (see Materials and methods). This family of sequence-based methods is analogous with and complementary to the experimental approaches for the detection of protein interaction .
In order to predict functional associations of proteins through the dynamics of gene fusion events, we have applied our algorithm to 24 entire genome sequences that were available from a variety of species (Table 1). We define the genome where we seek component proteins as the 'query' genome and all genomes from which we obtain composite proteins as 'reference' genomes. A 'fusion event' is therefore defined as any pair of component proteins that are detected as a fused, composite protein in a reference genome. For simplicity, we do not attempt to attach directionality to fusion events. In other words, some of these fusion cases (for example, fused in bacteria but split in metazoa) may represent gene 'fission' events.
Table 1. Genomes used in the present analysis
Our algorithm was applied individually for each of the 24 genomes, against the remaining 23 genomes which are used as references (see also Materials and methods). Paralogy in the query genome makes it difficult to determine precisely the actual number of possible associations. As we have previously pointed out, paralogy in the query genome increases uncertainty, while paralogy in the reference genome increases the fidelity of the predictions . It is for this reason that detected component and composite proteins from all genomes are subsequently clustered according to sequence similarity . Each cluster should therefore indicate a distinct family of component or composite proteins. The analysis of the distribution of these gene fusion classes among genomes allows us to investigate the dynamics and distribution of this evolutionary process and to assess the extent of the predictive power of the approach.
The detection of gene fusion events yielded 132,812 component and 66,406 composite proteins in an all-against-all genome comparison, but these values represent multiple occurrences of the same proteins across species. Of these, there are 7,224 component and 2,365 composite unique proteins across the 24 species (a 18- and 28-fold reduction respectively). The multiple detection of these cases within or across genomes signifies that the majority of components and composites are observed more than once and therefore represent genuine cases (as opposed to sequencing artifacts, which are usually isolated cases).
The high precision of the method allows the prediction of 39,730 unique pairwise functional associations of the components with reference to the composite protein set. Eighty-six percent of the 66,406 predicted associations obtained from the total number of composite proteins yield a Z-score value of less than 3 (Figure 1), previously shown to result in virtually no false-positive cases . This increased precision is due to the introduction of an additional constraint that does not permit any overlap between the component proteins (see Materials and methods). All the above results are available on the worldwide web (see Materials and methods). Some of these associations are known, but we estimate that more than half of them are newly detected cases, testable by using techniques from functional genomics .
Figure 1. Z-scores for component proteins. The graph illustrates the Z-score (blue bars) distribution and its cumulative sum (step function, with red rectangles) between components, for all detected fusion events (66,406 in total). The Z-score is a statistical measure of similarity for each pair of components. Components that have a Z-score similarity of less than 10, and both exhibit similarity to the same composite protein are detected as fusion events. In general, fusion events where the Z-score between components is less than 3 (marked by a vertical line) result in fewer false-positive fusion detections.
Currently, the only species for which predictions can be extensively validated is the yeast Saccharomyces cerevisiae, given the ongoing work on transcript profiling  and two-hybrid technology . For yeast, there are 440 distinct component cases (predicted by all other genomes as reference, excluding some highly paralogous Drosophila melanogaster homologs) involved in 706 predicted interactions, most of which are detected by their homology to composite proteins from Caenorhabditis elegans and D. melanogaster. Two examples of predicted protein pairs that are known to interact are CPA1 (YOR303w) with CPA2 (YJR109c)  and MET3 (YJR010w) with MET14 (YKL001c) , both derived from C. elegans homologs.
We have attempted to test the validity of our predictions by comparing the set of components to a list of potentially interacting gene products, using results from a large-scale two-hybrid experiment . However, there is only one case shared between the 1,004 proteins involved in 957 putative interactions detected by the two-hybrid system and the complete set of 706 pairs in this analysis: YIL033C (SRA1) and YKL166C (TPK3) matching the C. elegans protein C09G4.2 and D. melanogaster protein CT10911. This very low count of common pairs may be expected by the sampling biases of the two rather independent methodologies, given that each approach can only detect a very small subset of the total number of actual interacting pairs in yeast. Interestingly, based on a simple conditional probability calculation, an estimate for the total number of detectable interactions in the yeast cell may be of the order of 675,000.
Another validation procedure for the S. cerevisiae predictions was obtained by comparing all 706 component pairs against their expression profiles from publicly available gene expression data. We have found that at least 20% of our predictions exhibit very strong correlations across gene expression experiments. For each of the pairs, a profile from 87 experiments involving cell cycle , sporulation  and diauxic shift  was used to determine whether expression data corroborated our predictions for the association of the component proteins (see Materials and methods). The detected pairs of components from fusion analysis clearly exhibit similar patterns of expression for the above mentioned experiments (Figure 2, inset). Despite the noise levels for the gene expression data as a result of the limited number of experimental conditions available, with some random pairs exhibiting significant correlations, there are twice as many predicted associations than random, above the threshold of average correlation value of 0.5. With a higher threshold of 0.55, precision is increased, with four times as many predicted associations over the random background. Above this value, 92 predicted functional associations (20% of 536 available pairs, see Materials and methods) exhibit high correlation across all experiments (Figure 2). Below that threshold, it is very difficult to estimate the precision rate of our predictions, because of the high amount of noise and the rather limited number of publicly available gene expression data sets. This comparison between fusion detection and transcript profiling contrasts with previous approaches , where expression data was used as a filtering step for the detection of functional associations, and not as a validation criterion.
Figure 2. Correlation of gene expression between component pairs. The graph illustrates the distributions of average correlation values of gene expression between component pairs (blue bars) and randomly selected pairs (gray bars), above a threshold value of 0.5. Inset: Distributions of average correlation values for both predicted and random associations (vertical line indicates the cut-off value of 0.5).
We have analyzed the S. cerevisiae predictions and detected many interesting cases, which appear to be hitherto undetected functional associations between yeast proteins. Two of these are discussed in some detail here. First, MXR1 (peptide methionine sulfoxide reductase, involved in anti-oxidative processes)  and YCL033C (function unknown) are predicted to be functionally associated by virtue of gene fusion in three other species - Helicobacter pylori (both strains), Haemophilus influenzae and Treponema pallidum. This observation is supported by experimental results . MXR1 is 39% identical to the amino terminus of the H. pylori composite proteins and YCL033C is 38% identical to the carboxyl terminus of these proteins. It appears that YCL033C is a selenoprotein, also homologous to the human SelX protein, which may be involved in scavenging reactive oxygen species . These two proteins may be associated to protect the yeast cell from oxidative damage.
Second, another interesting observation involves yeast proteins MSS4 (phosphatidylinositol 4-phosphate kinase), which is involved in a signaling pathway responsible for the cell-cycle-dependent organization of actin cytoskeleton , and CCT3 (cytoplasmic chaperonin subunit gamma) which is involved in microtubule and actin assembly . A central domain of CCT3 is 25% identical to a large domain of C. elegans protein VF11C1L.1 and the carboxy-terminal domain of MSS4 is 29% identical to its carboxyl terminus. Thus, these two proteins are predicted to cooperate in cell-cycle-dependent cytoskeleton organization and assembly.
The distribution of components and composites differs dramatically between species. There are 7,224 component cases, with an average of 350 cases per genome, exhibiting significant variation (Figure 3a, blue bars). The query genome sequences detected 2,365 composite cases, with an average of 115 cases per genome (Figure 3b, blue bars). Interestingly, we have observed some relatively small genomes containing composite proteins, which may yield predictions for components of higher organisms. For instance, there are 71 proteins (forming 30 families) in the C. elegans genome that match a fused protein gene in Mycobacterium tuberculosis. Two such examples are the component pair T06C10.1/C49H3.7 matching composite Rv0957 and the component pair W04C9.1/Y65B4B_12.b matching both composites Rv1272c and Rv1273c. Another clear prediction is the S. cereuisiae component pair YER052c/YJRl39c (encoding HOM3/HOM6 respectively) matching composite MetL (3847.PRO) from Escherichia coli and other species.
Figure 3. Numbers of component and composite proteins. Absolute number of (a) component and (b) composite proteins as individual cases (blue bars) and protein families (green bars), by species. Species name abbreviations as in Table 1. Data for C. elegans and D. melanogaster are clipped (1,973 and 1,981 components, 567 and 559 composites, respectively).
This has been a key observation that dictated the all-against-all genome comparison in this analysis. In other words, when species A is used as a query against species B, the resulting set of component and composite proteins is different from that with the reverse comparison, when species B is used as a query against species A. The three principal factors in gene fusion during evolution appear to be paralogy, genome size and phylogenetic distance. For instance, larger genomes have more composite, possibly paralogous, proteins. At the same time, closely related species evidently show similar patterns of gene fusion. The results below address each of these factors in turn and examine their relative contribution to the gene fusion process and their effects on the prediction of functional association of proteins.
For every genome, both sets of component and composite proteins were subsequently clustered , to detect the degree of paralogy for these proteins (Figure 3, green bars; see also sequence clustering in Materials and methods). There are 2,534 component families, with an average of 105 families per genome (Figure 3a, green bars) and 1,323 composite families, with an average of 55 families per genome (Figure 3b, green bars). Comparing these numbers with the number of unique cases, it is evident that there is a paralogy level of two- to threefold per genome for the composite and component proteins, respectively. As mentioned above, this effect contributes to the confidence of the predictions, depending on whether paralogy is observed in the query or the reference genome.
Another characteristic of this process is the redundancy of both sets of component and composite cases: the number of instances of these may be high but they are widely present across species, falling into well-defined protein families. When all components and composites are clustered as a single set (as opposed to within species, above), sequence clustering results in 1,287 single component families and 621 single composite families (as represented in the current analysis for the 24 species). Comparing these numbers with the number of families per species, it is apparent that there is a further twofold reduction for both sets. This result indicates that gene fusion is widespread in evolution but forms a finite set. Different species may contain a common core of composite families, but also provide new families that are used to predict functional association. For instance, D. melanogaster provides far more composite families (more than 200) compared to C. elegans (fewer than 100) (Figure 3b, green bars). Genomes with unique composite families, such as D. melanogaster, contribute strongly to the majority of predicted interactions. It may also be that only certain classes of proteins are involved in gene fusion and that there is an upper limit for the predictive power of this approach obtainable from (currently available) 621 families.
Evidently, the number of component and composite proteins detected in each species is also dependent on genome size (Figure 4). When the above numbers for unique cases and families of components (Figure 4a) and composites (Figure 4b) are normalized by the number of open reading frames (ORFs) for the species examined, the patterns of distribution are significantly altered. For instance, Aquifex aeolicus and Thermotoga maritima appear to have a large number of components involved in gene fusion (more than 12% of their genes are involved in this process) (Figure 4a), whereas the absolute numbers are low (Figure 3a). This is also the case for composites, where, for example, S. cerevisiae yields as many cases as D. melanogaster in relative terms (4% of the genome) (Figure 4b), while the absolute counts are dramatically different (Figure 3b).
Figure 4. Numbers of component and composite proteins relative to genome size. Relative numbers of (a) component and (b) composites per species, as individual cases (blue bars) and protein families (green bars), normalized by total genome size (number of ORFs). Species name abbreviations as in Table 1. Average values per genome are 9% for components and 4% for composites.
Finally, when the factors of paralogy and genome size are removed by sequence clustering and normalization, respectively, the effect of phylogenetic distance between species can be detected. A distance measure based on shared composite families has been devised (see Materials and methods) and was used to identify relationships between the 24 species examined. The fact that the tree based on this distance measure (Figure 5) does not significantly contradict other trees based on sequence alignments is a strong indication that our hypotheses about the factors involved in gene fusion are valid. This result also indicates that certain types of fusion events appear to be confined to specific phylogenetic groups, such as the Archaea, various bacterial clades and the Eukarya (Figure 5).
Figure 5. Neighbor-joining dendrogram representing the phylogenetic proximity of each of the 24 species in terms of detected gene fusion events. The distance measure is derived from the count of composite families (see Materials and methods). Scale bar is set to indicate a distance of 10 (ranging from 0 to 100). Species name abbreviations as in Table 1. Only bootstrap values less than 100 are shown.
The exhaustive detection of gene fusion events in entire genome sequences allows the prediction of functionally associated components based merely on genome structure. The all-against-all species comparison is a necessary step because we have repeatedly observed fused, composite proteins in taxonomically lower organisms. The landscape of gene fusions appears to be a complex one, affected by paralogy, genome size and phylogenetic distance.
Although gene fusion is widely present across various phylogenetic groups, it is a process that may involve only certain types of proteins. Yet, this approach for the prediction of functional associations of proteins results in robust predictions for physical interactions, pathway involvement, complex formation and other types of functional associations of protein molecules.
With the present analysis, we delineate the available universe of fusion events and detect a set of 621 composite protein families from which predictions may be obtained. This approach results in 39,730 pairs of functionally associated proteins across 24 species, with high precision and coverage. This novel set of predictions is made available to the scientific community for the first time, and we believe that many of these cases can be subsequently verified by experimental methods.
Materials and methods
All 24 genomes were filtered using the CAST compositional bias filtering algorithm , then compared against themselves and each of the other 23 genomes using the BLASTp  sequence similarity searching algorithm with a cut-off E-value of 1 × 10-10. The DifFuse algorithm  was then applied automatically to each genome in turn as a query against the other 23 (reference) genomes. Using other protein databases as reference yields fewer composite cases (for example, the well-known case of the TopA/TopB pair appears multiple times in this analysis), showcasing the extreme bias of annotated databases, such as SwissProt (data not shown). Performing the same computation using the non-redundant sequence database (nrdb) is prohibitively expensive in terms of computation time for an analysis of this size. The detected gene fusion results for each of the 552 comparisons were further automatically filtered for significant overlap of the BLAST alignments of the component proteins. In this case, component proteins that overlap by more than 10% of their total length when aligned together with the composite protein. This step avoids the detection of 'promiscuous domains'  and gene prediction errors, which result in false-positive fusion detection cases. The detected component and composite proteins are far fewer in number than for the two previous reports of E. coli  and S. cerevisiae , due to the much stricter criteria employed in the present analysis and the multi-step protocol we have developed. This analysis was fully automatic and carried out in parallel over a period of four weeks on 11 SUN UltraSPARC CPUs running Solaris 7.
Expression profile analysis
Gene expression ratios for all experiments were transformed into log-odds values so that induction and repression measurements are directly comparable (positive and negative values, respectively). The log-odd values were then normalized across all timepoints for each experiment, using Z-score values for each timepoint. The Z-score values for all time points of each experiment thus allow cross-comparison of gene expression across separate experiments .
Our predicted functional associations for S. cerevisiae with available expression data represent 536 component pairs in total. For each pair of proteins, a Pearson correlation coefficient was calculated between two corresponding experiments and averaged over all experiments. To estimate noise in these data, a control set of 536 randomly selected S. cerevisiae proteins was taken and treated as above (Figure 2).
The distribution of averaged Pearson correlation coefficients for the predicted functional associations was compared against the distribution of coefficients for the control set using a t-test for mean values (where the null hypothesis is that the two means are equal). The test results in a t-value of 3.6 (critical t-value is 1.64), which is highly significant (P-value is 0.000173), indicating that there is a higher average correlation of expression profiles for the predicted functional associations against the background.
All proteins involved in gene fusion events as either component or composite proteins were identified automatically from the results of the fusion analysis. From these data we can obtain raw counts of the number of gene fusion events detected and the number of proteins involved in these events as either composite or component proteins. These figures are skewed however, due to the presence of homology in both the query and reference sets. Proteins involved in gene fusion events as either component or composite genes are then assembled into two lists. These lists are then used to generate two sequence databases, the first one containing all component sequences from the 24 genomes and the second containing all composite sequences.
These sequence databases of component and composite proteins are then compared against themselves using the BLASTp (version 2.0) sequence similarity searching algorithm  (cut-off E-value 1 × 10-10), then clustered according to their similarity using the RAGE algorithm . The RAGE algorithm lists all composite and component proteins in clusters according to similarity and domain structure. Homologous proteins with similar domain structure were clustered together. Each cluster in this case indicates a distinct class of fusion event and cluster members indicate which proteins involved in this type of event from different genomes. These clusters are used to calculate the number of unique fusions detected within and across genomes. This is done by examining how many distinct types of fusion are present in any given genome.
All composite proteins were clustered into 621 families and a distance measure δ was derived according to the sharing of clusters between the 24 species examined. This pairwise distance measure is calculated as δ = (1 - SA,B/TA,B) × 100, where SA,B is the number of shared composite clusters and TA,B is the average of the composite cluster counts from the two species. This measure is reminiscent of a recent genome-wide "ortholog" analysis . This measure was used to calculate a nearest-neighbor dendrogram for the 24 species. Bootstrap values were generated using a 'delete-half jack-knife procedure.
All results of the present analysis are available from the Computational Genomics Group website .
We thank John Aach (Harvard Medical School), Despina Alexandraki (University of Crete and IMBB, Heraklion) and members of the Computational Genomics Group at the EBI for discussions. This work was fully supported by the European Molecular Biology Laboratory (EMBL). C.O. acknowledges further support from the European Commission DGXII (Science, Research and Development), the Medical Research Council (UK) and IBM Research. Patent application filed on behalf of EMBL.
Ito T, Tashiro K, Muta S, Ozawa R, Chiba T, Nishizawa M, Yamamoto K, Kuhara S, Sakaki Y: Toward a protein-protein interaction map of the budding yeast: a comprehensive system to examine two-hybrid interactions in all possible combinations between the yeast proteins.
Uetz P, Giot L, Cagney G, Mansfield TA, Judson RS, Knight JR, Lockshon D, Narayan V, Srinivasan M, Pochart P, et al.: A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae.
Promponas VJ, Enright AJ, Tsoka S, Kreil DP, Leroy C, Hamodrakas S, Sander C, Ouzounis CA: CAST: an iterative algorithm for the complexity analysis of sequence tracts. Complexity analysis of sequence tracts.