A response to Functional associations of proteins in entire genomes by means of exhaustive detection of gene fusions by AJ Enright, CA Ouzounis. Genome Biology 2000, 2:research0034.1-0034.7
The field of predicting protein-protein interactions has been active for over two decades, but in silico methods have become prominent in the last three years with the expansion of analyses at a genomic scale [1,2]. The rationale of one of the computational methods is as simple as it is elegant. Two polypeptides A and B in one organism are likely to interact if their homologs are expressed as a single polypeptide AB in another. The latter polypeptide (AB) is called a Rosetta Stone protein, as it contains information about both A and B. Marcotte et al.  have proposed that fusion to form a single polypeptide reduces the entropy of dissociation of A and B. The result is a huge increase in the local concentration of A with respect to B. A recent paper in Genome Biology describes the effort of Enright and Ouzounis , who carried out a vast analysis of genes of the Rosetta Stone type over 24 genomes, including eukaryotic genomes. They uncovered many new 'composite' or 'Rosetta' proteins, many of them contributed by eukaryotes. Here, I provide simple arguments to suggest that including eukaryotic sequences in the analyses may increase the robustness of predictions made using the Rosetta Stone approach.
In prokaryotes, transcription and translation are coupled, and functionally related genes are clustered. Dandekar et al.  compared nine prokaryotic genomes and noticed a poor conservation of architecture of operons (clusters of co-transcribed functionally related genes) - but what is an operon in one organism may, in another, be a regulon (several co-regulated operons or sub-operons). They noticed that a number of gene pairs were highly conserved, taking this as evidence for direct physical interactions between the corresponding gene products, rather than a reflection of co-regulation resulting from functional coupling (see also ). Given that a biochemical function in many cases depends on the action of a multimeric complex, a correlation between co-regulated and interacting proteins is to be expected (corresponding to a proportion of the positive hits in the approach of Dandekar et al.).
The rationale of the Rosetta-type search seems to be far more robust, but the existence of a Rosetta protein is not always proof of protein-protein interactions. The operon can be considered as a selfish cassette of DNA that can confer a selective advantage under certain conditions. The operon is therefore a gene cluster that has been assembled by deleting 'uninteresting' intervening sequences and can be spread by horizontal gene transfer to many recipient genomes . From a minimalist perspective, it is conceivable that deletion of an intervening sequence between two adjacent genes may lead to an in-frame fusion of the open reading frames (ORFs). If folding of the fused proteins is not altered, this is a way to co-regulate gene expression as efficiently as an operon does for separate ORFs. Thus, fusion events may reflect an alternative strategy of co-regulation and not direct physical interactions. This might explain, at least in part, one surprising result of Enright and Ouzounis : when they tried to validate their predictions using the results of a yeast two-hybrid experiment on a genomic scale, only one case found validation. This may also reflect, as the authors notice, the extremely high number of false-positive hits of the two-hybrid method. Consider also that Mycoplasma genitalium (with a 580 kb genome containing 479 ORFs), which has a genome smaller than that of Mycoplasma pneumoniae (816 kb, 677 ORFs), nevertheless contains 15 Rosetta proteins whose M. pneumoniae homologs are encoded by split genes. The reverse comparison shows that M. pneumoniae has only four Rosetta proteins when the reference genome is that of M. genitalium. Although this does not preclude the possibility of physical interactions between the putative partners, it can be used as a circumstantial argument to suggest that reductive evolution may push towards gene fusion for the sake of economy.
Although in prokaryotes fusion events may in some instances reflect co-regulatory strategies, the introduction of eukaryotic genomes into searches for Rosetta proteins may help improve the robustness of predictions, especially when the hits involve organisms from different kingdoms of life (eukaryotes versus bacteria and archaea). Instead of using 'energetic' arguments (changes in entropy, ΔS, or in the Gibbs' free energy, ΔG), we can use a simple mass-action approach to justify this. Consider, for example, a dimerization reaction characterized by the equilibrium constant K, such that
where Aeq, Beq and ABeq are the concentrations at equilibrium. If Ao and Bo are the initial concentrations of A and B, the constant can be expressed as
This formula is an ordinary quadratic equation, and we can find the value of ABeq analytically. Imagine now that we have a cell that swells (expanding from the volume of a prokaryotic cell to that of a eukaryote) without changing the input concentration amounts of each subunit. A modest increase of the volume, say 1,000 times, translates into a dramatic decrease in the amounts of AB. For instance, for K = 108 M-1 and Ao = Bo = 1 μM, a 1,000-fold dilution leads to over 10,000 times less AB at equilibrium (see Figure 1a). Even worse, if the starting concentration of A and B is 1 nM, the same dilution leads to over 800,000 times less AB at the end. Increasing K may help to overcome this problem. If now K = 1012 M-1 for Ao = Bo = 1 μM, we will have only 1,000 times less AB (which is a proportional change with respect to the dilution factor). However, if Ao = Bo = 1 nM, 1,000 times less of AB is obtained for K > 1015 M-1.
Figure 1. (a) The relationship between the fold reduction of the equilibrium concentrations of AB (in the dimerization reaction discussed in the text) and the association constant K following a 1,000-fold dilution of A0 and B0. Two different initial concentrations of A and B have been considered (1 μM and 1 nM). Calculations were made using the analytical solution of the equation: fold reduction = (ABeq(for A0,B0))/(ABeq(for A0/1,000,B0/1,000)). Note that both axes are logarithmic. It is easy to see that the differences in the yield of AB after dilution depend on the initial input concentrations and are enormous for low K values (weak complexes). These differences, however, decrease as K increases (tight complexes). Notice that for lower initial concentrations of A and B the curve moves towards the right, and higher K values will be required to obtain a similar fold-reduction relative to the situation where the initial monomer concentrations are higher. Fusion of A and B is intuitively equivalent to increasing K to infinity or to enormously increasing the amounts of A able to 'see' B. There is a link between K and classical thermodynamic parameters (such as those evoked in ) but these parameters are less immediate than the mass-action arguments and are of minor importance for understanding here. (b) The equilibrium concentration of A (or B) and AB as a function of K (M-1) for A0 = B0 = 1 nM. It is clear that for low K the association reaction is very inefficient. As K increases, the concentration of free monomers (physicochemically required to fulfill the constraints of the mass action law but biologically useless) decreases. Fusion of A and B is the best option available to avoid monomer wasting and the diffusional problems mentioned in the text.
It is clear that increasing the affinity (K) between the partners enhances dimer formation even at low input doses of monomers (Figure 1b). The increase in K required can be enormous, however, and even in the case of an irreversible (ultra-tight) dimerization, it will always be limited by the translational diffusion of the partners. If the cell requires a certain amount of a product at a certain moment, this may be unattainable, from a kinetic point of view, with low monomer concentrations. This is especially critical in eukaryotes, where co-regulated genes are not physically linked and where transcription and translation take place in different compartments. If similar levels of AB activity were needed by both types of cells (that is, a small or prokaryotic cell and a large or eukaryotic cell), three main non-exclusive alternatives are possible: first, a proportional increase in the input molar concentrations in the large cell; second, the introduction of compartments in the large cell; and third, fusion of the interacting partners. The first strategy is not parsimonious because, in our hypothetical example, a 1,000-fold increase in the initial concentrations of all interacting partners must be guaranteed (involving a 1,000-fold increase of 'biologically useless' free monomers, that is, Ao-ABeq, Bo-ABeq). This might also imply drastically slowing down the turnover of the proteins so as to allow their accumulation. The second strategy is more advantageous. Note, however, that in most cases independent polypeptides must diffuse across the cytosol (site of synthesis) to reach their compartments. Since the compartments and/or organelles are big targets and able to sequester the proteins, they may relieve the diffusion problem, whose consequence is only kinetic. If the cell needs high concentrations of the AB complex in a short period of time, however, increasing the input concentration of monomers will be required (in spite of the 'relief' provided by the organelles); if time is not a problem, by contrast, the cell can 'wait' until the organellar concentrations of monomers and/or complexes are the correct ones. The strategy of gene/protein fusion is the most parsimonious and kinetically advantageous. It dramatically helps to diminish the amount of transcribed and translated products required to attain the desired levels of functional activity. This is clearly beneficial in terms of time and energy consumption, and can also be applied to proteins sorted to specialized compartments (no diffusion of monomers is required to enable them to meet inside the compartments or within membranes).
In the case of enzymes, we can also evaluate the advantages of gene or protein fusion from the perspective of the chemical reactions they catalyze. In prokaryotes, after translation, whether the gene products interact physically or not, they are all produced in relative proximity. In eukaryotes, protein fusion to yield polyproteins able to catalyze successive steps of a metabolic pathway provides a great advantage compared to producing independent polypeptides. Note that the partition of the cellular volume into organelles also enhances metabolic efficiency and that, perhaps, the existence of organelles is linked to improving metabolic processes rather than to relieving the protein diffusion problem evoked above. This is reminiscent of the notion of metabolic channeling, used to describe the restricted flow of substrates and products in multienzyme systems (substrates and/or products are passed from one active center to another). It has been argued that free diffusion is sufficiently rapid to obviate the need for channeling . Again, this is easily applicable to prokaryotes. In eukaryotes, however, large cytosolic volumes may result in a greater need for channeling.
It is safe to assume that most Rosetta proteins conserved across eukaryotes and prokaryotes are responsible for the 'core' metabolism (intermediary and basic information transfer, as defined in ). Consistent with this, the comparison of Drosophila with other organisms  shows that, in almost all cases where functional annotation is known, the Rosetta components are involved in the core metabolism. Exploitation of these results might aid understanding of how simple organisms work. Even this would be a big achievement, which would in turn help us to understand more complex eukaryotic systems. From the perspective outlined here, eukaryotic Rosetta proteins are likely better to reflect protein-protein interactions (producing fewer false positives) than those found outside Eukarya (many of which are also relevant). On the other hand, Rosetta proteins specific to eukaryotes might reflect the modular nature and the combinatorial design of many eukaryotic components (complex transcription factors, signal transduction molecules and molecular adaptors). It is conceivable that nature could have evolved a huge combinatorial panoply of enzymes with a limited set of interacting generic domains (cofactor-binding and catalytic domains), but in fact selection has favored a 'copy-and-paste' strategy, allowing the multiplication of domains that appear today as fusion products . The results of this copy-and-paste process can be grouped into two main classes: a primordial scenario leading from several domains constituting several peptides to several domains combined into one polypeptide, and a sophisticated scenario leading from several genes encoding several polypeptides to one gene for one polyprotein. The evolutionary path followed from one state to the other is, however, largely unknown.
I thank Sandrine Caburet for helpful discussions.