Why do some groups of physically linked genes stay linked over long evolutionary periods? Although several factors are associated with the formation of gene clusters in eukaryotic genomes, the particular contribution of each feature to clustering maintenance remains unclear.
We quantify the strength of the proposed factors in a yeast lineage. First we identify the magnitude of each variable to determine linkage conservation by using several comparator species at different distances to Saccharomyces cerevisiae. For adjacent gene pairs, in line with null simulations, intergenic distance acts as the strongest covariate. Which of the other covariates appear important depends on the comparator, although high co-expression is related to synteny conservation commonly, especially in the more distant comparisons, these being expected to reveal strong but relatively rare selection. We also analyze those pairs that are immediate neighbors through all the lineages considered. Current intergene distance is again the best predictor, followed by the local density of essential genes and co-regulation, with co-expression and recombination rate being the weakest predictors. The genome duplication seen in yeast leaves some mark on linkage conservation, as adjacent pairs resolved as single copy in all post-whole genome duplication species are more often found as adjacent in pre-duplication species.
Current intergene distance is consistently the strongest predictor of synteny conservation as expected under a simple null model. Other variables are of lesser importance and their relevance depends both on the species comparison in question and the fate of the duplicates following genome duplication.
The precise location of genes in eukaryotic genomes was assumed to be largely random not so long ago . This was motivated by the understanding that, unlike in bacteria, there need not be chromosomal domains associated with high rates of gene transcription. Common reports of chromosome inversions with little effect on phenotype confirmed the picture of random placement of genes and a lack of selective constraint on gene order .
However, recent studies in diverse eukaryotes challenge this initial intuition . Indeed, in all well studied eukaryotic genomes, genes of similar expression tend to cluster more commonly than expected by chance . For example, in humans both broadly [4,5] and highly [6,7] expressed genes cluster, while in yeast highly co-expressed genes are neighboring more commonly than expected . The same tendency for genes that are physically close to be co-expressed might additionally explain why genes whose proteins are close in either the metabolic [9,10] or protein-protein interaction network [11,12] are in close chromosomal proximity more commonly than expected. More subtle organizations have also been claimed, such as periodicity in gene location , but this appears to be caused by data biases . Not all patterns are necessarily associated with co-expression of some variety. Most notably, in yeast, essential genes cluster into domains of low recombination . The clustering of essential genes may be more to do with ensuring precise control over expression (that is, minimal noise), rather than co-expression per se .
While all these previous analyses helped to clarify some of the factors associated with gene order, they also opened new questions. Most particularly, how important, in absolute and relative terms, are all of these features? If we take a pair of genes adjacent to each other in Saccharomyces cerevisiae, we can then ask whether the same two genes are also adjacent or not in a different species. How relevant are the above parameters in explaining which genes are adjacent in both species? In yeast, intergene distance and co-expression have been shown to be two independent determinants of gene order conservation using Candida albicans as comparator species . Intergene distance is expected under the simplest neutral null model of gene order evolution. This is because we suppose that a re-arrangement that disrupts a gene will not be tolerated, hence those genes currently with a large intergene distance between them are more likely to be affected by viable gene re-ordering events, all else being equal (note that under this simplest null model it follows that overlapping genes are impossible to break up). Likewise, a pair of genes that currently have a large intergene distance between them are more likely to have had in the past a large intergene distance, even if they were not immediately next to the genes that are currently their neighbors.
Other evidence suggests that this null model alone is not adequate. Notably, essential genes tend to stay together more commonly than expected by chance, although their mean intergenic distance is unexceptional [15,18]. Whether this is owing to selection per se or simply a reduced probability of chromosomal re-arrangements in domains of low recombination  remains unclear. We can then ask a series of questions. First, if we treat each parameter in isolation, we can ask whether that parameter explains a significant proportion of conservation of gene order. Second, in a fuller model we can ask how relatively important and independent each of the parameters might be. Third, are the results of the above analyses sensitive to which comparator species we employ to compare with S. cerevisiae? Fourth, can we predict the characteristics of those gene pairs that through all lineages of yeast have remained physically together? Finally, what will be the effect of the differential gene silencing associated with the whole genome duplication in the yeast lineage?
To address these questions, we computed a group of potential determinants in S. cerevisiae and quantified how they determined linkage conservation in a full yeast lineage.
Results and discussion
A neutral model of gene order evolution
While in principle a relationship between intergene distance and conservation rates of genes that are immediate neighbors seems reasonable, a problem in demonstrating this derives from the fact that intergene distance data that we can directly obtain from genome sequencing describe the situation after the process of evolution from an ancestor. If we assume that DNA is neither lost nor gained, then a gene pair with a small intergene distance in S. cerevisiae may have a small distance either because the pair have always resided together and the intergene distance has not changed or because the pair came together following an inversion and this inversion just happened to bring with it a small intergene spacer. Moreover, two genes may be together in both S. cerevisiae and a relatively distant comparator, for example, C. albicans, not because they have always been immediate neighbors but because repeated events broke them up but also re-positioned them, bringing them back together. To further investigate the extent to which intergene distance might differ between genes that are immediate neighbors in any two species and those that are immediate neighbors only in one of the two species, we performed a set of neutral simulations.
In these simulations we consider a chromosome with 400 genes. The intergene distance between any gene pair is randomly selected from intergene distances currently observed in S. cerevisiae (after removal of overlapping transcripts). We then randomly select a position on the chromosome and accept this position if it is an intergene spacer. We then pick a point that is approximately 5 kb upstream or downstream of the selected chromosome location and accept it as the end point of the inversion if in intergene spacer. This distance approximately matches the mean size of the small inversions seen in yeasts . We then invert the sequence, thereby altering intergene distance between, at the most, two pairs of genes. We then carry on evolving the new chromosome over numerous rounds of inversions. We repeat the simulation for 1,000 inversions 100 times.
The first question to ask is what might be the relationship between the number of genes that are still immediate neighbors in the derived chromosome that were ancestrally also immediate neighbors. To examine this we compare the evolved chromosome with the ancestral one and partition gene pairs into those that are in retained synteny (that is, still immediate neighbors) and those that are not. Note that if A and B reside next to each other in the ancestor, then AB or BA ordering is considered to be preserved synteny, regardless of the DNA strand on which the two genes reside. Results are shown in Figure 1. As can be seen, the data describe an exponential decay function of rates of synteny conservation with increasing numbers of inversions. Note too that the asymptote of this function is not zero conserved synteny. This is owing to the fact that by chance in any random chromosome a certain number of gene pairs will be the same as in any other random chromosome.
Figure 1. Rate of synteny conservation in a null model of gene order evolution. The relationship between the proportion of gene pairs retained as neighbors and the number of inversions between two taxa.
The second question to ask is what, at any given time point following divergence from an ancestor, is the difference in intergene distance between those genes currently in preserved synteny and those that are no longer nearest neighbors. To examine this we compare the current intergene distances between the two groups. For each simulation we consider the mean intergene distance in the two groups and then consider the mean of these means over all simulations. As can be seen (Figure 2), at all divergence times (measured as number of inversions) the group remaining in synteny has a smaller mean intergene distance in the descendent chromosome. At least two reasons underpin this. First, as previously noted, randomly selected positions are most likely in long intergene spacers. A second, less appreciated fact is that when genes separated by a long spacer are involved in inversions, they tend to bring with them abundant intergene spacer sequence. Hence, not only do those genes that are retained in synteny comprise a special subgroup associated with low intergene distance, but those genes not retained in synteny tend not to transfer to the small intergene distance class.
Figure 2. Differences in intergene distance. Intergene distances of gene pairs currently in synteny and in the ancestor (blue) and those that were not ancestral neighbors but currently are neighbors (red) as a function of the number of inversions (error bars are standard error of the mean).
A third question to ask is how the simulations relate to real data. To do this we consider the following for both the real and simulated data. We take all currently observed neighboring gene pairs (that is, those in S. cerevisiae or those at the relevant point in the simulations) and rank order them according to their intergene distance. We then consider the top 50 (smallest intergene distance) and ask what proportion are retained in synteny. We then consider ranks 2 to 51 and repeat the calculation and so forth. For each species comparison (S. cerevisiae versus other), we consider in the simulations the distributions after the same number of inversions as corresponds to the number of gene pairs overall that are not in preserved synteny in the real data. As can be seen (Figure 3 and Additional data file 1), the correspondence between the simulant data and the observed data is striking. In all cases, for both the real data and the simulation, the genes with the shortest intergene distances show much higher levels of synteny conservation than those with longer intergene distances. This qualitative fit suggests that a simple null model that genes divided by large intergene distances are more likely to be re-ordered or, more precisely, to have been re-ordered, provides, to a first approximation, a good fit. This is yet more remarkable given that we suppose that gene orientation has no impact on the effect of a re-arrangement. This is an unrealistic assumption given that, of all three possible orientations of gene pairs, two convergent genes (→ ←) are unique in having no promoter sequence in the intergene spacer. This might in part explain why in many cases a further regularity appears, namely that gene pairs currently closely linked (that is, with little intergene distance) are not conserved quite as much as in the simulations while, conversely, those with relatively large intergene spacers in S. cerevisiae tend to be conserved somewhat more than seen in most simulants, while the overall rate of synteny conservation is the same. Prior evidence, however, also supports the view that the simple null model, even allowing for gene orientation, cannot explain everything. To assess the importance of the other suggested correlates, we consider a set of statistical approaches in which we look for deviations from the null model given by current intergene distance.
Figure 3. Proportion of gene pairs conserved in a comparator versus intergene distance in S. cerevisiae. Profiles of the rate of gene pairs conserved versus their current spacer in S. cerevisiae (red) or in simulants (blue) when comparing S. cerevisiae with two comparator species for (a) C. glabrata and (b) A. gossypii. For the simulations the number of inversions to run was determined by comparing observed synteny conservation rates against inversion number as shown in Figure 1. For our five focal species we also restricted analysis to cases where both of the orthologues of the S. cerevisiae gene pair are on the same chromosome in the comparator species, as this fits better the simulant model and permits higher orthology certainty. Each data point in the real and simulant data represents the proportion of gene pairs from 50 showing conserved synteny, after the data was rank ordered by intergene distance. After considering the first 50 we then considered ranks 2-51, 3-52, and so on. In addition, we also considered other comparators, and a much more distant comparator, C. albicans (Additional data file 1).
Predictors of gene order conservation
We consider specifically seven factors either previously associated with the formation of clusters or that could predict linkage conservation in S. cerevisiae. These predictors are the following: met, metabolic relationship [9,10,20]; cex, gene co-expression ; igd, physical proximity (that is, intergenic distance ); let, density of lethals (that is, local essential gene density) ; rec, recombination rate ; cre, gene co-regulation (number of common regulatory motifs between two genes) ; and pro, distance in the protein-protein interaction network [11,12].
In asking whether the above parameters predict gene order conservation, we could be asking two different questions. First, we could ask whether gene pairs of a particular class, given they are of the same class, are preserved as immediate neighbors more than those not in the same class. Second, we can ask whether, in determining which genes are preserved in linkage, the fact of belonging to the same class can explain much of the variation. The difference in analysis can be easily illustrated. Consider that there were just two genes in the genome that belonged to a given class (perhaps there are just two genes involved in a given cell process, process X). Consider also that these two genes were always preserved in linkage for some reason. At first sight process X looks like a strong predictor, as, given that two genes both belong to class X and are neighbors in one species, we can be sure they are neighbors in another. If we approach the analysis using the first method we would conclude that belonging to class X was important. However, as a variable to explain patterns of conservation or not of gene pairs in general, it explains very little of the conservation of gene order (just one pair) and most conservation of synteny has nothing to do with belonging to class X or not. The second mode of analysis would report that belonging to class X is not an important variable. A priori then, we expect the answers to depend on precisely what questions we ask.
We concentrate predominantly on the second mode of analysis. We take two broad approaches. First, for given pairwise comparisons (S. cerevisiae versus other species) we ask about statistical models that act to explain the variation between gene pairs as to whether they are syntenic (immediate neighbors) in both species or not. Second, we ask about the properties of gene pairs that are syntenic in all of the species concerned. While the first question allows us to ask whether predictors of conservation of gene order are dependent on the taxa compared and their phylogenetic distance, the second mode permits us to distil the properties that enable gene order conservation in the long term.
To begin, we start by asking whether there is a difference in the predictor variables, each in isolation, between those gene pairs that remain as immediate neighbors and those that do not remain adjacent. To this end, we first computed these values for adjacent gene pairs in the S. cerevisiae genome with homologues in a given comparator yeast species. We considered five hemiascomycetes species included in the Yeast Gene Order Browser  in which the number of gene rearrangements is not too high. The identification of homologues in a given comparator should take into account the whole genome duplication event that occurred in a shared ancestor of some of the species considered. One has to specifically distinguish between orthologues and paralogues to properly argue about the conservation of a pair as adjacent. To avoid any ambiguous assignment, we used the set of ancestral loci introduced in . Table 1 shows the mean values for these properties for adjacent genes in S. cerevisiae, within the previous set, which are found adjacent (-co) or nonadjacent (-nc) in the corresponding comparator. As expected, a significant covariate for gene order conservation in all cases is the intergenic distance, with the physical distance between adjacent pairs in S. cerevisiae clearly smaller for those pairs found adjacent in each species comparison. Of the remaining parameters, not all appear as important predictors. While genes adjacent in several species have a stronger co-expression signal and lower recombination rates, by contrast, distance in the metabolic and protein-protein interaction networks did not appear to be a relevant determinant of order conservation. In part this may be a methodological artifact as we assume that all genes not present in the network have an average distance, and most genes are not in the network (a more restricted study, examining only those genes featured in the networks, corroborated their relevance; see Materials and methods). We keep nevertheless both predictors in our study as controls. In summary, this analysis confirms the relevance of several aspects of gene expression control and genetic linkage as predictors of synteny conservation in yeast. This method cannot, however, really quantify the relative importance of each of them, so it is this we describe next.
Table 1. Determinants of gene order conservation
Quantifying predictor relevance in single species
We use multivariate analysis to disentangle the contribution of each of the previous factors to gene order conservation. The general idea is to describe the relationship between a dependent variable, the response, and a group of independent variables, the predictors, by means of a multiple regression model. In our case the response variable takes two discrete values; an adjacent gene pair in S. cerevisiae could be found as adjacent or nonadjacent in a given comparator species, so we apply logistic regression . We consider several complementary strategies to estimate the relevance of each linkage predictor. In addition, since the correlation between some of the determinants could also be relevant, for example, essential clusters having low recombination rates , a phenomenon termed multicollinearity in regression modeling, we compute the correlation matrix of the parameter estimates in the logistic equation to quantify this effect (Materials and methods). The final outcome of all these combined studies is the simplest logistic model capable of predicting the observed conservation patterns.
We first apply these methods to the closest comparator species, Candida glabrata, a post whole genome duplication (WGD) species (Table 2). In the univariate regression studies, the residual deviance of a logistic model with a single covariate is shown. We find a deviance value smaller than that of the null case (dev.null = 853.06) for some of the variables, notably co-regulation (dev.cre = 845.2). This indicates the possible relevance of this factor as a predictor. The strength of each factor is further supported by the order of appearance of the corresponding variable in a forward stepwise regression model. This method includes as part of the descriptive model only those terms that increase the goodness of fit (according to the Akaike's criterion). In the multiple regression analyses, we present estimates of the regression coefficients related to each predictor with their corresponding standard errors (z-values). The last two subcolumns in this study give the deviance table, differences between models as variables are added to the model in turn, and the probabilities associated with an approximated χ2 test (deviance differences have approximately a χ2 null distribution with degrees of freedom equal to the difference between the numbers of parameters in the two models ). After this combined study, we introduce a reduced model in which we retain only co-regulation and intergenic distance as significant determinants. Indeed, if we compare the full model, including all variables, with the reduced one, we can hardly notice the difference in coefficient estimates or deviance (data not shown). This is striking as it suggests that, for this particular analysis, most of the proposed co-variates are too weak to register as explanatory variables.
Table 2. S. cerevisiae versus C. glabrata logistic regression analyses
The relationship between the probability that an adjacent pair in S. cerevisiae was adjacent also in C. glabrata, Pr, and their intergenic distance and co-regulation is:
where the logit transformation is given by logit , and igd, cre denotes intergenic distance (in units of base-pairs) and co-regulation score (with a maximum value of 1), respectively. The model indicates that the probability to be adjacent in both species decreases with intergenic distance and co-regulation, the latter being the weaker of the two determinants. The relevance of each variable is easily determined by comparing the coefficients in Table 2, where variables were scaled in standard deviations (standardized data). A higher absolute value of an estimate in these units, and its order of appearance in the stepwise regression, reflects this relevance. Moreover, the previous model gives us the effect of change in one determinant when controlling the other. Thus, the effect of an increase in intergene distance (in units of base-pairs) for a fixed co-regulation score is exp(-0.001) = 0.999, while the maximal effect of increase in co-regulation, controlling for spacer, is exp(-0.526) = 0.591 with cre = 1. Are these effects independent? Analyzing the correlation between estimates, we find some dependence between both predictors (correlation of cre-igd coefficients: -0.19). We could in turn add an additional term in the model to account for this interaction. However, the decrease in deviance achieved by this more complex model is small, so we can still consider the two-component model as a valid description. Overall, this corroborates that an increase in intergene distance diminishes the probability that genes are adjacent in both species. This also suggests that non-adjacently conserved pairs exhibit stronger co-regulation, which is, at first sight, a counter-intuitive result. Analyzing this behavior in detail (data not shown), we find that high co-regulation scores are associated with a low density of regulatory motifs, that is, regulation by a small number of common transcriptional factors. It is probably this low density of regulatory sites that ensures that any re-arrangement is less likely to be opposed by purifying selection.
Predictors of linkage conservation in a yeast lineage
How would the previous model change with the comparator species? To analyze this question, we consider five comparator species: C. glabrata (discussed above), Saccharomyces castelli, Kluyveromyces waltii, Kluyveromyces lactis, and Ashbya gossypii. We apply the same methodology as in the previous section (Additional data file 1) to obtain the corresponding reduced logistic models. These models include only those terms which significantly contributed to explain the conservation pattern.
Table 3 shows the models associated with the different comparators. We see that recombination rate and, especially, co-expression emerge as new determinants for pre-WGD species. We examined how the probability to remain adjacent changes with the comparator for the situation of an adjacent pair with null spacer and averaged features, that is, zero mean co-expression, no co-regulation and averaged recombination rate (rec = 1). These probabilities are (see Table 3):
Table 3. Logistic models of gene order conservation for different comparator species
Thus, an 'averaged' adjacent pair in S. cerevisiae with null intergenic distance is less likely to be found as adjacent in a given comparator as phylogenetic distance increases. While this is, naturally, trivial, it goes some way to validating the method. More interesting is to see how this behavior changes when these pair types have non-zero intergenic distances? For a characteristic spacer of 500 bp, the probabilities to remain adjacent are (0.93, 0.85, 0.75, 0.53, 0.56), which correspond to a percentage of decrease with respect to previous values of ~(2.1%, 6.6%, 9.6%, 18.5%, 17.6%). Gene pairs with a large intergene distance should then be disproportionately more likely to be conserved as adjacent the closer the comparator species is to S. cerevisiae. Put differently, as the time to common ancestor increases, as intergene distance increases, so the probability that the genes are not in synteny in the comparator goes up at an accelerating rate.
Co-expression and intergenic distance act as the two main determinants of order conservation in pre-WGD species (clustering of essential genes near the adjacent pairs appears as a third determinant in K. waltii). These variables appear to be independent since the correlation of their corresponding estimates is low in all proposed models (<0.08 in all three pre-WGD comparators). To analyze the role of co-expression in more detail, we discretize the co-expression values so that a unit increase in the model corresponds to an increase of 0.1 in the correlation (estimates did not change very much with respect to those in Table 2). What is the effect of an increase in co-expression? This is given by the exponential of the corresponding coefficient in the logistic model. For K. waltii, this reads as exp(0.78) = 2.18, which indicates that, controlling for intergene distance and recombination rate, each increase in the correlation of 0.1 increases the odds that a pair remains as adjacent by 2.18 (slightly higher values applied to K. lactis and A. gossypii).
Properties of gene pairs preserved throughout yeast evolution
Rather than asking whether given variables explain conservation of order in a given pairwise comparison, for which, as we have seen, the answer heavily depends on the species chosen, we can also ask which parameters are relevant to gene order conservation by considering only those gene pairs that are immediate neighbors through all the species we have examined (509 pairs). Naturally, the answers will again be somewhat subject to precisely which species we consider (indeed, consider a comparator as distant as Schizosaccharomyces pombe and no gene pairs are conserved among all species). Nonetheless, this method allows us to distil the factors that are consistently important across a long evolutionary time period. The mean values of the determinants associated with this set were compared to those obtained by randomly selecting a group of genes of the same size (this was obtained by sampling from the full set of S. cerevisiae adjacent genes with homologues, adjacent or not, in at least one comparator species, 10,000 times). Adjacent pairs that have been retained as such in the whole lineage exhibited higher co-expression (0.0843 versus a random value of 0.0767, P < 0.05), smaller intergene distance (344.52 bp versus a random spacer of 422.44, P < 0.0001), higher density of essential genes nearby (0.25 versus a random density of 0.21, P = 0.0003), smaller recombination rate (1.029 versus a random rate of 1.043, P < 0.05) and lower co-regulation (0.093 versus a random score of 0.125, P < 0.01). These results thus support the view that the importance of predictor variables is in the order: intergene distance > local density of essential genes > co-regulation > co-expression and recombination rate. The lineage effect can be alternatively quantified by comparing the nominal values of the determinants of those genes remaining non-adjacent in the least distant species to S. cerevisiae, C. glabrata, with those of the genes retained as adjacent in the most distant species, A. gossypii. This corroborates the relevance of co-expression, density of flanking essential genes and intergene distance in the conservation of pairs in the lineage (Figure 4).
Figure 4. Determinants of close non-adjacently conserved pairs versus distant adjacently conserved pairs. The difference between the ratio of determinant values of non-adjacently conserved genes in a close species to S. cerevisiae (C. glabrata) and those adjacently conserved in a distant species (A. gossypii) is plotted in red for each predictor (line between points to help visualization). This ratio is defined as the quotient between the corresponding values of the close (distant) pairs and those of the adjacently conserved pairs in the close species, that is, C. glabrata. We also plotted the null behavior obtained by random sampling of the combined group, close and distant, preserving group size, 10,000 times (mean, continuous blue line, ±2 standard deviations, dashed blue lines). Behavior was qualitatively robust for the cex, igd, and let predictors when using S. castelli and K. lactis as close/distant comparator (Additional data file 1).
Given that clusters of essential genes also have low recombination rates, we can, in addition, ask whether the retention of synteny of those gene pairs in the middle of essential gene clusters is due to the low recombination rate. In wheat, for example, it is observed that chromosomal domains associated with low recombination rates also have low re-arrangement rates , potentially consistent with a model in which recombination is associated with the generation of re-arrangements. By contrast, recent simulations suggest that selection to preserve essential genes in chromosomal domains of low gene expression noise (open chromatin) will result in low rates of disruption of gene order in essential gene clusters, independent of any effect of recombination . To ask whether essential gene clusters are conserved owing to their low recombination rates, we considered two subclasses: those gene pairs with very few essential genes in their vicinity (the 'low' group: N = 385) and those with many (the 'high' group: N = 34). Next, within each group we ask about the recombination rate of those pairs conserved in synteny across all lineages and those not so. If recombination is an independent predictor, then in both the high and low groups the recombination rate of those in conserved synteny should be lower. For the 83 pairs in the low group conserved as a pair, the recombination rate is slightly lower than that of the low group as a whole (1.04 versus 1.06), but not significantly so (P = 0.5). In the high class, 10 of the 34 are retained in synteny. These have, if anything, a slightly higher recombination rate than the average for the high group (0.98 versus 0.95), but again, not significantly so (P = 0.47). So, in sum, while clusters of essential genes have low recombination rates, the recombination rate does not in and of itself explain the conservation of synteny. This is not to say that what is reported in wheat is wrong nor, indeed, more generally that recombination does not induce re-arrangements. The most important problem in this analysis is that the recombination rate measurements come from current laboratory yeasts while the synteny conservation data spans events over hundreds of millions of years. Problematically, recombination rates are thought to evolve quite fast. To better resolve this issue it might be better to compare telomeric (high recombination rate) and centromeric (low recombination rates) domains, rather than asking about conservation of gene pairs in isolation.
Predictors of linkage conservation and reciprocal gene loss
How would the conservation of synteny of a given pair of neighboring genes be influenced by the processes associated with the WGD event? We focus our attention on two possible effects. First, linkage conservation might be influenced by the fate of the pre-WGD adjacent pairs after the WGD event. We could compare two opposite situations. Either both adjacent pairs have lost the same (orthologue) copy of the corresponding gene, or both remained duplicated in all three post-WGD species. These are actually the most common fates of ancestral loci in yeasts . According to this, one could imagine, for instance, that since a duplicate gene might contribute to perform part of the function originally associated with a single gene (sub-functionalization model), adjacent genes with duplicates could experience less pressure to remain linked, as part of the function is implemented by the duplicate. We would predict then that adjacent pairs resolved as single copy in all post-WGD species would more often be found as adjacent in pre-WGD species. This is indeed what we obtain. Single copy adjacent genes were more likely conserved as adjacent in all pre-WGD species: K. waltii (χ2 = 5.83, P < 0.02, d.f. = 1); K. lactis (χ2 = 5.77, P < 0.02, d.f. = 1); and A. gossypii (χ2 = 5.41, P = 0.02, d.f. = 1).
Alternatively, as deletion of one duplicate is the most common process after the WGD, linkage conservation could be influenced by how this deletion is resolved in the different post-WGD lineages. Divergent classes are those in which some of the genes lost are paralogues in the three post-WGD species, while convergent classes imply that all lost genes are orthologues. This latter class implies a less random choice of gene loss. We find that adjacent pairs both belonging to the convergent class are more conserved than expected in four out of five species: C. glabrata, χ2 = 4.18, P = 0.04; K. waltii, χ2 = 6.64, P = 0.01; K. lactis, χ2 = 9.56, P < 0.01; A. gossypii, χ2 = 6.81, P < 0.01; d.f. = 1 in all cases.
In asking about what factors determine gene order conservation, despite the dependence of the answer on the question, one regularity appears. This is the finding that gene pairs currently with a short intergene spacer are less likely to have been re-arranged. This fits with data from microsporidians in which gene overlap is common and gene order rearrangements are rare . The null model, assuming nothing more than an intolerance to inversions that cut within genes, provides a strikingly good fit for such a simple model. The model was made deliberately simple by not assuming that gene orientation would make a difference and takes no account of the density of functional sites between genes. As noted above, these are unrealistic assumptions. This indeed may explain in part the conservation of gene pairs that are co-expressed, as inversions could, for example, break bidirectional promoters between genes in divergent orientation.
Beyond the role of the intergene spacer, further answers are dependent on just how one asks the question. We can, for example, ask whether gene pairs in a given class tend to be more conserved than gene pairs not in the specified class. For example, gene pairs that specify proteins close in either the metabolic or protein interaction network do tend to be more commonly conserved as neighbors than gene pairs that also specify proteins that feature in the relevant network but are not close in the network. By contrast, if we ask whether network proximity is generally an important predictor of synteny conservation, the answer is no, largely because most proteins do not explicitly feature within the network. Second, when asking about predictors of linkage conservation, the answer depends on which species one is comparing. Close comparators highlight co-regulation, whilst more distant comparators suggest co-expression and maybe the recombination rate (as measured in S. cerevisiae) as important predictors. Analysis of the properties of the gene pairs preserved as a pair in all species points to the density of flanking essential genes as an important predictor, suggesting that essential gene clusters tend to be frozen, as previously noted [15,18].
That the results are dependent on the species under comparison perhaps reflects a difference in the strength of selection to preserve a class of gene pairs and the commonality of such pairs. Consider, for example, the possibility that the top 2% of co-expressed gene pairs are under very strong selection to remain linked. Would this be transparent in comparisons between closely related species? The answer is probably not. In our close comparators, approximately 90% of gene pairs remain as immediate neighbors. If just the 2% most highly co-expressed genes resist re-arrangement, there may not even have been a single re-arrangement that might have occurred between linked highly co-expressed genes that was rejected by selection. Hence there would be no signal of co-expression as an important factor in linkage conservation. As the distance between comparators increases, however, the resilience of the 2% will start to appear as an ever stronger signal, assuming the co-expression to be both ancestral and under selection (in different ecologies different co-expression profiles might be under selection). In sum, strong but relatively rare selection will be discernable only in distant comparators. Put differently, the more distant comparisons and the analysis of those pairs always conserved hones in on the special subclass of genes for which selection acts to preserve the gene order.
Perhaps then relatively little is to be learnt from relatively close comparators as so few re-arrangements will have been sampled. In this context, however, there exists one apparent oddity. In the close species comparisons intergene distance and co-regulation appear as important predictors. However, against expectations, gene pairs with a high level of co-regulation, that is, that share much of the same transcription factor-based regulation, are more, not less, likely to be broken up. When analyzed in detail, however, we find that this strong signal is associated with a low density of regulatory motifs: very high co-regulation scores are disproportionately associated with gene pairs with only one (the same) transcriptional motif, hence a low motif density. It is this low motif density that most likely contributes to the lack of conservation of the gene pairs in the short term.
Even if we assume that longer distance phylogenetic comparisons are best, the yeast analysis suggests that phylogenetic distance alone is not the sole arbiter. Rather than the comparator distance, the duplication event experienced in the lineage seems also to be influencing the fate of adjacent pairs. The potential relaxation of the functional constraints associated with the pair members, because of either being duplicated or being divergently conserved, is reflected in a smaller tendency to remain as immediate neighbors.
The results presented here no doubt do not reflect the full complexity of gene order evolution. For example, while we expect that the absolute rate of gene order evolution should scale monotonically with the amount of intergene spacer, this model fails to make any sense of the much higher re-arrangement rates seen in rodents than in primates , although the low rate seen in chicken is consistent, the chicken genome being relatively compact. We can also ask whether the other forces we have identified might have any general applicability? Prior reports have found that clusters of housekeeping genes in mammalian genomes tend to have preserved synteny  and that essential gene clusters in mice are also conserved . In these instances it will be informative to ask about the relationship between the two parameters (there is a broad overlap between essential genes and housekeeping genes in mammals) and how intergene distance and recombination rate might interrelate. More generally, when more whole genome dispensability data are available it will be interesting to see if the preservation of essential clusters is a common phenomenon and, in turn, ask about the underlying rationale.
Materials and methods
Comparator species and genome data
We used data from the Yeast Gene Order Browser . This collection includes seven hemiascomycetes species, four of them having diverged before the whole genome duplication event occurred in the lineage. We considered only six for our study, three pre-WGD, that is, A. gossypii, K. lactis and K. waltii, and three post-WGD, that is, C. glabrata, S. castelli and S. cerevisiae. To compute unambiguosly whether a given syntenic pair in S. cerevisiae is conserved as syntenic in a comparator species we analyzed only pairs from a subset of genes termed ancestral loci. Each member of this set correspond to a locus in a pre-WGD species, or the corresponding duplicated pair of loci in the post-WGD species and has been defined by homology and genome context information .
We examined the metabolic relationship of a gene pair by means of a metabolic network of S. cerevisiae recently reconstructed using genomic, biochemical and physiological information . More specifically, we considered this network as a graph whose nodes and edges are the metabolic genes and the metabolic reactions, respectively , and quantified the metabolic relationship of a pair by its shortest distance in the network (graph has 851 nodes, and 294 of them are ancestral loci). We computed how many syntenic pairs belonging to this network (up to three intervening genes between them in S. cerevisiae) are conserved as syntenic (non-syntenic) in C. glabrata. We found 29 genes conserved in C. glabrata, 4 of which are non-syntenic. The mean graph shortest distance of those conserved (not conserved) as syntenic is = 3.58 ( = 5). This hints at metabolic network distance as a plausible predictor of linkage conservation, that is, the closer in the graph the more likely to be conserved as a linked pair corroborating previous studies [9,20]. For the extended study with all syntenic genes included, we assigned a null distance value to those adjacent pairs without metabolic information. We set this characteristic value to the network mean value ( = 3.83).
Co-expression and intergenic distance
To quantify gene co-expression, 40 different sets of genome-wide transcription time series from ExpressDB were used as compiled in . In our analyses co-expression for a given gene pair denotes then the mean of the 40 correlation coefficients of mRNA expression, corresponding to 40 different experiments, for a given gene pair. All sequence information was obtained from the Saccharomyces Genome Database .
Density of lethals
We used a list of essential genes included in the Saccharomyces Genome Database , which contains information on a large-scale knockout study . We introduced a score quantifying the number of essential genes around a syntenic pair. For each pair, the density of lethals reads as the mean number of essential genes located at -3,-2,-1,pair1,pair2,1,2,3 gene coordinates, that is, up to 4 genes in either the 5' to 3' or the 3' to 5' direction around each member of the pair.
We considered the recombination data set obtained in , an estimate of recombination rate, by using double strand break analysis. To each gene we assigned a recombination rate. For a syntenic pair we took the mean of each gene rate.
We used a dataset of regulatory motifs determined in . A P value threshold of 0.001 was considered to select the transcriptional factor binding sites. The co-regulation of a pair is denoted by the number of common regulatory motifs between two genes :
Here |...| denotes the size of the set, ∩ the intersection and mi the number of regulatory motifs for gene i in the pair.
We used the high confidence data set of multi-validated protein interactions in S. cerevisiae  and quantified the protein-protein relationship of a pair by its shortest distance in the network as before. Only 168 adjacent genes are included in this graph (with 1088 nodes). We found 155 genes conserved in S. castelli, 18 of which remained adjacent. The mean graph shortest distance of those conserved (not conserved) as adjacent is = 5.12 ( = 5.28). This hints at protein-protein network distance as a plausible predictor of linkage conservation, that is, the closer in the graph the more likely to be conserved as a linked pair. For the extended study with all adjacent genes included, we assigned a characteristic distance value to those adjacent pairs without protein-protein interaction information. We set this characteristic value to the network mean value ( = 5.35).
Logistic regression is a class of multivariate analyses usually applied to describe the dependencies of binary responses with respect to a set of variables . While multiple regression usually quantifies the influence of several factors on continuous dependent variables, logistic regression extends these techniques to the study of qualitative features. In our case, we interpreted an adjacent pair in S. cerevisiae retained or not as syntenic in a given comparator as such a binary, or discrete, response to model. To estimate the relevance of each linkage predictor in a robust way, given that genomic data are known to be noisy and analysis of them might lead to very significant but totally misleading results , we considered several complementary strategies: simple (logistic) regression of each of the variables; forward stepwise regression according to the Akaike criterion (a measure of goodness of fit) , in this case, predictors are included in the model only if they increase the goodness of fit; multiple regression - here all proposed variables are considered; and principal component multiple regression (Additional data file 1). We scaled each of the covariates to zero mean and unit variance (standardized data) before carrying out the studies unless indicated. Correlation between some of the determinants could be relevant, for example, essential clusters having low recombination rates. To quantify the correlation between independent variables, one can compute the correlation matrix of the parameter estimates in the logistic equation. Large values of these correlations indicate that multicollinearity may be complicating the modeling process. Alternatively, if correlations between the estimates are fairly small, it is expected that removing one variable from the model does not change the coefficients and P values for other variables much.
cex, gene co-expression; cre, gene co-regulation; igd, intergenic distance; let, density of lethals; met, metabolic relationship; pro, distance in the protein-protein interaction network; rec, recombination rate; WGD, whole genome duplication.
JFP and LDH conceived and designed the study, analyzed the data, and wrote the paper. LDH implemented the null models, JFP implemented the statistical models.
Additional data files
The following additional data are available with the online version of this paper. Additional data file 1 is an extended discussion of the regression and null models corresponding to several comparator species.
Additional data file 1. Includes the extended discussion, a complementary figure related to the determinants of close non-adjacently conserved pairs versus distant adjacently conserved pairs, and a table summarizing the results of the principal component regression analysis.
Format: PDF Size: 5.7MB Download file
This file can be viewed with: Adobe Acrobat Reader
We thank Oscar Rueda, Andrés Cañada and Kevin Byrne for technical assistance and valuable discussions. This research has been partially supported by the Spanish Ministerio de Educación y Ciencia Grant FIS2006-10368 (JFP) and the UK Biotechnology and Biological Sciences Research Council (LDH).
Caron H, van Schaik B, van der Mee M, Baas F, Riggins G, van Sluis P, Hermus MC, van Asperen R, Boon K, Voûte PA, et al.: The human transcriptome map: clustering of highly expressed genes in chromosomal domains.
Versteeg R, van Schaik BD, van Batenburg MF, Roos M, Monajemi R, Caron H, Bussemaker HJ, van Kampen AH: The human transcriptome map reveals extremes in gene density, intron length, GC content, and repeat pattern for domains of highly and weakly expressed genes.
McLysaght A, Seoighe C, Wolfe K: High frequency of inversions during eukaryote gene order evolution. In Comparative Genomics. Edited by Sankoff D, Nadeau JH. Dordrecht: Kluwer Academic Publishers; 2000:47-58.
Mol Syst Biol 2005, 1:2005.0016.
2005.0016PubMed Abstract | Publisher Full Text | PubMed Central Full Text
Akhunov ED, Akhunova AR, Linkiewicz AM, Dubcovsky J, Hummel D, Lazo G, Chao S, Anderson OD, David J, Qi L, et al.: Synteny perturbations between wheat homoeologous chromosomes caused by locus duplications and deletions correlate with recombination rates.
Ball CA, Jin H, Sherlock G, Weng S, Matese JC, Andrada R, Binkley G, Dolinski K, Dwight SS, Harris MA, et al.: Saccharomyces Genome Database provides tools to survey gene expression and functional analysis data.
Winzeler EA, Shoemaker DD, Astromoff A, Liang H, Anderson K, Andre B, Bangham R, Benito R, Boeke JD, Bussey H, et al.: Functional characterization of the S. cerevisiae genome by gene deletion and parallel analysis.