Clusters of genes co-expressed are known in prokaryotes (operons) and were recently described in several eukaryote organisms, including Human. According to some studies, these clusters consist of housekeeping genes, whereas other studies suggest that these clustered genes exhibit similar tissue specificity. Here we further explore the relationship between co-expression and chromosomal co-localization in the human genome by analyzing the expression status of the genes along the best-annotated chromosomes 20, 21 and 22.
Gene expression levels were estimated according to their publicly available ESTs and gene differential expressions were assessed using a previously described and validated statistical test. Gene sequences for chromosomes 20, 21 and 22 were taken from the Ensembl annotation.
We identified clusters of genes specifically expressed in similar tissues along chromosomes 20, 21 and 22. These co-expression clusters occurred more frequently than expected by chance and may thus be biologically significant.
The co-expression of co-localized genes might be due to higher chromatin structures influencing the gene availability for transcription in a given tissue or cell type.
Since the publication of two "complete" first drafts of the human genome [1, 2], a huge continuing effort is being made to annotate the human genome. Whereas some regions remain poorly annotated, the exact positions of most - protein coding - genes are now defined. This allows the systematic analysis of the influence of the position of genes on various of their properties, such as their expression level and tissue distribution. The positional clustering of co-expressed genes is common in prokaryotes (operons) and was recently described in Saccharomyces cerevisiae , in Caenorabditis elegans [4, 5] and in Drosophila melanogaster [6, 7]. Throughout the human genome, it is often supposed that genes are randomly distributed, except for tandem duplicates. However, clusters of highly expressed genes were recently revealed in the Human genome [8, 9]. To date, no clear functional relationships between genes in these clusters have been identified and their biological meaning, if any, is yet to be determined.
Two studies were carried out on the expression level of sets of co-localized human genes. Caron et al.  analyzed the gene expression profiles for any chromosomal regions in various tissue types (Human Transcriptome Map). The genes studied corresponded to about 24,000 UniGene clusters and expression levels were estimated from 12 SAGE libraries made in different conditions. This study revealed about 50 large regions, called RIDGEs (Region of IncreaseD Gene Expression), showing a clustering of highly expressed genes. A similar study by Lercher et al.  (based on 11,000 UniGene clusters and 14 SAGE libraries) suggested that such RIDGEs might mostly consist of housekeeping genes and no clusters of genes with similar tissue expression profiles were identified.
In order to specifically analyze tissue specific expression, other studies were based on sets of genes expressed in a given tissue. Gabrielsson et al.  performed a micro-array analysis of genes expressed in the adipose tissue. Mapping these genes back on the human genome, revealed clusters of adipose tissue specific genes on chromosomes 11, 19 and 22. Using ESTs, Dempsey et al.  focused on genes from chromosomes 21 and 22 expressed in the cardio-vascular system (CVS). They showed some chromosomal clustering of these genes. Bortoluzi et al.  performed a similar study on genes expressed in the skeletal muscle. They identified positional clusters of skeletal muscle genes on chromosomes 17, 19 and X. Finally, an EST analysis of the murine placenta by Ko et al.  identified clusters of placenta specific genes on chromosomes 2, 7, 9 and 17. Overall, these studies suggest that clusters of tissue specific genes do exist, and might be more frequent than initially thought.
Previous studies were based on the whole set of genes expressed in a particular tissue, irrespective of the behavior of these genes in other tissues. In order to evaluate the clustering of genes specifically expressed in any tissue - not specified in advance -, we performed a comprehensive analyzed of the expression profiles of all genes identified along human chromosomes 20, 21 and 22. These chromosomes were chosen as the most complete and best annotated available human chromosomes. For each gene, we first estimated the expression level in various tissues from the public EST database and then computed the probability of differential expression in each tissue. We then compared these probabilities with those calculated for the neighboring genes and looked for a succession of genes specifically over-expressed (SOGs) in a given tissue. This procedure revealed more of such clusters than expected at random.
Relationship between Gene Expression and Tissue type
The following analyses were based on the number of specifically expressed genes (SEGs) in each tissue category and on chromosomes 20, 21 and 22 (using a p-value > 0.90). Tissue categories were pooled in three groups according to their origin: a diseased group (DIZ), a healthy and infant group (INF), and a healthy and adult group (ADLT).
Chromosome analysis. On each chromosome, 80% of the genes were found differentially expressed in at least one tissue category. The same proportion was found by Su et al.  in an analysis of the human transcriptome map. The remaining 20% represents genes ubiquitously expressed (i.e. housekeeping genes), or weakly expressed genes. The expression level of such genes - represented by low EST numbers - cannot be reliably estimated nor their differential expression status.
Genes with erratic expression levels. The number of tissue types associated with significant differential expression (p > 0.90) was estimated for each gene. We noticed that some genes were statistically identified as "differentially expressed" in more than 50% of the tissue types (Table 2). Our statistical test is performed by comparing the number of cognate ESTs found for each library type to the number found for all other library types aggregated as one virtual "average" tissue. With this procedure, genes exhibiting expression levels fluctuating highly above or below the average (over all the other tissues), may appear significantly differentially over- or under- expressed in numerous libraries. The genes we found exhibiting this erractic behavior were all highly expressed, corresponding to a large number of ESTs such as ribosomal proteins, known to be found in all tissues. This strongly suggests that the erratic EST counts (from almost none, to much higher than average) has an artifactual origin, e.g. an untold "normalization" procedure. Indeed, it is (and was) customary for a number of EST sequencing laboratories not to record (or even not to pick the clones corresponding to) the many instances of the most abundant transcripts (such as ribosomal proteins, elongation factor EF-Tu, and the like). This ad hoc -but not consistent- subtraction of the most abundant ESTs (even though the libraries are not normalized) is the most probable cause for the corresponding gene to appear either over- or under-expressed in many tissues. We thus removed them from our subsequent analyses.
Table 1. Number of ESTs and keywords characterizing the tissue types
Table 2. Genes from chromosomes 20, 21 and 22 expressed in most of the tissues
We searched for correlation islands, defined as clusters of at least three successive SOGs in a common tissue (see Material and methods). Nine, 5 and 17 clusters of SOGs were found for chromosomes 20, 21 and 22, respectively. To assess the statistical significance of these results, we computed the probability of finding such a number of clusters under a random permutation of the gene order along the chromosomes. This probability was found to be very low (Table 3). We can thus confidently conclude that there are more clusters that expected by chance, and further explore their potential biological meaning.
Table 3. Number of clusters for chromosomes 20, 21 and 22
The functional annotation of these gene clusters is shown in Tables 4a, 4b and 4c. No functional correlation was identified within the clusters, but such a correlation would be hard to establish given the lack of a defined function for many of the genes.
Table 4a. Clusters of successive genes from chromosomes 20
Table 4b. Clusters of successive genes from chromosome 21
Table 4c. Clusters of successive genes from chromosome 22
Two clusters (III and IV on chromosome 22) were each composed of three genes, two of them being annotated as having exactly the same function. These genes are from computer prediction and may correspond to a single gene erroneously interpreted as two different genes. As this particularity concern only two clusters, we did not consider them further.
The analysis of all genes along chromosomes 20, 21 and 22, identified clusters of co-expressed genes (e.g. the known immunoglobulin cluster), and genes expressed in every tissue (e.g. some ribosomal proteins). The visualization of SEGs in various conditions allowed expression variations to be detected in diseased vs. healthy or infant vs. adult tissues. For instance, we noticed that a small cluster of immunoglobulin apparently specific of ovary diseased tissues. These immunoglobulins may be involved in an immune response specific to this pathology.
ESTs were grouped according to the tissue type: organ, developmental and pathological states. While comparing the gene expression across adult healthy tissues is biologically meaningful, comparing gene expression across pathological states (DIZ group) is more problematic as it involves treating different pathological conditions as one. For instance, different cancer types - each with its specific expression patterns- may arise in the same organ . In principle, only diseased tissues corresponding exactly to the same disorder should be pooled. When dealing with diseased tissues, our protocol was thus expected to provide a distorted view of their gene expression patterns.
As in all statistical studies, sample size is important. As less fetal/infant libraries were available, less fetal or infant tissue specific gene clusters were detected.
In a study of Drosophila gene clusters, Spellman et al.  found that no functional relationship could be detected between the genes within a cluster. Our study again failed to reveal any relationships between the gene forming co-expressed/co-localized clusters. However, the large proportion of genes with no defined function is not allowing any final conclusion to be drawn.
Chromatin is usually described as been divided into "open" domains, where genes have the potential to be expressed, and domains of "closed" regions, where gene expression is shut down. The existence of co-expressed/co-localized gene clusters is consistent with a model where large chromatin regions would change their activity (openness) status in a tissue specific manner, allowing neighboring genes to be transcribed or shut down in a coordinated way. Such a model, confirmed by our study, has been around for quite sometimes, although experimental evidence have been obtained for only a few tissues and cell types [16, 17].
Materials and methods
EST and Libraries
Human ESTs were obtained from dbEST (release Oct.2001) . Pooled, subtracted or normalized libraries were removed from the study. The remaining 1270 libraries were classified in three groups: 489 libraries from diseased tissues, whatever their developmental stage (DIZ), 194 libraries from healthy fetal or infant tissues (INF) and 587 libraries from healthy adult tissues (ADLT). The classification was made with the data extracted from the 'keywords' and 'developmental stage' fields of the library description. A similar analysis was performed on the three groups. ESTs were then masked for vector, common repeats and low complexity sequences using RepeatMasker (URL: http://repeatmasker.genome.washington.edu/ webcite.) and Repbase . After these steps, 2,251,840 ESTs remained: 1,147,369 in the diseased libraries group, 478,320 in the infant libraries group and 626,151 in the adult libraries group. In each group, libraries were categorized into 40 organs as described in Table 1. Each library was classified in a tissue category if at least one of its keywords characterized the tissue category. Libraries were individually characterized by keywords extracted from the library description, in the 'lib', 'keywords', 'tissue description', 'tissue type', and 'cell type' and 'organ' fields. Tissue categories were individually characterized by representative keywords, such as the name of the category or its synonyms. A library could only belong to a single category. Finally, the classification was visually verified. Categories with less than 1,000 ESTs were removed. The numbers of EST for every tissue category in the groups DIZ, INF and ADLT are shown in Table 1. The list of the libraries composing each tissue category is given as supplementary data.
Genes on Chromosomes 20, 21 and 22
Gene sequences were downloaded from Ensembl (release Nov.2001). As we analyzed the gene expression along the chromosomes, the various transcripts of a single gene were not considered. We used both known and novel genes predicted by Ensembl. Respectively, 694, 243 and 595 gene sequences were found for chromosomes 20, 21 and 22. As the Ensembl sequence identifier might change from one release to the next, a correspondence between the Ensembl sequences and the NCBI sequences is given in Tables 4a, 4b and 4c.
Gene Expression Profiles
Every gene was compared to the total EST set of the corresponding group at high stringency (%identity > 95 and match length > 66% of the query sequence) with BLAST 2.2.1. The expression profile was derived from the cognate ESTs in each tissue category relative to the total number of ESTs in the tissue category of this group. All expression profiles were stored in a matrix with rows corresponding to genes and columns corresponding to tissue categories. The Mij element thus correspond to the relative frequency of cognate ESTs for gene i in tissue category j.
Differential Gene Expression
To assess its differential expression in a tissue category and for a given group (DIZ, INF or ADLT), every contig was compared to the total EST set of this group at high stringency (previously described matrix). The hit list of cognate matches was then separated in two groups: ESTs from the corresponding tissue category vs. any other tissue categories. The statistical significance of the difference in frequencies between these two groups was computed according to a previously published formula . The groups of diseased, infant or adult tissues were treated independently.
Correlation islands were considered as clusters of at least three successive SOGs (p-value > 0.90) in the same tissue category. To assess the biological meaning of these clusters, we estimated the probability of finding such a number of clusters under a randomization of the gene position along the chromosomes (5,000 randomizations). The probabilities are presented in Table 3.
Additional data files
KM was supported by a grant from the Region Provence-Alpes Cote d'Azur and AVENTIS Pharma. We thank Deborah Byrne for reading the manuscript.
Caron H, van Schaik B, van der Mee M, Baas F, Riggins G, van Sluis P, Hermus MC, van Asperen R, Boon K, Voute PA, et al.: The human transcriptome map: clustering of highly expressed genes in chromosomal domains.
Bortoluzzi S, Rampoldi L, Simionati B, Zimbello R, Barbon A, d'Alessi F, Tiso N, Pallavicini A, Toppo S, Cannata N, et al.: A comprehensive, high-resolution genomic transcript map of human skeletal muscle.
Ko MS, Threat TA, Wang X, Horton JH, Cui Y, Pryor E, Paris J, Wells-Smith J, Kitchen JR, Rowe LB, et al.: Genome-wide mapping of unselected transcripts from extraembryonic tissue of 7.5-day mouse embryos reveals enrichment in the t-complex and under-representation on the X chromosome.
Akashi K, He X, Chen J, Iwasaki H, Niu C, Steenhard B, Zhang J, Haug J, Li L: Transcriptional accessibility for multi-tissue and multi-hematopoietic lineage genes is hierarchically controlled during early hematopoiesis.
Nat Genet 1993, 4:332-333. PubMed Abstract