Gene expression data of P. falciparum 48 h erythrocytic cycle . Each point in the figures corresponds to an expression profile plotted according to the first and second principal components. All points remain at the same coordinate throughout the six figures (that is, only the coloring changes). In the top row, colors indicate the cluster membership of each profile after k-means clustering of the original data (that is, prior to dimensionality reduction) using a Pearson correlation distance for k = 3, 5 and 7. We observe that the clusters are nearly equally sized and that their edges are rather arbitrary, as they do not follow low density regions and change radically for different k values. Figures in the bottom row show regions of the expression space enriched for three regulatory motifs identified in previous studies [6,7,18]. Enrichment is defined on the original data (that is, prior to dimensionality reduction) by measuring the proportion of genes that contain the motif in their upstream sequence (1 kb) among the 200 nearest neighbors of each gene (according to their profiles). Colored points correspond to profiles where this proportion is three standard deviations above the expected one according to the hypergeometric law (see motif density in Material and methods). Uncolored points are shaded for clarity. We observe that: (i) each motif corresponds to a contiguous region of the expression space, (ii) these regions do not correspond to those defined by any of the above clusterings and (iii) regions defined by different motifs can strongly overlap, highlighting the weaknesses of clustering-based approaches for motif discovery.
Lajoie et al. Genome Biology 2012 13:R109 doi:10.1186/gb-2012-13-11-r109