Efficient text search enables live exploration of genome-scale datasets. For three simple queries performed on a small set of genomic regions, this figure illustrates how EpiExplorer analyses are translated into text search queries, how these queries are run against a text index built from genomic data, how the responses are translated back into genome analysis results, and how the results are visualized in the user's web browser. (a) EpiExplorer's software architecture consists of three tiers: a web-based user interface, a middleware that translates between genomic analyses and text search queries, and a backend that efficiently retrieves matching regions for each query. (b) When a user uploads a genomic region set (here: chromosome, start and end position for ten regions named R1 to R10), the middleware annotates this region set with genome and epigenome data, encodes the results in a semi-structured text format, and launches a CompleteSearch server instance to host the corresponding search index. (c) To identify which regions overlap with a CpG island, a simple query overlap:CGI is sent to the backend, and the backend returns an XML file with the matching regions. (d) To identify regions that overlap with CpG islands as well as with H3K4me3 peaks, an AND search is performed (query: overlap:CGI overlap:H3K4me3), and the backend returns only regions that are annotated with both keywords. (e) To efficiently generate percent overlap diagrams, a prefix query overlap:* is sent to the backend, which identifies all possible completions of the prefix and returns the total number of regions matching each query completion.
Halachev et al. Genome Biology 2012 13:R96 doi:10.1186/gb-2012-13-10-r96