Figure 1.

PAR-CLIP data analysis pipeline. The PARma workflow starts with the raw data from PAR-CLIP experiments (replicates or different conditions), that is, several fastq files containing sequencing reads. First, we utilize Bowtie [46] to align these reads to multiple reference sequences such as the human genome and transcriptome or viral genomes, which results in several sam files, one for each fastq file and reference sequence. Second, for each read from each experiment we identify all optimal alignments in terms of mismatches, considering T to C conversions as matches, and map transcriptomic reads that span splice junctions to the genome. Third, possible target sites of miRNAs are identified by clustering reads from all datasets simultaneously. The clusters including additional annotations such as the number of conversions and cleavages per position are written to separate files for each experiment. The cluster detection module implements a splitting procedure to identify target sites with overlapping reads and is able to handle target sites that span splice junctions. Fourth, for each dataset, the core PARma component estimates a generative model for the data and k-mer activity probabilities using kmerExplain in an iterative manner (see also Figure 3). Fifth, the models and the activity probabilities are used to score clusters and to assign the most probable miRNA. Target sites with various annotations such as gene ids are written to tabular files that can be further analyzed and visualized.

Erhard et al. Genome Biology 2013 14:R79   doi:10.1186/gb-2013-14-7-r79
Download authors' original image