Skip to main content

CapR: revealing structural specificities of RNA-binding protein target recognition using CLIP-seq data

Abstract

RNA-binding proteins (RBPs) bind to their target RNA molecules by recognizing specific RNA sequences and structural contexts. The development of CLIP-seq and related protocols has made it possible to exhaustively identify RNA fragments that bind to RBPs. However, no efficient bioinformatics method exists to reveal the structural specificities of RBP–RNA interactions using these data. We present CapR, an efficient algorithm that calculates the probability that each RNA base position is located within each secondary structural context. Using CapR, we demonstrate that several RBPs bind to their target RNA molecules under specific structural contexts. CapR is available at https://sites.google.com/site/fukunagatsu/software/capr.

Background

RNA-binding proteins (RBPs) play integral roles in various post-transcriptional regulatory processes, including the splicing, processing, localization, degradation and translation of RNA molecules [1]. RBPs typically contain a limited set of RNA-binding domains, such as the RNA recognition motif and K homology domain, and they must bind to specific RNA molecules to function. The human genome contains more than 400 annotated RBPs [2]. Although most of these RBPs are still poorly characterized, it is known that the dysfunction of certain RBPs causes severe diseases, such as neurodegenerative disorders, heart failure and cancers [3, 4]. RBP–RNA interactions and their specificities are important for understanding the complex gene regulatory networks and the mechanisms of human diseases.

Recent advances in ‘ribonomic’ technologies, such as cross-linking immunoprecipitation high-throughput sequencing (CLIP-seq, also referred to as HITS-CLIP) [5], individual-nucleotide resolution CLIP (iCLIP) [6], and photoactivatable-ribonucleoside-enhanced CLIP (PAR-CLIP) [7], have enabled the study of RBP–RNA interactions, both on a genomic scale and at high resolution. The use of microarrays in the classical RNA-binding protein immunoprecipitation microarray (RIP-Chip) method [8] prevented the precise identification of binding sites. In contrast, CLIP-seq methods bond an RBP and RNAs covalently by ultraviolet cross-linking, collect them by immunoprecipitation and directly sequence the RBP-bound sites of the RNAs. Using these technologies, researchers can identify sequential RNA motifs that are over-represented around the binding sites of each RBP using bioinformatics methods similar to those used for analyzing transcription-factor binding DNA motifs [9]. Such sequential motifs are often very short (up to ten bases), and there are many unbound sites that have the same motif. Thus, sequential motifs alone cannot explain the specificity of RBP–RNA interactions.

RBPs bind to their target RNA molecules by recognizing specific RNA sequences and their structures. Several studies have addressed this issue by calculating the accessibility of RNA regions around the RBP-binding sites [10]. Here, the accessibility of an RNA region is defined by the probability that the region exhibits a single-stranded conformation. Theoretically, the accessibility can be efficiently and exactly calculated using an energy model of RNA secondary structures [11, 12]. Double-helical RNAs usually form the A-form helical structure, whose major grooves are too narrow to be accessed by RBPs [13], and Li et al. showed that the accessibilities tend to be high around the RBP-bound motif sites by analyzing RIP-Chip data [10]. However, it is not sufficient to consider accessibility alone in analyzing the structure-specific target recognition by RBPs. For example, Vts1p, which is a yeast RBP regulating mRNA stability, binds to its target CNGG sequential motif when it is located within hairpin loops but not when it is located in single-stranded regions or other structures [14, 15]. The human FET family of proteins, whose mutations are associated with amyotrophic lateral sclerosis, bind to its target sequential UAN n Y motif within hairpin loops [16]. Computational methods for calculating the secondary structural contexts of RNA molecules, such as bulge loops, hairpin loops and stems, are required to uncover the characteristics of the RNA structures that are recognized by the RBPs in vivo.

In the present study, we developed an efficient algorithm that calculates the probabilities that each RNA base position is located within each secondary structural context. Six contexts of RNA secondary structures were taken into account, according to the well-established Turner energy model of RNAs [17]. These structures included stems (S), hairpin loops (H), bulge loops (B), internal loops (I), multibranch loops (M) and exterior loops (E) (see Figure 1). We defined a structural profile of an RNA base as a set of six probabilities that the base belongs to each context. At present, Sfold [18] is the only software that can calculate a structural profile. Sfold cannot be readily applied to tens of thousands RNA fragments because it uses a statistical sampling method that requires huge sample sizes and computational costs, particularly when analyzing long RNAs or mRNAs. We implemented our efficient algorithm as software named ‘CapR’, which can compute the structural profiles for tens of thousands of long RNAs within a reasonable time by enumerating all the possible secondary structures of the RNAs.

Figure 1
figure 1

Visual representation of the six structural contexts. The six structural contexts are represented by six colors: stems (red), exterior loops (light green), hairpin loops (purple), bulge loops (pink), internal loops (blue) and multibranch loops (green). The unstructured context is the union of the exterior and multibranch loops. These colors are used throughout the paper.

Results

Methods overview

We have developed a new algorithm that calculates the structural profiles of any RNA sequence based on the Turner energy model with time complexity O(N W2) [17]. Here, N is the input sequence length and W is the maximal span, which is a given parameter of the maximal length between the bases that form base pairs. The parameter W was introduced because considering very long interactions does not improve the accuracy of the secondary structure predictions but does increase the computational costs [19].

Let x be an RNA sequence of length N and σ be a possible secondary structure on x without pseudoknots. We refer to a base in x as stem if it forms a base pair with another base, and represent it using the character S. Single-stranded bases are categorized into five structural contexts, namely, bulge loop (represented by B), exterior loop (E), hairpin loop (H), internal loop (I) and multibranch loop (M), which are defined as follows. In a secondary structure representation, RNA bases are vertices of polygons whose edges are the RNA backbone or hydrogen bonds, which are shown as solid or dotted lines, respectively, in Figure 1. The exterior loop context is given to single-stranded bases if they do not form polygons. The hairpin loop context is given to single-stranded bases if they form a polygon that has a single hydrogen bond. The bulge and internal loop contexts are given to single-stranded bases if they form a polygon that has two hydrogen bonds, which are connected by a single backbone edge for bulge loops and which are not connected by a single backbone edge for internal loops. Finally, the multibranch loop context is given to single-stranded bases if they form a polygon that has more than two hydrogen bonds. Note that for a given secondary structure σ, any base of x is unambiguously classified as one of the six structural contexts. Additionally, we define unstructured (U) to represent collectively the exterior and multibranch loop contexts.

We assume that the probability distribution of the secondary structures follows the Boltzmann distribution with respect to the Turner energy model [17]. The probability p(i,δ) that a base at position i has the structural context δ∈{B,E,H,I,M,S} is given by

p ( i , δ ) = 1 Z ( x ) ∑ σ ∈ Ω ( i , δ ) exp − ΔG ( σ , x ) / RT Z ( x ) = ∑ σ ∈ Ω 0 exp − ΔG ( σ , x ) / RT

where Δ G(σ,x) is the difference of the Gibbs energies of the given structure σ and the structure σ0 that contains no base pairs, R is the gas constant and T is the temperature (we used T=310.15 K in this study). Ω0 is the set of all the possible secondary structures of x, and Ω(i,δ) is the set of all the possible secondary structures in which the base at position i is in the structural context δ. Then, the structural profile of i is defined as the probabilities of the structural contexts {p(i,δ)|δ∈{B,E,H,I,M,S}}. Note that the structural profile satisfies the probability condition ∑ δ p(i,δ)=1.

Our algorithm efficiently calculates structural profiles by referring to the Rfold model, which is a variant of the stochastic context-free grammar (SCFG) that calculates all the RNA secondary structures without redundancy [20]. In formal language theory, the RNA secondary structures without pseudoknots are modeled by SCFG [21]. While the state transition rules of the Rfold model contain seven non-terminal symbols, our algorithm associated them with the six structural contexts. The details of the algorithm, which is a variant of the inside-outside algorithm of SCFG, are given in the Materials and methods section.

Influence of the maximal span and the GC content on the structural profile calculations

Before we investigated the structure-specific target recognition by RBPs, we evaluated the performance of CapR. Because we introduced the maximal span W, we needed to investigate an appropriate range for this parameter. Because GC content is known to affect the RNA secondary structures, its effect was also analyzed.

To investigate the dependence on the maximal span W, we applied CapR to 1,000 random RNA sequences of 2,000 nucleotides with a fixed GC content (GC = 0.5). Figure 2A shows how the proportions of the calculated structural profiles depend on W. As expected, if W is small, the predictions are dominated by exterior loops because few bases form base pairs under this condition. Whereas the probabilities for bulge loops, hairpin loops, internal loops and stems are relatively stable for W≥100, the exterior loop probabilities monotonically decrease and the multibranch loop probabilities monotonically increase with increasing W. This is because at large W new base pairs form in exterior loops and exterior loops turn into multibranch loops. On the other hand, the probabilities of the unstructured context, which collectively represents the exterior and multibranch loop contexts, are insensitive to W (Additional file 1: Figure S1). Therefore, the unstructured context can be adopted instead of the exterior and multibranch loop contexts to avoid the influence of the parameter W, if a discrimination of the two contexts is not critical.

Figure 2
figure 2

Dependence of the structural profiles on the maximal span W and GC content. (A) Dependence of the structural profiles on the maximal span W. The x-axis represents the maximal span W. The y-axis represents the averaged p(i,δ) over all the nucleotides. (B) Dependence of the structural profiles on the GC content. The x-axis represents the GC content. The y-axis represents the averaged p δ (i) over all the nucleotides. The unstructured context is represented by light blue. B, bulge loop; E, exterior loop; H, hairpin loop; I, internal loop; M, multibranch loop; S, stem; U, unstructured.

Although Kiryu et al. revealed the dependence of the accessibilities on the GC content [12], the dependence of structural profiles on the GC content has not been investigated. We investigated the dependence on the GC content by applying CapR to 1,000 random RNA sequences of 2,000 nucleotides with a fixed maximal span (W = 100). Figure 2B shows how the proportions of the computed structural profiles depend on the GC content. The stem probability is high and the unstructured probability is low with a high GC content, probably because the energy of the G-C pairs is larger than that of the A-U pairs and palindromic sequences are more likely to occur in the high-GC background. This result suggests that users should carefully interpret the results when analyzing RNAs with biased GC content.

Performance of CapR

We evaluated the speed of CapR by comparing its computational run-time with that of Sfold. The input sequences were generated randomly with equal probabilities of A, C, G and U. For Sfold, the number of sampled structures was set to its default value (1,000). The computation was performed on an AMD Opteron 6276 2.3 GHz with 1 GB memory. Figure 3A shows the computational run-times, which depended on the maximal span W and sequence lengths. In all cases, CapR was much faster than Sfold. Sfold could not run for N≥4,000 while CapR did for N=10,000. These results show that CapR can compute structural profiles for long RNAs within a reasonable time.

Figure 3
figure 3

Performance of CapR. (A) Computational run-times for different values of maximal span W and sequence length N. The x-axis represents the sequence length N. The y-axis represents the computational run-time. (B) The receiver operating characteristic curve for each loop context. The x-axis represents 1-specificity and the y-axis represents the sensitivity. The specificity and sensitivity are defined as true positive/(true positive + false negative) and true negative/(true negative + false positive), respectively. (C) The structural profiles of tRNAs. The x-axis represents the nucleotide positions from 5′ to 3′. The y-axis represents averaged probabilities that each base belongs to each structural context across all tRNA genes in the Rfam dataset [22]. The black boxes represent the nucleotides annotated as stem in Rfam. (D) tRNA cloverleaf structure annotated in Rfam. B, bulge loop; E, exterior loop; H, hairpin loop; I, internal loop; M, multibranch loop; S, stem.

Next, we evaluated the accuracy of the structural profiles computed by CapR using 8,775 RNA genes that have experimentally validated secondary structure annotations in the Rfam database [22]. We set W=800 to allow for stem-forming of the base pairs with the longest distance observed in the Rfam dataset. To estimate the accuracy of the structural profiles, we calculated the area under the receiver operating characteristic curve (AUROC) for each structural context. Briefly, the AUROC is high if the probability p(i,δ) for the structural context δ annotated in Rfam is high.

Table 1 and Figure 3B show the AUROC values and the receiver operating characteristic curves, respectively. The AUROC value for each structural context was larger than 0.75, indicating that the computed structural profiles were very consistent with the Rfam annotation. For example, the structural profile of transfer RNAs (tRNAs), whose secondary structures are well characterized, is shown in Figure 3C. Each line represents averaged probabilities that each base belongs to each structural context across all tRNA genes in the Rfam dataset. Probabilities of the stem, hairpin loop, multibranch loop and exterior loop contexts were high at the corresponding parts of the tRNA cloverleaf structure (Figure 3D). Calculated structural profiles are interpreted by considering that stem probabilities tend to be overestimated by the Turner energy model. In the tRNA example, the calculated stem probabilities were slightly higher than the multibranch loop probabilities at positions 25, 43 and 44, which are annotated as multibranch loops in Rfam.

Table 1 AUC score of each structural context

Finally, the same analysis was conducted using Sfold, and the accuracies of the structural profiles predicted by CapR and Sfold were compared. The accuracies of CapR were comparable to those of Sfold (Table 1).

Datasets and methods used in the CLIP-seq data analysis

Because it was shown that CapR is accurate in calculating structural profiles of RNA molecules, we applied it to several CLIP-seq datasets to reveal the structural specificities of RBP–RNA interactions. For the subsequent analyses, we downloaded CLIP-seq data of RBP-bound RNAs from the doRina database [23], and selected ten RBPs: GLD-1 (nematode), QKI (human), Pum2 (human), SRSF1 (human), Nova (mouse), Lin28A (mouse), FXR1 (human), FXR2 (human), FMR1_7 (human) and FMR1_1 (human) [7, 24–28] (refer to Materials and methods for the criteria for the data selection). FMR1_7 and FMR1_1 are two splicing isoforms of FMR1. RBPs with two known sequential motifs (FXR1, FXR2, FMR1_7 and FMR1_1) were analyzed separately for each of the motifs. Hereafter, these cases are represented by the protein names with their sequential motifs: FXR1(ACUK), FXR1(WGGA), FXR2(ACUK), FXR2(WGGA), FMR1_7(ACUK), FMR1_7(WGGA), FMR1_1(ACUK) and FMR1_1(WGGA).

We created one positive dataset and two negative datasets for each of these 14 cases. The positive dataset was a collection of transcribed sequences of ±2,000 nucleotides around each RBP-bound site. The RBP-bound sites were defined as sites of sequential motifs within the CLIP-seq peak regions. The two negative datasets are referred to as the unbound and shuffled datasets. The unbound dataset was a collection of transcribed sequences of ±2,000 nucleotides around a sequential motif site that was in the same transcriptional unit and within ±1,000 nucleotides of any RBP-bound site, but was not an RBP-bound site. In short, this dataset represents the sequential motif sites that are transcribed but unbound by the RBP. The shuffled dataset was generated by randomly shuffling each of the upstream and downstream sequences of each RBP-bound site by preserving nucleotide di-nucleotide frequencies for every sequence in the positive dataset. Thus it represents the sequential motif sites flanked by sequences with preserved sequence compositions. The details of the datasets are described in the Materials and methods section.

We calculated the structural profiles of the positive, unbound and shuffled datasets for each of the RBPs (W=200). Then, to evaluate the structural contexts that are significant in the positive dataset statistically, we defined a P score as follows. First, we calculated a P value using the one-sided Wilcoxon–Mann–Whitney test for each side for each position. Second, we selected the smaller P value of the two hypotheses and transformed it into − log10P, which we designated the P score. Third, if a P score was calculated under the hypothesis that each context probability of the positive dataset was smaller than that of the negative dataset, we changed the sign of the P score. For example, a large positive P score indicates that the probability of that structural context is significantly larger in the positive dataset. Finally, the two P scores calculated for the two negative datasets were compared for each position, and the smaller P score was taken (if one P score was positive and the other was negative, we used 0 instead of the two P scores). Note that the Bonferroni correction was used for multiple testing. To avoid the effects of the artificial value selection for the parameter W, we used the unstructured context instead of the exterior and multibranch loop contexts in the following analysis. We confirmed that the choice of W actually did not affect the results (Additional file 1: Figure S2).

Specific RNA structural contexts recognized by RNA-binding proteins

We investigated the preferred RNA structural contexts for each RBP and revealed that most RBPs prefer a specific structural context (Figure 4 and Additional file 1: Figure S3). Our method was robust regarding the selection of the negative datasets, because selecting the larger P scores did not affect the results overall (Additional file 1: Figures S4 and S5). Among the 14 cases analyzed, six cases showed a preference for the unstructured context (GLD-1, QKI, SRSF1, Nova, FXR1(ACUK) and FXR2(ACUK)). Except for Nova, the RBP-bound sites tended to form the unstructured context, but did not show preferences for the bulge, internal or hairpin loop contexts (Figure 4A and Additional file 1: Figure S3). It should be noted that these results could not be obtained by analyzing the accessibility alone, which does not discriminate between these non-stem contexts.

Figure 4
figure 4

The distribution of the P scores for each RNA-binding protein. The x-axis represents the nucleotide positions and the y-axis represents the P score of ±20 bases around the sequential motif site. The position 0 denotes the start position of the sequential motif. Positive P scores for each structural context indicate that the positions tend to prefer the structural context. The black box represents the sequential motif site. The dotted lines show the corrected significance levels of the Bonferroni correction (α=0.05). The panels represent the distribution of P scores for (A) QKI, (B) Pum2, (C) Lin28A, (D) FXR2(WGGA), (E) FMR1_7(ACUK), (F) FXR2(ACUK), (G) Nova and (H) SRSF1. B, bulge loop; H, hairpin loop; I, internal loop; S, stem; U, unstructured.

Pum2 showed a preference for the hairpin loop context (Figure 4B). To our knowledge, this is the first report of the structural preference for the hairpin loop context by Pum2, which is known to be involved in germ cell development [29]. Lin28A showed preferences for the hairpin and internal loop contexts (Figure 4C). Lin28A is known to inhibit the maturation of let-7 miRNAs and the translation of mRNAs that are destined for the endoplasmic reticulum [27]. The specificity of Lin28A to the hairpin loop context is consistent with the previous study [27]. In addition, our result is the first to suggest that Lin28A prefers the internal loop context in mRNA binding, and Lin28A has been reported to bind to the internal loop of let-7 miRNAs [27].

FXR1(WGGA), FXR2(WGGA) and FMR1_7(WGGA) showed preferences for the stem context (Figure 4D and Additional file 1: Figure S3), although RBPs were considered to be unlikely to be bound to the stem regions of RNAs as already mentioned. These three RBPs (and FMR1_1) are members of the FMRP family and are known to be responsible for the fragile X syndrome. Darnell et al. showed that FMRP-bound WGGA sites tend to form a G-quadruplex, which is composed of guanine-rich sequences forming a four-stranded RNA structure [30]. We suppose that the preference for the stem contexts could reflect the tendency that these family members recognize the G-quadruplex; however, this should be investigated further as currently our energy model and grammar cannot deal with G-quadruplexes.

FMR1_7(ACUK) showed preferences for the internal and bulge loop contexts (Figure 4E). To our knowledge, this is the first report of the structural specificities of FMR1. In contrast, FXR2(ACUK), where FXR2 is a homolog of FMR1, preferred neither the internal nor bulge loop context (Figure 4F). FMR1_7 has an exon insertion in its K homology domain that recognizes the ACUK sequential motifs [28]. This insertion appears to underlie the differences in the structural specificity between FMR1_7(ACUK) and FXR2(ACUK).

Positional preferences in the RNA structure recognition by RNA-binding proteins

The present understanding of the structural specificities of RBP–RNA interactions overlooks structures of the flanking sequences of RBP-bound sites. Therefore, we investigated the secondary structures not only of the RBP-bound sites but also of their flanking sequences. In fact, the positions with the highest P scores were not within the RBP-bound sites in some RBPs. QKI (Figure 4A), Nova (Figure 4G) and SRSF1 (Figure 4H) preferred the unstructured context. High P scores were observed within the RBP-bound sites for SF2ASF, whereas they were observed in the flanking and upstream sequences for QKI and Nova, respectively. These results suggest that RBPs also recognize specific structures existing outside of sequential motif sites, and CapR can uncover these positional preferences from ribonomic datasets.

Figure 5A,B shows the nucleotide compositions around the RBP-bound sites of QKI and Nova. The flanking sequences of QKI-bound sites were guanine poor, whereas those of Nova-bound sites were uracil rich. Because sequences with low GC content tend to form an unstructured context, the aforementioned positional preferences could be generated by the biased nucleotide compositions. To address this possibility, we investigated the relations between the nucleotide compositions and structural specificities in the flanking sequences. We generated partially shuffled datasets by randomly shuffling sequences outside of the ±5 or 10 nucleotides of the RBP-bound sites with preserving di-nucleotide frequencies, and compared their structural profiles with those of the positive datasets using the Wilcoxon–Mann–Whitney test. Then, the P scores for the shuffled and partially shuffled datasets were compared (Figure 6A,B). For QKI, whereas the shuffled dataset had positional preferences in the flanking sequences, the partially shuffled datasets had no significant preferences. This means that the structural specificities of QKI could be generated by the biased nucleotide compositions in the flanking sequences. For Nova, the partially shuffled datasets still had significant P scores upstream of the RBP-bound sites. Therefore, the nucleotide compositions in the flanking sequences alone cannot generate the positional specificities of Nova, that is, sequences in distant regions could also contribute to the position-specific RNA binding of Nova. The nucleotide compositions around the RBP-bound sites and the analyses of the partially shuffled datasets of other RBPs are described in Additional file 1: Figures S6 and S7, respectively.

Figure 5
figure 5

The nucleotide compositions around the RBP-bound sites. The nucleotide compositions of ±20 bases around the RBP-bound sites for (A) QKI and (B) Nova. The x-axis represents the nucleotide position and the y-axis is the probability of each nucleotide. The black box represents the sequential motif site.

Figure 6
figure 6

Comparison of P scores of the positive datasets with P scores of the shuffled and partially shuffled datasets. In the legend of this figure, ‘0’, ‘5’ and ‘10’ represents the shuffled, the partially shuffled (±5) and the partially shuffled (±10) datasets, respectively. The x-axis represents the nucleotide position and the y-axis is the P score of (A) QKI and (B) Nova. The black boxes are the RBP-bound sites, and the horizontal dotted lines the corrected significance levels of the Bonferroni correction. The vertical dotted lines indicate the ±5 or 10 nucleotides of RBP-bound sites. RBP, RNA-binding protein.

Discussion

In this study, we developed an efficient algorithm that calculates the structural profiles of RNAs, and implemented it as CapR. It is the fastest software that can be applied to tens of thousands of long RNAs.

Using CapR, we investigated structural specificities of RBP target recognition using several CLIP-seq datasets. Our analysis revealed that most RBPs prefer specific structural contexts and some RBPs show positional preferences in their structural recognition. These findings could provide insights into the mechanisms of diseases involving RBPs. FMR1_7, where FMR1 is a causative gene of the fragile X syndrome, was revealed to bind specifically to internal and bulge loops. The observed structural specificity raises the possibility that disruption of the internal or bulge loop structures within the target sites of FMR1_7 may cause this disease. On the other hand, the structural specificities of Nova were revealed to be affected by the sequences of distant regions. This means that a mutation of a nucleotide distant from the RBP-bound sites can cause changes to the secondary structures around the RBP-bound sites. Because some disease-associated single nucleotide polymorphisms in non-coding regions are reported to affect RNA secondary structures [31, 32], CapR could also contribute to exploring disease mechanisms behind such polymorphisms.

It has been shown that the secondary structures around the target sites of small interfering RNAs (siRNAs) and miRNAs influence their activities [33, 34]. Kiryu et al. showed that the activity of an siRNA depends on the accessibility of the 3 ′ end of the siRNA target site, and Marin et al. showed that the 3 ′ end of an miRNA target site is more accessible than the other positions [12, 35]. As supported by the X-ray crystal structure of the guide-strand-containing Argonaute [36], these positional tendencies in the accessibility can reflect the kinetic aspects of the siRNA and miRNA binding mechanisms. We hypothesize that the positional preferences of RBPs discovered in this study also reflect the kinetic aspects of the RBP–RNA interactions. For example, Nova had a positional preference for upstream of the sequential motif site in the unstructured context recognition. In fact, the co-crystal structure of human Nova with the target RNA (PDBID: 1EC6) [37] showed that the area upstream of the sequential motif site interacts with the C-terminal amino acids of Nova [38] (see Figure 7; note that the CLIP-seq data were for a highly similar ortholog, mouse Nova). In addition, the deletion of these C-terminal amino acids inhibits the RNA binding function of Nova [39]. Therefore, the positional preference does likely reflect the kinetic aspects of the RNA binding function of Nova. We argue that this example demonstrates the potential power of ribonomic analysis.

Figure 7
figure 7

Co-crystal structure of Nova and the target RNA. This figure was generated using Pymol. The ten amino acids of the C-terminal tail are shown in red. RNA is represented by green sticks. The positions and the nucleotides are shown in yellow. Position 1 is the start position of the sequential motif.

Three future perspectives are envisioned based on the present study. The first perspective is to estimate the sequential and structural specificities simultaneously. Throughout this study, we focused on the RBPs with known and well-defined sequential motifs. Nonetheless, for several RBPs, no such sequential motifs have been identified (for example, FET binds to a highly flexible UAN n Y motif within the hairpin context [16]). To examine the binding specificities of these RBPs, CapR needs to be extended. The second perspective is prediction of RBP-bound sites. Li et al. showed that prediction of RBP-bound RNAs in vivo was improved by a motif-finding algorithm that considers accessibility [10]. Thus, consideration of structural profiles may also improve the prediction of RBP-bound sites in vivo, although we did not directly show this in the present study. Further investigation is necessary for evaluating whether discrimination of RBP-binding sites from a background sequence would be improved using the structural specificities of RBP target recognition. Other factors or subcellular localizations also need to be considered. The third perspective is application of CapR to functional RNAs. For example, the kissing hairpin, which is a hairpin–hairpin interaction that stabilizes RNA structures [40], may be predicted accurately using CapR because CapR enables the calculation of the hairpin loop probabilities. Another target would be small nucleolar RNAs (snoRNAs), where the detection algorithms still have room for improvement [41]. Because snoRNAs are characterized by specific internal loops, they may also be predicted accurately by taking advantage of the accurate calculation of internal loop probabilities by CapR.

Conclusions

We developed a highly efficient algorithm that calculates the probabilities that each RNA base position is located within each secondary structural context for tens of thousands of RNA fragments. The algorithm was implemented as software named CapR and was applied to the CLIP-seq data of various RBPs. Our algorithm demonstrated that several RBPs bind to their target RNA molecules under specific structural contexts. For example, FMR1, which is an RBP responsible for the fragile X syndrome, was found to bind specifically to the internal and bulge loops of RNA. Another example is Nova, a neuron-specific RBP related to a paraneoplastic neurologic disorder, which showed positional preference in the structural contexts of binding targets.

Secondary structures are known to be essential for the molecular functions of RNA. As large-scale, high-throughput approaches are becoming more popular in studying RNAs and RBPs, our algorithm will contribute to the systematic understanding of RNA functions and structure-specific RBP–RNA interactions.

Materials and methods

Rfold model

The state transition rules of the Rfold model are given by

Outer → ε | Outer · a | Outer · Stem Stem → b < · Stem · b > | b < · StemEnd · b > StemEnd → s n | s m · Stem · s n ( m + n > 0 ) | Multi Multi → a · Multi | MultiBif MultiBif → Multi1 · Multi2 Multi1 → MultiBif | Multi2 Multi2 → Multi2 · a | Stem

where ε represents the null terminal symbol, a is an unpaired nucleotide character, s k is an unpaired base string of length k and (b<, b>) is a base pair. There are seven non-terminal symbols: Outer, Stem, StemEnd, Multi, MultiBif, Multi1 and Multi2. Outer emits exterior bases. Stem emits all the base pairs. StemEnd represents the end of each stem from which a hairpin loop (StemEnd→s n ), and internal and bulge loop (StemEnd→s m ·Stem·s n (m+n>0)), or a multibranch loop (StemEnd→Multi) is emitted. Multi represents a complete multibranch loop. Multi1, Multi2 and MultiBif represent parts of a multibranch loop structure that contains one or more, exactly one, and two or more base pairs in the loop, respectively. Based on this grammar, the structural profiles are calculated by using a variant of the inside-outside algorithm for SCFG. First, we give an illustrative example to show how to calculate the internal loop probabilities from the inside and outside variables α s (i,j) and β s (i,j) (i,j=0,…,N, s∈{Outer,Stem,StemEnd,Multi,MultiBif,Multi1,Multi2 }). In the subsequent section, we completely describe how to calculate structural profiles.

Algorithm for calculating internal loop probabilities

When a base at position i has an internal loop context, the base i is caught in two base pairs, (j, k) and (p, q) where j≤p≤q≤k (Figure 8). Then, the outside structure of base pair (j, k) and the inside structure of base pair (p, q) may take arbitrary structures. The sums of Boltzmann weights of all patterns of the outside structure of base pair (j, k) and the inside structure of base pair (p, q) are represented by outside variable β StemEnd (j,k−1) and inside variable α Stem (p−1,q), respectively. Therefore, Boltzmann weights that the base i is caught in two base pairs (j, k) and (p, q) are obtained by the multiplication of β StemEnd (j,k−1), the score for transition StemEnd (j,k−1)→S t e m(p−1,q), and α S t e m(p−1,q). Here, we sum these Boltzmann weights for all combinations of base pairs (j, k) and (p, q). Finally, we obtain p(i,I) by dividing the sum by the partition function.

Figure 8
figure 8

Schematic illustration of calculation of internal loop probability. This figure shows the transition patterns that emit an internal loop. This figure was generated by modifying the output of VARNA [42].

The calculation formulas are given by:

w ( i , I ) = w InternalLeft ( i , I ) + w InternalRight ( i , I ) w InternalLeft ( i , I ) = ∑ j = max ( 1 , i − W ) i ∑ k = i + 1 min ( n , j + W ) ∑ p = i + 1 min ( j + C + 1 , k − 1 ) ∑ q = max ( p + 4 , k − C − p + j − 1 ) k β StemEnd ( j , k − 1 ) · α Stem ( p − 1 , q ) · t ( StemEnd → ( Interior ) → Stem ) w InternalRight ( i , I ) = ∑ j = max ( 1 , i − W ) i ∑ k = i + 1 min ( n , j + W ) ∑ p = j + 1 min ( j + C + 1 , i − 1 ) ∑ q = max ( p + 4 , k − C − p + j − 1 ) i β StemEnd ( j , k − 1 ) · α Stem ( p − 1 , q ) · t ( StemEnd → ( Interior ) → Stem ) p ( i , I ) = w ( i , I ) / Z ( x )

where t(s→s′) is the score for transition s→s′ and C is the maximal length of the internal and bulge loops. Many software programs, including RNAfold [43], adopt this parameter. In this study, following the default setting of RNAfold, we set C=30.

Algorithms for calculating the structural profile

The inside algorithm and the outside algorithm

To calculate the inside and outside variables, we developed a variant of the inside-outside algorithm corresponding to the Rfold model. The inside algorithm is described as follows:

α Stem ( i , j ) = ∑ α Stem ( i + 1 , j − 1 ) · t ( Stem → Stem ) α Stem ( i + 1 , j − 1 ) · t ( Stem → StemEnd ) α Multibif ( i , j ) = ∑ α Multi1 ( i , k ) · α Multi2 ( k , j ) · t ( MultiBif → Multi1 · Multi2 ) for i < k < j α Multi2 ( i , j ) = ∑ α Stem ( i , j ) · t ( Multi2 → Stem ) α Multi2 ( i , j − 1 ) · t ( Multi2 → Multi2 ) α Multi1 ( i , j ) = ∑ α Multi2 ( i , j ) · t ( Multi1 → Multi2 ) α MultiBif ( i , j ) · t ( Multi1 → MultiBif ) α Multi ( i , j ) = ∑ α Multi ( i + 1 , j ) · t ( Multi → Multi ) α MultiBif ( i , j ) · t ( Multi → MultiBif ) α StemEnd ( i , j ) = ∑ t ( StemEnd → ( Hairpin ) ) α Stem ( i ′ , j ′ ) · t ( StemEnd → ( Interior ) → Stem ) for i ≤ i ′ ≤ j ′ ≤ j , 0 < ( j − j ′ ) + ( i ′ − i ) ≤ C α Multi ( i , j ) · t ( StemEnd → Multi ) α Outer ( i ) = ∑ 1 if j = 0 α Outer ( i − 1 ) · t ( Outer → Outer ) α Outer ( k ) · α Stem ( k , i ) · t ( Outer → Outer · Stem ) for ( i − W ) < k < i

The outside algorithm is described as follows:

β Outer ( i ) = ∑ 1 if i = N β Outer ( i + 1 ) · t ( Outer → Outer ) α Stem ( i , k ) · β Outer ( k ) · t ( Outer → Outer · Stem ) for i < k < i + W β StemEnd ( i , j ) = β Stem ( i − 1 , j + 1 ) · t ( Stem → StemEnd ) β Multi ( i , j ) = ∑ β StemEnd ( i , j ) · t ( StemEnd → Multi ) β Multi ( i − 1 , j ) · t ( Multi → Multi ) β Multi1 ( i , j ) = ∑ β MultiBif ( i , k ) · α Multi2 ( j , k ) · t ( MultiBif → Multi1 · Multi2 ) for j < k < ( i + W ) β Multi2 ( i , j ) = ∑ β Multi2 ( i , j + 1 ) · t ( Multi2 → Multi2 ) β Multi1 ( i , j ) · t ( Multi1 → Multi2 ) β MultiBif ( k , j ) · α Multi1 ( k , i ) · t ( MultiBif → Multi1 · Multi2 ) for ( j − W ) < k < i β MultiBif ( i , j ) = ∑ β Multi1 ( i , j ) · t ( Multi1 → MultiBif ) β Multi ( i , j ) · t ( Multi → MultiBif ) β Stem ( i , j ) = ∑ α Outer ( i ) · β Outer ( j ) · t ( Outer → Outer · Stem ) β StemEnd ( i ′ , j ′ ) · t ( StemEnd → ( Interior ) → Stem ) for i ′ ≤ i < j ≤ j ′ , 0 < ( i − i ′ ) + ( j − j ′ ) ≤ C β Multi2 ( i , j ) · t ( Multi2 → Stem ) β Stem ( i − 1 , j + 1 ) · t ( Stem → Stem )

The original computational complexity of both algorithms is O(N W3); because we adopted the parameter C, it becomes O(N W2) as described below.

Calculation of the structural profile

We calculate the structural profiles from the inside and outside variables computed by the inside-outside algorithm. The calculation formula is described as follows:

Z = α O ( N ) p ( i , B ) = 1 Z ∑ j = max ( 1 , i − W ) i ∑ k = i + 1 min ( n , j + W ) ∑ p = i + 1 min ( j + C + 1 , k − 1 ) β SE ( j , k − 1 ) · α S ( p − 1 , k − 1 ) · t ( SE → ( Interior ) → S ) + ∑ j = max ( 1 , i − W ) i ∑ k = i + 1 min ( n , j + W ) ∑ q = max ( j + 4 , k − C − 1 ) i β SE ( j , k − 1 ) · α S ( j , q ) · t ( SE → ( Interior ) → S ) p ( i , E ) = 1 Z α O ( i − 1 ) · β O ( i ) · t ( O → O ) p ( i , H ) = 1 Z ∑ j = max ( 1 , i − W ) i − 1 ∑ k = i + 1 k = min ( n , i + W ) β SE ( j , k − 1 ) · t ( SE → ( Hairpin ) ) p ( i , I ) = 1 Z ∑ j = max ( 1 , i − W ) i ∑ k = i + 1 min ( n , j + W ) ∑ p = i + 1 min ( j + C + 1 , k − 1 ) ∑ q = max ( p + 4 , k − C − p + j − 1 ) k β SE ( j , k − 1 ) · α S ( p − 1 , q ) · t ( SE → ( Interior ) → S ) + ∑ j = max ( 1 , i − W ) i ∑ k = i + 1 min ( n , j + W ) ∑ p = j + 1 min ( j + C + 1 , i − 1 ) ∑ q = max ( p + 4 , k − C − p + j − 1 ) i β SE ( j , k − 1 ) · α S ( p − 1 , q ) · t ( SE → ( Interior ) → S ) p ( i , M ) = 1 Z ∑ k = i min ( i + W , n ) β M ( i − 1 , k ) · α M ( i , k ) · t ( M → M ) ∑ k = max ( 0 , i − W ) i β M2 ( i , k ) · α M2 ( k , i − 1 ) · t ( M2 → M2 ) p ( i , S ) = 1 Z ∑ j = max ( 0 , i − W ) min ( n , i + W ) β S ( i − 1 , j ) · α SE ( i , j − 1 ) · t ( S → SE ) β S ( i − 1 , j ) · α S ( i , j − 1 ) · t ( S → S )

Here, O is the outer state, S is the stem state, SE is the stem-end state, M is the multi state and M2 is the multi2 state in the Rfold model.

Implementation

We implemented the algorithms in C++ as a program named CapR. CapR exhaustively computes the structural profile {p(i,δ)} for a given RNA sequence with O(N W2) time and O(N W) memory. We used a portion of the source code from the Vienna RNA package [43]. We include the source code as Additional file 2. Our source code is also available from [44].

Data preparation and analysis

To evaluate the accuracy of the structural profiles calculated by CapR, we used 188 structural RNA families in the Rfam 10.0 seed dataset [22]. They are provided as 188 structural alignments with experimentally validated pseudoknot-free structures. By excluding alignment columns with a gap proportion of ≥0.5, we obtained 8,775 sequences and 1,039,537 nucleotides.

In the present study, we focused on RBP target recognition. In this application, it should be ineffective to consider transcribed sequences that are too long because regions that are too distant are unlikely to affect the secondary structures around the RBP-bound sites, although our algorithm itself can be applied to long RNAs. Therefore, we investigated how much distance we should take into account. We prepared 100 random RNA sequences 10,100 nucleotides long and truncated them so that the lengths of the flanking sequences of the central 100 bases became l=250,500,…, 2,500. Then, we calculated the structural profiles of the central 100 bases for each l, and calculated the Pearson correlation coefficient between the structural profiles of the original sequence and those of the truncated sequences. Additional file 1: Figure S8 shows that the Pearson correlation coefficients were more than 0.99 for l≥2,000. Therefore, we considered 2,000 nucleotides upstream and downstream of the RBP-bound sites in this study.

To investigate the structural characteristics of RNAs around the RBP-binding sites, we downloaded CLIP-seq datasets from the doRina database [23] (human [45], mouse [46] and nematode [47]). We excluded from the analysis CLIP-seq datasets that met one of the following three criteria: (1) well-defined sequential motifs not presented in the original paper of the dataset, (2) datasets for mutant RBPs and (3) the average number of RBP-bound sites (that is the sequential motif-matched sites within the CLIP-seq peak regions defined in doRina) is less than two. The third criterion was adopted because many RBP-bound sites include false positives. As a result, we selected ten RBPs: GLD-1 (nematode), QKI (human), Pum2 (human), SRSF1 (human), Nova (mouse), Lin28A (mouse), FXR1 (human), FXR2 (human), FMR1_7 (human) and FMR1_1 (human) [7, 24–28]. When the peak regions spanned just one or two bases, we sought sequential motif-matched sites within ±10 nucleotides around the peak regions. If no motif-matched sites were found, such peak regions were excluded from the analysis. Then, we extracted ±2,000 nucleotide sequences around the RBP-bound sites to create the positive datasets. If there existed multiple RBP-bound sites in the same peak region, we averaged the structural profiles around those sites and used them as a single observation. For each gene in RefSeq [48], the transcribed sequence was defined by the genomic region between the most upstream 5′ position and the most downstream 3′ position of its mRNA isoforms. To generate the shuffled and partially shuffled datasets, we used the uShuffle software to preserve the di-nucleotide frequencies of the original sequences [49]. The data sizes and other basic statistics of the CLIP-seq datasets are summarized in Additional file 1: Tables S1 and S2. In the present study, because the distributions of the structural profiles did not follow a normal distribution, we used the non-parametric Wilcoxon–Mann–Whitney test.

We also examined how the choice of the maximal span W influences the results. We compared the highest P scores of the exterior and multibranch loops with different W because these two loops are sensitive to W. We calculated the ratios of the W sensitivity (δ) of the highest P scores among all positions for each loop δ calculated at W=400 and 30:

W sensitivity ( δ ) = Highest P score for δ at W = 400 Highest P score for δ at W = 30

Additional file 1: Figure S9 is a box plot of the W sensitivity of the exterior loop, multibranch loop and unstructured contexts for all the RBP datasets. The highest P scores of the exterior and multibranch loops were sensitive to W, whereas the highest P score of unstructured context was insensitive to W.

Notes added in proof

After the manuscript was accepted, we were informed that the similar algorithm to CapR was internally used in the previous researches [50–52].

Abbreviations

AUROC:

Area under the receiver operating characteristic curve

CLIP:

Cross-linking immunoprecipitation

iCLIP:

Individual-nucleotide resolution CLIP

miRNA:

microRNA

PAR-CLIP:

Photoactivatable-ribonucleoside-enhanced CLIP

RBP:

RNA-binding protein

RIP-Chip:

RNA-binding protein immunoprecipitation microarray

SCFG:

Stochastic context-free grammar

seq:

Sequencing

siRNA:

Small interfering RNA

snoRNA:

Small nucleolar RNA.

References

  1. Keene JD: RNA regulons: coordination of post-transcriptional events. Nat Rev Genet. 2007, 8: 533-543. 10.1038/nrg2111.

    Article  Google Scholar 

  2. Cook KB, Kazan H, Zuberi K, Morris Q, Hughes TR: RBPDB: a database of RNA-binding specificities. Nucleic Acids Res. 2011, 39: D301-D308. 10.1093/nar/gkq1069.

    Article  Google Scholar 

  3. Lukong KE, Chang KW, Khandjian EW, Richard S: RNA-binding proteins in human genetic disease. Trends Genet. 2008, 24: 416-425. 10.1016/j.tig.2008.05.004.

    Article  Google Scholar 

  4. Musunuru K: Cell-specific RNA-binding proteins in human disease. Trends Cardiovasc Med. 2003, 13: 188-195. 10.1016/S1050-1738(03)00075-6.

    Article  Google Scholar 

  5. Licatalosi DD, Mele A, Fak JJ, Ule J, Kayikci M, Chi SW, Clark TA, Schweitzer AC, Blume JE, Wang X, Darnell JC, Darnell RB: HITS-CLIP yields genome-wide insights into brain alternative RNA processing. Nature. 2008, 456: 464-469. 10.1038/nature07488.

    Article  Google Scholar 

  6. Konig J, Zarnack K, Rot G, Curk T, Kayikci M, Zupan B, Turner DJ, Luscombe NM, Ule J: iCLIP reveals the function of hnRNP particles in splicing at individual nucleotide resolution. Nat Struct Mol Biol. 2010, 17: 909-915. 10.1038/nsmb.1838.

    Article  Google Scholar 

  7. Hafner M, Landthaler M, Burger L, Khorshid M, Hausser J, Berninger P, Rothballer A, Ascano M, Jungkamp AC, Munschauer M, Ulrich A, Wardle GS, Dewell S, Zavolan M, Tuschl T: Transcriptome-wide identification of RNA-binding protein and microRNA target sites by PAR-CLIP. Cell. 2010, 141: 129-141. 10.1016/j.cell.2010.03.009.

    Article  Google Scholar 

  8. Keene JD, Komisarow JM, Friedersdorf MB: RIP-Chip: the isolation and identification of mRNAs, microRNAs and protein components of ribonucleoprotein complexes from cell extracts. Nat Protoc. 2006, 1: 302-307. 10.1038/nprot.2006.47.

    Article  Google Scholar 

  9. Bailey TL, Williams N, Misleh C, Li WW: MEME: discovering and analyzing DNA and protein sequence motifs. Nucleic Acids Res. 2006, 34: W369-373. 10.1093/nar/gkl198.

    Article  Google Scholar 

  10. Li X, Quon G, Lipshitz HD, Morris Q: Predictingin vivobinding sites of RNA-binding proteins using mRNA secondary structure. RNA. 2010, 16: 1096-1107. 10.1261/rna.2017210.

    Article  Google Scholar 

  11. Bernhart SH, Muckstein U, Hofacker IL: RNA accessibility in cubic time. Algorithms Mol Biol. 2011, 6: 3-10.1186/1748-7188-6-3.

    Article  Google Scholar 

  12. Kiryu H, Terai G, Imamura O, Yoneyama H, Suzuki K, Asai K: A detailed investigation of accessibilities around target sites of siRNAs and miRNAs. Bioinformatics. 2011, 27: 1788-1797. 10.1093/bioinformatics/btr276.

    Article  Google Scholar 

  13. Draper DE: Themes in RNA-protein recognition. J Mol Biol. 1999, 293: 255-270. 10.1006/jmbi.1999.2991.

    Article  Google Scholar 

  14. Aviv T, Lin Z, Ben-Ari G, Smibert CA, Sicheri F: Sequence-specific recognition of RNA hairpins by the SAM domain of Vts1p. Nat Struct Mol Biol. 2006, 13: 168-176. 10.1038/nsmb1053.

    Article  Google Scholar 

  15. Oberstrass FC, Lee A, Stefl R, Janis M, Chanfreau G, Allain FH: Shape-specific recognition in the structure of the Vts1p SAM domain with RNA. Nat Struct Mol Biol. 2006, 13: 160-167. 10.1038/nsmb1038.

    Article  Google Scholar 

  16. Hoell JI, Larsson E, Runge S, Nusbaum JD, Duggimpudi S, Farazi TA, Hafner M, Borkhardt A, Sander C, Tuschl T: RNA targets of wild-type and mutant FET family proteins. Nat Struct Mol Biol. 2011, 18: 1428-1431. 10.1038/nsmb.2163.

    Article  Google Scholar 

  17. Mathews DH, Sabina J, Zuker M, Turner DH: Expanded sequence dependence of thermodynamic parameters improves prediction of RNA, secondary structure. J Mol Biol. 1999, 288: 911-940. 10.1006/jmbi.1999.2700.

    Article  Google Scholar 

  18. Ding Y, Lawrence CE: A statistical sampling algorithm for RNA, secondary structure prediction. Nucleic Acids Res. 2003, 31: 7280-7301. 10.1093/nar/gkg938.

    Article  Google Scholar 

  19. Doshi KJ, Cannone JJ, Cobaugh CW, Gutell RR: Evaluation of the suitability of free-energy minimization using nearest-neighbor energy parameters for RNA secondary structure prediction. BMC Bioinformatics. 2004, 5: 105-10.1186/1471-2105-5-105.

    Article  Google Scholar 

  20. Kiryu H, Kin T, Asai K: Rfold: an exact algorithm for computing local base pairing probabilities. Bioinformatics. 2008, 24: 367-373. 10.1093/bioinformatics/btm591.

    Article  Google Scholar 

  21. Eddy SR, Durbin R: RNA sequence analysis using covariance models. Nucleic Acids Res. 1994, 22: 2079-2088. 10.1093/nar/22.11.2079.

    Article  Google Scholar 

  22. Gardner PP, Daub J, Tate J, Moore BL, Osuch IH, Griffiths-Jones S, Finn RD, Nawrocki EP, Kolbe DL, Eddy SR, Bateman A: Rfam: Wikipedia, clans and the ‘decimal’ release. Nucleic Acids Res. 2011, 39: D141-D145. 10.1093/nar/gkq1129.

    Article  Google Scholar 

  23. Anders G, Mackowiak SD, Jens M, Maaskola J, Kuntzagk A, Rajewsky N, Landthaler M, Dieterich C: doRiNA: a database of RNA interactions in post-transcriptional regulation. Nucleic Acids Res. 2012, 40: D180-D186. 10.1093/nar/gkr1007.

    Article  Google Scholar 

  24. Jungkamp AC, Stoeckius M, Mecenas D, Grun D, Mastrobuoni G, Kempa S, Rajewsky N: In vivoand transcriptome-wide identification of RNA, binding protein target sites. Mol Cell. 2011, 44: 828-840. 10.1016/j.molcel.2011.11.009.

    Article  Google Scholar 

  25. Sanford JR, Wang X, Mort M, Vanduyn N, Cooper DN, Mooney SD, Edenberg HJ, Liu Y: Splicing factor SFRS1 recognizes a functionally diverse landscape of RNA transcripts. Genome Res. 2009, 19: 381-394.

    Article  Google Scholar 

  26. Zhang C, Darnell RB: Mappingin vivoprotein-RNA interactions at single-nucleotide resolution from HITS-CLIP data. Nat Biotechnol. 2011, 29: 607-614. 10.1038/nbt.1873.

    Article  Google Scholar 

  27. Cho J, Chang H, Kwon SC, Kim B, Kim Y, Choe J, Ha M, Kim YK, Kim VN: LIN28A is a suppressor of ER-associated translation in embryonic stem cells. Cell. 2012, 151: 765-777. 10.1016/j.cell.2012.10.019.

    Article  Google Scholar 

  28. Ascano M, Mukherjee N, Bandaru P, Miller JB, Nusbaum JD, Corcoran DL, Langlois C, Munschauer M, Dewell S, Hafner M, Williams Z, Ohler U, Tuschl T: FMRP targets distinct mRNA sequence elements to regulate protein expression. Nature. 2012, 492: 382-386. 10.1038/nature11737.

    Article  Google Scholar 

  29. Moore FL, Jaruzelska J, Fox MS, Urano J, Firpo MT, Turek PJ, Dorfman DM, Pera RA: Human Pumilio-2 is expressed in embryonic stem cells and germ cells and interacts with DAZ (Deleted in Azoospermia) and DAZ-like proteins. Proc Natl Acad Sci USA. 2003, 100: 538-543. 10.1073/pnas.0234478100.

    Article  Google Scholar 

  30. Darnell JC, Jensen KB, Jin P, Brown V, Warren ST, Darnell RB: Fragile X mental retardation protein targets G quartet mRNAs important for neuronal function. Cell. 2001, 107: 489-499. 10.1016/S0092-8674(01)00566-9.

    Article  Google Scholar 

  31. Halvorsen M, Martin JS, Broadaway S, Laederach A: Disease-associated mutations that alter the RNA structural ensemble. PLoS Genet. 2010, 6: e1001074-10.1371/journal.pgen.1001074.

    Article  Google Scholar 

  32. Salari R, Kimchi-Sarfaty C, Gottesman MM, Przytycka TM: Sensitive measurement of single-nucleotide polymorphism-induced changes of RNA, conformation: application to disease studies. Nucleic Acids Res. 2013, 41: 44-53. 10.1093/nar/gks1009.

    Article  Google Scholar 

  33. Kertesz M, Iovino N, Unnerstall U, Gaul U, Segal E: The role of site accessibility in microRNA target recognition. Nat Genet. 2007, 39: 1278-1284. 10.1038/ng2135.

    Article  Google Scholar 

  34. Shao Y, Chan CY, Maliyekkel A, Lawrence CE, Roninson IB, Ding Y: Effect of target secondary structure on RNAi efficiency. RNA. 2007, 13: 1631-1640. 10.1261/rna.546207.

    Article  Google Scholar 

  35. Marin RM, Voellmy F, von Erlach T, Vanicek J: Analysis of the accessibility of CLIP bound sites reveals that nucleation of the miRNA:mRNA pairing occurs preferentially at the 3’-end of the seed match. RNA. 2012, 18: 1760-1770. 10.1261/rna.033282.112.

    Article  Google Scholar 

  36. Wang Y, Sheng G, Juranek S, Tuschl T, Patel DJ: Structure of the guide-strand-containing argonaute silencing complex. Nature. 2008, 456: 209-213. 10.1038/nature07315.

    Article  Google Scholar 

  37. Rose PW, Bi C, Bluhm WF, Christie CH, Dimitropoulos D, Dutta S, Green RK, Goodsell DS, Prlic A, Quesada M, Quinn GB, Ramos AG, Westbrook JD, Young J, Zardecki C, Berman HM, Bourne PE: The RCSB Protein Data Bank: new resources for research and education. Nucleic Acids Res. 2013, 41: D475-D482. 10.1093/nar/gks1200.

    Article  Google Scholar 

  38. Lewis HA, Musunuru K, Jensen KB, Edo C, Chen H, Darnell RB, Burley SK: Sequence-specific RNA binding by a Nova KH domain: implications for paraneoplastic disease and the fragile X syndrome. Cell. 2000, 100: 323-332. 10.1016/S0092-8674(00)80668-6.

    Article  Google Scholar 

  39. Jensen KB, Musunuru K, Lewis HA, Burley SK, Darnell RB: The tetranucleotide UCAY directs the specific recognition of RNA by the Nova K-homology 3 domain. Proc Nat Acad Sci USA. 2000, 97: 5740-5745. 10.1073/pnas.090553997.

    Article  Google Scholar 

  40. Li PT, Bustamante C, Tinoco I: Unusual mechanical stability of a minimal RNA kissing complex. Proc Natl Acad Sci USA. 2006, 103: 15847-15852. 10.1073/pnas.0607202103.

    Article  Google Scholar 

  41. Hertel J, Hofacker IL, Stadler PF: SnoReport: computational identification of snoRNAs with unknown targets. Bioinformatics. 2008, 24: 158-164. 10.1093/bioinformatics/btm464.

    Article  Google Scholar 

  42. Darty K, Denise A, Ponty Y: VARNA: interactive drawing and editing of the RNA secondary structure. Bioinformatics. 2009, 25: 1974-10.1093/bioinformatics/btp250.

    Article  Google Scholar 

  43. Lorenz R, Bernhart SH, Honer Zu Siederdissen C, Tafer H, Flamm C, Stadler PF, Hofacker IL: ViennaRNA Package 2.0. Algorithms Mol Biol. 2011, 6: 26-10.1186/1748-7188-6-26.

    Article  Google Scholar 

  44. CapR. [https://sites.google.com/site/fukunagatsu/software/capr],

  45. doRiNA databse (human). [http://dorina.mdc-berlin.de/rbp_browser/hg18.html],

  46. doRiNA databse (mouse). [http://dorina.mdc-berlin.de/rbp_browser/mm9.html],

  47. doRiNA databse (nematode). [http://dorina.mdc-berlin.de/rbp_browser/ce6.html],

  48. Pruitt KD, Tatusova T, Brown GR, Maglott DR: NCBI Reference Sequences (RefSeq): current status, new features and genome annotation policy. Nucleic Acids Res. 2012, 40: D130-D135. 10.1093/nar/gkr1079.

    Article  Google Scholar 

  49. Jiang M, Anderson J, Gillespie J, Mayne M: uShuffle: a useful tool for shuffling biological sequences while preserving the k-let counts. BMC Bioinformatics. 2008, 9: 192-10.1186/1471-2105-9-192.

    Article  Google Scholar 

  50. Wilbert ML, Huelga SC, Kapeli K, Stark TJ, Liang TY, Chen SX, Yan BY, Nathanson JL, Hutt KR, Lovci MT, Kazan H, Vu AQ, Massirer KB, Morris Q, Hoon S, Yeo GW: LIN28 binds messenger RNAs at GGAGA motifs and regulates splicing factor abundance. Mol Cell. 2012, 48: 195-206. 10.1016/j.molcel.2012.08.004.

    Article  Google Scholar 

  51. Kazan H, Morris Q: RBPmotif: a web server for the discovery of sequence and structure preferences of RNA-binding proteins. Nucleic Acids Res. 2013, 41: W180-W186. 10.1093/nar/gkt463.

    Article  Google Scholar 

  52. Ray D, Kazan H, Cook KB, Weirauch MT, Najafabadi HS, Li X, Gueroussov S, Albu M, Zheng H, Yang A, Na H, Irimia M, Matzat LH, Dale RK, Smith SA, Yarosh CA, Kelly SM, Nabet B, Mecenas D, Li W, Laishram RS, Qiao M, Lipshitz HD, Piano F, Corbett AH, Carstens RP, Frey BJ, Anderson RA, Lynch KW, Penalva LO, et al: A compendium of RNA-binding motifs for decoding gene regulation. Nature. 2013, 499: 172-177. 10.1038/nature12311.

    Article  Google Scholar 

  53. Supercomputing facilities at the Human Genome Center, University of Tokyo. [http://sc.hgc.jp/shirokane.html],

Download references

Acknowledgements

The authors thank their research group colleagues for assistance in this research. This study was supported by JSPS KAKENHI Grant Numbers 22240031 (to KA and HK), 23710231 (to WI) and 25134701 and 25870190 (to HK). The computations were performed using the supercomputing facilities at the Human Genome Center, University of Tokyo [53].

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Tsukasa Fukunaga.

Additional information

Competing interests

The authors declare that they have no competing interests.

Authors’ contributions

TF, KA and HK designed the project. TF and HK developed the algorithm. TF implemented the software and performed the analyses. HO, GT, KA and WI advised on the project. TF, HO, WI and HK wrote the paper. All the authors read and approved the final manuscript.

Electronic supplementary material

13059_2013_3223_MOESM1_ESM.pdf

Additional file 1: Supplementary materials. This file includes additional figures and tables not shown in the manuscript. (PDF 1 MB)

Additional file 2: The source code of CapR. This file includes the source code of CapR. (ZIP 18 KB)

Authors’ original submitted files for images

Rights and permissions

Reprints and permissions

About this article

Cite this article

Fukunaga, T., Ozaki, H., Terai, G. et al. CapR: revealing structural specificities of RNA-binding protein target recognition using CLIP-seq data. Genome Biol 15, R16 (2014). https://doi.org/10.1186/gb-2014-15-1-r16

Download citation

  • Received:

  • Accepted:

  • Published:

  • DOI: https://doi.org/10.1186/gb-2014-15-1-r16

Keywords