DNA60: A special bioinformatics challenge from Genome Biology: The answers
Listed here are the full solutions for each challenge stage. The solutions as written are an extract from a Comment article (written by the curators and winner) that discusses the DNA60IFX challenge in more detail.
Recommended scripts to solve each stage of the challenge are available from the DNA60IFX GitHub page, which also houses a copy of the data. The winner's Python script for the final stage can be downloaded here.
The first stage was based on the common bioinformatics problem of motif finding, such as for identifying a transcription factor binding motif or other regulatory element upstream of a set of gene sequences. Finding true biological motifs requires complex learning approaches such as Gibbs sampling to account for the variability that may be present. For the contest, we simplified the problem to identification of a 7 base-pair sequence motif without any variability or errors. As a result, the solution could be computed in a few seconds with any of a number of k-mer counting software packages. Nevertheless, the simplicity of the stage aided in explaining the process of how to use the solution to unlock the subsequent stages, and also made the contest accessible to a very large set of participants.
Recommended algorithm: Jellyfish k-mer counter
The second stage centered around the important problem of computational gene finding. Users were presented with an artificial one megabase-pair microbial genome, and tasked to identify the open reading frames (ORFs) and analyze their amino acid sequences. ORFs are regions of a genome stretching from a start codon to a stop codon absent of any in-frame internal stop codons, and represent possible protein-coding genes. While not every ORF in a microbe will be a true gene, the longest ORFs typically are, and thus constitute an effective heuristic for training a gene finder for classifying the other ORFs in an unannotated genome. Once the ORFs were identified, participants were then tasked to translate their codons into their corresponding amino acid sequences, and then report the 25th amino acid from the 15 longest ORFs in sorted order. There are several gene finding and ORF finding programs available that could be used for solving the stage, including EMBOSS and Glimmer, although it seems many participants chose to implement their own given the questions we received, especially to clarify the processing of overlapping ORFs. Care was taken in designing the problem to ensure that the answer was unambiguous by ensuring the top 15 longest ORFs had distinct lengths. Several participants asked if we had a typo in designing the contest, but they should take note that there is no amino acid with the abbreviation 'O'.
Recommended algorithm: EMBOSS getorf program
For stage three, participants were presented with a pair of simulated RNA-seq experiments from a portion of Escherichia coli, and asked to find the most highly differentially expressed gene. While RNA-seq has the potential to discover new genes and new isoforms, in this stage we provided the annotation for the genome, and being a microbe, did not include any alternatively spliced genes. As such, identifying the solution was a relatively simple matter of mapping the reads and comparing the mapped read coverage in the two conditions. Curiously, from the access logs it appears at least one person attempted to solve the stage by systematically trying all 93 annotated genes until the correct one was found.
Recommended algorithms: Bowtie and SAMtools
Stage four simulated a metagenomics experiment, as used to explore the microbial composition on different sites on the human body or in different environments around the world. A reasonable shotgun metagenomics simulation would have required a larger dataset than was desirable for the stage, thus we chose to simulate a microbial community profiling experiment using amplified 16S rRNA sequences. We randomly selected 80 or so members of the Helicobacter genus, together with a matched number from other random genera, from the Greengenes database of 16S sequences. We then generated simulated 250 to 400 base-pair reads from the V1-V3 variable regions, with a progressively decreasing number of reads drawn from each species. Sequencing and other characteristic errors found in real 16S experiments were not simulated after initial evaluations determined they would make the stage difficult to solve unambiguously without a much larger dataset. The resulting dataset was highly enriched for reads from members of Helicobacter, allowing an answer to be determined as verified using the RDP classifier or CAMERA. In generating this stage’s dataset, we found that if we reduced the prevalence of the dominant genus it quickly became difficult for common taxonomic classifiers to yield an unambiguous answer. However, because Helicobacter was so over-represented, the correct answer could easily be guessed just by aligning random reads to an appropriate database.
Recommended algorithm: CAMERA
The final stage was to identify a secret message that we had embedded into a genome, and then email us the correct phrase as fast as possible. This simplified the scoring as we had a time-stamped electronic record of the submissions along with the email addresses of the participants. We embedded the secret message using the encoding scheme proposed by Church et al., in which text or images are represented in a binary alphabet expressed in DNA nucleotides. To further complicate the stage, instead of providing the genome with the secret message embedded within it, we simulated the shotgun sequencing of it and presented just the unassembled reads. We expected the participants to then assemble the reads, BLAST the assembly at NCBI to determine the species, align the assembly to the reference, extract the inserted nucleotides, and then decode the message using the included decoder. Alternatively, one could run the decoder script directly on the unassembled reads. The majority of the reads would decode into unintelligible characters, but those with the insertion would decode into recognizable words that could then be assembled into the entire phrase. This approach would be somewhat more complex to implement since most available genome assemblers are specialized for DNA sequences, but has the advantage of skipping the time-consuming steps of assembling and BLASTing to determine the reference. Indeed, the winning entry used this shortcut to outpace the competition.
Solution: 'We went up, saw the structure, we came back to King's and looked at our Pattersons, and every section of our Pattersons we looked at screamed at you, "Double Helix!" And it was just there! - once you knew what to look for. It was amazing.' (a quote from Genome Biology's DNA Day interview with Ray Gosling)
Recommended algorithms: ALLPATHS-LG, BLAST and MUMmer
Click here to go back to our start page. The challenge should be much easier now that you know the answers!