A bioinformatic pipeline to identify candidate centromere DNAs based on their tandem repeat nature and abundance. (a) Random shotgun sequences from a variety of platforms can be used to identify the most common tandem repeat monomer. Sanger and PacBio reads are usually long enough to contain multiple copies of a tandem repeat. Illumina and 454 reads are generally too short, and must be assembled to create longer sequences. Tandem repeat monomers were identified by Tandem Repeats Finder (TRF). (b) Identification of known centromere tandem repeats from three species. The human centromere repeat is 171 bp in length. The 728-bp monkeyflower centromere repeat is too long to be found in Sanger reads, but a PRICE assembly of Illumina reads reveals the known repeat. The 1,419-bp cattle centromere repeat and a less abundant 680-bp tandem repeat were directly identified from PacBio reads. Note that the graph for monkeyflower has no background of low abundance tandem repeats because these were not assembled by PRICE. (c) Three examples of de novo identification of centromere tandem repeats. Sanger WGS reads from the American pika, Hydra, and Colorado Blue Columbine revealed 253-bp, 183-bp, and 329-bp repeat monomers, respectively. nt, nucleotides.
Melters et al. Genome Biology 2013 14:R10 doi:10.1186/gb-2013-14-1-r10