Long terminal repeat (LTR) retrotransposons make up a large fraction of the typical mammalian genome. They comprise about 8% of the human genome and approximately 10% of the mouse genome. On account of their abundance, LTR retrotransposons are believed to hold major significance for genome structure and function. Recent advances in genome sequencing of a variety of model organisms has provided an unprecedented opportunity to evaluate better the diversity of LTR retrotransposons resident in eukaryotic genomes.
Using a new data-mining program, LTR_STRUC, in conjunction with conventional techniques, we have mined the GenBank mouse (Mus musculus) database and the more complete Ensembl mouse dataset for LTR retrotransposons. We report here that the M. musculus genome contains at least 21 separate families of LTR retrotransposons; 13 of these families are described here for the first time.
All families of mouse LTR retrotransposons are members of the gypsy-like superfamily of retroviral-like elements. Several different families of unrelated non-autonomous elements were identified, suggesting that the evolution of non-autonomy may be a common event. High sequence similarity between several LTR retrotransposons identified in this study and those found in distantly-related species suggests that horizontal transfer has been a significant factor in the evolution of mouse LTR retrotransposons.
Retrotransposons are mobile genetic elements that make up a large fraction of most eukaryotic genomes. All retrotransposons are distinguished by a life cycle involving an RNA intermediate. The RNA genome of a retroelement is copied into a double-stranded DNA molecule by reverse transcriptase, which is subsequently integrated into the host's genome. Retrotransposons fall into two main categories: those with long terminal repeats (LTRs), such as retroviruses and LTR retrotransposons, and those that lack such repeats, for example, long interspersed nuclear elements (LINEs).
Retrotransposons are particularly abundant in plants, where they are often a principal component of nuclear DNA. In corn, 50-80%, and in wheat fully 90%, of the genome is made up of retrotransposons [1,2]. This percentage is generally lower in animals than in plants but it can still be significant. For example, about 8% of the human genome is now known to be composed of LTR retrotransposons . In the mouse genome this figure has been estimated at 10% .
This article presents the results of a recent survey (December 2002) of the GenBank mouse (M. musculus) database (GBMD) and the 2.9 Gbp Ensembl  mouse dataset (EMD) for the presence of LTR retrotransposons. We have employed a new search program, LTR_STRUC (LTR retrotransposon structure program), as the initial data-mining tool in our survey . Identified elements were subjected to sequence analyses to identify open reading frames (ORFs) encoding reverse transcriptase (RT) and other retroviral proteins. LTR_STRUC finds only full-length elements, that is, ones having two LTRs and a pair of target site duplications (TSDs). We therefore augmented our search approach by conducting BLAST searches using reverse transcriptase queries. These queries are of two types: previously known RTs in the public database from mouse and other mammals, and RTs obtained from our initial scan of the EMD with LTR_STRUC. Subsequent RT sequence alignments were carried out, followed by construction of phylogenetic trees.
An LTR retrotransposon 'family' is defined as a group of elements with RTs at least 90% similar at the amino acid level . Experience has shown that when two elements have RTs that are 90% similar, their LTRs are typically about 60% similar. Thus, non-autonomous elements, lacking an RT ORF, are assigned to the same family if their LTRs are at least 60% similar. Many LTR retrotransposons replicate non-autonomously. Four different families of murine LTR retrotransposons have non-autonomous members. (MalR elements, ETn elements, VL30 elements and a new type identified in this study, related to IAP elements). These non-autonomous elements are discussed below. Non-autonomous elements can reach a high copy number even though they lack an RT ORF [4,8-11].
Currently there is no standard mouse retrotransposon nomenclature. In our system of classification for mouse, LTR retrotransposons are specified by the acronym Mmr (M. musculus retrotransposon). Distinct families are indicated by number (for example, Mmr1, Mmr2, Mmr3). We have chosen to adopt the Mmr nomenclature in this study because it is consistent with the systematic logic ('Mm' indicative of the genus and species of the host organism; 'r' indicates retrotransposon) used in previous articles [8,12]. In each case where we use the Mmr acronym in this article to refer to a previously named family, we also include any pre-existing name for the family.
Results and discussion
RTs from elements identified in our survey fall into numerous distinct families. All autonomous LTR retrotransposons identified were of the gypsy-like elements (Classes I, II, and III). Autonomous retroviral-like elements in the mouse genome usually have an overall length of between 6,000 and 9,000 bp. Results of our study indicate that the TSDs of mouse LTR retrotransposons are four to six base pairs long and that within each of the three major classes of these elements a single TSD length is characteristic (see below). With the exception of a few mutated copies, mouse LTR retrotransposons seem to have the same canonical dinucleotides terminating the LTRs as are typically found in other species (TG/CA). The LTRs of murine retroviral-like elements are generally 300-600 bp long, with the exception of mouse mammary tumor virus (MMTV) where the LTRs are some 1,300 bp in length. Our survey shows that at least 21 distinct LTR retrotransposon families exist in the mouse genome, 13 of which have not been described previously.
LTR retrotransposon families of the murine genome
To date, LTR retrotransposon diversity has been rigorously classified into families for only a few organisms (for example, Oryza sativa , Drosophila melanogaster  and Caenorhaditis elegans ). This article represents a first attempt to establish a similar uniform classification and nomenclature for the domestic mouse. Previous studies have classified murine retrotransposons into broad categories only, which ignore the standard definition of 'family' (see above). For example, the term 'intracisternal type A particle' (IAP) has been used to refer to elements that belong to several distinct LTR-retrotransposon phylogenetic groups. The autonomous elements identified in our survey of the GBMD and EMD fall into 20 families on the basis of degree of RT divergence (greater than 10% denotes family). In addition, we have classified MalR elements, which are non-autonomous, into a twenty-first family that is closely related to MuERV-L elements, because these two types of transposons have similar LTRs. MusD and ETn elements form a second pair of related autonomous and non-autonomous elements; MmERV and VL30 elements constitute a third. These three paired families are discussed in more detail below.
Our analysis supports previous categorization  of mouse LTR retrotransposons into three distinct classes (Figure 1): Class I, containing elements related to retroviral leukemia viruses in mouse (MuLV) and other species (for example, gibbon: GALV and cat: FeLV); Class II which contains the IAP elements, mouse mammary tumor virus (MMTV) and the MusD2/ETn family; and Class III which comprises the MalR and MuERV-L elements. In using these names for the three main categories of murine LTR retrotransposons we follow the usage of the Mouse Genome Sequencing Consortium , but the reader is cautioned that the same terminology has been used to designate RNA-based transposons (Class I) and DNA-based transposons (Class II). Here, however, all three classes are RNA-based LTR retrotransposons.
Figure 1. Unrooted RT-based neighbor-joining tree for all three classes of murine retrotransposons. RT sequences from host species other than mouse are included for comparison.
Class I (families 1-4)
Members of this class make up 0.68% of the mouse genome (copy number about 34,000) . They have 4-bp TSDs and are related to murine leukemia virus (MuLV; AF033811), a C-type retrovirus that occurs only in mice and is a major cause of cancer in that genus. Class I, to which MuLV belongs, contains at least three other families: Mmr1_MmERV, Mmr3_MuRRS, and Mmr4. In this article, MuLV is referred to as family Mmr2_MuLV. Class I endogenous retroviruses are more closely related to elements in other species than to mouse retroelements belonging to Classes II or III. RTs from endogenous retrovirus in pig (PK15; AF038601) and koala (KoRV; AAF15098), as well as from leukemia viruses in gibbon (GALV; AAA466810) and cat (FeLV; L06140), group with this class; their RTs are all about 80% similar at the amino acid level to those of murine Class I elements. One member of Class I is found in two different mouse species, M. musculus and M. dunni, and has previously been referred to as either MmERV (in M. musculus) or MDEV (in M. dunni) ; here it is referred to as Mmr1_MmERV. The identity of this family in these two species is demonstrated by the presence of an element (AAC31805) in the M. dunni (Indian pigmy mouse) genome, which is 96% similar (at the amino acid level) to members of Mmr1_MmERV resident in M. musculus (Figure 2). This finding is consistent either with a recent common origin of these two mouse species or with a horizontal transfer of this retrovirus. This virus may be infectious since an envelope protein sequence is present in the GenBank database (AAC31806) for the M. dunni retrovirus and has also been detected in copies of this family during our own survey of M. musculus. Mmr4 is a previously unrecognized Class I family, with members about 80% similar to those of Mmr2_MuLV. Family Mmr3_MuRRS includes the so-called murine retroviral related sequences (MuRRS). A known human endogenous retrovirus type C oncoviral sequence (AAA73090) is approximately 56% similar at the amino acid level to members of Class I. BLAST searches with RT queries from Class I indicate that at least some elements in the human genome are even more similar (>65%) to Class I elements in mouse (for example, HSAP-2; Figure 2 and Table 1).
Figure 2. RT-based neighbor-joining tree for Class I murine retrotransposons. The distances (uncorrected 'p') appear next to each of the branches. RT sequences from host species other than mouse are included for comparison. The outgroup is the Class II element GH-H18 (from golden hamster, Mesocricetus auratus; see Table 3 and Figure 3).
Table 1. Non-murine RTs obtained from translating BLAST
Class II (families 5-19)
Class II retroviral-like elements make up 3.14% of the mouse genome (copy number approximately 127,000) . This class contains 15 of the 21 murine LTR families. Its members have 6 bp TSDs and are related to MMTV (NC_001503), an oncogenic B-type retrovirus that causes breast cancer in mice. Our survey has revealed only three full-length copies of a member of this family (Mmr11_MMTV) in the mouse genome. MMTV contains an ORF coding for envelope protein (BAA03768). Mmr11_MMTV RTs are also 75% similar to those of a separate endogenous mouse family, Mmr16. For the most part, Mmr16 seems to be represented in the mouse genome by fragmentary elements, but the full-length element Mmr16-1 described in Table 2 has a full complement of retroviral genes, including an envelope ORF, as is the case with MMTV.
Table 2. Exemplars of mouse LTR retrotransposon families characterized in this study
Another family in Class II, Mmr19_MusD, has been previously described under the name MusD. Mager and Freeman  who discovered this family, showed that the non-autonomous mouse ETn retroelements (early transposons) are deletion derivatives of Mmr19_MusD. They are so closely related to MusD elements that we have assigned them to the same family. Most copies of the former are around 5,500 bp long, while those of the latter are usually around 7,400 bp in length. ETn elements (Y17107; AB033509), first reported by Brulet et al. , are a moderately repetitive family of murine retrotransposons that lack most of the usual retroviral ORFs. Our survey with LTR_STRUC suggests that full-length copies of ETn elements are about half as common again as full-length MusD elements. Family Mmr12 is about 80% similar to Mmr19_MusD. Both of these families are 70% similar to Mason-Pfizer Monkey Virus (MPMV; NC_001550). The RTs of MusD elements have an unusual active site sequence: FTDDVLM ('T' is not canonical for an active site) . Class II contains an additional clade (See Figure 3), comprising at least eight additional families (Mmr6, Mmr7, Mmr9, Mmr10_IAP, Mmr14, Mmr15, Mmr17, and Mmr18) with no two families differing from any other by more than 70%. The major constituents of this clade are the IAP retrotransposons, the second most abundant family in the mouse genome, here referred to as family Mmr10_IAP. They lack complete env genes  and thus are considered non-infective. Murine elements identified in GenBank as IAP (for example, GNPSIP and GNMSIA) are restricted to family Mmr10_IAP. Nevertheless, members of any of the eight families listed above have been described as IAP by various authors. In addition, a family of retroelements in golden hamster (GH-G18 }; Figure 3) have been described as 'IAP' but do not actually belong to the Mmr10_IAP family (their RT ORFs differ from those of Mmr10_IAP by about 18% at the amino acid level). Thus, in mice, the term IAP might best be restricted to Mmr10_IAP. Numerous IAP elements share a common, 1,800-bp deletion that includes the upstream end of the RT. Yet these elements were, and perhaps still are, capable of transposing as evidenced by the fact that copies with the same deletion were found on many different chromosomes. Even shorter, internally-deleted elements, with two LTRs and ostensibly capable of transposition, can be assigned to Mmr10_IAP on the basis of LTR similarity (down to about 2,700 bp in overall length).
Figure 3. RT-based neighbor-joining tree for Class II murine retrotransposons. The distances (uncorrected 'p') appear next to each of the branches. RT sequences from host species other than mouse are included for comparison. The outgroup is the Class I element MDEV (from house/rice field mouse, M. dunni; see Table 3 and Figure 2).
Class III (families 20 and 21)
Members of this class make up 5.40% of the mouse genome (copy number about 442,500) . They have 5 bp TSDs and Class III has two constituents: murine ERV-L elements, which have an estimated copy number of 37,000 ; and the non-autonomous MalRs (mammalian apparent LTR retrotransposons), which are the most common retroviral element in the mouse genome, making up 4.8% of the mouse genome . MuERV-L elements are closely related to human endogenous retrovirus L (HERV-L). In BLAST searches we have identified a human element (HSAP-1; Table 1 and Figure 4) that is 85% similar at the amino acid level to MuERV-L RTs. Because alignments show that their LTRs are 51% similar, we conclude that murine MalRs and MuERV-L elements share a recent common ancestor. However, as they are not quite sufficiently similar to be members of the same family, we have assigned these families the names Mmr20_ MuERV-L and Mmr21_MaLR.
Figure 4. RT-based neighbor-joining tree for Class III murine retrotransposons. Distances (uncorrected 'p') appear next to each of the branches. RT sequences from host species other than mouse are included for comparison. The outgroup is the Class II element GH-G18 (from golden hamster, Mesocricetus auratus; see Table 3 and Figure 3).
Like MalRs in other species, murine MalRs are all internally deleted. The internal region contains only non-coding repetitive DNA. Nevertheless they have typical LTRs, primer binding site and polypurine tract. Members of Mmr21_MaLR are of two types: MT MalRs - the most common type of LTR retrotransposon in the mouse genome (mean length approximately 1,980 bp); and ORR1 MalRs (mean length approximately 2,460 bp). Our survey suggests that in the mouse genome, MT MalRs are about ten times as common as their longer relatives, the ORR1 MalRs. Non-truncated copies of Mmr20_ MuERV-L elements have an overall length of about 6,400 bp.
Length variation in murine LTR retrotransposons
Although all copies of family Mmr10_IAP found by LTR_STRUC have two LTRs and recognizable TSDs (as required by the search algorithm employed by the program), the individual members of this abundant family vary widely in overall length (2,700-7,200 bp) due to the presence of internal deletions of varying length. On the other hand, the two abundant types of non-autonomous Class III elements (MT and ORR1 MalRs) exhibit a markedly different pattern of variation from that of Mmr10_IAP elements. Lengths of ORR1 MalRs peak sharply at 2,300 bp and those of MT MalRs at 1,980 bp, with very few elements in either case differing from these peak frequencies by more than 100 bp (<1%). Moreover, most copies of Mmr10_IAP, from the shortest to the longest, are preponderantly represented by copies with a high level of LTR-LTR identity (>99%), a finding consistent with recent transposition. The ability of internally truncated Mmr10_IAPs to complete their replication cycle is consistent with the fact that a number of Mmr10_IAP copies bearing the same 1,800-bp deletion (affecting the polyprotein ORF) were found in our survey on a variety of different mouse chromosomes. A similar dispersed distribution of lengths was observed in two other families Mmr19_MusD and Mmr1_MmERV. Comparison of a VL30 element (AF486451) with our data revealed a high degree of LTR-LTR similarity (>90%) to elements in family Mmr1_MmERV and therefore are members of that family (VL30s are non-autonomous and cannot be compared with other elements on the basis of RT similarity).
Certain families of mouse LTR retrotransposons are more closely related to elements present in other species than to other classes of mouse elements. For example, murine Class I elements are more similar to viruses in gibbon, pig, cat, and koala, than to murine retrotransposons of Classes II or III (Figure 2). Among Class II murine endogenous retroviruses (Figure 3), family Mmr10_IAP is more closely related to the golden hamster element GH-G18 than it is to any other family of murine retroviral elements. Similarly, the amino acid sequences (RT ORFs) of members of Mmr20_MuERV_L (mouse Class III elements, Figure 4) differ from a human element (for example, HSAP-1, Table 3) by only 15%, but differ from those of any non-Class III element by more than 60%. Such findings suggest that horizontal transfer may have been a source of new mouse LTR retrotransposon families over evolutionary time.
Table 3. Known RTs used for comparison in phylogenies
All autonomous retrotransposons identified in our study were retroviral-like elements (of Classes I, II, and III). At least 21 distinct families of murine LTR retrotransposons exist. Families Mmr4, Mmr5, Mmr6, Mmr7, Mmr8, Mmr9, Mmr12, Mmr13, Mmr14, Mmr15, Mmr16, Mmr17, and Mmr18 have not been previously recognized, 13 families in all. These new families are all Class II elements (with the exception of Mmr4, which belongs to Class I) and are thus akin to immune deficiency viruses such as simian retrovirus SRV-1, to mouse mammary tumor virus (MMTV), and to IAP elements.
Our purpose in using LTR_STRUC to begin our survey of the mouse genome was to obtain a broadly representative sample of murine retrotransposons. Since the algorithm it employs is not dependent upon sequence homology, as in standard search methods such as BLAST, the initial results of our survey presumably were not biased toward a particular set of queries. Also, since the current version of LTR_STRUC now categorizes the elements it locates and assigns a new name to any element that differs sufficiently from any found earlier in the search, the chances of overlooking low-copy families has been reduced. The thoroughness of our BLAST search can only have been augmented by using LTR_STRUC because, in the BLAST phase of our survey, the queries used were a combination of those element types already recognized, prior to our investigation, with those found by LTR_STRUC. We believe this approach is the reason we were able to identify the 13 previously unreported families listed above.
Materials and methods
Using a new data-mining program, LTR_STRUC , we have mined the Ensembl mouse (M. musculus) dataset  for LTR retrotransposons. We have used elements found in this initial search, as well as murine LTR retrotransposons identified by previous workers, to conduct BLAST searches of the GenBank mouse database.
Automated characterization of LTR retrotransposons
The methods used in our survey of the mouse genome are essentially the same as those used in our earlier study of the rice genome and are described elsewhere . Briefly, we began our survey by using a new computer program, LTR_STRUC, which identifies new LTR retrotransposons based on the presence of characteristic retroelement features . Additional elements were identified by BLAST searches using the RTs, both of elements located by LTR_STRUC and of ones previously recognized in earlier studies by previous researchers.
Initial scans with LTR_STRUC were conducted on a dataset consisting of the 2.9 Gbp of M. musculus sequence data available in the Ensembl database at the time of the initial scan (December 2002). The dataset (EMD) was obtained from the Ensembl website . In an effort to identify additional elements not picked up in the initial survey with LTR_STRUC, we have used representative sequences from each retrotransposon family identified in this study as queries to conduct BLAST searches against the GenBank mouse database (GBMD). Thus, the results reported here constitute a reasonably unbiased survey of LTR-retrotransposon diversity in mouse. RT sequences were identified according to previously described criteria [16,17].
Multiple sequence alignments and phylogenetic analyses
The RT domains of the various Mmr elements were aligned, as described elsewhere , with previously reported RT sequences (Table 3). In the case of elements lacking an RT sequence because of fragmentation or internal truncation, the LTR sequences were used to assign them the proper family.
SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, Zakharov D, Melake-Berhan A, Springer PS, Edwards KJ, Lee M, Avramova Z, Bennetzen JL: Nested retrotransposons in the intergenic regions of the maize genome.
Philos Trans R Soc Lond B Biol Sci 1986, 312:227-242. PubMed Abstract
Proc Natl Acad Sci USA 1983, 80:5641-5645. PubMed Abstract
Adv Cancer Res 1988, 51:183-276. PubMed Abstract
EMBO J 1990, 9:3353-3362. PubMed Abstract