<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2001-2-9-research0037</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Research</dochead>
      <bibl>
         <title>
            <p>On the species of origin: diagnosing the source of symbiotic transcripts</p>
         </title>
         <aug>
            <au id="A1" ca="yes">
               <snm>Hraber</snm>
               <mi>T</mi>
               <fnm>Peter</fnm>
               <insr iid="I1"/>
               <email>pth@santafe.edu</email>
            </au>
            <au id="A2">
               <snm>Weller</snm>
               <mi>W</mi>
               <fnm>Jennifer</fnm>
               <insr iid="I2"/>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>Santa Fe Institute, 1399 Hyde Park Road, Santa Fe, NM 87501, USA</p>
            </ins>
            <ins id="I2">
               <p>Virginia Bioinformatics Institute, 1750 Kraft Drive, Suite 400, Blacksburg, VA 24061, USA</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2001</pubdate>
         <volume>2</volume>
         <issue>9</issue>
         <fpage>research0037.1</fpage>
         <lpage>research0037.14</lpage>
         <url>http://genomebiology.com/2001/2/9/research/0037</url>
         <xrefbib>
            <pubidlist>
               <pubid idtype="doi">10.1186/gb-2001-2-9-research0037</pubid>
               <pubid idtype="pmpid">11574056</pubid>
            </pubidlist>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>11</day>
               <month>6</month>
               <year>2001</year>
            </date>
         </rec>
         <revrec>
            <date>
               <day>11</day>
               <month>7</month>
               <year>2001</year>
            </date>
         </revrec>
         <acc>
            <date>
               <day>25</day>
               <month>7</month>
               <year>2001</year>
            </date>
         </acc>
         <pub>
            <date>
               <day>23</day>
               <month>8</month>
               <year>2001</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2001</year>
         <collab>Hraber and Weller, licensee BioMed Central Ltd</collab>
      </cpyrt>
      <shortabs>
         <p>Symbiotic interactions range from pathogenic to mutualistic, and many of the underlying molecular mechanisms remain to be discovered. Given a sequence expressed in an interaction between two symbionts, the challenge is to determine from which organism the transcript originated. Previous investigations into GC content and comparative similarity searching provide solutions, but a comparative lexical analysis, which uses a likelihood-ratio test of hexamer counts, is more powerful. Microbial transcripts comprised 75% of a <it>Phytophthora sojae</it>-infected soybean library, contrasted with 15% or less in root tissue libraries of Medicago truncatula from axenic, <it>P. medicaginis</it>-infected, mycorrhizal, and rhizobacterial treatments. Mycorrhizal libraries contained about 23% microbial transcripts; an axenic plant library contained a similar proportion of putative microbial transcripts. Many of the transcripts isolated from mixed cultures were of unknown function, suggesting specificity to symbiotic metabolism.</p>
      </shortabs>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <sec>
               <st>
                  <p>Background</p>
               </st>
               <p>Most organisms have developed ways to recognize and interact with other species. Symbiotic interactions range from pathogenic to mutualistic. Some molecular mechanisms of interspecific interaction are well understood, but many remain to be discovered. Expressed sequence tags (ESTs) from cultures of interacting symbionts can help identify transcripts that regulate symbiosis, but present a unique challenge for functional analysis. Given a sequence expressed in an interaction between two symbionts, the challenge is to determine from which organism the transcript originated. For high-throughput sequencing from interaction cultures, a reliable computational approach is needed. Previous investigations into GC nucleotide content and comparative similarity searching provide provisional solutions, but a comparative lexical analysis, which uses a likelihood-ratio test of hexamer counts, is more powerful.</p>
            </sec>
            <sec>
               <st>
                  <p>Results</p>
               </st>
               <p>Validation with genes whose origin and function are known yielded 94% accuracy. Microbial (non-plant) transcripts comprised 75% of a <it>Phytophthora sojae</it>-infected soybean (<it>Glycine max</it> cv Harasoy) library, contrasted with 15% or less in root tissue libraries of <it>Medicago truncatula</it> from axenic, <it>Phytophthora medicaginis</it>-infected, mycorrhizal, and rhizobacterial treatments. Mycorrhizal libraries contained about 23% microbial transcripts; an axenic plant library contained a similar proportion of putative microbial transcripts.</p>
            </sec>
            <sec>
               <st>
                  <p>Conclusions</p>
               </st>
               <p>Comparative lexical analysis offers numerous advantages over alternative approaches. Many of the transcripts isolated from mixed cultures were of unknown function, suggesting specificity to symbiotic metabolism and therefore candidates likely to be interesting for further functional investigation. Future investigations will determine whether the abundance of non-plant transcripts in a pure plant library indicates procedural artifacts, horizontally transferred genes, or other phenomena.</p>
            </sec>
         </sec>
      </abs>
   </fm>
   <meta>
      <classifications>
         <classification type="BMC" subtype="man_spc_id" id="30010014">Microbiology and parasitology</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010010">Genome studies</classification>
         <classification type="BMC" subtype="man_spc_id" id="30010019">Plant biology</classification>
      </classifications>
   </meta>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>Access to automated DNA sequencing technology has made possible the rapid generation and analysis of gene transcripts expressed in organisms via expressed sequence tags (ESTs) [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>,<abbr bid="B3">3</abbr>,<abbr bid="B4">4</abbr>,<abbr bid="B5">5</abbr>]. This information has helped to identify those genes expressed in particular stages of development and in specialized tissues or organs [<abbr bid="B6">6</abbr>,<abbr bid="B7">7</abbr>,<abbr bid="B8">8</abbr>,<abbr bid="B9">9</abbr>,<abbr bid="B10">10</abbr>]. Novel gene products and target leads for therapeutic intervention can also be gleaned rapidly from ESTs [<abbr bid="B1">1</abbr>,<abbr bid="B2">2</abbr>,<abbr bid="B11">11</abbr>]. A more detailed understanding of the molecular interactions between symbionts, whether pathogenic or mutualistic [<abbr bid="B12">12</abbr>], is also possible with this approach [<abbr bid="B13">13</abbr>,<abbr bid="B14">14</abbr>,<abbr bid="B15">15</abbr>,<abbr bid="B16">16</abbr>,<abbr bid="B17">17</abbr>].</p>
         <p>For a sequence isolated from interacting symbionts, determining its cellular role (or roles) is complicated by not knowing which species expressed the sequence [<abbr bid="B18">18</abbr>]. We refer to this challenge as 'the problem': given a sequence <it>x</it> expressed in an interaction between species <it>A</it> and <it>B</it>, did <it>x</it> originate from <it>A</it> or <it>B</it>? Various solutions are readily conceived, each with merits and faults. Here, we show that a comparative lexical analysis of word counts (specifically, hexamer frequencies), previously used to detect library contamination in sequencing projects [<abbr bid="B19">19</abbr>], provides a powerful computational basis to infer a transcript's species of origin.</p>
         <p>Experimentally, one can attempt to solve the problem by hybridizing a clone (as probe) to genomic DNA (target) from both species and determining to which target the probe hybridizes. This approach can produce very reliable results. However, if a sequence is highly conserved in the two taxa, hybridization stringency conditions can influence the outcome considerably. For high-throughput EST sequence analysis, source verification by hybridization is impractical in terms of time and reagents. As an alternative to <it>in vitro</it> hybridization, several computational solutions are possible.</p>
         <p>Were the genome sequence of both species completely determined, one could simply use sequence similarity searching [<abbr bid="B20">20</abbr>,<abbr bid="B21">21</abbr>,<abbr bid="B22">22</abbr>]. However, most plant hosts and their microbial symbionts have little or no genomic sequence data available, which makes this approach very unreliable. Strong similarity to a sequence from one organism does not preclude the possibility that a similar sequence is present in the other species. Conclusions based upon such partial knowledge have been informative, but are potentially misleading [<abbr bid="B18">18</abbr>,<abbr bid="B23">23</abbr>].</p>
         <p>Codon usage varies across taxa [<abbr bid="B24">24</abbr>,<abbr bid="B25">25</abbr>,<abbr bid="B26">26</abbr>]. Exploiting this fact may seem a viable solution to the problem, as it has proven suitable for predicting the presence of introns among exons in genomic DNA. However, it really is not practical, because of the need to know the reading frame for translation of a messenger RNA into an amino acid. EST data are of notoriously unreliable quality, sometimes having a large proportion of ambiguous bases, and sometimes having single base-pair insertions or deletions, which disrupt a reading frame. Word counting is less prone to these sources of error, and uses information intrinsic to biases in codon usage by counting codon pairs as hexamers in a sliding window, whereas codons are read in non-overlapping, tiled windows.</p>
         <p>An intuitive approach to the problem that examines sequence composition is to compare the guanine and cytosine (GC) base content of a sequence with other sequences from the species being studied. When two species' genomes have different GC content, this method can be very useful. In a recent investigation, for instance, sequences from the stramenopile plant pathogen <it>Phytophthora sojae</it> and its soybean (<it>Glycine max</it>) host showed a 20% difference in mean GC content [<abbr bid="B18">18</abbr>]. The origin of a number of sequences could readily be identified this way, but a large proportion could not, because of considerable overlap in the distributions' tails. Counting frequencies of GC is simple word counting, where the word size <it>k</it> is 1/2: only two semi-words, G/C and A/T are counted.</p>
         <p>An alternative approach to determining the origin of a sequence is suggested by previous work on analysis of word counts, or <it>k</it>-tuple frequencies, which was intended as a means of evaluating a library for contamination when sequencing from a single model organism [<abbr bid="B19">19</abbr>]. The word-counting method provides distinct advantages over other computational methods. Unlike sequence-similarity searching, it does not require that the full protein-coding content of both genomes be known for reasonable inferences to be made. Further, word counting is sensitive to biases in codon usage and GC content commonly observed when comparing taxa, but does not require knowledge of the reading frame for amino-acid translation. That is, the underlying differences between the two organisms that result in base composition or codon usage biases can also be detected by counting words. Unlike GC analysis, lexical analysis establishes a clear threshold above or below which we can infer the species of origin, and a confidence level for an inference can readily be assigned. Dunning's likelihood-ratio test of word dissimilarities [<abbr bid="B27">27</abbr>] also has the appealing property of being non-parametric, having no assumption of normality for the underlying frequency distribution, which makes it statistically powerful [<abbr bid="B28">28</abbr>]. Dunning [<abbr bid="B27">27</abbr>] demonstrated that unreliable results can be obtained from parametric tests, such as &#935;<sup>2</sup>, particularly in such cases as lexical analysis.</p>
         <p>In the experiments detailed below, we first validate the word-counting method on sequences whose origin and function are known, then compare it with ability to diagnose the origin of sequences with distributions of GC content. We examine sequences from pathogenic interactions between species from the genus <it>Phytophthora</it> and the plant hosts <it>G. max</it> and <it>Medicago truncatula</it>, then apply the word-counting approach to sequences from two microbial mutualists in association with <it>M. truncatula</it>, the arbuscular mycorrhizal zygomycete <it>Glomus versiforme</it>, and the nitrogen-fixing bacterium <it>Sinorhizobium meliloti</it>.</p>
      </sec>
      <sec>
         <st>
            <p>Results</p>
         </st>
         <p>Validation sequence accession numbers, gene names, and comparison results appear in Table <tblr tid="T1">1</tblr>. Incorrect inferences are underlined. The word-counting method was generally quite reliable when tested against sequences of known origin, being wrong in 3 cases out of 50; a phosphate transporter from <it>G. versiforme</it> and two <it>in planta</it>-induced genes from <it>Phytophthora infestans</it> were misidentified as plant sequences. This indicates a failure rate of 6% - all false negatives under the null hypothesis that a transcript originates from the plant host. Performance of the method was not influenced by whether the isolated source of a sequence was an mRNA or DNA molecule, as indicated by the column labeled 'mRNA?'.</p>
         <tbl id="T1">
            <title>
               <p>Table 1</p>
            </title>
            <caption>
               <p>Dissimilarity (<it>D</it>) comparison results from 50 validation sequences</p>
            </caption>
            <tblbdy cols="7">
               <r>
                  <c ca="left">
                     <p>Accession</p>
                  </c>
                  <c ca="center">
                     <p>Gene name</p>
                  </c>
                  <c ca="center">
                     <p>mRNA (?)</p>
                  </c>
                  <c ca="center">
                     <p>Length (nucleotides)</p>
                  </c>
                  <c ca="center">
                     <p><it>D</it>(<it>A</it>) plants</p>
                  </c>
                  <c ca="center">
                     <p><it>D</it>(<it>B</it><sub>1</sub>) oomycetes</p>
                  </c>
                  <c ca="center">
                     <p><it>D</it>(<it>B</it><sub>3</sub>) bacteria</p>
                  </c>
               </r>
               <r>
                  <c cspan="7">
                     <hr/>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>Glomus versiforme</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AJ009628</p>
                  </c>
                  <c ca="center">
                     <p>chitin synthase <it>Gvchs1</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>638</p>
                  </c>
                  <c ca="center">
                     <p>2,535.2</p>
                  </c>
                  <c ca="center">
                     <p>2,468.6</p>
                  </c>
                  <c ca="center">
                     <p>2,718.4</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AJ009629</p>
                  </c>
                  <c ca="center">
                     <p>chitin synthase <it>Gvchs2</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>481</p>
                  </c>
                  <c ca="center">
                     <p>2,203.2</p>
                  </c>
                  <c ca="center">
                     <p>2,050.0</p>
                  </c>
                  <c ca="center">
                     <p>2,286.0</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AJ009630</p>
                  </c>
                  <c ca="center">
                     <p>chitin synthase <it>Gvchs3</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>4,116</p>
                  </c>
                  <c ca="center">
                     <p>7,205.9</p>
                  </c>
                  <c ca="center">
                     <p>5,235.8</p>
                  </c>
                  <c ca="center">
                     <p>5,985.8</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>U38650</p>
                  </c>
                  <c ca="center">
                     <p>phosphate transporter</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,833</p>
                  </c>
                  <c ca="center">
                     <p>3,937.9</p>
                  </c>
                  <c ca="center">
                     <p>5,702.3</p>
                  </c>
                  <c ca="center">
                     <p>6,514.3</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>Glycine max</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>J01297</p>
                  </c>
                  <c ca="center">
                     <p>actin <it>SAc3</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,620</p>
                  </c>
                  <c ca="center">
                     <p>3,322.0</p>
                  </c>
                  <c ca="center">
                     <p>4,554.6</p>
                  </c>
                  <c ca="center">
                     <p>5,329.7</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>K00821</p>
                  </c>
                  <c ca="center">
                     <p>lectin <it>Le1</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>2,152</p>
                  </c>
                  <c ca="center">
                     <p>4,124.6</p>
                  </c>
                  <c ca="center">
                     <p>6,558.3</p>
                  </c>
                  <c ca="center">
                     <p>7,928.3</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>M64267</p>
                  </c>
                  <c ca="center">
                     <p>iron superoxide dismutase</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,056</p>
                  </c>
                  <c ca="center">
                     <p>2,773.6</p>
                  </c>
                  <c ca="center">
                     <p>3,761.2</p>
                  </c>
                  <c ca="center">
                     <p>4,269.2</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>Medicago truncatula</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF000354</p>
                  </c>
                  <c ca="center">
                     <p>phosphate transporter <it>MtPT1</it></p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,920</p>
                  </c>
                  <c ca="center">
                     <p>3,800.3</p>
                  </c>
                  <c ca="center">
                     <p>5,630.7</p>
                  </c>
                  <c ca="center">
                     <p>6,654.2</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF000355</p>
                  </c>
                  <c ca="center">
                     <p>phosphate transporter <it>MtPT2</it></p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,867</p>
                  </c>
                  <c ca="center">
                     <p>3,673.9</p>
                  </c>
                  <c ca="center">
                     <p>5,390.1</p>
                  </c>
                  <c ca="center">
                     <p>6,424.0</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF055921</p>
                  </c>
                  <c ca="center">
                     <p><it>Mt4</it> genomic sequence</p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>954</p>
                  </c>
                  <c ca="center">
                     <p>2,631.9</p>
                  </c>
                  <c ca="center">
                     <p>4,004.4</p>
                  </c>
                  <c ca="center">
                     <p>4,539.1</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF106929</p>
                  </c>
                  <c ca="center">
                     <p>cell wall protein <it>AM1</it></p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>885</p>
                  </c>
                  <c ca="center">
                     <p>3,433.6</p>
                  </c>
                  <c ca="center">
                     <p>4,200.0</p>
                  </c>
                  <c ca="center">
                     <p>4,774.3</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF106930</p>
                  </c>
                  <c ca="center">
                     <p>translation initiation protein <it>AM3-1</it></p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>3,154</p>
                  </c>
                  <c ca="center">
                     <p>4,557.6</p>
                  </c>
                  <c ca="center">
                     <p>5,982.7</p>
                  </c>
                  <c ca="center">
                     <p>7,212.8</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF106931</p>
                  </c>
                  <c ca="center">
                     <p>translation initiation protein <it>AM3-2</it></p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,384</p>
                  </c>
                  <c ca="center">
                     <p>3,371.1</p>
                  </c>
                  <c ca="center">
                     <p>4,130.0</p>
                  </c>
                  <c ca="center">
                     <p>4,644.4</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AJ132891</p>
                  </c>
                  <c ca="center">
                     <p><it>ha1</it> gene, exons 1&#8211;22</p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>3,620</p>
                  </c>
                  <c ca="center">
                     <p>4,383.2</p>
                  </c>
                  <c ca="center">
                     <p>8,683.6</p>
                  </c>
                  <c ca="center">
                     <p>10,730.7</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AJ388847</p>
                  </c>
                  <c ca="center">
                     <p><it>MtNo213</it> superoxide dismutase</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>530</p>
                  </c>
                  <c ca="center">
                     <p>2,110.2</p>
                  </c>
                  <c ca="center">
                     <p>2,219.8</p>
                  </c>
                  <c ca="center">
                     <p>2,367.0</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AJ388865</p>
                  </c>
                  <c ca="center">
                     <p><it>MtNo233</it> triosephosphate isomerase</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>563</p>
                  </c>
                  <c ca="center">
                     <p>2,171.6</p>
                  </c>
                  <c ca="center">
                     <p>2,405.6</p>
                  </c>
                  <c ca="center">
                     <p>2,618.6</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>U16727</p>
                  </c>
                  <c ca="center">
                     <p>peroxidase precursor <it>rip1</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>2,603</p>
                  </c>
                  <c ca="center">
                     <p>4,246.1</p>
                  </c>
                  <c ca="center">
                     <p>8,210.0</p>
                  </c>
                  <c ca="center">
                     <p>9,901.9</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>U38651</p>
                  </c>
                  <c ca="center">
                     <p>sugar transporter</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,728</p>
                  </c>
                  <c ca="center">
                     <p>3,619.6</p>
                  </c>
                  <c ca="center">
                     <p>5,128.4</p>
                  </c>
                  <c ca="center">
                     <p>5,976.5</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X57732</p>
                  </c>
                  <c ca="center">
                     <p>leghemoglobin <it>Mtlb1</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,073</p>
                  </c>
                  <c ca="center">
                     <p>3,021.3</p>
                  </c>
                  <c ca="center">
                     <p>5,029.1</p>
                  </c>
                  <c ca="center">
                     <p>5,845.9</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X57733</p>
                  </c>
                  <c ca="center">
                     <p>leghemoglobin <it>Mtlb2</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>592</p>
                  </c>
                  <c ca="center">
                     <p>2,045.9</p>
                  </c>
                  <c ca="center">
                     <p>3,156.0</p>
                  </c>
                  <c ca="center">
                     <p>3,568.2</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X60386</p>
                  </c>
                  <c ca="center">
                     <p>lectin <it>lec1</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,363</p>
                  </c>
                  <c ca="center">
                     <p>3,228.8</p>
                  </c>
                  <c ca="center">
                     <p>4,935.4</p>
                  </c>
                  <c ca="center">
                     <p>5,605.6</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X60387</p>
                  </c>
                  <c ca="center">
                     <p>lectin <it>lec2</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,192</p>
                  </c>
                  <c ca="center">
                     <p>3,142.8</p>
                  </c>
                  <c ca="center">
                     <p>4,472.6</p>
                  </c>
                  <c ca="center">
                     <p>4,985.9</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X82216</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>lec3</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,155</p>
                  </c>
                  <c ca="center">
                     <p>2,928.4</p>
                  </c>
                  <c ca="center">
                     <p>4,283.3</p>
                  </c>
                  <c ca="center">
                     <p>4,930.8</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X68032</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>ENOD12</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>772</p>
                  </c>
                  <c ca="center">
                     <p>2,780.4</p>
                  </c>
                  <c ca="center">
                     <p>3,679.7</p>
                  </c>
                  <c ca="center">
                     <p>4,096.5</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X99466</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>ENOD16</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,142</p>
                  </c>
                  <c ca="center">
                     <p>3,124.6</p>
                  </c>
                  <c ca="center">
                     <p>4,535.5</p>
                  </c>
                  <c ca="center">
                     <p>5,156.2</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X99467</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>ENOD20</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,405</p>
                  </c>
                  <c ca="center">
                     <p>4,003.6</p>
                  </c>
                  <c ca="center">
                     <p>5,294.7</p>
                  </c>
                  <c ca="center">
                     <p>5,966.7</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>Y10267</p>
                  </c>
                  <c ca="center">
                     <p>glutamine synthetase</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,413</p>
                  </c>
                  <c ca="center">
                     <p>3,116.1</p>
                  </c>
                  <c ca="center">
                     <p>4,506.5</p>
                  </c>
                  <c ca="center">
                     <p>5,292.1</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>Y10373</p>
                  </c>
                  <c ca="center">
                     <p>chitinase</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>1,305</p>
                  </c>
                  <c ca="center">
                     <p>3,369.5</p>
                  </c>
                  <c ca="center">
                     <p>4,090.4</p>
                  </c>
                  <c ca="center">
                     <p>4,703.4</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>Phytophthora infestans</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF004951</p>
                  </c>
                  <c ca="center">
                     <p>surface glycoprotein elicitor <it>inf2A</it></p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>648</p>
                  </c>
                  <c ca="center">
                     <p>3,428.4</p>
                  </c>
                  <c ca="center">
                     <p>2,421.9</p>
                  </c>
                  <c ca="center">
                     <p>2,589.1</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF004952</p>
                  </c>
                  <c ca="center">
                     <p>surface glycoprotein elicitor <it>inf2B</it></p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>701</p>
                  </c>
                  <c ca="center">
                     <p>3,611.7</p>
                  </c>
                  <c ca="center">
                     <p>2,514.5</p>
                  </c>
                  <c ca="center">
                     <p>2,698.6</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>L23938</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>ipiO2</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,556</p>
                  </c>
                  <c ca="center">
                     <p>4,125.2</p>
                  </c>
                  <c ca="center">
                     <p>4,339.5</p>
                  </c>
                  <c ca="center">
                     <p>4,855.5</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>L23939</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>ipiO1</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,826</p>
                  </c>
                  <c ca="center">
                     <p>4,360.5</p>
                  </c>
                  <c ca="center">
                     <p>4,580.9</p>
                  </c>
                  <c ca="center">
                     <p>5,259.0</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>L24206</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>ipiB1</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,726</p>
                  </c>
                  <c ca="center">
                     <p>6,086.7</p>
                  </c>
                  <c ca="center">
                     <p>4,584.3</p>
                  </c>
                  <c ca="center">
                     <p>5,159.3</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>M59715</p>
                  </c>
                  <c ca="center">
                     <p>actin <it>actA</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,736</p>
                  </c>
                  <c ca="center">
                     <p>5,137.1</p>
                  </c>
                  <c ca="center">
                     <p>3,637.2</p>
                  </c>
                  <c ca="center">
                     <p>4,420.0</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>M59716</p>
                  </c>
                  <c ca="center">
                     <p>actin <it>actB</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,405</p>
                  </c>
                  <c ca="center">
                     <p>4,425.3</p>
                  </c>
                  <c ca="center">
                     <p>3,569.5</p>
                  </c>
                  <c ca="center">
                     <p>4,141.6</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>M83535</p>
                  </c>
                  <c ca="center">
                     <p>calmodulin <it>calA</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,358</p>
                  </c>
                  <c ca="center">
                     <p>4,063.0</p>
                  </c>
                  <c ca="center">
                     <p>3,724.9</p>
                  </c>
                  <c ca="center">
                     <p>4,138.1</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X64537</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>tigA</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>2,448</p>
                  </c>
                  <c ca="center">
                     <p>6,221.0</p>
                  </c>
                  <c ca="center">
                     <p>4,193.8</p>
                  </c>
                  <c ca="center">
                     <p>5,181.9</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>P. capsici</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>U42304</p>
                  </c>
                  <c ca="center">
                     <p>chitin synthase <it>chs</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>449</p>
                  </c>
                  <c ca="center">
                     <p>2,238.8</p>
                  </c>
                  <c ca="center">
                     <p>1,882.7</p>
                  </c>
                  <c ca="center">
                     <p>1,997.5</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>P. parasitica</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X97205</p>
                  </c>
                  <c ca="center">
                     <p>cellulose-binding-elicitor lectin</p>
                  </c>
                  <c ca="center">
                     <p>y</p>
                  </c>
                  <c ca="center">
                     <p>918</p>
                  </c>
                  <c ca="center">
                     <p>3,819.1</p>
                  </c>
                  <c ca="center">
                     <p>2,876.0</p>
                  </c>
                  <c ca="center">
                     <p>3,208.4</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>Sinorhizobium meliloti</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF040724</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>nodD</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,776</p>
                  </c>
                  <c ca="center">
                     <p>5,317.9</p>
                  </c>
                  <c ca="center">
                     <p>4,197.8</p>
                  </c>
                  <c ca="center">
                     <p>4,179.4</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>AF110770</p>
                  </c>
                  <c ca="center">
                     <p>superoxide dismutase <it>sodA</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,196</p>
                  </c>
                  <c ca="center">
                     <p>4,898.2</p>
                  </c>
                  <c ca="center">
                     <p>3,343.2</p>
                  </c>
                  <c ca="center">
                     <p>2,916.0</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>M61753</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>exoD</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>858</p>
                  </c>
                  <c ca="center">
                     <p>4,071.8</p>
                  </c>
                  <c ca="center">
                     <p>2,847.2</p>
                  </c>
                  <c ca="center">
                     <p>2,372.9</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>M68858</p>
                  </c>
                  <c ca="center">
                     <p>nodulation proteins <it>nodP</it> and <it>nodQ</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>3,476</p>
                  </c>
                  <c ca="center">
                     <p>9,992.3</p>
                  </c>
                  <c ca="center">
                     <p>5,288.0</p>
                  </c>
                  <c ca="center">
                     <p>3,954.7</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>M96261</p>
                  </c>
                  <c ca="center">
                     <p>phosphate regulators <it>phoU</it> and <it>phoB</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,178</p>
                  </c>
                  <c ca="center">
                     <p>5,332.7</p>
                  </c>
                  <c ca="center">
                     <p>3,359.3</p>
                  </c>
                  <c ca="center">
                     <p>2,866.4</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>U90221</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>syrA</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>1,102</p>
                  </c>
                  <c ca="center">
                     <p>4,176.2</p>
                  </c>
                  <c ca="center">
                     <p>3,375.8</p>
                  </c>
                  <c ca="center">
                     <p>3,220.8</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X01649</p>
                  </c>
                  <c ca="center">
                     <p><it>nodA</it>, nodB<it>,</it> and <it>nodC</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>3,373</p>
                  </c>
                  <c ca="center">
                     <p>7,684.1</p>
                  </c>
                  <c ca="center">
                     <p>4,819.5</p>
                  </c>
                  <c ca="center">
                     <p>4,646.1</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X03065</p>
                  </c>
                  <c ca="center">
                     <p>regulatory nitrogen fixation <it>fixD</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>2,111</p>
                  </c>
                  <c ca="center">
                     <p>5,723.0</p>
                  </c>
                  <c ca="center">
                     <p>4,249.8</p>
                  </c>
                  <c ca="center">
                     <p>4,228.3</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>X17523</p>
                  </c>
                  <c ca="center">
                     <p>glutamine synthetase II</p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>990</p>
                  </c>
                  <c ca="center">
                     <p>4,303.5</p>
                  </c>
                  <c ca="center">
                     <p>2,959.9</p>
                  </c>
                  <c ca="center">
                     <p>2,720.4</p>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>Y08500</p>
                  </c>
                  <c ca="center">
                     <p>
                        <it>putA</it>
                     </p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>3,804</p>
                  </c>
                  <c ca="center">
                     <p>13,623.3</p>
                  </c>
                  <c ca="center">
                     <p>6,376.5</p>
                  </c>
                  <c ca="center">
                     <p>4,212.0</p>
                  </c>
               </r>
               <r>
                  <c ca="left">
                     <p>
                        <it>Agrobacterium tumefaciens</it>
                     </p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
               <r>
                  <c indent="1" ca="left">
                     <p>U91632 </p>
                  </c>
                  <c ca="center">
                     <p>sugar transporter <it>gguA</it> and membrane-spanning permeases <it>gguB</it> and <it>gguC</it></p>
                  </c>
                  <c ca="center">
                     <p>n</p>
                  </c>
                  <c ca="center">
                     <p>4,185</p>
                  </c>
                  <c ca="center">
                     <p>11,132.6</p>
                  </c>
                  <c ca="center">
                     <p>5,959.4</p>
                  </c>
                  <c ca="center">
                     <p>4,551.4</p>
                  </c>
               </r>
               <r>
                  <c>
                     <p/>
                  </c>
                  <c ca="center">
                     <p>sugar transporter <it>gguA</it> and membrane-spanning permeases <it>gguB</it> and <it>gguC</it></p>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
                  <c>
                     <p/>
                  </c>
               </r>
            </tblbdy>
            <tblfn>
               <p>Incorrect inferences are underlined.</p>
            </tblfn>
         </tbl>
         <p>Distributions of GC content are approximately normal in two of three cases studied, those of axenic <it>P. sojae</it> cultures (Figure <figr fid="F1">1</figr>). For sequences from infected plant cultures, a bimodal distribution is apparent. Roughly 25% of a total of 927 infected <it>G. max</it> sequences contain less than 50% GC; most of these are likely to be plant transcripts [<abbr bid="B18">18</abbr>]. This is a considerably greater number than for axenic <it>P. sojae</it> cultures, in which fewer than 5% of mycelia and zoospore isolates contain less than 50% GC.</p>
         <fig id="F1">
            <title>
               <p>Figure 1</p>
            </title>
            <caption>
               <p>Distribution of GC content in pure and mixed-culture libraries</p>
            </caption>
            <text>
               <p>Distribution of GC content in pure and mixed-culture libraries. <b>(a)</b> Probability densities for histogram bin sizes of 0.02 (2%) in base content. <b>(b)</b> Cumulative probability distribution functions (<it>cdf</it>s).</p>
            </text>
            <graphic file="gb-2001-2-9-research0037-1"/>
         </fig>
         <p>Several properties of cumulative distribution functions warrant comment, to help explain similar plots from word dissimilarity comparisons (Figures <figr fid="F1">1b</figr>,<figr fid="F2">2a</figr>). The median of a distribution occurs where the function reaches a cumulative probability of 0.5. Medians from all three <it>P. sojae</it> libraries are similar, varying by less than 4% GC (Figure <figr fid="F1">1b</figr>). Other moments of the distributions are readily apparent; the variance is inversely related to the slope at the median value of the function. A useful property of cumulative distribution functions is that any point on the <it>y</it> axis gives the integrated area (cumulative probability) under the curve. We use this property to establish experiment-wide false-positive and false-negative rates (Figure <figr fid="F2">2a</figr>). In this case, &#945; = 0.088 and &#946; = 0.032.</p>
         <fig id="F2">
            <title>
               <p>Figure 2</p>
            </title>
            <caption>
               <p>Distribution of hexamer dissimilarity test results from pure and mixed-culture libraries</p>
            </caption>
            <text>
               <p>Distribution of hexamer dissimilarity test results from pure and mixed-culture libraries. <b>(a)</b> Calculation of statistical parameters from <it>cdf</it>s <it>A</it> and <it>B</it>. Overlap in the upper tail of <it>cdf</it><sub>A</sub> with <it>cdf</it><sub>B</sub> and the lower tail of <it>cdf</it><sub>B</sub> with c<it>df</it><sub>A</sub> are likely regions for error. We find the false-positive rate &#945; where 1 - c<it>df</it><sub>A</sub> intersects 0 [<it>cdf</it><sub>A</sub>(0)= 1 - &#945;], and the false-negative rate &#946; where <it>cdf</it><sub>B</sub> crosses 0. Also shown are the medians (&#956;) for each distribution, where <it>cdf</it>(&#956;) = 0.5. <b>(b)</b> Calibration curves for plant (<it>A</it><sub>1</sub>, <it>Glycine</it> and <it>Medicago</it> spp., solid black line) and stramenopile plus <it>P. infestans</it> EST (<it>B</it><sub>1</sub>, dashed black line) training sequences. Superimposed distributions of test results show dissimilarity differences for infected <it>G. max</it> (green) and axenic <it>P. sojae</it> mycelial and zoospore sequences (blue and cyan, respectively).</p>
            </text>
            <graphic file="gb-2001-2-9-research0037-2"/>
         </fig>
         <p>Calibration curves from hexamer dissimilarity tests, shown in Figure <figr fid="F2">2b</figr> as solid black lines for plant and dashed black lines for stramenopile training sequences are approximately normal. The medians differ considerably, with only about 10% percent overlap in the two distributions' tails about the neutral <it>t</it>-value of zero. Superimposed are comparison curves from <it>P. sojae</it> test sets (Figure <figr fid="F2">2b</figr>), which parallel the GC content curves in Figure <figr fid="F1">1b</figr> but show slightly less variance. Axenic sequences are clearly more like stramenopiles (<it>B</it><sub>1</sub>) than plants (<it>A</it><sub>1</sub>) in hexamer composition, with all but a small percentage having positive <it>t</it> values. Plant-like sequences are as abundant in the mixed library as detected by GC content, about 23%. As expected, the two methods agree, having positively correlated values for GC and <it>t</it> (<it>r</it><sup>2</sup> = 0.852, <it>P</it> &lt; 10<sup>-16</sup>, <it>v</it> = 2,641).</p>
         <p>Looking in more detail at the paired dissimilarity values (Figure <figr fid="F3">3</figr>), we can see which individual sequences are more or less like plant and pathogen. The magnitudes of dissimilarity are also apparent, with longer sequences having larger dissimilarity values. BLASTX similarity searches against the protein sequences in nr, a non-redundant library of proteins [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>,<abbr bid="B31">31</abbr>] revealed that none of the 12 plant-like mycelial transcripts significantly resemble known proteins (<it>E</it> > 10<sup>-4</sup>). Among the top ten most plant-like transcripts from the infected <it>G. max</it> library, three had no significant matches, four matched putative <it>Arabidopsis thaliana</it> proteins, and three matched known <it>G. max</it> proteins: cytochrome P450 (accession AF022460, <it>E</it> &lt; 10<sup>-34</sup>), methylglyoxalase (accession P46417, <it>E</it> &lt; 10<sup>-34</sup>), and a ripening related protein (accession AF127110, <it>E</it> &lt; 10<sup>-71</sup>). Thus, the majority of the most plant-like transcripts in the infected soybean library strongly resemble characterized plant sequences. Analysis results from all <it>P. sojae</it> and mixed-culture transcripts are available as additional data files, grouped by the library from which transcripts were sequenced.</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>Paired dissimilarity test results from pure and mixed-culture libraries</p>
            </caption>
            <text>
               <p>Paired dissimilarity test results from pure and mixed-culture libraries. Each point corresponds to an expressed tag from either <b>(a)</b> infected <it>G. max</it> or <b>(b)</b> axenic <it>P. sojae</it> mycelial or <b>(c)</b> zoospore sequences, compared with plant (<it>A</it><sub>1</sub>) and stramenopile plus <it>P. infestans</it> EST training sequences (<it>B</it><sub>1</sub>). The identity function indicates equal dissimilarity to both training sets, <it>t=D</it>(<it>A</it>) - <it>D</it>(<it>B</it>)= 0. Points above the identity function are more plant-like than points below.</p>
            </text>
            <graphic file="gb-2001-2-9-research0037-3"/>
         </fig>
         <p>Figure <figr fid="F4">4</figr> shows that calibration curves from comparing plant and microbial symbiont training sets have good separation and minimal overlap (about 10%) in two of three cases, but not for training set <it>B</it><sub>2</sub>, comprised of zygomycetes and chytridiomycetes, which overlaps considerably with plants (Figure <figr fid="F4">4b</figr>). The associated error rates are &#945; = 0.126 and &#946; = 0.207. When comparing between plants and bacteria, the error rates are &#945; = 0.052 and &#946; = 0.084, much lower than when comparing plants (<it>A</it><sub>2</sub>, <it>Medicago</it>) with fungi (<it>B</it><sub>2</sub>, zygomycetes and chytridiomycetes). Error rates for comparing stramenopiles and <it>P. infestans</it> ESTs with plants are as in Figure <figr fid="F2">2</figr> (&#945; = 0.088, &#946; = 0.032).</p>
         <fig id="F4">
            <title>
               <p>Figure 4</p>
            </title>
            <caption>
               <p>Dissimilarity distributions from <it>Medicago truncatula</it> libraries</p>
            </caption>
            <text>
               <p>Dissimilarity distributions from <it>Medicago truncatula</it> libraries. Calibration curves compare plant training sets (<it>A</it><sub>1</sub> and <it>A</it><sub>2</sub>, solid black lines) with one of three microbial symbiont training sets (broken black lines): <b>(a)</b> Stramenopile and <it>P. infestans</it> EST sequences (<it>B</it><sub>1</sub>); <b>(b)</b> pooled zygomycete and chytridiomycete coding sequences (<it>B</it><sub>2</sub>); and <b>(c)</b> sequences from the genera <it>Rhizobium</it>, <it>Sinorhizobium</it> and <it>Bradyrhizobium</it> (<it>B</it><sub>3</sub>). Cumulative distributions of test results from <it>M. truncatula</it> axenic and microbial symbiont mixed cultures appear in each panel (colored lines).</p>
            </text>
            <graphic file="gb-2001-2-9-research0037-4"/>
         </fig>
         <p>Also shown in Figure <figr fid="F4">4</figr> are cumulative distributions from comparisons with <it>M. truncatula</it> and microbial symbionts. All resemble calibration curves from plant sequences, having similar medians and slightly less variance than the plant calibration curves. Comparison curves show that the great majority of test sequences are more plant-like than otherwise, with 20% or less resembling microbial symbionts more closely than plants. A greater proportion of microbial sequences is present in the <it>M. truncatula</it>-<it>G. versiforme</it> interaction library (20%, Figure <figr fid="F4">4b</figr>) than in the <it>P. medicaginis</it>-infected <it>M. truncatula</it> library (5%, Figure <figr fid="F4">4a</figr>). However, Long's root-hair enriched library (MtRHE) [<abbr bid="B6">6</abbr>] had a greater proportion of putative microbial sequences present (7% and 25%) than any of the libraries isolated from symbiont-associated cultures. The axenic and nodulating root libraries had the smallest portion of putative microbial transcripts (&lt; 2%, Figure <figr fid="F4">4c</figr>), with the axenic library closely resembling nodulating root libraries. The method of preparing a library can affect the proportion of plant and non-plant sequences, as discussed below.</p>
         <p>Paired dissimilarity values in Figure <figr fid="F5">5</figr> show in greater detail which sequences are more or less like plant and symbiont. Sequences from an interaction library and pure plant root cultures appear together for comparison. Considerable variation in the degree of dissimilarity to both training sets is clear, largely due to variation in the length of sequences within test sets. Consistent with the cumulative distributions of <it>D(A)</it> &#8211; <it>D(B)</it> in Figure <figr fid="F4">4</figr>, most sequences lie above the identity function, and resemble the plant host more closely than the microbial symbiont. Mycorrhizal test sequences are more difficult to differentiate than sequences from the rhizobacterial or pathogenic associations, as seen by the diminished variation about the identity function in mycorrhizal comparisons (Figure <figr fid="F5">5b</figr>), contrasted with comparisons from pathogen-infected and nodulating root libraries (Figures <figr fid="F5">5a</figr> and <figr fid="F5">c</figr>, respectively). Analysis results from all <it>M. truncatula</it> and mixed-culture transcripts are available as additional data files, grouped by the library from which transcripts were sequenced, and sorted from the least plant-like transcripts to the most plant-like.</p>
         <fig id="F5">
            <title>
               <p>Figure 5</p>
            </title>
            <caption>
               <p>Paired comparison results frompure and mixed-culture <it>M. truncatula</it> libraries</p>
            </caption>
            <text>
               <p>Paired comparison results frompure and mixed-culture <it>M. truncatula</it> libraries. Each point indicates the dissimilarity of a test sequence compared with a plant training set (<it>A</it><sub>1</sub> or <it>A</it><sub>2</sub>) and one of three microbial symbiont training sets: <b>(a)</b> Stramenopile and <it>P. infestans</it> EST sequences (<it>B</it><sub>1</sub>); <b>(b)</b> pooled zygomycete and chytridiomycete coding sequences (<it>B</it><sub>2</sub>); and <b>(c)</b> sequences from the genera <it>Rhizobium</it>, <it>Sinorhizobium</it> and <it>Bradyrhizobium</it> (<it>B</it><sub>3</sub>). Sequences from <it>M. truncatula</it> axenic (green) and microbial symbiont mixed culture libraries are represented in each panel. The identity function (<it>y</it> = <it>x</it>) is also shown.</p>
            </text>
            <graphic file="gb-2001-2-9-research0037-5"/>
         </fig>
      </sec>
      <sec>
         <st>
            <p>Discussion</p>
         </st>
         <p>Clearly, the word-counting approach provides a reliable solution to the problem of source identification with known confidence, and has several significant advantages. The reliability of the method is best justified in terms of the favorable validation test results, and is further corroborated by agreement with an analysis of GC content. In test cases where the correct answer is known <it>a priori</it>, results were correct within error rates expected from overlap in training sets. (Recall that &#945; = 0.088 for comparisons between plants and stramenopiles, and &#945; = 0.052 for comparisons between plants and bacteria.) Unlike GC content, the problem is clearly resolved by word counting with a threshold value of <it>t</it> = 0, and with statistical rigor, because false-positive and false-negative rates for a set of comparisons are readily computed from cumulative distributions of dissimilarity between two training sets. Optimal statistical power (minimal false-negative rate) is ensured when using a likelihood-ratio test statistic, as demonstrated by the Pearson-Neyman Theorem [<abbr bid="B28">28</abbr>]. Further, word counting need not be trained only for the species being compared. Rather, it is sufficient that the training set be related to, but not necessarily congeners of, the species from which sequences are being compared. Sequences from several species of the genus <it>Phytophthora</it> were correctly distinguished from plant and bacterial sequences, and three genes from <it>Agrobacterium tumefaciens</it> were correctly identified as representing a bacterial sequence.</p>
         <p>However, several caveats warrant prudence. Transcribed sequences that do not encode proteins, but rather catalytic single-stranded RNAs such as transfer and ribosomal RNAs [<abbr bid="B32">32</abbr>], should be treated independently because they are more highly conserved across taxa than messenger RNAs. Also, filtering or trimming of low-complexity repeat regions, such as poly(A) or poly(T) tracts, is helpful because comparison results can be influenced by the abundance of a single hexamer. Early in our investigations, using one set of training sequences obtained from directionally cloned <it>P. infestans</it> cDNAs produced results that were difficult to interpret. It eventually became clear that, as the <it>P. infestans</it> sequences were all single-pass reads from the 5' end of a clone generated with the T3 primer, few sequences complementary to the 3' end of the mRNA sequence were present in the training set. This meant that the hexamer AAAAAA was common, but the hexamer TTTTTT scarce. Large amounts of the poly(T) hexamer would be expected when sequencing reverse complements of mRNAs obtained from 3' sequences generated with the T7 primer. Both poly(A) and poly(T) regions were present among plant training sequences. As a result, any sequence that contained a poly(T) tract tended to resemble the plant sequences. Further, because the error rates for an inference depend on the degree to which calibration curves overlap, the best results are obtained where overlap is minimal. Despite these caveats, word counting presents a viable solution to the problem.</p>
         <p>The <it>P. sojae</it>-infected <it>G. max</it> library provides a clear example of contrast in both hexamer composition and GC content, resulting in readily diagnosed origins. Not every case is this simple. For clear separation between the two species to appear, the two must differ in composition and a detectable proportion of transcripts from each species must be present in the library. To be detectable, the proportion of transcripts present from a particular species must be greater than the error rate obtained from calibration curves.</p>
         <p>Though these criteria are true for the infected <it>G. max</it> library (<it>t</it> &lt; 0 for &lt;25% of 927 transcripts), they do not appear to be true for the <it>M. truncatula</it> libraries we analyzed (<it>t</it> &lt; 0 for 80&#8211;99% of 890&#8211;3,017 transcripts). In the <it>P. medicaginis</it> interaction library, we might expect the same bimodal distribution as seen with <it>P. sojae</it>. However, the two libraries were prepared in different ways. The <it>P. sojae</it>-infected library was prepared two days after infection, using a susceptible plant host strain, so as to maximize the number of pathogen transcripts present in the host tissue [<abbr bid="B18">18</abbr>]. Further, <it>G. max</it> hypocotyl tissues were infected directly with a zoospore suspension. In contrast, the <it>P. medicaginis</it>-infected library was prepared ten days after infection and individual plants varied in their degree of susceptibility (C. Vance, unpublished data). Plants were also inoculated in a different manner: ground mycelia were dissolved in sterile water and incubated, and the resulting inoculum was pipetted onto the soil surface, rather than the plant. These differences in how tissues were cultured prior to library preparation could have produced the disparate abundance of plant transcripts, though both libraries were prepared from plant tissues infected with <it>Phytophthora</it>.</p>
         <p>For mycorrhizal root libraries, we might explain the relative lack of symbiont sequences as resulting simply from a relative lack of transcripts in the host tissue. Most of the biomass in mycorrhizal roots is plant biomass [<abbr bid="B33">33</abbr>]. We might therefore expect that most of the transcripts therein originate from the plant host. Confounding this result, the error rates in this comparison are the greatest among all the comparisons we performed, most likely because the evolutionary distance between fungi (zygomycetes and chytridiomycetes) and plants is the least among comparisons [<abbr bid="B34">34</abbr>]. Also, zygomycete protein-coding sequences are rare in GenBank, which resulted in a small training set for these fungi, and may have amplified any biases. The high false-negative rate probably led to a failure to detect some symbiont transcripts.</p>
         <p>In nodulating root libraries, we do not expect to observe an abundance of bacterial transcripts, because bacteria generally do not form polyadenylated mRNAs [<abbr bid="B35">35</abbr>]. As the protocols used to extract and purify mRNAs from tissue lysate for the libraries cited in this study all relied on the presence of polyadenylation sites, we generally do not expect to find bacterial transcripts.</p>
         <p>The abundance of putative microbial symbiont transcripts among sequences from a pure plant root library is difficult to interpret. The predicted portion of microbial transcripts was greater in the axenic root-hair enriched library than in mixed cultures. Error rates were greatest for comparisons between training sets from plant and pooled zygomycete and chytridiomycete sequences. Other than providing an 87% confidence level, the 13% false-positive rate does not completely explain why about 15% of root-hair enriched transcripts resemble fungal hexamer composition more closely than plants, and warrants further study.</p>
         <p>Care had been taken to avoid contaminating plant tissue cultures by culturing seedlings in covered plates. Because of concern that ethylene accumulation in covered plates could improperly stimulate nodulation-related gene expression, seedlings were treated with Ag<sub>2</sub>SO<sub>4</sub>, an inhibitor of the plants' response to ethylene [<abbr bid="B6">6</abbr>]. Inhibition of the ethylene response could have resulted in synthesis of transcripts that are uncharacteristic of plant roots. Analysis of another axenic root-hair enriched library, particularly one provided a carbon source to identify potential contaminants, and not treated with an inhibitor of ethylene response, would be an informative test.</p>
         <p>These observations warrant further experimental scrutiny. The transcripts identified as most and least like plant or symbiont might also be studied in more detail as candidate participants in symbiosis. Symbiotic interactions, whether pathogenic or mutualistic, present novel challenges to both plant hosts and the biologists who study them. Computational approaches, in concert with experimental verification, can help resolve these challenges.</p>
      </sec>
      <sec>
         <st>
            <p>Methods and materials</p>
         </st>
         <sec>
            <st>
               <p>Training sequences</p>
            </st>
            <sec>
               <st>
                  <p>Calibration</p>
               </st>
               <p>To characterize hexamer frequencies in plant hosts and their microbial symbionts, we collected sets of training sequences from public databases and edited them for quality. Training sets were chosen to be representative of, but obtained independently from, taxa participating in symbiotic associations for which a diagnosis of origin would be made. Because the species being compared are represented unevenly in public sequence databases, taxa were chosen so that roughly the same number of genes were analyzed in each training set, rather than simply to maximize the numbers of species or sequences present.</p>
               <p>Training sets represent protein-coding sequences from three taxonomic groupings: plants (<it>A</it><sub>1</sub>, <it>Medicago</it> and <it>Glycine</it> spp.), either fungi (<it>B</it><sub>2</sub>, zygomycetes and chytridiomycetes) or stramenopiles (<it>B</it><sub>1</sub>), including ESTs from <it>P. infestans</it> [<abbr bid="B16">16</abbr>], and bacteria (<it>B</it><sub>3</sub>, <it>Rhizobium</it>, <it>Sinorhizobium</it> and <it>Bradyrhizobium</it>). We performed pairwise comparisons with two different, taxon-specific training sets (<it>A</it> and <it>B</it>) to infer the origin of a transcript.</p>
               <p>Training sets were obtained by querying the GenBank database using the Entrez retrieval tool [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>,<abbr bid="B31">31</abbr>]. A preliminary query by taxon name obtained all available nucleotide sequences from that taxon, then the Limits option excluded ESTs, STSs (sequence-tagged sites), GSSs (genome survey sequences), working draft sequences, and patented sequences from the query set. Organellar (mitochondrial and chloroplast) DNA was also excluded via the Limits option. A query term to require that a sequence contain a protein-coding region (CDS) was also added, which excluded ribosomal and transfer RNA sequences. The results consisted of all sequences that contain a nuclear protein-coding sequence available for that taxon at the time of the query. This was done on two separate occasions: in April and October 2000. (Changing slightly the composition of training sets between those dates did not notably affect the experimental outcome.)</p>
               <p>Following a previously established protocol [<abbr bid="B19">19</abbr>], we used a resampling procedure to evaluate the degree of overlap between distributions of hexamer composition obtained from comparing two training sets. In this protocol, we resampled each training set 40 times by random partitioning into training (for hexamer counts) and test calculation pools. To control for any bias introduced by length variation, a program randomly clipped 300 nucleotide fragments for word counting. As a result, one random 300 nucleotide fragment from each training sequence was present in the training set during a single resampling replicate; independent replicates contained different, randomly chosen training sequences and 300 nucleotide fragments. Values of the test statistic from 40 resampled replicates were pooled for calibration purposes.</p>
               <p>As with the original protocol [<abbr bid="B19">19</abbr>], we pooled the resulting test statistic distributions, normalized them as cumulative distributions, and then evaluated them for overlap. We call the resulting comparisons 'calibration curves', as they are not used directly to make inferences, but rather indirectly, to evaluate the degree of separation in hexamer counts from different taxa. Overlap of calibration curves should be minimal to yield the most statistically powerful results possible.</p>
               <p>Due to considerable overlap of calibration curves between taxonomically general, inclusive training sets (that is, all eudicots, all fungi and miscellaneous eukaryotes, and all eubacteria, data not shown), we opted to work with specific training sets that included only the most species-specific sequences available, while maintaining approximately equal sample sizes across taxa.</p>
               <p>The most challenging case was that of the arbuscular mycorrhizal fungi, for which very few protein-coding sequences are available. To increase the amount of data in this training set (<it>B</it><sub>2</sub>) without biasing sample sizes, we pooled sequences from all species in the zygomycetes with all available chytridiomycete coding sequences, and compared this training set with a set from a single plant genus, <it>Medicago</it> (<it>A</it><sub>2</sub>). We chose this option, rather than including an arbitrary subset of sequences from the ascomycetes and basidiomycetes, because zygomycetes and chytridiomycetes have diverged from their common ancestor less recently than the ascomycetes and basidiomycetes, based on 18S ribosomal RNA sequence data [<abbr bid="B34">34</abbr>]. That is, the ascomycetes and basidiomycetes are more highly derived from the common fungal ancestor than zygomycetes and chytridiomycetes, which resemble more closely the ancestral state in modern lineages.</p>
            </sec>
            <sec>
               <st>
                  <p>Data quality</p>
               </st>
               <p>Starting with a full set of sequences, we filtered for high-quality sequences by trimming regions having extensive ambiguous bases (N-rich) and poly(A) or poly(T) regions. The test statistic can be sensitive to the abundance of a single word [<abbr bid="B19">19</abbr>]. Thus, we trimmed poly(A) and poly(T) sites to minimize the cases in which a test sequence resembles one training set more closely than the other, simply by virtue of having an abundance of the hexamer AAAAAA or TTTTTT. Similarly, test results obtained from short or N-rich sequences can be difficult to interpret [<abbr bid="B19">19</abbr>]. We allowed no more than one N per hexamer and trimmed poly(A) or poly(T) tracts longer than 13 nucleotides. To accommodate for possible sequence chimeras, those sequences found to contain an internal poly(A) or poly(T) segment longer than 13 nucleotides were partitioned into two fragments, and the longer of the two fragments was used in analysis, provided its length was at least 300 nucleotides.</p>
               <p>After trimming, we screened all remaining sequences of 300 nt or longer for similarity to <it>Escherichia coli</it> using BLASTN [<abbr bid="B20">20</abbr>,<abbr bid="B21">21</abbr>]. All BLAST searches used default parameters and low-complexity filtering with the programs DUST or SEG. The decision to exclude non-coding RNA sequences from training sets was informed by the appearance of bimodal distributions of hexamer frequencies and a large degree of overlap between calibration curves (data not shown), likely a result of divergent evolutionary rates between protein-coding and non-coding sequences [<abbr bid="B36">36</abbr>,<abbr bid="B37">37</abbr>]. Chloroplast and mitochondrial sequences were eliminated to avoid complications due to variation in codon usage between nuclear and organellar genomes.</p>
               <p>Table <tblr tid="T2">2</tblr> summarizes counts of sequences and nucleotides in training sets before and after trimming and screening. All training sets obtained using the procedure described above are available as additional files.</p>
               <tbl id="T2">
                  <title>
                     <p>Table 2</p>
                  </title>
                  <caption>
                     <p>Training sets</p>
                  </caption>
                  <tblbdy cols="7">
                     <r>
                        <c ca="left">
                           <p>Taxon</p>
                        </c>
                        <c cspan="2" ca="center">
                           <p>Raw</p>
                        </c>
                        <c cspan="2" ca="center">
                           <p>Trimmed</p>
                        </c>
                        <c cspan="2" ca="center">
                           <p>Screened</p>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c cspan="2">
                           <hr/>
                        </c>
                        <c cspan="2">
                           <hr/>
                        </c>
                        <c cspan="2">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>n</p>
                        </c>
                        <c ca="center">
                           <p>nt</p>
                        </c>
                        <c ca="center">
                           <p>n</p>
                        </c>
                        <c ca="center">
                           <p>nt</p>
                        </c>
                        <c ca="center">
                           <p>n</p>
                        </c>
                        <c ca="center">
                           <p>nt</p>
                        </c>
                     </r>
                     <r>
                        <c cspan="7">
                           <hr/>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <it>Glycine</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>892</p>
                        </c>
                        <c ca="center">
                           <p>1,265,829</p>
                        </c>
                        <c ca="center">
                           <p>834</p>
                        </c>
                        <c ca="center">
                           <p>1,219,114</p>
                        </c>
                        <c ca="center">
                           <p>826</p>
                        </c>
                        <c ca="center">
                           <p>1,184,951</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p><it>Medicago</it> (<it>A</it><sub>2</sub>)</p>
                        </c>
                        <c ca="center">
                           <p>401</p>
                        </c>
                        <c ca="center">
                           <p>561,104</p>
                        </c>
                        <c ca="center">
                           <p>382</p>
                        </c>
                        <c ca="center">
                           <p>519,739</p>
                        </c>
                        <c ca="center">
                           <p>380</p>
                        </c>
                        <c ca="center">
                           <p>513,868</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Total, plants (<it>A</it><sub>1</sub>)</p>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>1,206</p>
                        </c>
                        <c ca="center">
                           <p>1,698,819</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Stramenopiles</p>
                        </c>
                        <c ca="center">
                           <p>199</p>
                        </c>
                        <c ca="center">
                           <p>299,113</p>
                        </c>
                        <c ca="center">
                           <p>184</p>
                        </c>
                        <c ca="center">
                           <p>287,600</p>
                        </c>
                        <c ca="center">
                           <p>181</p>
                        </c>
                        <c ca="center">
                           <p>279,900</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <it>P. infestans</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>2,131</p>
                        </c>
                        <c ca="center">
                           <p>1,219,463</p>
                        </c>
                        <c ca="center">
                           <p>2,102</p>
                        </c>
                        <c ca="center">
                           <p>1,209,113</p>
                        </c>
                        <c ca="center">
                           <p>2,082</p>
                        </c>
                        <c ca="center">
                           <p>1,199,372</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Total, stramenopiles (<it>B</it><sub>1</sub>)</p>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>2,263</p>
                        </c>
                        <c ca="center">
                           <p>1,479,272</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Zygomycetes</p>
                        </c>
                        <c ca="center">
                           <p>232</p>
                        </c>
                        <c ca="center">
                           <p>343,817</p>
                        </c>
                        <c ca="center">
                           <p>212</p>
                        </c>
                        <c ca="center">
                           <p>329,222</p>
                        </c>
                        <c ca="center">
                           <p>211</p>
                        </c>
                        <c ca="center">
                           <p>327,229</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Chytridiomycetes</p>
                        </c>
                        <c ca="center">
                           <p>82</p>
                        </c>
                        <c ca="center">
                           <p>123,698</p>
                        </c>
                        <c ca="center">
                           <p>78</p>
                        </c>
                        <c ca="center">
                           <p>119,754</p>
                        </c>
                        <c ca="center">
                           <p>78</p>
                        </c>
                        <c ca="center">
                           <p>119,754</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Total, Fungi (<it>B</it><sub>2</sub>)</p>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>289</p>
                        </c>
                        <c ca="center">
                           <p>446,983</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <it>Rhizobium</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>478</p>
                        </c>
                        <c ca="center">
                           <p>1,430,132</p>
                        </c>
                        <c ca="center">
                           <p>444</p>
                        </c>
                        <c ca="center">
                           <p>1,404,883</p>
                        </c>
                        <c ca="center">
                           <p>444</p>
                        </c>
                        <c ca="center">
                           <p>1,404,883</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <it>Sinorhizobium</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>320</p>
                        </c>
                        <c ca="center">
                           <p>900,294</p>
                        </c>
                        <c ca="center">
                           <p>312</p>
                        </c>
                        <c ca="center">
                           <p>898,687</p>
                        </c>
                        <c ca="center">
                           <p>312</p>
                        </c>
                        <c ca="center">
                           <p>898,687</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>
                              <it>Bradyrhizobium</it>
                           </p>
                        </c>
                        <c ca="center">
                           <p>153</p>
                        </c>
                        <c ca="center">
                           <p>471,309</p>
                        </c>
                        <c ca="center">
                           <p>146</p>
                        </c>
                        <c ca="center">
                           <p>465,307</p>
                        </c>
                        <c ca="center">
                           <p>146</p>
                        </c>
                        <c ca="center">
                           <p>465,307</p>
                        </c>
                     </r>
                     <r>
                        <c ca="left">
                           <p>Total, rhizobacteria (<it>B</it><sub>3</sub>)</p>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c>
                           <p/>
                        </c>
                        <c ca="center">
                           <p>902</p>
                        </c>
                        <c ca="center">
                           <p>2,768,877</p>
                        </c>
                     </r>
                  </tblbdy>
                  <tblfn>
                     <p>Number of sequences (n) and nucleotides (nt), as raw, trimmed (removed N-rich regions, poly(A) and poly(T) sites), and screened sequences (removed ribosomal, chloroplast, and mitochondrial DNA and remaining sequences shorter than 300 nucleotides).</p>
                  </tblfn>
               </tbl>
            </sec>
            <sec>
               <st>
                  <p>Validation</p>
               </st>
               <p>To test the validity of word counting as a solution to the problem, we identified a set of 50 gene sequences from plants (<it>M. truncatula</it> and <it>G. max</it>), oomycetes (<it>Phytophthora</it>), zygomycetes (<it>Glomus versiforme</it>), and bacteria (<it>Sinorhizobium meliloti</it> and <it>Agrobacterium tumefaciens</it>), for which the function and origin have been characterized experimentally. We chose genes known to play a role in plant-microbe interactions, as well as genes that are found across taxa. We withheld these sequences, and partial transcripts of the same genes, from training sets prior to comparative lexical analysis, and calculated hexamer dissimilarities for each of the three training sets as described below.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>Test sequences</p>
            </st>
            <p>To diagnose the species of origin for sequences expressed in symbiotic cultures, we collected sequences generated by distinct EST sequencing projects from the GenBank database [<abbr bid="B29">29</abbr>,<abbr bid="B30">30</abbr>,<abbr bid="B31">31</abbr>]. Sequences from pathogenic interactions originated from cultures of a species from the genus <it>Phytophthora</it> with its plant host, such as <it>P. sojae</it> and soybean (<it>G. max</it>) isolated from inoculated hypocotyls two days after infection [<abbr bid="B18">18</abbr>] and <it>P. medicaginis</it> and <it>M. truncatula</it> isolated from infected roots 10 days after infection (C. Vance, unpublished data). Sequences expressed during mutualistic interactions were obtained from cultures with <it>M. truncatula</it> and mycorrhizal (<it>Glomus versiforme</it>; M.J. Harrison, unpublished data) or rhizobacterial (<it>S. meliloti</it>; K. VandenBosch, unpublished data) endosymbionts several days after inoculation. Sequences expressed in pure, axenic cultures from <it>P. sojae</it> mycelia and zoospores [<abbr bid="B18">18</abbr>] and from sterile, uninoculated <it>M. truncatula</it> roots [<abbr bid="B6">6</abbr>] provided a basis for comparison in which no foreign transcripts were expected.</p>
            <p>To maximize the reliability of diagnostic comparisons, we screened test sequences for high quality as for training sequences, and for low similarity to <it>E. coli</it>, chloroplast and mitochondrial genes, and non-coding RNA transcripts (ribosomal and transfer RNAs). Independent BLASTN comparisons identified sequences having very high similarity (<it>E</it> &lt; 10<sup>-100</sup>) to vector sequences or moderately high similarity (<it>E</it> &lt; 10<sup>-20</sup>) to non-nuclear or non-coding sequences obtained from GenBank. Sequences so identified were withheld from analysis. A summary of test sequences appears in Table <tblr tid="T3">3</tblr>. All test sequences obtained using the procedure described above are available as additional files.</p>
            <tbl id="T3">
               <title>
                  <p>Table 3</p>
               </title>
               <caption>
                  <p>Test sets</p>
               </caption>
               <tblbdy cols="9">
                  <r>
                     <c ca="left">
                        <p>Species</p>
                     </c>
                     <c ca="center">
                        <p>Tissue</p>
                     </c>
                     <c ca="center">
                        <p>Library (ID)</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Raw</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Trimmed</p>
                     </c>
                     <c cspan="2" ca="center">
                        <p>Screened</p>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                     <c cspan="2">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c>
                        <p/>
                     </c>
                     <c ca="center">
                        <p>n</p>
                     </c>
                     <c ca="center">
                        <p>nt</p>
                     </c>
                     <c ca="center">
                        <p>n</p>
                     </c>
                     <c ca="center">
                        <p>nt</p>
                     </c>
                     <c ca="center">
                        <p>n</p>
                     </c>
                     <c ca="center">
                        <p>nt</p>
                     </c>
                  </r>
                  <r>
                     <c cspan="9">
                        <hr/>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>P. sojae</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>Mycelia</p>
                     </c>
                     <c ca="center">
                        <p>MY</p>
                     </c>
                     <c ca="center">
                        <p>969</p>
                     </c>
                     <c ca="center">
                        <p>527,295</p>
                     </c>
                     <c ca="center">
                        <p>902</p>
                     </c>
                     <c ca="center">
                        <p>510,010</p>
                     </c>
                     <c ca="center">
                        <p>895</p>
                     </c>
                     <c ca="center">
                        <p>506,086</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>P. sojae</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>Zoospores</p>
                     </c>
                     <c ca="center">
                        <p>ZO</p>
                     </c>
                     <c ca="center">
                        <p>1,013</p>
                     </c>
                     <c ca="center">
                        <p>583,520</p>
                     </c>
                     <c ca="center">
                        <p>960</p>
                     </c>
                     <c ca="center">
                        <p>569,576</p>
                     </c>
                     <c ca="center">
                        <p>957</p>
                     </c>
                     <c ca="center">
                        <p>567,976</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>+ <it>G. max</it></p>
                     </c>
                     <c ca="center">
                        <p>2 dpi</p>
                     </c>
                     <c ca="center">
                        <p>HA</p>
                     </c>
                     <c ca="center">
                        <p>994</p>
                     </c>
                     <c ca="center">
                        <p>577,626</p>
                     </c>
                     <c ca="center">
                        <p>938</p>
                     </c>
                     <c ca="center">
                        <p>563,226</p>
                     </c>
                     <c ca="center">
                        <p>927</p>
                     </c>
                     <c ca="center">
                        <p>556,305</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>M. truncatula</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>Root hairs</p>
                     </c>
                     <c ca="center">
                        <p>MtRHE</p>
                     </c>
                     <c ca="center">
                        <p>899</p>
                     </c>
                     <c ca="center">
                        <p>539,719</p>
                     </c>
                     <c ca="center">
                        <p>893</p>
                     </c>
                     <c ca="center">
                        <p>536,787</p>
                     </c>
                     <c ca="center">
                        <p>890</p>
                     </c>
                     <c ca="center">
                        <p>534,037</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>+ <it>G. versiforme</it></p>
                     </c>
                     <c ca="center">
                        <p>10&#8211;38 dpi</p>
                     </c>
                     <c ca="center">
                        <p>MHAM</p>
                     </c>
                     <c ca="center">
                        <p>3,259</p>
                     </c>
                     <c ca="center">
                        <p>1,785,721</p>
                     </c>
                     <c ca="center">
                        <p>3,030</p>
                     </c>
                     <c ca="center">
                        <p>1,735,390</p>
                     </c>
                     <c ca="center">
                        <p>3,017</p>
                     </c>
                     <c ca="center">
                        <p>1,725,491</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>+ <it>P. medicaginis</it></p>
                     </c>
                     <c ca="center">
                        <p>10 dpi</p>
                     </c>
                     <c ca="center">
                        <p>DSIR</p>
                     </c>
                     <c ca="center">
                        <p>2,462</p>
                     </c>
                     <c ca="center">
                        <p>1,324,815</p>
                     </c>
                     <c ca="center">
                        <p>2,289</p>
                     </c>
                     <c ca="center">
                        <p>1,287,568</p>
                     </c>
                     <c ca="center">
                        <p>2,284</p>
                     </c>
                     <c ca="center">
                        <p>1,282,518</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>
                           <it>M. truncatula</it>
                        </p>
                     </c>
                     <c ca="center">
                        <p>Roots</p>
                     </c>
                     <c ca="center">
                        <p>KV0</p>
                     </c>
                     <c ca="center">
                        <p>2,718</p>
                     </c>
                     <c ca="center">
                        <p>1,387,832</p>
                     </c>
                     <c ca="center">
                        <p>2,550</p>
                     </c>
                     <c ca="center">
                        <p>1,351,137</p>
                     </c>
                     <c ca="center">
                        <p>2,492</p>
                     </c>
                     <c ca="center">
                        <p>1,318,131</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>+ <it>S. meliloti</it></p>
                     </c>
                     <c ca="center">
                        <p>1 dpi</p>
                     </c>
                     <c ca="center">
                        <p>KV1</p>
                     </c>
                     <c ca="center">
                        <p>1,125</p>
                     </c>
                     <c ca="center">
                        <p>562,452</p>
                     </c>
                     <c ca="center">
                        <p>1,012</p>
                     </c>
                     <c ca="center">
                        <p>537,644</p>
                     </c>
                     <c ca="center">
                        <p>1,003</p>
                     </c>
                     <c ca="center">
                        <p>531,813</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>+ <it>S. meliloti</it></p>
                     </c>
                     <c ca="center">
                        <p>2 dpi</p>
                     </c>
                     <c ca="center">
                        <p>KV2</p>
                     </c>
                     <c ca="center">
                        <p>1,960</p>
                     </c>
                     <c ca="center">
                        <p>976,344</p>
                     </c>
                     <c ca="center">
                        <p>1,732</p>
                     </c>
                     <c ca="center">
                        <p>926,953</p>
                     </c>
                     <c ca="center">
                        <p>1,726</p>
                     </c>
                     <c ca="center">
                        <p>922,433</p>
                     </c>
                  </r>
                  <r>
                     <c ca="left">
                        <p>+ <it>S. meliloti</it></p>
                     </c>
                     <c ca="center">
                        <p>3 dpi</p>
                     </c>
                     <c ca="center">
                        <p>KV3</p>
                     </c>
                     <c ca="center">
                        <p>2,375</p>
                     </c>
                     <c ca="center">
                        <p>1,316,430</p>
                     </c>
                     <c ca="center">
                        <p>2,217</p>
                     </c>
                     <c ca="center">
                        <p>1,279,691</p>
                     </c>
                     <c ca="center">
                        <p>2,173</p>
                     </c>
                     <c ca="center">
                        <p>1,251,795</p>
                     </c>
                  </r>
               </tblbdy>
               <tblfn>
                  <p>Number of EST sequences (n) and nucleotides (nt) as raw, trimmed (limited lengths of N-rich regions, poly(A) and poly(T) sites), and screened (removed ribosomal, chloroplast, and mitochondrial DNA, and remaining sequences shorter than 300 nt) sequences. Transcripts were isolated from the cDNA library indicated by the ID column. dpi, days post-inoculation, indicating mixed plant-microbe cultures.</p>
               </tblfn>
            </tbl>
         </sec>
         <sec>
            <st>
               <p>Base content</p>
            </st>
            <p>We wrote a PERL program (countGC.pl) that calculates the GC base content of a sequence as the portion of guanine and cytosine residues among all unambiguous (non-N) nucleotides in a sequence. The hist method in R, version 1.1.1 [<abbr bid="B38">38</abbr>] aggregated continuous percentages into discrete histogram bins, using bin sizes of 2% difference in GC, with inclusive lower bin boundaries and exclusive upper bounds; the lm method tested for linear correlation of the dissimilarity test statistic <it>t</it> with GC.</p>
         </sec>
         <sec>
            <st>
               <p>Comparative lexical analysis</p>
            </st>
            <p>White <it>et al</it>. [<abbr bid="B19">19</abbr>] used a likelihood-ratio test to determine whether word frequencies from a particular sequence more closely resemble the frequency distribution of control data sets from the taxon being sequenced or a distantly related outgroup. They computed a test statistic <it>t(A,B,x)</it> for each sequence <it>x</it> as the difference of log-likelihood ratio dissimilarity measures, <it>D</it>(<it>A</it>,<it>x</it>) = -2log&#955;(<it>A</it>,<it>x</it>), for two data sets, a control set <it>A</it> and an outgroup <it>B</it>, such that <it>t(A,B,x) = D(A,x) - D(B,x)</it>. A negative value for <it>t</it> indicates that the sequence more closely resembles words from <it>A</it>; conversely, a positive value indicates a likely contaminant related to <it>B</it>. (Dissimilarity is conceptually related to distance. However, dissimilarity does not measure distance because it does not possess the mathematical properties of a distance metric [<abbr bid="B39">39</abbr>].) Unlike the calculation of calibration curves, in which 300-nucleotide subsequences are randomly resampled, hexamer dissimilarity is measured over the whole length of a test sequence when inferring a transcript's origin. Originally, the investigators used the null hypothesis that no difference exists for dissimilarity measures between the two data sets, or that <it>t(A,B,x)</it> = 0 [<abbr bid="B19">19</abbr>]. White <it>et al.</it> [<abbr bid="B19">19</abbr>] tested two alternative hypotheses: that <it>t</it> &lt; 0, being more like <it>A</it>, or <it>t</it> > 0, like <it>B</it>.</p>
            <p>Lexical analysis using pentamers or heptamers yields similar error rates and very highly correlated values for the test result (not shown). Because White <it>et al</it>. [<abbr bid="B19">19</abbr>] reported the best results were obtained using hexamers, and because a word size of six nucleotides corresponds to the size of a dicodon, we chose to analyze hexamer frequencies. To use longer words requires more training data, because the number of possible words increases exponentially with increasing word size. Use of shorter words may be adequate for some applications and will be investigated in future work.</p>
            <p>Though we used White's word-counting methods, we did make slight modifications. We simplified one program (called hybridize) to compute individual dissimilarity values, rather than paired differences; a patch that details how to modify the C program is available (see hyb2dis.txt in additional data files). More importantly, we amended the null hypothesis and interpreted calibration curves to test for statistically significant dissimilarity differences. Though the likelihood-ratio test statistic indicates the magnitude of similarity to <it>A</it> or <it>B</it>, we do not know what values for <it>t</it> are significant with known confidence. When testing hypotheses, one can make two types of error: type I, or false positives, and type II, false negatives [<abbr bid="B28">28</abbr>]. The false-positive rate is denoted &#945; and false-negative rate &#946;. We determine &#945; and &#946; from overlap in the calibration curves. Inferring error rates from calibration curves is justified because we know the correct answer and determine the error rate via resampling, as with bootstrap methods to infer error rates or confidence intervals [<abbr bid="B40">40</abbr>].</p>
            <p>We are interested in knowing from which of two organisms a sequence originated, and are reasonably confident that it came from either one or the other. Thus, we assume it came from one and test whether we have evidence to refute this assumption. The null hypothesis here is that sequence <it>x</it> is from <it>A</it>. Alternatively, it might be from <it>B</it>. Evaluating the calibration curve overlap at <it>t</it> = 0 quantifies the associated error rates. The cumulative distribution function (<it>cdf</it>) of taxon <it>B</it> specifies &#946; where <it>cdf</it><sub>B</sub> intersects 0; the <it>cdf</it> from <it>A</it> specifies &#945; as 1-<it>cdf</it><sub>A</sub>(0). We can thus resolve the problem with known confidence <it>P</it>: <it>P</it> (<it>t</it> > 0) = &#945;. All other computations were performed as described previously [<abbr bid="B19">19</abbr>]. Software used for lexical analysis was obtained via anonymous ftp from the TIGR software FTP site [<abbr bid="B41">41</abbr>].</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>Additional data files</p>
         </st>
         <p>The following files are available for download:</p>
         <p><supplr sid="S1">countGC.pl</supplr>: PERL script used to compute GC content of sequences analyzed.</p>
         <p><supplr sid="S2">hyb2dis.txt</supplr>: patch file that converts White's hybridize program to compute individual dissimilarity values.</p>
         <p><supplr sid="S3">Training sets</supplr> (GlycineMedicago.txt,Rhizobia.txt, Stramenopiles.txt, ZygoChytrid.txt): FASTA-formatted text files that contain the sequences used for calibration and comparison.</p>
         <p><supplr sid="S4">Test sets</supplr> (PsojaeHA.txt, PsojaeMY.txt, PsojaeZO.txt, MtRHE.txt, DSIR.txt, MHAM.txt, KV0.txt, KV2.txt, KV3.txt): FASTA-formatted text files containing transcripts analyzed, edited for quality.</p>
         <p><supplr sid="S5">Test results</supplr> (PsojaeHA.dat, PsojaeMY.dat, PsojaeZO.dat, MtRHE-A1B1.dat, MtRHE-A2B2.dat, DSIR.dat, MHAM.dat, KV0.dat, KV2.dat, KV3.dat): text files that contain transcript analysis results, sorted from least to most plant-like.</p>
         <suppl id="S1">
            <title>
               <p>countGC.pl</p>
            </title>
            <caption>
               <p/>
            </caption>
            <text>
               <p>PERL script used to compute GC content of sequences analyzed.</p>
            </text>
            <file name="gb-2001-2-9-research0037-S1.pl">
               <p>countGC.pl</p>
            </file>
         </suppl>
         <suppl id="S2">
            <title>
               <p>hyb2dis.txt</p>
            </title>
            <caption>
               <p/>
            </caption>
            <text>
               <p>Patch file that converts White's hybridize program to compute individual dissimilarity values. </p>
            </text>
            <file name="gb-2001-2-9-research0037-S2.txt">
               <p>hyb2dis.txt</p>
            </file>
         </suppl>
         <suppl id="S3">
            <title>
               <p>Training sets</p>
            </title>
            <caption>
               <p/>
            </caption>
            <text>
               <p>FASTA-formatted text files that contain the sequences used for calibration and comparison.</p>
            </text>
            <file name="gb-2001-2-9-research0037-S3.txt">
               <p>GlycineMedicago.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S4.txt">
               <p>Rhizobia.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S5.txt">
               <p>Stramenopiles.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S6.txt">
               <p>ZygoChytrid.txt</p>
            </file>
         </suppl>
         <suppl id="S4">
            <title>
               <p>Test sets</p>
            </title>
            <caption>
               <p/>
            </caption>
            <text>
               <p>FASTA-formatted text files containing transcripts analyzed, edited for quality.</p>
            </text>
            <file name="gb-2001-2-9-research0037-S7.txt">
               <p>PsojaeHA.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S8.txt">
               <p>PsojaeMY.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S9.txt">
               <p>PsojaeZO.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S10.txt">
               <p>MtRHE.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S11.txt">
               <p>DSIR.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S12.txt">
               <p>KV0.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S13.txt">
               <p>KV2.txt</p>
            </file>
            <file name="gb-2001-2-9-research0037-S14.txt">
               <p>KV3.txt</p>
            </file>
         </suppl>
         <suppl id="S5">
            <title>
               <p>Test results</p>
            </title>
            <caption>
               <p/>
            </caption>
            <text>
               <p>Text files that contain transcript analysis results, sorted from least to most plant-like.</p>
            </text>
            <file name="gb-2001-2-9-research0037-S15.dat">
               <p>PsojaeHA.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S16.dat">
               <p>PsojaeMY.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S17.dat">
               <p>PsojaeZO.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S18.dat">
               <p>MtRHE-A1B1.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S19.dat">
               <p>MtRHE-A2B2.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S20.dat">
               <p>DSIR.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S21.dat">
               <p>MHAM.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S22.dat">
               <p>KV0.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S23.dat">
               <p>KV2.dat</p>
            </file>
            <file name="gb-2001-2-9-research0037-S24.dat">
               <p>KV3.dat</p>
            </file>
         </suppl>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgements</p>
            </st>
            <p>We thank Callum Bell, Mark Gijzen, Maria Harrison, Tom Kepler, Deb Samac, and Bruno Sobral for valued discussions and feedback. Comments from B. M. Tyler and an anonymous reviewer on an earlier version of this work greatly enhanced its presentation. PTH thanks the Santa Fe Institute for support and inspiration.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <aug>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Fields</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
            </aug>
            <source>Automated DNA sequencing and analysis. London: Academic Press,</source>
            <pubdate>1994</pubdate>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Complementary DNA sequencing: expressed sequence tags and human genome project.</p>
            </title>
            <aug>
               <au>
                  <snm>Adams</snm>
                  <fnm>MD</fnm>
               </au>
               <au>
                  <snm>Kelley</snm>
                  <fnm>JM</fnm>
               </au>
               <au>
                  <snm>Gocayne</snm>
                  <fnm>JD</fnm>
               </au>
               <au>
                  <snm>Dubnick</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Polymeropoulos</snm>
                  <fnm>MH</fnm>
               </au>
               <au>
                  <snm>Xiao</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Merril</snm>
                  <fnm>CR</fnm>
               </au>
               <au>
                  <snm>Wu</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Olde</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Moreno</snm>
                  <fnm>RF</fnm>
               </au>
               <etal/>
            </aug>
            <source>Science</source>
            <pubdate>1991</pubdate>
            <volume>252</volume>
            <fpage>1651</fpage>
            <lpage>1656</lpage>
            <xrefbib>
               <pubid idtype="pmpid">2047873</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B3">
            <title>
               <p><it>Medicago truncatula</it> - a model in the making!</p>
            </title>
            <aug>
               <au>
                  <snm>Cook</snm>
                  <fnm>DR</fnm>
               </au>
            </aug>
            <source>Curr Opin Plant Biol</source>
            <pubdate>1999</pubdate>
            <volume>2</volume>
            <fpage>301</fpage>
            <lpage>304</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1369-5266(99)80053-3</pubid>
                  <pubid idtype="pmpid" link="fulltext">10459004</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Large-scale sequencing of plant genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Rounsley</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Lin</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Ketchum</snm>
                  <fnm>KA</fnm>
               </au>
            </aug>
            <source>Curr Opin Plant Biol</source>
            <pubdate>1998</pubdate>
            <volume>1</volume>
            <fpage>136</fpage>
            <lpage>141</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S1369-5266(98)80015-0</pubid>
                  <pubid idtype="pmpid" link="fulltext">10066574</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Plant functional genomics.</p>
            </title>
            <aug>
               <au>
                  <snm>Somerville</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Somerville</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1999</pubdate>
            <volume>285</volume>
            <fpage>380</fpage>
            <lpage>383</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.285.5426.380</pubid>
                  <pubid idtype="pmpid" link="fulltext">10411495</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Expressed sequence tags from a root-hair-enriched <it>Medicago truncatula</it> cDNA library.</p>
            </title>
            <aug>
               <au>
                  <snm>Covitz</snm>
                  <fnm>PA</fnm>
               </au>
               <au>
                  <snm>Smith</snm>
                  <fnm>LS</fnm>
               </au>
               <au>
                  <snm>Long</snm>
                  <fnm>SR</fnm>
               </au>
            </aug>
            <source>Plant Phys</source>
            <pubdate>1998</pubdate>
            <volume>117</volume>
            <fpage>1325</fpage>
            <lpage>1332</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">34896</pubid>
                  <pubid idtype="pmpid" link="fulltext">9701588</pubid>
                  <pubid idtype="doi">10.1104/pp.117.4.1325</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Generation and analysis of 280,000 human expressed sequence tags.</p>
            </title>
            <aug>
               <au>
                  <snm>Hillier</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Lennon</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Becker</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Bonaldo</snm>
                  <fnm>MF</fnm>
               </au>
               <au>
                  <snm>Chiapelli</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Chissoe</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Dietrich</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>DuBuque</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Favello</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <etal/>
            </aug>
            <source>Genome Res</source>
            <pubdate>1996</pubdate>
            <volume>6</volume>
            <fpage>807</fpage>
            <lpage>828</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8889549</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNA from <it>Arabidopsis thaliana</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>H&#246;fte</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Desprez</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Amselem</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Chiapello</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>Rouze</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Caboche</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Moisan</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Jourjon</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Charpenteau</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Berthomieu</snm>
                  <fnm>P</fnm>
               </au>
               <etal/>
            </aug>
            <source>Plant J</source>
            <pubdate>1993</pubdate>
            <volume>4</volume>
            <fpage>1051</fpage>
            <lpage>1061</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1046/j.1365-313X.1993.04061051.x</pubid>
                  <pubid idtype="pmpid" link="fulltext">8281187</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Expressed sequences from conidial, mycelial, and sexual stages of <it>Neurospora crassa</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Nelson</snm>
                  <fnm>MA</fnm>
               </au>
               <au>
                  <snm>Kang</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Braun</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Crawford</snm>
                  <fnm>ME</fnm>
               </au>
               <au>
                  <snm>Dolan</snm>
                  <fnm>PL</fnm>
               </au>
               <au>
                  <snm>Leonard</snm>
                  <fnm>PM</fnm>
               </au>
               <au>
                  <snm>Mitchell</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Armijo</snm>
                  <fnm>AM</fnm>
               </au>
               <au>
                  <snm>Bean</snm>
                  <fnm>LL</fnm>
               </au>
               <au>
                  <snm>Blueyes</snm>
                  <fnm>E</fnm>
               </au>
               <etal/>
            </aug>
            <source>Fungal Genet Biol</source>
            <pubdate>1997</pubdate>
            <volume>21</volume>
            <fpage>348</fpage>
            <lpage>363</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/fgbi.1997.0986</pubid>
                  <pubid idtype="pmpid" link="fulltext">9290248</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Genes galore: a summary of the methods for accessing the results of large scale partial sequencing of anonymous <it>Arabidopsis thaliana</it> cDNA clones.</p>
            </title>
            <aug>
               <au>
                  <snm>Newman</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>de Bruijn</snm>
                  <fnm>F</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Keegstra</snm>
                  <fnm>K</fnm>
               </au>
               <au>
                  <snm>Kende</snm>
                  <fnm>H</fnm>
               </au>
               <au>
                  <snm>McIntosh</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Ohlrogge</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Raikhel</snm>
                  <fnm>N</fnm>
               </au>
               <au>
                  <snm>Somerville</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Thomashow</snm>
                  <fnm>M</fnm>
               </au>
               <etal/>
            </aug>
            <source>Plant Physiol</source>
            <pubdate>1994</pubdate>
            <volume>106</volume>
            <fpage>1241</fpage>
            <lpage>1255</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">159661</pubid>
                  <pubid idtype="pmpid">7846151</pubid>
                  <pubid idtype="doi">10.1104/pp.106.4.1241</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Towards an <it>in silico</it> analysis of transcription patterns.</p>
            </title>
            <aug>
               <au>
                  <snm>Bortoluzzi</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Danieli</snm>
                  <fnm>G</fnm>
               </au>
            </aug>
            <source>Trends Genet</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>118</fpage>
            <lpage>119</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0168-9525(98)01682-5</pubid>
                  <pubid idtype="pmpid" link="fulltext">10203810</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Biotrophic interfaces and nutrient transport in plant/fungal symbioses.</p>
            </title>
            <aug>
               <au>
                  <snm>Harrison</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>J Exp Bot</source>
            <pubdate>1999</pubdate>
            <volume>50</volume>
            <fpage>1013</fpage>
            <lpage>1022</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1093/jexbot/50.suppl_1.1013</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Isolation of potato genes that are induced during an early stage of the hypersensitive response to <it>Phytophthora infestans</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Birch</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Avrova</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Duncan</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Lyon</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Toth</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Mol Plant-Microbe Interact</source>
            <pubdate>1999</pubdate>
            <volume>12</volume>
            <fpage>356</fpage>
            <lpage>361</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Analysis of <it>Medicago truncatula</it> nodule expressed sequence tags.</p>
            </title>
            <aug>
               <au>
                  <snm>Gy&#246;rgyey</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Vaubert</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Jim&#233;nez-Zurdo</snm>
                  <fnm>JI</fnm>
               </au>
               <au>
                  <snm>Charon</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Troussard</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Kondorosi</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Kondorosi</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Mol Plant-Microbe Interact</source>
            <pubdate>2000</pubdate>
            <volume>13</volume>
            <fpage>62</fpage>
            <lpage>71</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10656586</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Molecular and cellular aspects of the arbuscular mycorrhizal symbiosis.</p>
            </title>
            <aug>
               <au>
                  <snm>Harrison</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Annu Rev Plant Phys Plant Mol Biol</source>
            <pubdate>1999</pubdate>
            <volume>50</volume>
            <fpage>361</fpage>
            <lpage>389</lpage>
            <xrefbib>
               <pubid idtype="doi">10.1146/annurev.arplant.50.1.361</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Initial assessment of gene diversity for the oomycete pathogen <it>Phytophthora infestans</it> based on expressed sequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Kamoun</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Hraber</snm>
                  <fnm>PT</fnm>
               </au>
               <au>
                  <snm>Sobral</snm>
                  <fnm>BWS</fnm>
               </au>
               <au>
                  <snm>Nuss</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Govers</snm>
                  <fnm>F</fnm>
               </au>
            </aug>
            <source>Fungal Genet Biol</source>
            <pubdate>1999</pubdate>
            <volume>28</volume>
            <fpage>94</fpage>
            <lpage>106</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/fgbi.1999.1166</pubid>
                  <pubid idtype="pmpid" link="fulltext">10587472</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>Novel genes induced during an arbuscular mycorrhizal (AM) symbiosis formed between <it>Medicago truncatula</it> and <it>Glomus versiforme</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>van Buuren</snm>
                  <fnm>ML</fnm>
               </au>
               <au>
                  <snm>Maldonado-Mendoza</snm>
                  <fnm>IE</fnm>
               </au>
               <au>
                  <snm>Trieu</snm>
                  <fnm>AT</fnm>
               </au>
               <au>
                  <snm>Blaylock</snm>
                  <fnm>LA</fnm>
               </au>
               <au>
                  <snm>Harrison</snm>
                  <fnm>MJ</fnm>
               </au>
            </aug>
            <source>Mol Plant-Microbe Interact</source>
            <pubdate>1999</pubdate>
            <volume>12</volume>
            <fpage>171</fpage>
            <lpage>181</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10065555</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Comparative analysis of expressed sequences in <it>Phytophthora sojae</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Qutob</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Hraber</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Sobral</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Gijzen</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Plant Physiol</source>
            <pubdate>2000</pubdate>
            <volume>123</volume>
            <fpage>243</fpage>
            <lpage>254</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">58998</pubid>
                  <pubid idtype="pmpid" link="fulltext">10806241</pubid>
                  <pubid idtype="doi">10.1104/pp.123.1.243</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B19">
            <title>
               <p>A quality control algorithm for DNA sequencing projects.</p>
            </title>
            <aug>
               <au>
                  <snm>White</snm>
                  <fnm>O</fnm>
               </au>
               <au>
                  <snm>Dunning</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Sutton</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Adams</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Venter</snm>
                  <fnm>JC</fnm>
               </au>
               <au>
                  <snm>Fields</snm>
                  <fnm>C</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1993</pubdate>
            <volume>21</volume>
            <fpage>3829</fpage>
            <lpage>3838</lpage>
            <xrefbib>
               <pubid idtype="pmpid">8367301</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Basic local alignment search tool.</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Gish</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Myers</snm>
                  <fnm>EW</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1990</pubdate>
            <volume>215</volume>
            <fpage>403</fpage>
            <lpage>410</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/jmbi.1990.9999</pubid>
                  <pubid idtype="pmpid" link="fulltext">2231712</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Gapped BLAST and PSI-BLAST: a new generation of protein database search programs.</p>
            </title>
            <aug>
               <au>
                  <snm>Altschul</snm>
                  <fnm>SF</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schaffer</snm>
                  <fnm>AA</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Zhang</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>W</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1997</pubdate>
            <volume>25</volume>
            <fpage>3389</fpage>
            <lpage>3402</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">146917</pubid>
                  <pubid idtype="pmpid" link="fulltext">9254694</pubid>
                  <pubid idtype="doi">10.1093/nar/25.17.3389</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Identification of common molecular subsequences.</p>
            </title>
            <aug>
               <au>
                  <snm>Smith</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Waterman</snm>
                  <fnm>MS</fnm>
               </au>
            </aug>
            <source>J Mol Biol</source>
            <pubdate>1981</pubdate>
            <volume>147</volume>
            <fpage>195</fpage>
            <lpage>197</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7265238</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B23">
            <title>
               <p>Large scale comparison of fungal sequence information: mechanisms of innovation in <it>Neurospora crassa</it> and gene loss in <it>Saccharomyces cerevisiae</it>.</p>
            </title>
            <aug>
               <au>
                  <snm>Braun</snm>
                  <fnm>E</fnm>
               </au>
               <au>
                  <snm>Halpern</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Nelson</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Natvig</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>2000</pubdate>
            <volume>10</volume>
            <fpage>416</fpage>
            <lpage>430</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1101/gr.10.4.416</pubid>
                  <pubid idtype="pmpid" link="fulltext">10779483</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B24">
            <title>
               <p>A simple model based on mutation and selection explains trends in codon and aminoacid usage and GC composition within and across genomes.</p>
            </title>
            <aug>
               <au>
                  <snm>Knight</snm>
                  <fnm>RD</fnm>
               </au>
               <au>
                  <snm>Freeland</snm>
                  <fnm>SJ</fnm>
               </au>
               <au>
                  <snm>Landweber</snm>
                  <fnm>LF</fnm>
               </au>
            </aug>
            <source>Genome Biol</source>
            <pubdate>2001</pubdate>
            <volume>2</volume>
            <fpage>1</fpage>
            <lpage>0010</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">31479</pubid>
                  <pubid idtype="pmpid" link="fulltext">11305938</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B25">
            <title>
               <p>Codon frequencies in 119 individual genes confirm consistent choices of degenerate bases according to genome type.</p>
            </title>
            <aug>
               <au>
                  <snm>Grantham</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gautier</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Gouy</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>1980</pubdate>
            <volume>8</volume>
            <fpage>1893</fpage>
            <lpage>1912</lpage>
            <xrefbib>
               <pubid idtype="pmpid">6159596</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B26">
            <aug>
               <au>
                  <snm>Li</snm>
                  <fnm>WH</fnm>
               </au>
               <au>
                  <snm>Graur</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Fundamentals of Molecular Evolution. Sunderland, MA: Sinauer Associates,</source>
            <pubdate>1991</pubdate>
         </bibl>
         <bibl id="B27">
            <title>
               <p>Accurate methods for the statistics of surprise and coincidence.</p>
            </title>
            <aug>
               <au>
                  <snm>Dunning</snm>
                  <fnm>T</fnm>
               </au>
            </aug>
            <source>Comp Linguistics</source>
            <pubdate>1993</pubdate>
            <volume>19</volume>
            <fpage>61</fpage>
            <lpage>74</lpage>
         </bibl>
         <bibl id="B28">
            <aug>
               <au>
                  <snm>Freund</snm>
                  <fnm>JE</fnm>
               </au>
               <au>
                  <snm>Walpole</snm>
                  <fnm>RE</fnm>
               </au>
            </aug>
            <source>Mathematical Statistics. Englewood Cliffs, NJ: Prentice-Hall,</source>
            <pubdate>1980</pubdate>
         </bibl>
         <bibl id="B29">
            <title>
               <p>GenBank.</p>
            </title>
            <aug>
               <au>
                  <snm>Benson</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Karsch-Mizrachi</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Lipman</snm>
                  <fnm>DJ</fnm>
               </au>
               <au>
                  <snm>Ostell</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Rapp</snm>
                  <fnm>BA</fnm>
               </au>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>15</fpage>
            <lpage>18</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102453</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592170</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.15</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B30">
            <title>
               <p>Database resources of the National Center for Biotechnology Information.</p>
            </title>
            <aug>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Chappey</snm>
                  <fnm>C</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Leipe</snm>
                  <fnm>DD</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Rapp</snm>
                  <fnm>BA</fnm>
               </au>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2000</pubdate>
            <volume>28</volume>
            <fpage>10</fpage>
            <lpage>14</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">102437</pubid>
                  <pubid idtype="pmpid" link="fulltext">10592169</pubid>
                  <pubid idtype="doi">10.1093/nar/28.1.10</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B31">
            <title>
               <p>Database resources of the National Center for Biotechnology Information.</p>
            </title>
            <aug>
               <au>
                  <snm>Wheeler</snm>
                  <fnm>DL</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>DM</fnm>
               </au>
               <au>
                  <snm>Lash</snm>
                  <fnm>AE</fnm>
               </au>
               <au>
                  <snm>Leipe</snm>
                  <fnm>DD</fnm>
               </au>
               <au>
                  <snm>Madden</snm>
                  <fnm>TL</fnm>
               </au>
               <au>
                  <snm>Pontius</snm>
                  <fnm>JU</fnm>
               </au>
               <au>
                  <snm>Schuler</snm>
                  <fnm>GD</fnm>
               </au>
               <au>
                  <snm>Schriml</snm>
                  <fnm>LM</fnm>
               </au>
               <au>
                  <snm>Tatusova</snm>
                  <fnm>TA</fnm>
               </au>
               <au>
                  <snm>Wagner</snm>
                  <fnm>L</fnm>
               </au>
               <etal/>
            </aug>
            <source>Nucleic Acids Res</source>
            <pubdate>2001</pubdate>
            <volume>29</volume>
            <fpage>11</fpage>
            <lpage>16</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">29800</pubid>
                  <pubid idtype="pmpid" link="fulltext">11125038</pubid>
                  <pubid idtype="doi">10.1093/nar/29.1.11</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B32">
            <title>
               <p>Noncoding RNA genes.</p>
            </title>
            <aug>
               <au>
                  <snm>Eddy</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Curr Opin Genet Dev</source>
            <pubdate>1999</pubdate>
            <volume>9</volume>
            <fpage>695</fpage>
            <lpage>599</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1016/S0959-437X(99)00022-2</pubid>
                  <pubid idtype="pmpid" link="fulltext">10607607</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B33">
            <title>
               <p>The calculation of intraradical fungal biomass from percent colonization in vesicular-arbuscular mycorrhizae.</p>
            </title>
            <aug>
               <au>
                  <snm>Toth</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Miller</snm>
                  <fnm>RM</fnm>
               </au>
               <au>
                  <snm>Jarstfer</snm>
                  <fnm>AG</fnm>
               </au>
               <au>
                  <snm>Alexander</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Bennet</snm>
                  <fnm>EL</fnm>
               </au>
            </aug>
            <source>Mycologia</source>
            <pubdate>1991</pubdate>
            <volume>83</volume>
            <fpage>553</fpage>
            <lpage>558</lpage>
         </bibl>
         <bibl id="B34">
            <title>
               <p>Evolutionary relationships among the eukaryotic crown taxa taking into account site-to-site rate variation in 18S rRNA.</p>
            </title>
            <aug>
               <au>
                  <snm>van de Peer</snm>
                  <fnm>Y</fnm>
               </au>
               <au>
                  <snm>de Wachter</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Mol Evol</source>
            <pubdate>1997</pubdate>
            <volume>45</volume>
            <fpage>619</fpage>
            <lpage>630</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9419239</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B35">
            <aug>
               <au>
                  <snm>Lewin</snm>
                  <fnm>B</fnm>
               </au>
            </aug>
            <source>Genes V. Oxford, UK: Oxford University Press,</source>
            <pubdate>1995</pubdate>
         </bibl>
         <bibl id="B36">
            <aug>
               <au>
                  <snm>Futuyma</snm>
                  <fnm>DJ</fnm>
               </au>
            </aug>
            <source>Evolutionary Biology Third edition. Sunderland, MA: Sinauer Associates,</source>
            <pubdate>1998</pubdate>
         </bibl>
         <bibl id="B37">
            <aug>
               <au>
                  <snm>Harvey</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Pagel</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>The Comparative Method in Evolutionary Biology. Oxford UK: Oxford University Press,</source>
            <pubdate>1991</pubdate>
         </bibl>
         <bibl id="B38">
            <title>
               <p>R: a language for data analysis and graphics.</p>
            </title>
            <aug>
               <au>
                  <snm>Ihaka</snm>
                  <fnm>R</fnm>
               </au>
               <au>
                  <snm>Gentleman</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J Comp Graphic Stat</source>
            <pubdate>1996</pubdate>
            <volume>5</volume>
            <fpage>299</fpage>
            <lpage>314</lpage>
         </bibl>
         <bibl id="B39">
            <aug>
               <au>
                  <snm>Weir</snm>
                  <fnm>BS</fnm>
               </au>
            </aug>
            <source>Genetic Data Analysis Second edition. Sunderland, MA: Sinauer Associates,</source>
            <pubdate>1996</pubdate>
         </bibl>
         <bibl id="B40">
            <aug>
               <au>
                  <snm>Efron</snm>
                  <fnm>B</fnm>
               </au>
               <au>
                  <snm>Tibshirani</snm>
                  <fnm>RJ</fnm>
               </au>
            </aug>
            <source>An Introduction to the Bootstrap. New York, NY: Chapman and Hall,</source>
            <pubdate>1993</pubdate>
         </bibl>
         <bibl id="B41">
            <title>
               <p>TIGR Software</p>
            </title>
            <url>ftp://ftp.tigr.org/pub/software/qc</url>
         </bibl>
      </refgrp>
   </bm>
</art>

