<?xml version='1.0'?>
<!DOCTYPE art SYSTEM 'http://www.biomedcentral.com/xml/article.dtd'>
<art>
   <ui>gb-2003-4-6-p5</ui>
   <ji>GBJ</ji>
   <fm>
      <dochead>Deposited research article</dochead>
      <bibl>
         <title>
            <p>Reverse engineering of gene regulatory networks: a finite state linear model</p>
         </title>
         <aug>
            <au id="A1">
               <snm>Brazma</snm>
               <fnm>Alvis</fnm>
               <insr iid="I1"/>
            </au>
            <au id="A2" ca="yes">
               <snm>Schlitt</snm>
               <fnm>Thomas</fnm>
               <insr iid="I1"/>
               <email>schlitt@ebi.ac.uk</email>
            </au>
         </aug>
         <insg>
            <ins id="I1">
               <p>EMBL European Bioinformatics Institute, Wellcome Trust Genome Campus, Hinxton, Cambridge CB10 1SD, United Kingdom</p>
            </ins>
         </insg>
         <source>Genome Biology</source>
         <issn>1465-6906</issn>
         <pubdate>2003</pubdate>
         <volume>4</volume>
         <issue>6</issue>
         <fpage>P5</fpage>
         <url>http://genomebiology.com/2003/4/6/P5</url>
         <note>This was the first version of this article to be made available publicly. </note>
         <xrefbib>
            <pubid idtype="doi">10.1186/gb-2003-4-6-p5</pubid>
         </xrefbib>
      </bibl>
      <history>
         <rec>
            <date>
               <day>14</day>
               <month>4</month>
               <year>2003</year>
            </date>
         </rec>
         <pub>
            <date>
               <day>29</day>
               <month>4</month>
               <year>2003</year>
            </date>
         </pub>
      </history>
      <cpyrt>
         <year>2003</year>
         <collab>BioMed Central Ltd</collab>
      </cpyrt>
      <kwdg>
         <kwd>gene regulation</kwd>
         <kwd>regulatory networks</kwd>
         <kwd>regulatory circuits</kwd>
         <kwd>dynamic systems</kwd>
         <kwd>finite state automata</kwd>
         <kwd>reverse engineering</kwd>
      </kwdg>
      <abs>
         <sec>
            <st>
               <p>Abstract</p>
            </st>
            <p>We propose a new model for describing gene regulatory networks that can capture discrete (Boolean) and continuous (differential) aspects of gene regulation. After giving some illustrations of the model, we study the problem of the reverse engineering of such networks, i.e., how to construct a network from gene expression data. We prove that for our model there exists an algorithm finding a network compatible with the given data. We demonstrate the model by simulating lambda-phage. We also describe some generalizations of the model, discuss their relevance to the real-world gene networks and formulate a number of open problems.</p>
         </sec>
      </abs>
   </fm>
   <bdy>
      <sec>
         <st>
            <p>Background</p>
         </st>
         <p>There are many mechanisms how genes are regulated. An important role in gene regulation apparently is played by specific proteins, called transcription factors, which influence the transcription of particular genes by binding to specific parts of the DNA in the genome. In this way a product of one gene can influence the expression of another gene, and we can consider a network of gene regulation. Such regulatory networks or circuits are well studied in lambda-phage and some other viruses <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>. If the network involves only few genes, its functioning can be understood relatively directly. But what does it mean to understand a gene regulatory network of hundreds or thousands of genes? Just describing such a network may be highly nontrivial. We think that to be able to understand complex gene regulatory networks, first a formal language for describing such networks has to be developed. The language can be graph based and preferably should allow the simulation of the behaviour of the network. By simulating a network we can make predictions and compare them to experimental data. If the predictions are consistent with the data, then we can say that the model is correct (within the given accuracy limits). Such an approach is usual in physics: models (theories) are built to explain existing data, then predictions are made, which again are compared to new data. If the correspondence is good, it is claimed that the phenomenon has been understood. Preferably, the model should not be a black box, but should be interpretable, and ideally its elements should have interpretation in the real world consistent with the existing knowledge. At the same time, each model involves a simplification of the real world, which is a part of the strength of the modelling approaches.</p>
         <p>Various models for gene regulatory networks have been proposed and studied (see for instance <abbrgrp><abbr bid="B2">2</abbr><abbr bid="B3">3</abbr></abbrgrp>). In general these models fall into two categories: boolean network based models, for instance <abbrgrp><abbr bid="B4">4</abbr><abbr bid="B5">5</abbr><abbr bid="B6">6</abbr></abbrgrp>, and dynamic systems described by differential or difference equations, for instance <abbrgrp><abbr bid="B7">7</abbr><abbr bid="B8">8</abbr></abbrgrp>. Each of these models have their advantages and drawbacks. The Boolean model is based on the assumption that the important aspects of gene regulation can be described by binary on/off switches, functioning in discrete time steps: the state of the network in time point n is determined by its state at time-point n-1. Even if we generalize these models to more than two discrete states they cannot describe continuous changes that happen in the cell environment. These can be described by differential equation based models, which on the other hand cannot easily describe the discrete aspects of gene regulation such as binding of a transcription factor to the DNA, which is essentially an on/off event. Also, in a differential equation model it is difficult (though not impossible) to describe non-additive logics in gene regulation (for instance, competitive events), as well as time delays.</p>
         <p>Models trying to combine the discrete and continuous components have been proposed, for instance in <abbrgrp><abbr bid="B9">9</abbr><abbr bid="B10">10</abbr><abbr bid="B11">11</abbr></abbrgrp>. Thomas and Thieffry <abbrgrp><abbr bid="B12">12</abbr><abbr bid="B13">13</abbr></abbrgrp> describe a combined model for qualitative description of gene regulatory networks. They introduce a notion of gene state and image, the last effectively representing the substance produced by the respective gene. There is a time delay between the change of the gene state and the change of the image state. By introducing different levels of gene activity and thresholds for switching the gene states, thus they go beyond binary models. They study the qualitative behaviours of various feed-back loops in their model, and show that they fall into two classes: positive loops leading to multi stable states and negative ones leading to periodicity.</p>
         <p>The finite state linear model proposed in this paper combines the discrete and continuous aspects of gene regulation in a simple and structured way. It has a boolean network type discrete control component, and an environment of substances changing their concentrations continuously. Time is continuous, and the state of the network directly determines only the concentration change rates, while the state is affected by the concentrations themselves.</p>
         <p>A framework (a formal language) for describing gene regulatory networks enables us to study the problem of building particular models from gene expression data-often referred to as the reverse engineering of gene networks (e.g., <abbrgrp><abbr bid="B3">3</abbr><abbr bid="B5">5</abbr></abbrgrp>). Until recently there were little quantitative data available for building models for gene regulation. Most of the earlier gene network models, including <abbrgrp><abbr bid="B13">13</abbr></abbrgrp> are based on observations from gene mutation data leading to phenomenological changes and not on direct observations of gene activities. This has changed with the advent of DNA microarray technology, which generates huge amounts of data characterizing gene activities under various conditions <abbrgrp><abbr bid="B14">14</abbr><abbr bid="B15">15</abbr><abbr bid="B16">16</abbr></abbrgrp> and are now being collected in various databases <abbrgrp><abbr bid="B17">17</abbr></abbrgrp>. There can be various precise formulations for the reverse engineering problem, and there is a certain analogy between the problems of reverse engineering of gene networks and the problem of identifying finite state automata from input/output data <abbrgrp><abbr bid="B18">18</abbr></abbrgrp>.</p>
         <p>In this paper we consider two different formulations of the reverse engineering problem. The weakest one is finding a gene network consistent with the given data. We prove that this problem is algorithmically solvable for our model. The second one involves assuming that the data have been produced by some unknown gene network, which we want to reconstruct by making experiments. This problem is still open. In the next section we describe the model, after which we study the reverse engineering problem. Then we give some informal extension of the model, and use it to describe the lambda-phage regulatory circuit. Finally we discuss some open problems.</p>
      </sec>
      <sec>
         <st>
            <p>Results and Discussion</p>
         </st>
         <sec>
            <st>
               <p>The definition of the model</p>
            </st>
            <p>The assumptions on which our model is based are: (1) the gene activity is determined by the state of transcription factor binding sites in its promoter region; (2) each binding site can be in one of a finite number of states, characterized by having or not having bound a particular transcription factor; (3) depending on the states of the binding sites in the promoter, the gene can either be silent, or have a particular activity level; (4) if a gene is active, the concentration of the substance it produces is growing with a rate dependent on the activity level of the gene, otherwise it is decreasing (or staying 0); (5) the state of a binding site depends on the concentration of the respective transcription factor(s). To make these assumptions precise and to formalize them we have developed the model described below. We begin by describing a simpler version of the model, which we call the <it>binary model</it>, where each binding site and each gene have only two states: on or off. We formulate the reverse engineering problem for the binary model, before introducing the general case, though the formulation remains the same in the general case.</p>
         </sec>
         <sec>
            <st>
               <p>The binary model</p>
            </st>
            <p>Informally we assume that we have an environment of n substances 1, ..., n having concentrations c<sub>1</sub>(t), ..., c<sub>n</sub>(t), respectively, which may change in time t. We also assume that there are, what we call <it>substance binding sites </it>in the environment, each of which can attach (bind) a specific substance. In the binary case the binding site can bind only one substance. We define a <it>binary binding site </it>b as a triple</p>
            <p>b = (i, a, d),</p>
            <p>where i is the number of the substance (which can bind to b), and a and d are positive real constants 0 &lt; d &lt; a, called <it>association </it>and <it>dissociation constants</it>, respectively. Each binding site can be in one of two states: <it>attached state </it>or <it>detached state</it>. If binding site b = (i, a, d) is in detached state, and the concentration of substance i reaches the association constant a, i.e., c<sub>i</sub>(t) &#8805; a, then the b switches to attached state. If b is in attached state and the concentration c<sub>i</sub>(t) falls below the dissociation constant d, i.e., c<sub>i</sub>(t) &#8804; d, then b switches to detached state. We denote the attached state by 1 and detached state by 0. Thus, the binding site can be described as a two state automaton in Figure <figr fid="F1">1</figr>, left. Next we define a <it>binary gene</it>. Each binary gene produces one substance. A binary gene can have two states <it>on </it>or <it>off</it>, depending on the state of the binding sites regulating this gene. If a gene G is <it>on</it>, then the respective substance is being produced and its concentration linearly increases. If G is <it>off</it>, the substance is being degraded by the environment, and its concentration linearly decreases (until it reaches 0, or the gene switches on). Formally a <it>binary gene </it>is a triple</p>
            <fig id="F1">
               <title>
                  <p>Figure 1</p>
               </title>
               <caption>
                  <p>Finite state automata describing a binary binding site and a multi-level binding site</p>
               </caption>
               <text>
                  <p>Finite state automata describing a binary binding site (left) and a multi-level binding site (right).</p>
               </text>
               <graphic file="gb-2003-4-6-p5-1"/>
            </fig>
            <p>G = (B, F, r),</p>
            <p>where B = (b<sub>1</sub>, ..., b<sub>k</sub>), and b<sub>1</sub>, ..., b<sub>k </sub>are a subset of the binding sites, F is a boolean function called <it>control function</it>, and r = (i, r<sub>0</sub>, r<sub>1</sub>), where i is an integer denoting the number of the substance produced by the gene, r<sub>0 </sub>&lt; 0 is a real constant called <it>degradation rate</it>, and r<sub>1 </sub>> 0 <it>production rate</it>. We call r a <it>substance generator</it>. Graphically, a gene is represented as in Figure <figr fid="F2">2</figr>, left. We can think of the binding sites and the control function, as the promoter of the gene, while the substance generator - as the coding part plus transcription machinery.</p>
            <fig id="F2">
               <title>
                  <p>Figure 2</p>
               </title>
               <caption>
                  <p>A graphical representation of a gene and an example gene network</p>
               </caption>
               <text>
                  <p>Left: a graphical representation of a gene. The triangles on the left represent the binding sites b<sub>1</sub>, b<sub>2</sub>, b<sub>3</sub>. The rectangle in the middle represent the control function (in the particular example F(x<sub>1</sub>,x<sub>2</sub>,x<sub>3</sub>) = x<sub>1 </sub>&amp; x<sub>2 </sub>&amp; &#172; x<sub>3</sub>, meaning that the gene is <it>on </it>if and only if the first two binding sites are in <it>attached </it>state, while the third in the <it>detached </it>state), and the diamond on the right represents the substance generator. Right: an example gene network. In this network &#915; = {G<sub>1</sub>, G<sub>2</sub>}, G<sub>1 </sub>= ((b<sub>1</sub>,b<sub>2</sub>), F<sub>1</sub>, r<sub>1</sub>), G<sub>2 </sub>= ((b<sub>3</sub>), F<sub>2</sub>, r<sub>2</sub>), b<sub>1 </sub>= (1, a<sub>1</sub>, d<sub>1</sub>), b<sub>2 </sub>= (2, a<sub>2</sub>, d<sub>2</sub>), b<sub>3 </sub>= (1, a<sub>3</sub>, d<sub>3</sub>), r<sub>1 </sub>= (1, r<sub>0,1</sub>, r<sub>1,1</sub>), and r<sub>2 </sub>= (2, r<sub>0,2</sub>, r<sub>1,2</sub>). The solid lines can be regarded as connecting the substance produced by the gene to the respective binding sites, while the dotted lines channelling the information about the states of the binding sites and genes. Another interpretation of the lines is that the solid lines transmit real numbers, while dotted ones - boolean values.</p>
               </text>
               <graphic file="gb-2003-4-6-p5-2"/>
            </fig>
            <p>The semantics of a gene G = (B, F, r) can be described as follows. Let q<sub>1</sub>, ..., q<sub>k </sub>be the states of binding sites b<sub>1</sub>, ..., b<sub>k </sub>where B = (b<sub>1</sub>, ..., b<sub>k</sub>): i.e., q<sub>i </sub>= 1 if b<sub>i </sub>is in attached state, and q<sub>i </sub>= 0, otherwise, at some given time point t'. If F(q<sub>1</sub>, ..., q<sub>k</sub>) = 1, i.e., the gene is <it>on</it>, then the concentration c<sub>i</sub>(t) of substance i (where r = (i, r<sub>0</sub>, r<sub>1</sub>) increases in time with rate r<sub>1</sub>, i.e., c<sub>i</sub>(t) = c<sub>i</sub>(t') + (t - t')r<sub>1</sub>. If F<sub>i</sub>(q<sub>1</sub>, ..., q<sub>k</sub>) = 0, i.e., the gene is <it>off</it>, then, the concentration c<sub>i</sub>(t) decreases with rate r<sub>0 </sub>while it is positive, or remains equal to 0.</p>
         </sec>
      </sec>
      <sec>
         <st>
            <p>A binary gene network</p>
         </st>
         <p>We define a <it>gene network </it>as a set of genes</p>
         <p>&#915; = G<sub>1</sub>, ..., G<sub>n</sub>.</p>
         <p>We can use a graphical representation of gene networks to show which gene products can attach to which binding sites. An example of such representation is given in Figure <figr fid="F2">2</figr>, right.</p>
         <p>In general, several genes my share the same binding site (in graphical representation the dotted line coming out of a binding site can fork to several control functions). To describe the functioning of a gene network let us consider an example in Figure <figr fid="F3">3</figr> (a more formal definition is given in Section 4.1).</p>
         <fig id="F3">
            <title>
               <p>Figure 3</p>
            </title>
            <caption>
               <p>The functioning of a simple network of two binary genes with a negative feedback loop</p>
            </caption>
            <text>
               <p>The functioning of a simple network of two binary genes with a negative feedback loop.</p>
            </text>
            <graphic file="gb-2003-4-6-p5-3"/>
         </fig>
         <p>Let &#915;<sub>1 </sub>= (G<sub>1</sub>, G<sub>2</sub>), G<sub>1 </sub>= ((b<sub>1</sub>), F<sub>1</sub>, r<sub>1</sub>), G<sub>2 </sub>= ((b<sub>2</sub>), F<sub>2</sub>, r<sub>2</sub>), and let us assume that the function F<sub>1 </sub>is the negation (i.e., F<sub>1</sub>(0) = 1 and F<sub>1</sub>(1) = 0), while F<sub>2 </sub>is the identity (i.e., F<sub>2</sub>(0) = 0 and F<sub>2</sub>(1) = 1). Gene G<sub>1 </sub>produces substance 1, gene G<sub>2 </sub>substance 2, and let b<sub>1 </sub>= (2,a<sub>1</sub>,d<sub>1</sub>), b<sub>2 </sub>= (1,a<sub>2</sub>,d<sub>2</sub>), r<sub>1 </sub>= (1,r<sub>1,0</sub>,r<sub>1,1</sub>), and r<sub>2 </sub>= (1,r<sub>2,0</sub>, r<sub>2,1</sub>).</p>
         <p>Further, we assume that at time point t<sub>0 </sub>= 0 the substance 1 has some positive initial concentration c<sub>1</sub>(t<sub>0</sub>) > 0, while c<sub>2</sub>(t<sub>0</sub>) = 0, as shown in the graph in the lower part of Figure <figr fid="F3">3</figr>. We also assume that the states of both binding sites are initially equal to 0, i.e., q<sub>1 </sub>= 0, q<sub>2 </sub>= 0. Starting from this state at t<sub>0</sub>, the network &#915;<sub>1 </sub>functions as follows. Since F<sub>1</sub>(0) = 1, the substance 1 is produced with rate r<sub>1,1 </sub>> 0, and the concentration c<sub>1</sub>(t) is growing. On the other hand F<sub>2</sub>(0) = 0, therefore the concentration c<sub>2</sub>(t) remains 0. This linear change continues until time t = t<sub>1</sub>, when c<sub>1</sub>(t) = a<sub>2</sub>, i.e., until the concentration of the substance 1 reaches the association constant for binding site b<sub>2</sub>. At that point b<sub>2 </sub>switches to attached state 1, and since F<sub>2</sub>(1) = 1, gene G<sub>2 </sub>switches to <it>on </it>state and starts producing substance 2 with rate r<sub>2,1</sub>. Thus, starting from t = t<sub>1</sub>, the concentration of both substances are growing. This continues until the c<sub>2 </sub>reaches a<sub>1</sub>, at which point b<sub>1 </sub>switches to <it>on </it>state, switching gene G<sub>1 </sub><it>off</it>. The concentration c<sub>1</sub>(t) starts falling, and when it reaches d<sub>2</sub>, gene G<sub>2 </sub>switches <it>off </it>and c<sub>2</sub>(t) starts falling too. This continues as shown in Figure <figr fid="F3">3</figr>. The table at the bottom of Figure <figr fid="F3">3</figr> show the states of the binding sites.</p>
         <p>The assumption that the substance concentrations change linearly for the given state is not essential for the model. We think that linearity may be a reasonable approximation in the cases where the gene expression rates are far from saturation levels. This assumption can be relaxed by changing the linear functions to a function that behave approximately linearly while the values are relatively small, decreasing the growth rate for larger values and asymptotically approaching some given maximum. An example of such a function is the solution of the logistic differential equation dc/dt = rc(1-c/k), where c is the concentration, and r and k are constants.</p>
         <p>Another instance where the linearity may be insufficient, is if the degradation rate of a certain substance depends on the concentration of another substance (for instance, if one substance is degrading the other). Our model can be generalized to capture this situation in a straight forward manner, if there are no loops in the dependency graph describing which substances degrade which.</p>
         <p>Although the linearity is not an essential feature of the model, in the next sections dealing with the reverse engineering, we will stick to this assumption, as we think that the properties of a simpler model should be explored first.</p>
         <sec>
            <st>
               <p>Reverse engineering of gene networks</p>
            </st>
            <p>Let b<sub>1</sub>, ..., b<sub>m</sub>, be all the binding sites in the environment, and let Q(t') = (q<sub>1</sub>(t'), ..., q<sub>m</sub>(t')) be their states at time point t'. We call Q(t') the <it>binding site state vector </it>of the network at time point t'.</p>
            <p>Let C(t') = (c<sub>1</sub>(t'), ..., c<sub>n</sub>(t')) be the concentrations of all environment substances at time point t'. We call C(t') the <it>environment concentration vector</it>. We say that the binding site state Q(t') and concentration state C(t') are <it>compatible</it>, if for every binding site b<sub>j </sub>= (i, a<sub>j</sub>, d<sub>j</sub>), if q<sub>j </sub>= 0 then c<sub>i </sub>&lt; a<sub>j</sub>, and if q<sub>j </sub>= 1 then c<sub>i </sub>> d<sub>j</sub>. We define the network state vector as a pair</p>
            <p>&#931;(t') = (Q(t'), C(t'))</p>
            <p>and we say that it is <it>compatible </it>if Q(t') is compatible with C(t'). We often omit t'.</p>
            <p>Note that concentration state vector C(t') = (c<sub>1</sub>(t'), ..., c<sub>n</sub>(t')) at a given time-point t' can be regarded as a concentration measurement. Let us define a <it>measurement series </it>as a pair of m-tuples</p>
            <p>M = ((t<sub>0</sub>, t<sub>1</sub>, ..., t<sub>m</sub>), (C(t<sub>0</sub>), C(t<sub>1</sub>), ..., C(t<sub>m</sub>))).</p>
            <p>The <it>reverse engineering problem </it>for gene networks can be formulated as follows:</p>
            <p>given a measurement series M = ((t<sub>0</sub>, t<sub>1</sub>, ..., t<sub>m</sub>), (C(t<sub>0</sub>), C(t<sub>1</sub>), ..., C(t<sub>m</sub>))), find a gene network &#915; that can produce concentrations C(t<sub>0</sub>), C(t<sub>1</sub>), ..., C(t<sub>m</sub>) at time points t<sub>0</sub>, t<sub>1</sub>, ..., t<sub>m</sub>. In this case we say that <it>network </it>&#915; is compatible with measurements M.</p>
            <sec>
               <st>
                  <p>Theorem</p>
               </st>
               <p><it>The <b>problem of reverse engineering </b>is algorithmically solvable for the linear finite state gene network models, i.e., there exists an algorithm that, given a series of measurements </it>M, <it>outputs a gene regulatory network &#915; compatible with </it>M.</p>
               <p>To prove the theorem, we need to introduce a few auxiliary notions. Given a network &#915; and a compatible starting state &#931;(t<sub>0</sub>), network &#915; defines the <it>concentration change graph </it>&#916;, which is the set of all points C(t) = (c<sub>1</sub>(t), ...,c<sub>n</sub>(t)), for the time interval t &#8712; [t<sub>0</sub>, &#8734;]. An example of an initial part of such a graph is given in the lower part of Figure <figr fid="F3">3</figr> and in Figure <figr fid="F4">4</figr>. Note that each concentration changes as a piecewise linear function.</p>
               <fig id="F4">
                  <title>
                     <p>Figure 4</p>
                  </title>
                  <caption>
                     <p>The environment change graph</p>
                  </caption>
                  <text>
                     <p>The environment change graph</p>
                  </text>
                  <graphic file="gb-2003-4-6-p5-4"/>
               </fig>
               <p>Let &#915; = {G<sub>1</sub>, ..., G<sub>n</sub>}be a network, where G<sub>i </sub>= (B<sub>i</sub>, F<sub>i</sub>, r<sub>i</sub>). Let us consider the sets of all the binding sites in the environment and all the substance generators in the network. Each binding site and each substance generator depends on two real value constants (association and dissociation constants for binding sites, and production and degradation constants for substance generators). Let us denote the set of all binding site constants in the network by &#946;, and the set of all substance generator constants by &#947;. Let &#945; = &#946; &#8746; &#947;, and we call &#945; <it>the set of the network constants</it>.</p>
               <p>Let us consider an initial part &#916;(t<sub>0</sub>,t') of a concentration change graph &#916; for a network &#915; in time interval [t<sub>0</sub>,t']. The slopes of the linear parts in the graph are determined by a subset of &#947;, while the transition-points by a subset &#946;. We denote these subsets by &#947;' and &#946;'. We call &#945;' = &#946;' &#8746; &#947;' the <it>set of reachable constants </it>for the network &#915; in [t<sub>0</sub>,t'] for the given starting state.</p>
               <p>Finally, for a given network &#915;, we define the <it>network structure </it>as the object obtained from &#915; by ignoring all the network constants (formally, we can substitute all the constants in &#915;, for instance, by 0). In the graphical representation the network remains the same, but the constants disappear. The control functions are a part of the structure.</p>
               <p>Now, to prove the theorem, first, note that given an initial part of a concentration change graph &#916;(t<sub>0</sub>,t'), we can find all reachable constants &#946;' and &#947;'. We also know the number of the genes in the network, which equals n. We know the maximal number of binding sites that can switch at least once during [t<sub>0</sub>, t'] from the graph. As there are only finite number of network structures for the limited number of genes and binding sites, we can enumerate them. For each structure, we can try all possible combinations of assignments of the constants from &#946;' to the binding sites, and &#947;' to the substance generators and for each combination we can check the compatibility of the obtained network with the measurements. In this way, given &#916;(t<sub>0</sub>,t'), we can construct a gene network that is compatible with it by an enumeration algorithm.</p>
               <p>To complete the proof of the theorem, it remains to note that &#916;(t<sub>0</sub>, t<sub>m</sub>) can be obtained from a series of measurements, for instance, by joining the points of the respective substance concentration by fragments of straight lines (i.e., c<sub>j</sub>(t<sub>i</sub>) is joined with c<sub>j</sub>(t<sub>i </sub>+ 1) for all j &#8712; {1,...,n} and i &#8712; {0,...,m-1}). Given &#916;(t<sub>0</sub>,t<sub>m</sub>), we can construct the network by exhaustive search as described above.</p>
               <p>Unfortunately such an enumeration algorithm needs exponential time and cannot be used in practice. We do not know if a polynomial-time reverse engineering algorithm exists for our model class. Note that even for finite state automata, the problem of finding a minimal automaton compatible with the input/output data is NP-complete <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>.</p>
               <p>The theorem does not guarantee the reconstruction of the original network that has produced the concentration vectors. The method that we used in constructing the concentration change graph was very crude and can be easily improved to produce a more realistic graph (i.e., a graph that is more likely to be produced by the original network), by minimizing the number of fragments of straight lines for building the graph. Here, the notion of "more likely" is undefined. The problem of reconstructing the original network is formulated in the "open questions" section, but next, we generalize our model to non-binary networks, and define the functioning of gene networks mathematically more precisely.</p>
            </sec>
         </sec>
         <sec>
            <st>
               <p>The multiple level generalization</p>
            </st>
            <p>For binary genes the control function is boolean, and consequently a gene has only two states: <it>on </it>or <it>off</it>. Also, the binding states have only two states. In the general case we assume that a binding site can bind more than one substance, and consequently has more than two states. We assume that the binding is exclusive, i.e., binding of one substance makes binding of any other substance impossible. In this way a binding site can either be in the detached state (denoted by 0), or in any of the attached states 1, 2, ..., p, characterized by the substance that is bound. For a given binding site b that can bind p substances, each substance has separate association and dissociation constants a<sub>h </sub>and d<sub>h</sub>, where h &#8712; {1, ..., p}. In this way a generalized binding site can be described by a finite state automaton of the type given in Figure <figr fid="F1">1</figr>, right.</p>
            <p>We also assume that a gene can have several expression levels {0, ..., k} (the 0 level usually meaning that the gene is not expressed). For this we assume that the control function F may have more than two values, i.e., instead of being a boolean, the function F maps an n-tuple of finite values, to a finite value from 0 to k (i.e., F<sub>i</sub>: ({0, ...,m<sub>1</sub>}, ..., {0, ..., m<sub>n</sub>}) &#8594; {0,..,k}). Respectively the gene can have k+1 states, and there are k+1 different concentration change rates r<sub>0</sub>,...,r<sub>k</sub>, i.e., the substance generator has the form r = (i, r<sub>0</sub>,...,r<sub>k</sub>). The concentration change rate of substance i is defined by the value of F(q<sub>1</sub>, ..., q<sub>k</sub>), where q<sub>1</sub>, ..., q<sub>k </sub>are the binding sites of the gene. Concretely, if F(q<sub>1</sub>, ..., q<sub>k</sub>) = j, then the rate equals to r<sub>j</sub>.</p>
            <p>Finally, we can also assume that genes can produce more than one substance, therefore in the general case a gene is defined as a triple G = (B, F, R), where R = {r<sub>1</sub>, ..., r<sub>p</sub>} and ri are the substance generators. We assume that all the substances are different (two genes cannot produce the same substance). In the graphical representation this implies that the dotted line coming out of a control function can fork to more than one substance generator (for instance, see Figure <figr fid="F6">6</figr>). In general, all the lines can fork, but they are not allowed to merge (they combine either through a control function or entering the same binding site). A dotted line leaving a binding site can enter one ore more control functions, a dotted line leaving a control function can enter one or more substance generators, and a solid line leaving a generator can enter one or more binding sites. The control functions can be regarded as defining the logics of the network, while binding sites and substance generators are mediators transforming discrete values into concentration change rates, and concentrations back into discrete values, respectively. Together with binding sites, the control function defines <it>promoter </it>(B, F) of gene G = (B,F,R).</p>
            <fig id="F6">
               <title>
                  <p>Figure 6</p>
               </title>
               <caption>
                  <p>In-formal description of lambda-phage using the elements defined by our model</p>
               </caption>
               <text>
                  <p>In-formal description of lambda-phage using the elements defined by our model (for further description see text)</p>
               </text>
               <graphic file="gb-2003-4-6-p5-6"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>Functioning of a gene network and simulations</p>
            </st>
            <p>The notion of binding site state vector can be generalized for multilevel networks in a straight-forward way (by changing a binary vector to a vector of integers representing the states of the respective binding sites at the given moment). The notion of the compatibility of the binding site state and concentration vectors can also be easily generalized to multilevel situation. Further, we can assume that all the control functions F<sub><it>i </it></sub>in the gene network have the binding site state vector Q = (q<sub>1</sub>(t), ..., q<sub>n</sub>(t)) as the argument (each function F<sub>i </sub>can be changed to n argument function by adding dummy arguments for those binding sites which actually do not affect the gene). Let</p>
            <p>&#931;<sup>(i) </sup>= (C(t<sub>i</sub>), Q<sup>(i)</sup>)</p>
            <p>be a compatible environment state, for i &#8805; 0. We define the <it>linear concentration change corresponding to state </it>&#931;<sup>(i) </sup>as follows. For a substance j and gene G = (B,F,R), where R = {r<sub>1</sub>,...,r<sub>h</sub>,...,r<sub>m</sub>} and R<sub>h </sub>= (j, r<sub>h,1</sub>, ..., r<sub>h,k</sub>), for t &#8805; t<sub>i </sub>we set</p>
            <p>c<sub>j</sub>(t) = c<sub>j</sub>(t<sub>i</sub>) + (t - t<sub>i</sub>) r<sub>h,j</sub>,</p>
            <p>where j = F(Q<sup>(i)</sup>). Let t = t<sub>i+1 </sub>be the smallest t > t<sub>i</sub>, such that (C(t), Q<sup>(i)</sup>) is not a compatible state. Let b<sub>j1</sub>,...,b<sub>jp </sub>be the binding sites the states of which are not compatible with C(t<sub>i+1</sub>). Let Q<sup>(i+1) </sup>be obtained from Q<sup>(i) </sup>by changing the states q<sub>j1</sub>, ..., q<sub>jp </sub>to compatible ones. (In principle, there may be more than one way how this can be achieved - we can assume that we always change to the compatible state with the smallest number. This situation will not occur in the probabilistic generalization discussed in the next section.)</p>
            <p>Let &#931;<sup>(i) </sup>= (C(t<sub>i+1</sub>),Q<sup>(i+1)</sup>). Then, given the initial compatible environment state &#931;<sup>(i) </sup>= (C(t<sub>0</sub>),Q<sup>(0)</sup>), the environment changes in the described manner for i = 0,1,.... The environment behaviour can be visualized as in the example in Figure <figr fid="F4">4</figr>.</p>
            <p>We say that promoter (B,F) of gene G = (B,F,R) is <it>active </it>at a given time point t, if at this time-point the concentration of the substance produced by the gene G is increasing.</p>
            <p>Already with only a few genes the calculation of the network behaviour becomes quite laborious. Therefore we implemented a simulator ("Genenet") for these networks in JAVA. Figure <figr fid="F5">5</figr> left shows the behaviour of a gene network consisting of only two genes, as depicted on the right of Figure <figr fid="F5">5</figr>. Both genes have a negative feedback loop to themselves. The first gene has an additional negative feedback onto the second gene, while the second gene has an additional positive feedback onto the first one (Figure <figr fid="F5">5</figr>, right). This example demonstrates that a very simple network of just two genes may show a non-trivial behaviour.</p>
            <fig id="F5">
               <title>
                  <p>Figure 5</p>
               </title>
               <caption>
                  <p>Output of the simulation program "Genenet" and its corresponding network</p>
               </caption>
               <text>
                  <p>Left: Output of the simulation program "Genenet" (using Gnuplot for visualisation) Right: corresponding network; abbreviations: a1 stands for association constant 1, belongs to the bindingsite with the a1, d1 label, d1 is the corresponding dissociation constant; a2, d2, a3, d3, a4, d4 correspondingly</p>
               </text>
               <graphic file="gb-2003-4-6-p5-5"/>
            </fig>
         </sec>
         <sec>
            <st>
               <p>A model of lambda-phage</p>
            </st>
            <p>The model defined above was designed to describe processes involved in transcriptional regulation. Many additional cellular processes can be involved in gene regulatory networks. This makes some extensions necessary. With minor changes the model can be extended to allow the description of cellular processes like protein degradation. Some informal extensions are made to improve the readability for humans. The shaded boxes indicate how many different output states a control-function can have. The default value is 0,1 indicating the two possible states of the substance generator ON and OFF. But more states are possible, e.g. OFF, weak activity ON1, strong activity ON2. We demonstrate the usage of our model by describing a simplified model of lambda-phage.</p>
         </sec>
         <sec>
            <st>
               <p>lambda-phage</p>
            </st>
            <p>A lambda-phage has two modes of operating: lysis and lysogeny (for instance see <abbrgrp><abbr bid="B1">1</abbr></abbrgrp>). During the infection of the bacterial cell by the phage a complex decision is made for either lysis or lysogeny. In the lysogenic mode the phage DNA is integrated into the bacterial genome, and the gene for lambda-repressor <it>cI </it>is the only expressed phage gene. External influences can trigger the switch from lysogenic to lytic behaviour. In the lytic mode the phage DNA is replicated, excised, new phage particles are produced and in the end the bacterium is broken open (lysed) to release the new phages. The lysis-lysogeny decision network is well studied and known to involve several cascades of events. In Figure <figr fid="F6">6</figr> we present a simplified genetic network the lambda-phage. To make the graph more readable, we do not draw the lines between substance generators (depicted by diamonds) and the related bindingsites (depicted by triangles) but instead label them by the respective substances. We also allow more freedom to introduce connections between control-functions.</p>
            <p>The mode of a lambda-phage operating is essentially determined by two proteins <it>CI </it>and <it>Cro</it>. If <it>CI </it>is in abundance, the phage is in lysogenic mode, if <it>Cro </it>is in abundance, the phage is in lytic mode. Both genes are regulated by the same DNA region (but transcribed in opposite directions), which has three binding sites: O<sub>R1</sub>, O<sub>R2 </sub>and O<sub>R3</sub>. Each binding site can bind either <it>Cro </it>or <it>CI </it>competitively, but with different affinities. In this way each binding site can be in one of three states - unbound, <it>Cro</it>-bound, or <it>CI</it>-bound. Depending on these states the control functions P<sub>R </sub>and P<sub>M </sub>have different activity levels. The circuit functions like a trigger and has two stable state: either <it>cro </it>is transcribed and <it>cI </it>is down-regulated, or vice versa. The regulatory cascades of the lambda-Phage are quite complex, for reference please see <abbrgrp><abbr bid="B1">1</abbr><abbr bid="B21">21</abbr></abbrgrp>. We will now go through a simplified description (Figure <figr fid="F6">6</figr>).</p>
            <p>On infection of the <it>E. coli</it>-cell by the lambda-Phage, only two promoters PL and PR of the lambda-Genome are active. From promoter PL the expression of <it>N </it>and <it>CIII </it>are initiated. Between both coding regions there is a leaky terminator of transcription located. Therefore <it>CIII </it>is produced at a lower rate than <it>N</it>. A second terminator is located between the coding region for <it>CIII </it>and <it>Xis</it>. This terminator is completely stopping transcription. If the concentration of <it>N </it>is high enough, the RNA-polymerase is able to ignore the terminators and the genes are expressed at the same rate. As it will be important later, transcription from PL can be repressed by <it>CI </it>binding to its <it>CI </it>bindingsite.</p>
            <p>The basal activity of promoter PR leads to the expression of <it>cro </it>and at lower level of <it>O, P, cII</it>, because there is also a terminator site located. <it>Q </it>is not expressed, because of a second terminator located upstream of it.</p>
            <p>For the lysis-lysogeny decision <it>CII </it>is the crucial protein. It is protected by <it>CIII </it>from degradation by cellular enzymes. Thus, the concentration of <it>CII </it>depends on its rate of production, the activity of cellular proteinases and the concentration of <it>CIII</it>.</p>
            <p>The promoters PE, PI, PM are active only, if enough <it>CII </it>is present to bind to them. Promoter PI initiates the expression of <it>int</it>. The <it>Int </it>protein is important for the integration of the phage-DNA into the host genome. Promoter PE with <it>CII </it>leads to the production of <it>CI</it>, also called lambda-Repressor. Therefore the promoter is called <ul>P</ul>romoter repressor <ul>E</ul>arly (PE). <it>CI </it>binds to the operator sites O<sub>R1 </sub>and O<sub>R2 </sub>in promoter PR and to PL, thus blocking transcription from PR, PL, PE. But it activates its own synthesis via promoter PM (<ul>P</ul>romoter for repressor <ul>M</ul>aintenance). Thus the single gene for <it>cI </it>can be either transcribed from PM or PE. Actually these promoters are serially organized on DNA level. The promoters PM and PR are sharing the operator sites O<sub>R1</sub>, O<sub>R2</sub>, O<sub>R3</sub>. These sites are bound by increasing concentrations of <it>CI</it>. Binding to O<sub>R1 </sub>and O<sub>R2 </sub>leads to inactivation of PR and activation of PM. However, binding to O<sub>R3 </sub>at even higher <it>CI </it>concentrations leads to inactivation of PR and PM, thus down-regulating its own expression.</p>
            <p>At this point, the lambda-DNA is integrated into the bacterial genome and <it>cI </it>is the only expressed lambda-Phage gene. An auto-regulation circuit for controlling the concentration of <it>CI </it>at a high level is established. This is called the lysogenic state. Bacterial cells at this state show immunity to super-infection with lambda-phages, because they contain enough lambda-Repressor to immediately repress the expression of the newly incoming lambda-phage genes. The <it>CI </it>protein, however, is prone to be degraded by some bacterial enzymes, which are expressed by the bacterial cell as stress response upon e.g. UV irradiation. When the <it>CI </it>concentration is rapidly decreasing because of the degradation by cellular enzymes, PR is not repressed anymore. This leads to production of <it>Cro</it>, the counter-player of <it>CI </it>in the lambda-system. The degradation of <it>CI </it>triggered by stress response proteins is depicted in our model by a circular control-function with an input for the stress response signal, which could actually be a bindingsite for a stress response protein.</p>
            <p>The regulatory protein <it>Cro </it>activates its own promoter by competing with <it>CI </it>for binding to O<sub>R1</sub>, O<sub>R2</sub>, O<sub>R3</sub>. It binds to these sites with inverse preference compared to <it>CI</it>. Being a self-activating system it is leading to a rapid increase of <it>Cro </it>protein in the cell. <it>Cro </it>also allows activation of PL, leading to increasing amounts of <it>N</it>. <it>N </it>is an anti-terminator which binds to the terminators mentioned before. With <it>N </it>the expression of <it>cIII, xis </it>and <it>int </it>is increasing rapidly. <it>Xis </it>and <it>Int </it>are needed for the excision of the lambda-phage-DNA from the bacterial genome. From PR not only <it>cro </it>is expressed, but also <it>O, P, cII</it>. <it>O </it>and <it>P </it>are needed for DNA replication of the lambda-Phage. With <it>N </it>these genes are produced at a significantly higher rate than without. <it>N </it>also allows the expression of <it>Q</it>. <it>Q </it>is an anti-terminator for structural genes coded downstream of promoter PR'. This means, once <it>CI </it>is degraded to sufficiently low concentrations <it>Cro </it>is rapidly produced and then activating the genes necessary for excision from the host DNA, DNA replication and production of new phage particles, leading to host cell lysis and setting free new infectious phage particles ("<it>Cro </it>is opening Pandora"s box").</p>
         </sec>
         <sec>
            <st>
               <p>A lambda-phage simulation</p>
            </st>
            <p>In our model the promoter PL is represented by the control-function P<sub>L</sub>, its output is 1 if the CI binding site is unbound or bound by Cro and 0 if the bindingsite is bound by CI (the control-function would look like "if (Cro-bound OR unbound) return 1 (=ON), if CI-bound return 0 (=OFF)"). The first terminator is modelled by introducing a control-function P<sub>L1 </sub>which has two inputs, one from a bindingsite for N and the other one from control-function P<sub>L</sub>. The three different possibilities for the production rate of CIII are degradation (state 0), production at lower rate (state 1, if N is not bound to P<sub>L1</sub>, 80% of full rate) and production at high rate (state 2, if the bindingsite for N at P<sub>L1 </sub>is occupied, full rate). Control-function P<sub>L2 </sub>is leads to a complete stop of transcription. The input of P<sub>L2 </sub>is the used to model the second terminator site. Without N this terminator output of P<sub>L1 </sub>and a bindingsite for N. The output equals the input from P<sub>L1 </sub>if N is bound, or is 0 if N is not bound. The control-function P<sub>int </sub>is used to model the transcriptional control of Int. The substance Int is generated either if P<sub>L2 </sub>is active or if the CII binding site of P<sub>L2 </sub>is occupied.</p>
            <p>The implementation of the lambda-switch in the model is achieved in a similar way. The binding sites O<sub>R1</sub>, O<sub>R2 </sub>and O<sub>R3 </sub>can be bound by substance Cro or substance CI and are shared by the control-functions P<sub>R </sub>and P<sub>M</sub>. The association and dissociation constants for these substances to these bindingsites differ, allowing preferential binding in opposite order.</p>
            <p>Using the simulator it is possible to run a simulation of the lambda-phage. Just using a quite arbitrary parameter set leads to the expected behaviour. In the beginning all substances are produced to a higher or lesser extend. After some time there are smaller changes of substance production, some kind of steady state is reached (we will refer to this informally as "behaviour"). Over a wide range of parameter sets we so far only found two principally different "behaviours". One possible outcome is a steady state where only CI is produced. We will refer to this as lysogeny state (Figure <figr fid="F7">7</figr>, top). The other one reaches a steady state where CI and CII are not produced but the other substances are(Figure <figr fid="F7">7</figr>, bottom). To this we will refer to as lytic state. The lytic behaviour shows down-regulation of substance CI and up-regulation of the other substances under control of substance Cro. Some of these are regulated by a negative feedback loop and are limited to a certain concentration. Some of the others are growing infinitely. The lysogenic behaviour is exemplified by down-regulation of all substances besides CI which shows cyclic up- and down-regulation because of the feedback loop controlling its production/degradation. Interesting is to see, that at first the substances are up-regulated and until the "decision making" has taken place. Depending on the concentration of substance Cro and substance CI either lytic or lysogenic "behaviour" is selected. By changing the starting values for the rate of production of substance CII we can trigger the model into lytic or lysogenic behaviour. This reflects some property of the "real" lambda-phage, the dependence on the number of phage particles infecting one cell. If this number is high (about 10 phage particles per cell) the preference is for lysogeny otherwise for lysis. In our model having several substance generators producing the same substance at a low rate it is equivalent to having one substance generator producing the substance at the according higher rate.</p>
            <fig id="F7">
               <title>
                  <p>Figure 7</p>
               </title>
               <caption>
                  <p>Simulation of lambda-phage model leading to lysogenic behaviour or lytic behaviour</p>
               </caption>
               <text>
                  <p>Simulation of lambda-phage model leading to lysogenic behaviour (top) or lytic behaviour (bottom)</p>
               </text>
               <graphic file="gb-2003-4-6-p5-7"/>
            </fig>
            <p>The simulator allows to test for the effects of mutations easily, thus it is possible to experiment with the model and compare the simulations with the real mutants.</p>
            <p>The potentials of the lambda model have to be examined further, for example for what range of parameter sets we get similar behaviours and how many different kind of behaviours we can find. But already using only arbitrary numbers gives promising results. What seems to be a shortcoming of the lambda-phage model is the infinite growth of some substances (e.g. Int, Q). But this might as well be some property of the lambda-phage itself, because it appears in the lytic "behaviour" and this leads finally to the lysis of the host cell. There is not strict need for a feedback control e.g. of the proteins responsible for the lysis of the cell as the major function of these proteins is to kill the cell. The next challenge would be to find parameters which are derived from experimentally measured reaction constants. But the purpose of this model and simulation is rather to illustrate how the model is working in principle than to come up with a new lambda-phage study.</p>
            <p>It is obvious that additions to the model are necessary to get closer to the reality.</p>
            <p>Informally we introduced in Figure <figr fid="F6">6</figr> already a new kind of control-functions which are depicted by circles to stress that this is not an action which takes place on a promoter site. These control-functions can have the different current concentrations of substances (depicted by smaller circle labelled with the corresponding substance name) as an input and a substance generator of a different substance as an output. Thus we can model the influence of cellular components on the concentration of a substance, like for instance, a certain proteinase on the concentration of its substrate. This is depicted in our model by the circular control-function with input sites for CIII, CII and other cellular influences. It is important to add that this feature is not yet added to the simulator and not included in the simulation shown in Figure <figr fid="F7">7</figr>.</p>
            <p>In the deterministic model, the state of the network is fully determined by its initial state and initial concentrations. To model the behaviour of the decision-making realistically <abbrgrp><abbr bid="B21">21</abbr></abbrgrp>, we need to introduce a stochastic element in the model.</p>
            <p>Instead of setting precise thresholds for switching from detached to attached state and vice versa, we treat these switches as probabilistic events: the higher the concentration, the higher the probability of switching to attached state, and smaller to detached state, and vice versa. In this way a binary site can be defined as a triple B = (i,A,D), where as before i is the number of the substance that can bind to B, but A and D are two probability distributions, defining the probabilities of B switching from a detached to attached state and vice versa, respectively, depending on the concentration c<sub>i</sub>.</p>
         </sec>
         <sec>
            <st>
               <p>Open questions</p>
            </st>
            <p>We would like to extend our model with some informal elements to allow description of the regulatory processes that may not be fully understood yet or may be too complicated for formal incorporation into the model. The extended model can be regarded as a semi-formal language for depicting gene-regulatory networks. The goals of such a semi-formal language are twofold: finding a semi-formal description of a network is the first step towards building a completely formal model which can be used for simulation (i.e., to "describe" the network to a computer) and at the same time it helps to depict the regulatory network in a systematic way (to describe regulatory networks to other humans). Note that such a semi-formal approach is often used in business modelling, where a formal graph-based description, which allows simulations of the given business process, are supplemented with informal comments, that can be interpreted only by humans.</p>
            <p>As already noted, the formulation of the reverse engineering problem given in Section 3 is not entirely satisfactory, as it does not necessarily lead to the reconstruction of the "correct" network. A more satisfactory formulation involves assuming that the data have been produced by some unknown regulatory network (a black box), and the task is to find that or an equivalent network. For this, first, we need to define the equivalence of gene networks.</p>
            <p>Let &#915;<sub>1 </sub>and &#915;<sub>2 </sub>be two gene networks and let &#931;<sub>1</sub>(t<sub>0</sub>) and &#931;<sub>2</sub>(t<sub>0</sub>) be their compatible starting states at time point t<sub>0</sub>. Let &#931;<sub>i</sub>(t) = (C<sub>i</sub>(t),Q<sub>i</sub>(t)), for i = 1,2. We say that &#915;<sub>1 </sub>and &#915;<sub>2 </sub>are <it>equivalent for the starting states. </it>&#931;<sub>1</sub>(t<sub>0</sub>) <it>and </it>&#931;<sub>2</sub>(t<sub>0</sub>), if C<sub>1</sub>(t<sub>0</sub>) = C<sub>2</sub>(t<sub>0</sub>) implies C<sub>1</sub>(t) = C<sub>2</sub>(t) for all t > t<sub>0</sub>. We say that &#915;<sub>1 </sub>and &#915;<sub>2 </sub>are <it>equivalent</it>, if they are equivalent for every compatible starting states &#931;<sub>1</sub>(t<sub>0</sub>) and &#931;<sub>2</sub>(t<sub>0</sub>), for which C<sub>1</sub>(t<sub>0</sub>) = C<sub>2</sub>(t<sub>0</sub>). We can also define an approximate equivalence, or more precisely, d - <it>equivalence </it>for a constant d &#8805; 0. For this the requirement that C<sub>1</sub>(t) = C<sub>2</sub>(t) is relaxed to |C<sub>1</sub>(t) - C<sub>2</sub>(t)| &#8804; d.</p>
            <p>We define the <it>reverse engineering problem in the strict sense </it>in the following way. Let &#915; be an unknown gene network and let &#931;(t<sub>0</sub>) = (C(t<sub>0</sub>),Q(t<sub>0</sub>)) be its compatible starting state. We are allowed to measure the concentration state vector C(t) at any given time-point t &#8805; t<sub>0</sub>. The task is to find time points t<sub>1</sub>, t<sub>2</sub>, ..., t<sub>n</sub>, such that a network &#915;' equivalent to &#915; for the given starting state can be constructed from the measurements C(t<sub>1</sub>), C(t<sub>2</sub>), ..., C(t<sub>n</sub>).</p>
            <p>A generalized version of the problem is to find &#915;' equivalent to &#915; if we are allowed to choose arbitrary compatible starting states, and make series of concentration measurements for each of these states. Finally, a more practical problem is to find a network d-equivalent to &#915;, from approximate measurements.</p>
            <p>At the moment we do not know if these problems are algorithmically solvable or not, even by an enumeration algorithm. They have a certain analogy with the problem of restoring a finite state automata from experiments<abbrgrp><abbr bid="B18">18</abbr></abbrgrp>. This is algorithmically solvable, but is NP-hard <abbrgrp><abbr bid="B19">19</abbr><abbr bid="B20">20</abbr></abbrgrp>. Despite the analogy, situation with the finite state linear networks are different form finite state automata in many respects.</p>
            <p>Our theorem on the reverse engineering of gene networks gives us grounds for optimism that the reverse engineering problem for gene networks can be solved, still it is likely that heuristic methods will be needed for doing this in practice. To reconstruct gene networks all available background knowledge, such as knowing which binding sites belong to which gene promoters, will have to be used. Therefore systematic studies for regulatory signals in genomes, such as <abbrgrp><abbr bid="B22">22</abbr></abbrgrp>, will complement the approach followed here.</p>
         </sec>
      </sec>
   </bdy>
   <bm>
      <ack>
         <sec>
            <st>
               <p>Acknowledgments</p>
            </st>
            <p>The authors benefited from discussing the gene regulation with Frank Holstege and Jaques van Helden, and discussing the model with Mathieu Louis and Jaak Vilo. Martin Vingron pointed me to logistic differential equations. A. B.s former colleagues from the Institute of Mathematics and Computer Science at the University of Latvia gave valuable insights into the problems of restoring general objects from particular examples. A conversation with Chris Sander was highly motivating. The general idea was sparked by the talk of Richard Karp in ISMB99 conference.</p>
         </sec>
      </ack>
      <refgrp>
         <bibl id="B1">
            <aug>
               <au>
                  <snm>Ptashne</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>A genetic switch; phage lambda and higher organisms</source>
            <publisher>Oxford: Blackwell Science</publisher>
            <pubdate>1992</pubdate>
         </bibl>
         <bibl id="B2">
            <title>
               <p>Formalization of regulatory networks: a logical method and its automation.</p>
            </title>
            <aug>
               <au>
                  <snm>Thieffry</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Colet</snm>
                  <fnm>M</fnm>
               </au>
               <au>
                  <snm>Thomas</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Math Model Sci Comput</source>
            <pubdate>1993</pubdate>
            <volume>55</volume>
            <fpage>144</fpage>
            <lpage>151</lpage>
         </bibl>
         <bibl id="B3">
            <title>
               <p>Genetic network analysis in light of massively parallel biological data acquisition.</p>
            </title>
            <aug>
               <au>
                  <snm>Szallasi</snm>
                  <fnm>Z</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>1999</pubdate>
            <fpage>5</fpage>
            <lpage>16</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10380181</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B4">
            <title>
               <p>Identification of genetic networks from a small number of gene expression patterns under the Boolean network model.</p>
            </title>
            <aug>
               <au>
                  <snm>Akutsu</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Miyano</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kuhara</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>1999</pubdate>
            <fpage>17</fpage>
            <lpage>28</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10380182</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B5">
            <title>
               <p>Reveal, a general reverse engineering algorithm for inference of genetic network architectures.</p>
            </title>
            <aug>
               <au>
                  <snm>Liang</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Fuhrman</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Somogyi</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>1998</pubdate>
            <fpage>18</fpage>
            <lpage>29</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9697168</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B6">
            <title>
               <p>Modeling the normal and neoplastic cell cycle with "realistic Boolean genetic networks": their application for understanding carcinogenesis and assessing therapeutic strategies.</p>
            </title>
            <aug>
               <au>
                  <snm>Szallasi</snm>
                  <fnm>Z</fnm>
               </au>
               <au>
                  <snm>Liang</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>1998</pubdate>
            <fpage>66</fpage>
            <lpage>76</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9697172</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B7">
            <title>
               <p>Modeling gene expression with differential equations.</p>
            </title>
            <aug>
               <au>
                  <snm>Chen</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>He</snm>
                  <fnm>HL</fnm>
               </au>
               <au>
                  <snm>Church</snm>
                  <fnm>GM</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>1999</pubdate>
            <fpage>29</fpage>
            <lpage>40</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10380183</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B8">
            <title>
               <p>Linear modeling of mRNA expression levels during CNS development and injury.</p>
            </title>
            <aug>
               <au>
                  <snm>D'Haeseleer</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Wen</snm>
                  <fnm>X</fnm>
               </au>
               <au>
                  <snm>Fuhrman</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Somogyi</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>1999</pubdate>
            <fpage>41</fpage>
            <lpage>52</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10380184</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B9">
            <title>
               <p>Modeling transcriptional control in gene networks - methods, recent results, and future directions.</p>
            </title>
            <aug>
               <au>
                  <snm>Smolen</snm>
                  <fnm>P</fnm>
               </au>
               <au>
                  <snm>Baxter</snm>
                  <fnm>DA</fnm>
               </au>
               <au>
                  <snm>Byrne</snm>
                  <fnm>JH</fnm>
               </au>
            </aug>
            <source>Bull Math Biol</source>
            <pubdate>2000</pubdate>
            <volume>62</volume>
            <fpage>247</fpage>
            <lpage>92</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1006/bulm.1999.0155</pubid>
                  <pubid idtype="pmpid">10824430</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B10">
            <title>
               <p>Genetic control of flower morphogenesis in Arabidopsis thaliana: a logical analysis.</p>
            </title>
            <aug>
               <au>
                  <snm>Mendoza</snm>
                  <fnm>L</fnm>
               </au>
               <au>
                  <snm>Thieffry</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Alvarez-Buylla</snm>
                  <fnm>ER</fnm>
               </au>
            </aug>
            <source>Bioinformatics</source>
            <pubdate>1999</pubdate>
            <volume>15</volume>
            <fpage>593</fpage>
            <lpage>606</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1093/bioinformatics/15.7.593</pubid>
                  <pubid idtype="pmpid" link="fulltext">10487867</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B11">
            <title>
               <p>Algorithms for inferring qualitative models of biological networks.</p>
            </title>
            <aug>
               <au>
                  <snm>Akutsu</snm>
                  <fnm>T</fnm>
               </au>
               <au>
                  <snm>Miyano</snm>
                  <fnm>S</fnm>
               </au>
               <au>
                  <snm>Kuhara</snm>
                  <fnm>S</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>2000</pubdate>
            <fpage>293</fpage>
            <lpage>304</lpage>
            <xrefbib>
               <pubid idtype="pmpid">10902178</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B12">
            <title>
               <p>Qualitative analysis of gene networks.</p>
            </title>
            <aug>
               <au>
                  <snm>Thieffry</snm>
                  <fnm>D</fnm>
               </au>
               <au>
                  <snm>Thomas</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>Pac Symp Biocomput</source>
            <pubdate>1998</pubdate>
            <fpage>77</fpage>
            <lpage>88</lpage>
            <xrefbib>
               <pubid idtype="pmpid">9697173</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B13">
            <title>
               <p>Regulatory Networks Seen as Asynchronous Automata: A Logical Description.</p>
            </title>
            <aug>
               <au>
                  <snm>Thomas</snm>
                  <fnm>R</fnm>
               </au>
            </aug>
            <source>J theor Biol</source>
            <pubdate>1991</pubdate>
            <volume>153</volume>
            <fpage>1</fpage>
            <lpage>23</lpage>
         </bibl>
         <bibl id="B14">
            <title>
               <p>Dissecting the regulatory circuitry of a eukaryotic genome.</p>
            </title>
            <aug>
               <au>
                  <snm>Holstege</snm>
                  <fnm>FC</fnm>
               </au>
               <au>
                  <snm>Jennings</snm>
                  <fnm>EG</fnm>
               </au>
               <au>
                  <snm>Wyrick</snm>
                  <fnm>JJ</fnm>
               </au>
               <au>
                  <snm>Lee</snm>
                  <fnm>TI</fnm>
               </au>
               <au>
                  <snm>Hengartner</snm>
                  <fnm>CJ</fnm>
               </au>
               <au>
                  <snm>Green</snm>
                  <fnm>MR</fnm>
               </au>
               <au>
                  <snm>Golub</snm>
                  <fnm>TR</fnm>
               </au>
               <au>
                  <snm>Lander</snm>
                  <fnm>ES</fnm>
               </au>
               <au>
                  <snm>Young</snm>
                  <fnm>RA</fnm>
               </au>
            </aug>
            <source>Cell</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>717</fpage>
            <lpage>28</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9845373</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B15">
            <title>
               <p>Exploring the metabolic and genetic control of gene expression on a genomic scale.</p>
            </title>
            <aug>
               <au>
                  <snm>DeRisi</snm>
                  <fnm>JL</fnm>
               </au>
               <au>
                  <snm>Iyer</snm>
                  <fnm>VR</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1997</pubdate>
            <volume>278</volume>
            <fpage>680</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1126/science.278.5338.680</pubid>
                  <pubid idtype="pmpid" link="fulltext">9381177</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B16">
            <title>
               <p>Cluster analysis and display of genome-wide expression patterns.</p>
            </title>
            <aug>
               <au>
                  <snm>Eisen</snm>
                  <fnm>MB</fnm>
               </au>
               <au>
                  <snm>Spellman</snm>
                  <fnm>PT</fnm>
               </au>
               <au>
                  <snm>Brown</snm>
                  <fnm>PO</fnm>
               </au>
               <au>
                  <snm>Botstein</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Proc Natl Acad Sci U S A</source>
            <pubdate>1998</pubdate>
            <volume>95</volume>
            <fpage>14863</fpage>
            <lpage>8</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="pmcid">24541</pubid>
                  <pubid idtype="pmpid" link="fulltext">9843981</pubid>
                  <pubid idtype="doi">10.1073/pnas.95.25.14863</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B17">
            <title>
               <p>One-stop shop for microarray data.</p>
            </title>
            <aug>
               <au>
                  <snm>Brazma</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Robinson</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Cameron</snm>
                  <fnm>G</fnm>
               </au>
               <au>
                  <snm>Ashburner</snm>
                  <fnm>M</fnm>
               </au>
            </aug>
            <source>Nature</source>
            <pubdate>2000</pubdate>
            <volume>403</volume>
            <fpage>699</fpage>
            <lpage>700</lpage>
            <xrefbib>
               <pubidlist>
                  <pubid idtype="doi">10.1038/35001676</pubid>
                  <pubid idtype="pmpid" link="fulltext">10693778</pubid>
               </pubidlist>
            </xrefbib>
         </bibl>
         <bibl id="B18">
            <title>
               <p>Gedanken-experiments on sequential machines.</p>
            </title>
            <aug>
               <au>
                  <snm>Moore</snm>
                  <fnm>EF</fnm>
               </au>
            </aug>
            <source>In Automata Studies</source>
            <publisher>Princeton University Press</publisher>
            <editor>Shannon CE, McCartney J</editor>
            <pubdate>1956</pubdate>
            <fpage>129</fpage>
            <lpage>153</lpage>
         </bibl>
         <bibl id="B19">
            <title>
               <p>On the Complexity of Minimum Inference of Regular Sets.</p>
            </title>
            <aug>
               <au>
                  <snm>Angluin</snm>
                  <fnm>D</fnm>
               </au>
            </aug>
            <source>Inform Control</source>
            <pubdate>1978</pubdate>
            <volume>39</volume>
            <fpage>337</fpage>
            <lpage>350</lpage>
         </bibl>
         <bibl id="B20">
            <title>
               <p>Complexity of Automaton Identification from Given Data.</p>
            </title>
            <aug>
               <au>
                  <snm>Gold</snm>
                  <fnm>EM</fnm>
               </au>
            </aug>
            <source>Inform Control</source>
            <pubdate>1978</pubdate>
            <volume>37</volume>
            <fpage>302</fpage>
            <lpage>320</lpage>
         </bibl>
         <bibl id="B21">
            <title>
               <p>Circuit simulation of genetic networks.</p>
            </title>
            <aug>
               <au>
                  <snm>McAdams</snm>
                  <fnm>HH</fnm>
               </au>
               <au>
                  <snm>Shapiro</snm>
                  <fnm>L</fnm>
               </au>
            </aug>
            <source>Science</source>
            <pubdate>1995</pubdate>
            <volume>269</volume>
            <fpage>650</fpage>
            <lpage>6</lpage>
            <xrefbib>
               <pubid idtype="pmpid">7624793</pubid>
            </xrefbib>
         </bibl>
         <bibl id="B22">
            <title>
               <p>Predicting gene regulatory elements in silico on a genomic scale.</p>
            </title>
            <aug>
               <au>
                  <snm>Brazma</snm>
                  <fnm>A</fnm>
               </au>
               <au>
                  <snm>Jonassen</snm>
                  <fnm>I</fnm>
               </au>
               <au>
                  <snm>Vilo</snm>
                  <fnm>J</fnm>
               </au>
               <au>
                  <snm>Ukkonen</snm>
                  <fnm>E</fnm>
               </au>
            </aug>
            <source>Genome Res</source>
            <pubdate>1998</pubdate>
            <volume>8</volume>
            <fpage>1202</fpage>
            <lpage>15</lpage>
            <xrefbib>
               <pubid idtype="pmpid" link="fulltext">9847082</pubid>
            </xrefbib>
         </bibl>
      </refgrp>
   </bm>
</art>

