The Optimal Control Process of Therapeutic Intervention

Transcription

1 4930 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 Optimal Intervention Strategies for Therapeutic Methods With Fixed-Length Duration of Drug Effectiveness Mohammadmahdi R. Yousefi, Student Member, IEEE, Aniruddha Datta, Fellow, IEEE, Edward R. Dougherty, Fellow, IEEE Abstract Intervention in gene regulatory networks in the context of Markov decision processes has usually involved finding an optimal one-transition policy, where a decision is made at every transition whether or not to apply treatment. In an effort to model dosing constraint, a cyclic approach to intervention has previously been proposed in which there is a sequence of treatment windows treatment is allowed only at the beginning of each window. This protocol ignores two practical aspects of therapy. First, a treatment typically has some duration of action: adrugwillbeeffectiveforsomeperiod, after which there can be a recovery phase. This, too, might involve a cyclic protocol; however, in practice, a physician might monitor a patient at every stage decide whether to apply treatment, if treatment is applied, then the patient will be under the influence of the drug for some duration, followed by a recovery period. This results in an acyclic protocol. In this paper we take a unified approach to both cyclic acyclic control with duration of effectiveness by placing the problem in the general framework of multiperiod decision epochs with infinite horizon discounting cost. The time interval between successive decision epochs can have multiple time units, where given the current state the action taken, there is a joint probability distribution defined for the next state the time when the next decision epoch will be called. Optimal control policies are derived, synthetic networks are used to investigate the properties of both cyclic acyclic interventions with fixed-duration of effectiveness, the methodology is applied to a mutated mammalian cell-cycle network. Index Terms Acyclic intervention, cyclic intervention, drug scheduling, gene regulatory network, genomic signal processing, optimal control. I. INTRODUCTION MULTIPLE effects must be taken into account in cancer treatment, specifically in chemotherapy: potential toxicity, duration of effect, side effects of the drug. The aim is Manuscript received November 13, 2011; revised March 23, 2012; accepted May 16, Date of publication June 01, 2012; date of current version August 07, The associate editor coordinating the review of this manuscript approving it for publication was Dr. Z. Jane Wang. This work was supported by the Qatar National Research Fund (a member of the Qatar Foundation) by Grant NPRP The statements made herein are solely the responsibility of the authors. M. R. Yousefi A. Datta are with the Department of Electrical Computer Engineering, Texas A&M University, College Station, TX USA ( m.rezaei@tamu.edu; datta@ece.tamu.edu). E. R. Dougherty is with the Department of Electrical Computer Engineering, Texas A&M University, College Station, TX USA. He is also with the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ USA, the Department of Bioinformatics Computational Biology of the University of Texas M. D. Anderson Cancer Center, Houston, TX USA ( edward@ece.tamu.edu). This paper has supplementary downloadable multimedia material available at provided by the authors. The supplementary material is a PDF fileis19.5kbinsize. Digital Object Identifier /TSP to find the right dose schedules maximizing the benefit totoxicity ratio [1]. To this end, chemotherapy is generally given in cycles. Each therapeutic window begins by delivering the drug, which will be effective on the target cell(s) for some period of time [2], followed by a recovery phase. The chance of eradicating the tumor can be maximized by delivering the most effective drug dose level over as short a time as possible; however, the toxicity level of the drug could be intolerable to the patient. Hence, it is necessary to optimize a delivery schedule that results in the maximum integrated drug effect consistent with reasonable quality of life [1]. In this paper we model therapeutic intervention in gene regulatory networks (GRNs) with the goal of finding optimal intervention strategies when drug effectiveness has a fixed length of duration. To date, probabilistic Boolean networks (PBNs) have served as the main vehicle for studying control-theoretic intervention in GRNs [3]. PBNs can be studied in the context of Markov chains in which each state represents a gene activity profile (GAP) state transitions are made based on a transition probability matrix (TPM). Since the goal is to intervene in a PBN to minimize (or maximize) certain objective functions, we can use methods proposed in the Markov decision process (MDP) literature to find an optimal intervention strategy. Most MDP studies use dynamic programming as a stard iterative optimization algorithm to solve the optimal control problem for a stochastic system with associated costs [4]. A predefined cost at each stage is applied when the control action has to be taken /or the next state is in the set of undesirable states. The objective is to devise a control strategy whose actions minimize the expected total discounted cost in the long run. Usually, control strategies in the context of MDPs are implemented at every stage. One intervention strategy is to decide at every transition whether or not to apply treatment [5], thereby resulting in an optimal one-transition intervention policy. In [5], such an optimal infinite-horizon control policy has been designed for any PBN. The stationary policy obtained is independent of time dependent on the current state, where the paper concentrates on discounted cost problems with bounded cost per stage on average-cost-per-stage problems. The treatment strategy is modeled by either intervention in the form of toggling the current state of a specified gene, which is called the control gene, or no-intervention (letting the network evolve on its own). That is, if the control gene is OFF (downregulated), the intervention action will set it to ON (upregulated) vice versa, or we do not intervene at all. The set of control actions consists of two elements. However, one can generally assume X/$ IEEE

2 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER Fig. 1. (a) Intervention is only permitted at the beginning of each cycle. Thereafter, the network evolves freely with no intervention (recovery phase). (b) Decisions are made at to whether the control gene should be up-regulated, down-regulated or if there is no intervention. The drug, if taken, will be effective on the target cell(s) for a fixed length, followed by a recovery phase of length. (c) If the control decides not to intervene, the next decision is made one period later, but if the control decides to intervene, it will be effective for some number of transitions followed by a recovery period. that the intervention action forces the control gene to be either ON or OFF rather than altering its current state. In this case, the set of possible control actions includes three elements: no-intervention, ON OFF. Forsaking a one-transition intervention policy in order to model dosing constraint in chemotherapy, [6] finds an optimal cyclic intervention strategy by defining a treatment window to be every transitions of the system. Intervention is only permitted at the beginning of each treatment window. Thereafter, the network evolves for time steps without any intervention. An optimal cyclic intervention policy, the optimal -transition policy, is found by solving the stochastic control problem for a Markov chain with augmented state space via dynamic programming algorithms [6]. Fig. 1(a) shows a schematic of this model. The -transition policy ignores the drug s duration of action. The drug will be effective on the target cell(s) for some period, after which there can be a recovery phase. Although in general the length,, of the drug effectiveness period, recovery period, the dose schedule can vary, here we assume that these lengths are fixed. Fig. 1(b) shows a schematic of this model. It is possible to find an optimal intervention strategy for a cyclic therapeutic procedure with fixed-length duration of drug effectiveness by taking a similar approach as in [6]. This involves finding the TPMs of the controlled network then constructing the augmented Markov chain showing that the augmented state space can be collapsed to a smaller one, which results in a compressed space of size equal to the original state space. Then using the value iteration algorithm [4], by accumulating the expected cost of the system in each -transition period adding the discounted expected total cost from the previous decision stage, one can find the proper policy that minimizes the total discounted cost for this infinite-horizon optimization problem. Using cyclic (with or without drug duration of action) models necessitates waiting until the next therapeutic cycle ( transitions later) if no control action is taken. In a real therapeutic scenario, the physician can monitor the status of the patient at every stage decide whether to prescribe treatment or wait until the next appointment when a new decision will be made. If treatment is prescribed, then the patient will be under the influence of the drug for some period of time, although not being monitored, or equivalently the corresponding GRN of cancerous cells will be under an invariant control action for a certain number of transitions, followed by a recovery period. The next decision will be made after this period has elapsed. Therefore, the therapeutic strategy will not necessarily have a cyclic structure. Fig. 1(c) shows an example of this acyclic model. In this paper we take a unified approach to both cyclic acyclic control by placing the problem in the general framework of multiperiod decision epochs in MDPs with infinite horizon discounting cost [7]. In multiperiod decision epochs, the time interval between successive decision epochs can have multiple time units, where given the current state the action taken, there is a joint probability distribution defined for the next state the time when the next decision epoch will be called. II. PROBABILISTIC BOOLEAN NETWORKS A PBN consists of nodes denoted by a set of vector-valued predictor functions,. In the framework of gene regulation, each element represents the expression level of the th gene. In a binary PBN,.If then the th gene is OFF (down-regulated); if, then the th gene is ON (up-regulated). Each vector-valued function determines a constituent Boolean network of the PBN. The function is the predictor of gene,whenever Boolean network is selected. The PBN is called instantaneously rom if the regulatory functions are permitted to change at every updating epoch. However, the PBN is said to be context-sensitive if the regulatory functions update their values only at time points selected by a binary switching rom process. This process models the effect of latent variables outside the model, whose behaviors influence regulation within the system.context-sensitivepbnsaremoresuitablefortheanalysis of gene regulation can better represent the stability of biological systems by capturing the period of sojourning in constituent networks [8]. At each updating epoch, a binary rom variable determines whether the constituent network is switched with probability or not with probability. If the context remains unchanged, then the values of all the genes are updated synchronously according to the current constituent Boolean network. On the other h, if a switch occurs then a predictor function is romly selected according to the probability distribution, the values of the genes are updated. In this framework, instantaneously rom PBNs constitute a subclass of context-sensitive PBNs since the switching probability equals 1 (the wiring diagram of the system is subject to change at every instant). We assume PBNs with perturbation, meaning that at each time instant there is a small perturbation probability,independent of switching probability distributions, that any gene

3 4932 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 will change its value. It is worth mentioning that a Boolean network with perturbation is also a special case of a PBN with perturbation in which. The GAP is definedtobean -digit binary vector giving the binary expression level of the genes at time (epoch),where.anatural bijection between its decimal representation can be established. We denote the decimal equivalent of as given by. In the presence of external control, a PBN admits a control input,, in the space of control actions,. For now, assume that the control is permitted at any epoch. The control input specifies the intervention on a single control gene,. For example, suppose. If the control at the updating epoch is,thenthe control gene will be up-regulated; if,then will be down-regulated; if,then is not forcibly altered (that is, it might stay the same or it might change with the unforced evolution of the network). We assume that the control gene is given, we refer to as in the rest of this paper. Relative to,agenethatisreflective of metastasis or any undesirable state of the network is called a target gene. It can be used to partition the set of all states, into subsets of desirable undesirable states, denoted by, respectively. Dynamic behavior of a controllable PBN can be modeled by a discrete-time controlled Markov chain [9]. To this end, suppose that, as an element of the state space, is the state of the PBN at time. The vector contains information about the current constituent BN GAP of the underlying network. Suppose that the set of constituent networks is ordered, meaning that one can draw a bijection between its decimal representation denoted by. Originating from state given the control at state, the successor state is selected romly within the set according to the transition probability.thesetransition probabilities can be calculated as explained in [8]. One might alternatively write the system evolution as a stationary discrete-time equation for where, as an element of a countable space, is the disturbance term. As above, we constrain the control to take values in a given nonempty subset of, which depends on the current state. The disturbance isassumedtobegenerated from a given distribution it represents uncertainties in the form of network switching probabilities rom gene perturbations. One can define a set rewrite the transition probabilities as. Therefore, it is possible to write the system equation as whenever convenient [4]. Gene perturbation guarantees that the Markov chain associated with any stationary control policy is ergodic has a unique invariant distribution equal to its limiting distribution. It is common to assume that the active constituent network of the context-sensitive PBN is not observable at each instant [5], [10]. This facilitates designing the control policy, without explicit knowledge of the context, by reducing the dimensionality of the state space. The total number of states in the state space after reduction will be. In this scenario, the TPM is constructed when the context is removed from the state space computed with a weighted sum of the GAP behaviors over all the possible constituent networks [8]. At every step, the reduced system exhibits an expected behavior by averaging over all possible contexts. As such, the GAP determines the status of the approximate system the collapsed transition probability matrix specifies its evolution. It has been shown in [8] that averaging over the various contexts reduces the TPM of a context-sensitive PBN to the instantaneously rom PBN with identical parameters. The effects of using exact approximate TPMs on the intervention performance have been extensively studiedin[8]. III. CONTROLLED NETWORKS An external control input targets the control gene (or its products) that affects the GRN of interest. In our model, if no drug is delivered at a decision epoch, then the status of the control gene up to the next decision epoch evolves unrestrainedly. Accordingly, intervention in the form of delivering the prescribed drug at a decision epoch is assumed to be associated with making the status of the control gene ON or OFF. If the intervention policy forces the control gene to be ON (or OFF) for some period of time without any change, this in turn makes states of the state space for which the control gene is OFF (or ON) inaccessible since the system cannot transition to this specific set of states. Therefore, in the presence of external control, we will have a new PBN with a shrunken state space. Our objective in this section is to find the new TPMs of the PBN under intervention via some matrix manipulations explained in the following sections. Although for clarity of presentation we consider only one control gene, the method of obtaining the controlled transition matrices can be easily generalized to any number of control genes. A. Controlled Phase Assume an -gene uncontrolled PBN. The goal is to calculate the TPM of this PBN under control. We consider a single control gene,. Without loss of generality, we assume corresponds to the most significant bit in the binary representation of the states. By forcing to be either 0 or 1, we will have an -gene PBN with states. We define three sets of states corresponding to the control action by, respectively, for,. can be partitioned into two sets of accessible inaccessible states depending on,where denotes the accessible set. The members cardinality of the set depend on the choice of the control action. If the control action forces, then the set of states in which is accessible the set of states in which is inaccessible. For, no-intervention, all states are accessible.thetpmofthe original PBN can be partitioned into four submatrices based on the current the next states: (1) accessible state to accessible state; (2) accessible state to inaccessible state; (3) inaccessible state to accessible state; (4) inaccessible state to inaccessible state.

4 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER Given that the control action is in, current states cannot be among inaccessible states since the value of is fixed. Thus, the last two submatrices can be discarded from the TPM, leaving us with the first two submatrices. From an accessible state, before the control gene is constrained by the control action, the network could either have transitions to or to with some probabilities according to the TPM. However, after the restriction, all the probability mass of transitions to states of both types will be absorbed by the set of accessible states corresponding to the given fixed value for. Hence, to calculate the reduced controlled TPM, we need only add the first two submatrices entry by entry. For the case, the TPM remains unchanged. We denote the TPM of the uncontrolled network, which has a full size, by. We denote the TPM of the network under control by where, which has a smaller size when any intervention happens. By definition,.tokeepthesizeof TPMs consistent, we can assume that the probability of going from any accessible state to any inaccessible state from any inaccessible state to any other state is zero. Therefore, we will have some rows columns in the TPM where all the elements are zero. B. Transient Phases Next, we need to discuss an important issue in this class of intervention problems related to two types of transient phases. Because of the transient effect of control on the network, we also need to find the TPMs when the system transitions from an uncontrolled phase to a controlled phase vice versa. 1) Uncontrolled to Controlled Phase: When the system is in the uncontrolled phase, for example during the recovery period, it follows the original transition matrix,.attheendof this phase, before applying any control action before entering the controlled phase, once the current state is observed, the control strategy decides whether to intervene or not. With intervention, the system will have a different TPM for only one stage until it enters the controlled phase. After this transition, therewillbeareductioninthesizeofnetworkleadingtoa smaller TPM if ; otherwise, it remains the same. To find the TPM by which the system makes the one stage transition we need to first find out from which states to which states the network can transition. When uncontrolled, the network can transition from any original states in.withintervention, the value of the control gene is fixed; hence, it can only transition to states in, following the transition rules of the reduced network. For example, if the control action sets, then we will have a TPM in which we can have transitions from states in both to only states in. Because the external control forces, the probability of going from to is equal to the probability of going from to. The corresponding probabilities can be found in the same way as discussed in Section III.A. The probability of going from any inaccessible or accessible state to any inaccessible state is zero. We denote the TPM of the network from the uncontrolled phase to the controlled phase by. When the control action null,. 2) Controlled To Uncontrolled Phase: For the controlled phase to the uncontrolled phase transition, assume that the network is at a stage when the control action is going to be ineffective. Then the current state can only take on values from the accessible states, depending on the control action taken a few stages earlier, but the next state can be any state in the state space since the control action is no longer effective. Hence, the network will follow the original TPM, meaning that it is equal to. When there are several control genes, a similar approach with some adjustments can be easily adapted to find the corresponding TPMs. We are now ready to formulate the problem of optimal intervention for both cyclic acyclic therapeutic procedures with fixed-length duration of drug effectiveness. IV. INFINITE HORIZON MDPS WITH DISCOUNTING COST We first begin with an overview of the MDP problems where decision epochs happen at every stage, when any transition occurs, the cost function is discounted over time for an infinite horizon. By suitably utilizing the available results for these kinds of problems, we will subsequently formulate solve discounting problems with multiperiod decision epochs. A. Discounting Problems With Single-Period Decision Epochs In a classic MDP, we assume that decisions are taken at the beginning of each successive time period of fixed length. For a system described by given an initial state, the goal is to find a policy, where for all, that minimizes the expected total discounted cost (1) (2) where is a cost for using control at state. The expected value in (2) is evaluated with respect to the distribution definedonthespace. The expectation in (1) is taken with respect to a probability distribution constructed based on the system equation, the probability distributions, the initial state,the policy. This probability distribution is defined on the countable product set. The discounting factor guarantees convergence of the expected total cost [4]. We define a stationary bounded function as the cost associated with each transition in the system. We denote the set of all admissible control polices by, desire a control policy in with for all that gives the optimal cost function defined for all as.theadmissible stationary control policy,, is optimal if we have for all. On the basis of the principles mentioned above, it is possible, under certain conditions, to formulate a functional equation a mapping, obtained by applying the

5 4934 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 dynamic programming mapping to any function, for all as where is the transition probability of the controlled Makrov chain corresponding to a controllable PBN. By looking at the infinite horizon problem as outlined above, we can determine the optimal stationary policy with the help of three important propositions (convergence, optimality, uniqueness of the solution) resulting in an iterative algorithm called value iteration. Proofs of these statements can be found in [4]. B. Discounting Problems With Multiperiod Decision Epochs To model the multiperiod decision epochs problem, assume that the time intervals between successive decision epoches are dependent on the decisions made. Define for a fixed denoting maximum length of the time allowed between two consecutive decision epochs. The controlled Markov chain is described by transition probabilities,the probability that given action state, the next decision will be called for units of time later on, the new state will be. Formulating the expected total discounted cost requires redefining the equation representing the dynamics of this particular system. We write the new system equation as,where for all.inthenew equation, indicates the time in which the th decision epoch happens. Without loss of generality, let.forall is the disturbance term contained in a countable space. We can write the system equation as, where the decimal representation of is denoted by. The disturbance isassumedtobegenerated from a given distribution equivalent to. The system equation states that given the current state at time control action taken in that state, the next state the time when the next decision will be made is a rom vector that follows a joint probability distribution between,given. Givenaninitialstate, the goal is to find a policy that minimizes the expected total discounted cost where is a cost for using control at state.notethatsince is defined for the entire time interval of length,it might involve discounting as well, specification of its value might involve an expectation. The expected value in (4) is evaluated with respect to the distribution definedonthespace. The expectation in (3) is taken with respect to a probability distribution constructed based on the (3) (4) system equation, the probability distributions, the initial state,the policy. This probability distribution is defined on the countable set. The discounting factor guarantees the convergence of (3) in the long run. For cyclic or acyclic intervention in PBNs, we define to be the immediate cost function over the -period interval when a transition from state under control to state is made in subsequent periods. To minimize the expected total discounted cost, [7] suggests solving the following functional equation for all with the conditions characterizing it: where is finite, for all is given bounded for any, is also bounded for all,. Here, is the minimal expected discounted cost beginning in state with periods to go is the immediate expected cost up to the next decision epoch for state control action. If the next decision epoch is periods later on occurs upon arrival at state, the minimal expected discounted cost (relative to that point in time) is,which when discounted to the present time by a factor, when averaged over the joint distribution of, gives the required result by minimizing over [7]. As in Section IV-A, we define the mapping for all by The following propositions, proven in the Appendix, provide the solution to the problem of finding the optimal control policy for the multiperiod decision epochs problem. Proposition 1 (Convergence of the Algorithm): For any bounded function all, the optimal cost function satisfies. Proposition 2 (Bellman s Optimality Equation): For all, the optimal cost function satisfies Equivalently,. is the unique solution of this equation in the class of bounded functions. Proposition 3 (Necessary Sufficient Condition for Optimality): A stationary policy is optimal if only if attains the minimum in Bellman s optimality equation for each. We can implement value iteration algorithm where the optimal cost function can be found by sequentially computing, for any bounded initial cost function.an optimal control policy is found when the iteration converges to the optimal value of the cost function,. In both cyclic acyclic approaches to cancer therapy, practical considerations dictate that the main assumption regarding the time interval between successive decision epochs be changed: once an intervention has taken place, the next (5) (6)

6 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER decision can only be made time units later on. Within each window of length, the network undergoes a state transition at each time step. The system first evolves under the influence of the intervention for a window of length followed by a recovery window of length. Hence, should be calculated for the entire window of length.this requires considering discounting the cost function at each stage calculating the corresponding different state transition probability matrices depending on the choice of control at the beginning of the window. Note that in (5), the set of possible control actions,, at each decision epoch depends on the current state ; however, we can make a simplifying ( practical) assumption that the set of control actions is independent of the current state, in which case will have for all. V. THERAPEUTIC METHODS WITH FIXED-LENGTH DURATION OF DRUG EFFECTIVENESS By allowing the time intervals between successive decision epoches to depend on the decisions, we can formulate both the cyclic acyclic therapeutic methods under different conditions as discounting problems with multiperiod decision epochs. For the cyclic case, the time intervals are fixed identical for all decision epochs; for the acyclic case, the time interval depends on the choice of the control action is fixed identical for the same control action. Before determining an optimal policy for these problems, we need to discuss an important issue regarding the calculation of. One can calculate as in (6), which involves for all.wehaveassumedthatgiventhe control action, is deterministic. In the cyclic case, ; in the acyclic case, if then,if then. In each window of length we might have multiple transitions, each associated with some probability distributions, depending on the control action the network transition phase. Since all the required TPMs can be found following the procedure explained in Section III, the s can be readily calculated. Also, the immediate cost function when starting from state at the beginning of the window going to state at the end of the window is by itself an expectation accumulated over the window by some discounting factor. Calculating requires normalizing all the intermediate transition probabilities by sinceweknowthatthestatetransitions from to can happen through different intermediate paths. However, we will suggest a simplified approach for directly calculating. First consider. Without loss of generality suppose we are at time 0 for both cyclic acyclic problems. As explained in Section III, the network can have three phase transitions: uncontrolled to controlled, controlled to controlled uncontrolled to uncontrolled, represented by TPMs,respectively. The number of transitions in the first phase is one, in the second phase is inthethirdphaseis.forall let denote the probability of going from state to state in steps, i.e., the entry of the matrix corresponding to the current state -step future state.then (7) To calculate in the acyclic problem, we need only set in (7). We also define the immediate cost function,, which is associated with only one transition of the network. We assume that is stationary bounded for all. We postulate a structure for this function where the cost of being in any undesirable state is the same, as is the cost of being in any desirable state. In addition, we assume that the control cost is identical for all the states if there is any intervention in the network. A possible cost satisfying these properties is (8) where if otherwise. The sets in (8) correspond to the desirable undesirable states in all constituent networks, respectively, are defined as.thisdefinition of implies independence of the starting state, i.e.,.now,weare ready to formulate the functional equations for both cyclic acyclic therapeutic methods when the drug effectiveness has a fixed-length duration. A. Cyclic Problem For the cyclic case, the decision epochs occur every periods (transitions) meaning that in (5). Also, we should take into consideration the drug effectiveness phase of length when any intervention happens in the network. The functional equation for all becomes where is calculated from (7) is the expected immediate accumulated cost discounted over the window of length when the starting state is the control action is.notethat,sinceweareonlyallowedtomakedecisions at the end of each period, has increments of only length. Hence, (9) can be rewritten for as (9) (10) which has a similar structure to the recursive relationship obtained in [6]. This equation shows how the cost at the beginning of the th window affects the cost at the beginning of the th window, making the recursion independent of.as mentioned earlier, can be directly calculated by (11)

7 4936 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 Equation (11) has three components. The first gives the expected immediate cost when the network transitions from state in an uncontrolled phase to state in a controlled phase, when the decision epoch happens at time 0. The second is the sum of discounted expected immediate costs for the drug effectiveness phase when the network is restricted by the choice of control at time 0. All possible transition paths from state at time 0 to any state at time are considered. The third component is the sum of discounted expected immediate costs only for the recovery phase when the network makes transitions according to its original TPM; in this phase. Similar to the second component, we enumerate all possible transition paths from state at time 0 to any state at time. Using (7) (8), (10) (11) become, for all or,thenitis. To indicate this dependency we define by if if, we define by if if. Therefore, a functional similar to (9), but with, is written for all as (16) where are calculated by (17) (12) (18) for any bounded func- We define the mapping tion all (13) For any bounded function all,wedefine the mapping by (14) where is calculated from (13). Using Propositions 1, 2 3, the optimal cost function,, can be iteratively determined by running the recursion for any bounded initial cost function until a stopping criterion. When the iteration converges to, the stationary optimal policy for all is given by (15) B. Acyclic Problem The acyclic problem is different from the cyclic case in the sense that the length of time interval between successive decision epochs is different between.if, then the length of the time interval is 1, but if (19) where is calculated from (18). Propositions 1, 2 3 provide a method to iteratively calculate the optimal cost function by running the recursion for any bounded initial cost function until improvement stops. When the iteration converges to, the stationary optimal policy for all is (20) C. Computational Complexity of the Procedure Thetimeefficiency of the proposed methods can be analyzed in terms of the computational complexity of deriving an optimal policy. For a fixed, it is shown in [11] that using the ordinary (one-transition) value iteration algorithm, the optimal policy can be reached in a number of iterations polynomial in the size (bit length) of the inputs. Depending on the rate of convergence, one might take different approaches reduce the computational complexity involved in every step of the value iteration algorithm. Since the acyclic problem has approximately similar (maybe less) complexity than the cyclic case, let us only focus on the latter do the calculations for any given PBN with states, fixed, control actions.

8 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER First consider. It can be calculated beforeh once substituted directly into (14) at every iteration. We rewrite (11) in the matrix form as, probabilities are found based on the optimal control policy the TPMs computed from Section III. The new state space is defined as where. Doing the matrix multiplication from the right gives a one-time computational complexity of.inthefirst approach, one can also calculate beforeh substitute the result directly into the value iteration algorithm at every iteration. The ordinary matrix multiplication yields a one-time computational complexity of for calculating.having calculated, value iteration has the same time complexity as a regular dynamic programming algorithm, which is, at every iteration. In the second approach, one can write (14) in a matrix form similar to do the calculations from the right side, which gives a time complexity of at every step of the value iteration algorithm. Depending on the convergence rate of value iteration, one elects to proceed using an approach for which the algorithm has a lower time complexity. In this paper, we choose the first approach where all components are calculated outside the value iteration algorithm. VI. SHIFT IN THE LONG-RUN PROBABILITIES OF UNDESIRABLE STATES The effectiveness of any intervention strategy can be shown by comparing the long-run likelihood of visiting undesirable states for the network under control with that of the uncontrolled network. We anticipate that the control strategy that minimizes the expected total discounted cost also reduces the long-run probability of being in undesirable states by modifying the long-run behavior of the network. We measure the amount of shift in the aggregated probability of undesirable states before after intervention by (21) where are the occupation probabilities of being in state in the long run with without any intervention policy, respectively. To calculate we require. The former can be computed by finding the invariant distribution of the uncontrolled network governed by the original transition matrix. However, calculating the latter is not straightforward it needs more care since the transition probabilities change within the treatment window depending on the implemented optimal control policy for both the cyclic acyclic cases. To this end, we construct a Markov chain with an augmented state space based on the original Markov chain, for which the transition The maximum number of states in the augmented state space is, but as to be discussed, the actual size of the augmented state space will be less because unnecessary states will be removed. For the cyclic case, there are three classes, not to be mixed with the definition of communicating classes in Makrov chains, of states in the augmented state space. The first class of states is where is found from (15). represents the original state space at the decision epochs, where state is associated with its corresponding optimal control action. The second class is where is the image of we have.that is, we disregard any control actions in that are not actually implemented in the optimal control policy. This class consists of all possible states that the network can be at any time from to, no matter what control action is taken at time. Once the control action ceases to be effective, the network can evolve freely with no restrictions. Hence, we define the third class of states by What remains to be explained is the transition probabilities between the states. The states in can only transition to states in with according to the following transition probabilities. For all,wehave. The states in with can only transition to states in itself with probabilities defined by. The states in with can only transition to states in with basedonthe probabilities,forall,. The states in with can only transition to states in according to the probabilities for all given by. The states in with only transition to states in based on the probabilities,forall. All the other transition probabilities are zero. The acyclic case is different since for the control action, the decision epoch will last only one time unit instead of timeunits.therefore,wewillbeabletoremoveevenmore states in the new Markov chain with the augmented state space. Similar to the cyclic case, we define three classes of states in the augmented state space. The first class is

9 4938 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 Fig. 2. Performance comparison of different therapeutic methods across 512 rom PBNs. The bias mean is 0.2, maximum connectivity is 3.(a) Average total discounted cost. (b) Average shift. (c) Histogram of the difference between the average total discounted cost for acyclic methods ( ). (d) Histogram of the difference between the average total discounted cost for cyclic methods ( ). where is given by (20). denotes the original state space at each decision epoch, where each state is associated with an optimal control action. The second class is where is the set of unique elements in minus 0. That is, we ignore any control actions in that are not used in the optimal control policy as well as the null control action. The difference between the definition of in the cyclic acyclic cases shows that we do not reserve any additional set of states for in the acyclic case since taking the no-intervention decision makes the network transition back to the set corresponding to the decision epochs. The third class is The states in with can transition to states in with based on the transition probabilities,for. The remaining states in with transition back to according to the probabilities, for all. The states in with can transition only to states in with the probabilities. The states in with can only transition to states in with according to the probabilities,forall. The states in with can transition only to states within with the probabilities. The states in with only transition to states in based on the probabilities,forall. Similar to the cyclic case, all remaining transition probabilities are zero. The augmented Markov chain, for both the cyclic acyclic cases, is designed so that every state communicates with every other state. Irreducibility in the new Markov chain follows since the original Markov chain is irreducible. Hence, it possesses unique occupation probabilities for each.wedefine the set of undesirable states in the new Markov chain by. Using the invariant distribution of the new Makrov chain definition of,we rewrite (21) as (22) We will use this equation to calculate the amount of shift in the long-run probability distribution of undesirable states. Note that the calculation method should be slightly modified when we have special cases, such as or,a few others.

10 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER VII. RESULTS AND DISCUSSION A. Synthetic Networks For the simulation, we ignore the explicit knowledge of the context consider only instantaneously rom PBNs consisting of seven genes (the collapsed state space has states) with various properties. This will be illustrative enough showing the general trend of the results. Each PBN consists of four constituent BNs with a maximum connectivity of two or three, where the connectivity corresponds to the maximum number of predictors for each Boolean function. The bias of each PBN, defined to be the probability that each constituent Boolean function takes on the value 1, is taken romly from a beta distribution with a fixed mean of 0.1 or 0.2 a variance of Changing the bias connectivity affects the dynamical properties of romly generated BNs [12]. Hence, we can study their effects on the performance of the different control policies. The gene perturbation probability is set to For any given PBN, the TPM of the corresponding Markov chain is computed. We assume that the target control genes are chosen to be the least most significant bits in the binary representation of states, respectively, where the down-regulation of the target gene is desirable. To investigate the effect of the cost of control on the control policies, we set, vary to be 0, in (8). In practice, the actual values would have to be assigned by a physician according to his or her understing of the disease. In short, we will have four different classes of rom PBNs with three different cost functions, yielding 12 sets of simulations in total. To show the performance of the methods proposed in this paper on each set of simulations, we find the optimal policy using the value iteration algorithm for synthetic networks based on five intervention strategies consisting of: (S1) Acyclic -transition with a drug effectiveness period; (S2) Acyclic without a drug effectiveness period; (S3) Cyclic -transition with a drug effectiveness period; (S4) Cyclic without a drug effectiveness period; (S5) One-transition. Optimal policies derived from strategies S1, S2, S5 are applied to the PBN, where the system itself has an acyclic -transition with drug effectiveness structure. A similar approach is taken for the strategies S3, S4, S5, but the system has a cyclic -transition with drug effectiveness structure. It should be noted that only strategies S1 S3 are optimal, respectively, for the acyclic cyclic systems with fixed-length duration of drug effectiveness. The full set of simulation results is presented in the supplementary materials. For each PBN in a set of simulations, we produce synthetic time-course data for 1000 time-steps with different therapeutic strategies implemented. The total discounted cost is estimated by accumulating the discounted cost of the destination state given the policy at the origin state. We repeat this times with rom initial states compute the average of the total discounted cost. The procedure is repeated for 512 rom PBNs from each set of simulations. The amount of shift in the aggregated probability of undesirable states for each PBN is calculated from (22). This yields 512 values for the average total discounted cost. We graph the average total discounted cost the average shift in the aggregated probability of undesirable states across these 512 Fig. 3. Performance of acyclic -transition policy with a drug effectiveness period across 512 rom PBNs for. The bias mean is 0.2, maximum connectivity is 3. (a) Average total discounted cost. (b) Average shift. networks for all six therapeutic methods, in addition to the no-control case. We also show the histograms of the difference between the results of these methods, across all the generated networks, for both the average total discounted costs the shifts. This indicates how often to what degree the optimal strategies work. In the first set of experiments, the constituent BNs have a maximum connectivity of 3. The bias value of the romly generated PBNs is set to 0.2. The cost of control (intervention) is while the cost of undesirable states is.the average total discounted cost average shift for six different therapeutic methods as well as the case where no control on the network is applied are shown in Fig. 2(a) (b), respectively. In these figures we fix the length of the treatment window,, to 5 we vary from 1 to 5. Note that we do not include the amount of the shift for the no-control scenario since it is zero. The histograms of the difference between the average total discounted cost shift of different therapeutic methods are also shown in Fig. 2(c) for the acyclic methods Fig. 2(d) for the cyclic methods, where are set to 5 4, respectively. On average, the optimal strategies have better outcome, in terms of lower discounted cost higher shift, compared to the suboptimal ones. We observe that the performance of -transition strategies with without considering the drug effectiveness period in both acyclic (strategies S1 S2) cyclic

11 4940 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 Fig. 4. Performance comparison of different therapeutic methods across 512 rom PBNs. The bias mean is 0.1, maximum connectivity is 3.(a) Average total discounted cost. (b) Average shift. (c) Histogram of the difference between the average total discounted cost for acyclic methods ( ). (d) Histogram of the difference between the average total discounted cost for cyclic methods ( ). (strategies S3 S4) cases are very close. However, it can be seen from the histograms that these two strategies occasionally produce different results. Because the optimal policies minimize the expected total discounted cost, it is possible to observe cases such that strategies S2 S4 can perform more efficiently compared to strategies S1 S3, respectively, although this rarely happens. It is also evident that the acyclic cases are consistently showing better results than their cyclic counterparts. This behavior might be because the acyclic method introduces more flexibility into the optimization problem. Intuitively, we might think that by increasing,orbyapplying more control, the cost amount of shift both can be improved. In the long run, less treatment is applied for larger smaller. Consequently, more cost less shift result [6]. Fig. 3(a) (b) demonstrate that the performance of the optimal therapeutic method corresponding to acyclic -transition with a drug effectiveness period degrades as is increased is decreased. We observe similar trends for other methods when we have treatments that are further apart in time with a shorter period for the drug duration of action. Tables I II provide some statistics on the running times (in milliseconds) on an Intel Core i7-920 CPU with 2.66 GHz clock 12 GB of RAM. The simulation parameters intervention strategies are similar to the first set of experiments. For a fixed number of genes, we generate 100 different instantaneously rom PBNs, calculate the CPU times report the average. Two numbers are provided in each cell of the table for a specific intervention strategy a given.thefirst number is the time taken to calculate for all the second one indicates the time taken for the value iteration algorithm to converge. In the second set of experiments, the maximum connectivity of the constituent BNs is 3, the mean bias is 0.1 the cost of control is 0.1. Similar to the previous experiment, we observe that the optimal strategies perform better on average, inducing lower discounted cost higher shift compared to the suboptimal ones. Results of this experiment are shown in Fig. 4(a) (b) for fixed variable.the histograms of the difference between the costs shifts of different therapeutic methods are presented in Fig. 4(c) (d) for. The full set of the results for this experiment can be found in the supplementary materials. B. Mammalian Cell-Cycle Network We construct a probabilistic version of the Boolean networks for the mammalian cell-cycle regulation derived analyzed by Faure et al. [13]. The Boolean functions represent the main features of the wild-type biological system, as well as the the consequences of several types of mutations. Three key genes in this model are Cyclic D (CycD), retinoblastoma (Rb),

12 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER TABLE I COMPUTATION TIMES (IN MILLISECONDS) OF THE DYNAMIC PROGRAMMING ALGORITHM FOR PBNS WITH SIX GENES TABLE II COMPUTATION TIMES (IN MILLISECONDS) OF THE DYNAMIC PROGRAMMING ALGORITHM FOR PBNS WITH SEVEN GENES TABLE III BOOLEAN NETWORK OF A MUTATED MAMMALIAN CELL CYCLE p27. Mammalian cell division is coordinated with the overall growth of the organism through extracellular signals that control the activation of CycD in the cell. We follow a proposed mutation in [14], where p27 is assumed to be mutated can never be activated (always OFF). This mutation introduces a situation where both CycD Rb might be inactive, in which case the cell cycles in the absence of any growth factor [14]. This suggests considering the logical states in which both Rb CycD are downregulated as undesirable states, when p27 is mutated. Table III presents the Boolean functions of the mutated cell-cycle network. We use the Boolean functions of the mutated cell-cycle network to construct an instantaneous PBN model. Depending on the state of the extracellular signal, which determines CycD being ON or OFF, we obtain two constituent Boolean networks. We assume that the two Boolean networks are equally likely. The corresponding instantaneously rom PBN consists of eight gene ( states) in the following order from the most significant bit to the least significant bit in the binary representation: CycD, Rb, E2F, CycE, CycA, Cdc20, Cdh1, UbcH10, CycB. This specific ordering of genes is only for the sake of presentation. We also assume that the gene perturbation probability is 0.01 independent of the network switching probability distribution. Since cell growth in the absence of growth factors is undesirable, we define undesirable states as for which both CycD Rb are down-regulated. We choose E2F as the control gene [15]. The cost function is definedby(8),with. We find the optimal policy for the mutated cell-cycle network basedonthesamefive intervention strategies applied for the synthetic networks. We apply policies derived from strategies S1, S2, S5 to the PBN assuming the system has an acyclic -transition with drug effectiveness structure. We apply policies derived from strategies S3, S4, S5 to the PBN with a system that has a cyclic -transition with drug effectiveness structure. Given an initial condition, the total discounted cost is estimated by accumulating the discounted cost over 1000 timesteps from the time-course data generated while different therapeutic strategies are implemented. We repeat this procedure for rom initial states compute the average of the induced total discounted cost. The amount of shift in the aggregated probability of undesirable states for the PBN is calculated from (22) for different strategies. Note that since we have only one network, the amount of shift is not averaged. We graph the results of six therapeutic methods besides the no-control case. The average total discounted cost the shift in the aggregated probability of undesirable states for are shown in Fig. 5(a) (b), respectively. Similar results for are depicted in Fig. 5(c) (d). The acyclic methods show almost the same performance, although they perform consistently better than cyclic methods. The difference between the optimal suboptimal cyclic strategies is noticeable. To investigate the effect of different at the same time, we also present the simulation results of one therapeutic method, acyclic -transition with a drug effectiveness period, in Fig. 6(a) (b). Similar to the synthetic network simulations, more cost less shift are induced in the long run for larger smaller. The full set of simulation results on the mutated cell-cycle network for all six therapeutic methods can be found in the supplementary materials. VIII. CONCLUSION In this paper we have derived optimal control policies for cyclic acyclic interventions with fixed-duration of treat-

13 4942 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 Fig. 5. Performance comparison of different therapeutic methods based on the average total discounted cost shift in the aggregated probability of undesirable states of the mutated cell-cycle network. (a) Average total discounted cost.(b)shift. (c) Average total discounted cost.(d)shift. ment effectiveness in the context of gene regulatory networks. This has been accomplished by taking a unified approach in the framework of multiperiod decision epochs with infinite-horizon discounting cost. This paper continues the effort to develop approaches to external intervention in gene regulatory networks that take into account practical issues that arise from therapeutic constraints, biological complexity, computational limitations. Besides the cyclic window constraint of [6], methods have been proposed to constrain the intervention so as to limit the expected number of interventions [15], [16], biological complexity has motivated robust intervention policies to overcome model uncertainty [17], [18] intervention in asynchronous gene regulatory networks [14], various efforts have been proposed to mitigate the computational burden [19] [23]. APPENDIX In this section, we present the proofs of the statements made in Section IV.B. The system equation is defined by, where for all. Since the system equation can be written as, we denote by the immediate cost function over the -period interval when a transition from state under control to state is made in subsequent periods. Assume for any. Rewrite the mapping for any bounded function as for all,where the expectation is evaluated with respect to the distribution definedonthespace. Proposition 1 (Convergence of the Algorithm): For any bounded function all, the optimal cost function satisfies. Proof: Without loss of generality, assume.for every positive integer, initial state policy, the expected total discounted cost can be broken down into the portions incurred over the first decision epochs over the remaining decision epochs as where is defined in (4). Define. The cost is bounded, hence we can write

14 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER Taking the minimum over all in (23) using (24) yields, for (25) The result follows by the squeeze theorem as [4]. Lemma 1: For any function such that,forall,wehave for all. Proof: is the optimal cost function of the problem with periods to go the terminal cost function.the periods to go cost function will increase uniformly as the terminal cost function increases uniformly [4]. Lemma 2: Suppose for all. For any function for every scalar,we have,forall. Proof: Use induction [4]. Proposition 2 (Bellman s Optimality Equation): For all, the optimal cost function satisfies Fig. 6. Performance of a therapeutic method (acyclic -transition with a drug effectiveness period) based on the discounted cost average shift in the aggregated probability of undesirable states of the mutated cell-cycle network for. (a) Average total discounted cost. (b) Shift. where in the second inequality we used the fact that for all. Based on the above inequality that,wehave Equivalently,. is the unique solution of this equation in the class of bounded functions. Proof: Without loss of generality assume that is the zero function, i.e., for all.thenfrom (25) we have,. Now, apply the mapping for the above relation once. Using Lemmas 1 2, we can write (23) Now, we need to do some calculation regarding the middle term of the above inequalities. By repeatedly applying the mapping to the function we can write (24) where we used the fact that for all.since converges to based on Proposition 1 as,weobtain.if is bounded satisfies,then.by Proposition 1, proves the uniqueness of the solution [4]. Corollary 1: For every stationary policy all,the associated cost function satisfies, where the mapping is defined for any bounded function any control function as for all. Proof: Simply assume that the control constraint set contains only one element for each state corresponding to, apply Proposition 1 [4].

15 4944 IEEE TRANSACTIONS ON SIGNAL PROCESSING, VOL. 60, NO. 9, SEPTEMBER 2012 Corollary 2: For every stationary policy cost function satisfies, the associated for all. Equivalently,. is the unique solution of this equation within the class of bounded functions. Proof: Similar to the proof of Corollary 1 assume that the control constraint set contains only one element for each state, apply the result of Proposition 1 use the reasoning behind the proof of Proposition 2 [4]. Proposition 3 (Necessary Sufficient Condition for Optimality): A stationary policy is optimal if only if attains the minimum in Bellman s optimality equation, for each, i.e.,. Proof: If,thenwehave by the uniqueness we obtain, hence is optimal. Conversely, if the stationary policy is optimal, we have therefore from Corollary 2. This with Bellman s optimality equation yields [4]. REFERENCES [1] R. Simon L. Norton, The Norton-Simon hypothesis: Designing more effective less toxic chemotherapeutic regimens, Nature Clin. Practice Oncol., vol. 3, no. 8, pp , [2] L. S. Goodman, L. L. Brunton, B. Chabner,Goodman&Gilman s The Pharmacological Basis of Therapeutics. New York: McGraw- Hill Medical, 2010, ser. Goodman Gilman s the Pharmacol. Basis of Therapeut.. [3] I. Shmulevich, E. R. Dougherty, S. Kim, W. Zhang, Probabilistic boolean networks: A rule-based uncertainty model for gene regulatory networks, Bioinformatics, vol. 18, no. 2, pp , [4] D.P.Bertsekas, Dynamic Programming Optimal Control. New York: Athena Scientific, 2007, ser. Athena scientific optimization computation series. [5] R. Pal, A. Datta, E. R. Dougherty, Optimal infinite-horizon control for probabilistic Boolean networks, IEEE Trans. Signal Process., vol. 54, no. 6, pp , Jun [6] G. Vahedi, B. Faryabi, J.-F. Chamberl, A. Datta, E. R. Dougherty, Optimal intervention strategies for cyclic therapeutic methods, IEEE Trans. Biomed. Eng., vol. 56, no. 2, pp , Feb [7] D. J. White, Finite Dynamic Programming: An Approach to Finite Markov Decision Processes. New York: Wiley-Interscience, [8] B. Faryabi, G. Vahedi, J.-F. Chamberl, A. Datta, E. R. Dougherty, Intervention in context-sensitive probabilistic boolean networks revisited, EURASIP J. Bioinformat. Syst. Biol., vol.2009, pp. 5:1 5:13, Jan [9] M. Brun, E. R. Dougherty, I. Shmulevich, Steady-state probabilities for attractors in probabilistic Boolean networks, Signal Process., vol. 85, pp , Oct [10] R. Pal, A. Datta, M. L. Bittner, E. R. Dougherty, Intervention in context-sensitive probabilistic boolean networks, Bioinformat., vol. 21, no. 7, pp , [11] M. L. Littman, T. L. Dean, L. P. Kaelbling, On the complexity of solving Markov decision problems, in Proc. 11th Conf. Uncertainty in Artif. Intell., 1995, pp , ser. UAI 95. [12] I. Shmulevich E. R. Dougherty, Genomic Signal Processing. Princeton, NJ: Princeton Univ. Press, 2007, ser. Princeton series in applied mathematics. [13] A. Faure, A. Naldi, C. Chaouiya, D. Thieffry, Dynamical analysis of a generic boolean model for the control of the mammalian cell cycle, Bioinformatics, vol. 22, no. 14, pp. e124 e131, [14] B. Faryabi, J.-F. Chamberl, G. Vahedi, A. Datta, E. R. Dougherty, Optimal intervention in asynchronous genetic regulatory networks, IEEE J. Sel. Topics in Signal Process., vol. 2, no. 3, pp , Jun [15] B. Faryabi, G. Vahedi, J.-F. Chamberl, A. Datta, E. R. Dougherty, Optimal constrained stationary intervention in gene regulatory networks, EURASIP J. Bioinformat. Syst. Biol., vol.2008, 2008, article ID , 10 pages. [16] W.-K. Ching, S.-Q. Zhang, Y. Jiao, T. Akutsu, N.-K. Tsing, A. S. Wong, Optimal control policy for probabilistic boolean networks with hard constraints, IET Syst. Biol., vol. 3, no. 2, pp , Mar [17] R. Pal, A. Datta, E. R. Dougherty, Robust intervention in probabilistic Boolean networks, IEEE Trans. Signal Process., vol. 56, no. 3, pp , Mar [18] R. Pal, A. Datta, E. R. Dougherty, Bayesian robustness in the control of gene regulatory networks, IEEE Trans. Signal Process., vol. 57, no. 9, pp , Sep [19] M. K. Ng, S.-Q. Zhang, W.-K. Ching, T. Akutsu, A control model for markovian genetic regulatory networks, Trans. Computat. Syst. Biol. V, vol. 4070, pp , [20] T. Akutsu, M. Hayashida, W.-K. Ching, M. K. Ng, Control of boolean networks: Hardness results algorithms for tree structured networks, J. Theoret. Biol., vol. 244, no. 4, pp , [21] B. Faryabi, A. Datta, E. R. Dougherty, On approximate stochastic control in genetic regulatory networks, IET Syst. Biol., vol. 1, no. 6, pp , Nov [22] G. Vahedi, B. Faryabi, J.-F. Chamberl, A. Datta, E. R. Dougherty, Intervention in gene regulatory networks via a stationary mean-first-passage-time control policy, IEEE Trans. Biomed. Eng., vol. 55, no. 10, pp , Oct [23] I.Ivanov,P.Simeonov,N.Ghaffari,X.Qian,E.R.Dougherty, Selection policy-induced reduction mappings for boolean networks, IEEE Trans. Signal Process., vol. 58, no. 9, pp , Sep Mohammadmahdi R. Yousefi (S 12) received the B.S. M.S. degrees in electrical engineering from Sah University of Technology, Tabriz, Iran, the University of Tehran, Iran, in , respectively. From 2005 to 2008, he was a member of the Control Intelligent Processing Center of Excellence (CIPCE), Tehran. Since 2008, he has been working toward the Ph.D. degree in electrical engineering at Texas A&M University, College Station. His current research interests include optimal control of stochastic systems, intervention in genetic regulatory networks, small-sample classification, error estimation. Aniruddha Datta (F 09) received the Ph.D. degree in electrical engineering from the University of Southern California, Los Angles, in He is a Professor holder of the J. W. Runyon, Jr. 35 Professorship II in the Department of Electrical Computer Engineering, Texas A&M University, College Station. His areas of interest include adaptive control, robust control, PID control genomic signal processing. He has authored or coauthored five books over 100 journal conference papers on these topics. Edward R. Dougherty (F 11) received the Ph.D. degree in mathematics from Rutgers University, New Brunswick, NJ, has been awarded the Doctor Honoris Causa by the Tampere University of Technology, Finl. He is a Professor in the Department of Electrical Computer Engineering, Texas A&M University, College Station, where he holds the Robert M. Kennedy 26 Chair in Electrical Engineering is Director of the Genomic Signal Processing Laboratory. He is also co-director of the Computational Biology Division of the Translational Genomics Research Institute, Phoenix, AZ. Dr. Dougherty is a Fellow of SPIE has received the SPIE Presidents Award.