A Computer Scientist s Guide to the Regulatory Genome

Transcription

1 Fundamenta Informaticae 103 (2010) DOI /FI IOS Press A Computer Scientist s Guide to the Regulatory Genome Bartek Wilczyński Institute of Informatics, Warsaw University Banacha 2, Warsaw, Poland and European Molecular Biology Laboratory, Meyerhofstrasse 1, Heidelberg, Germany bartek@mimuw.edu.pl Torgeir R. Hvidsten Umeå Plant Science Centre, Department of Plant Physiology Umeå University, Umeå, Sweden and Computational Life Science Cluster (CLiC), Umeå University, Umeå, Sweden torgeir.hvidsten@plantphys.umu.se Abstract. Recent years have seen a wealth of computational methods applied to problems stemming from molecular biology. In particular, with the completion of many new full genome sequences, great advances have been made in studying the role of non-protein-coding parts of the genome, reshaping our understanding of the role of DNA sequences. Recent breakthroughs in experimental technologies allowing us to inspect the innards of cells on a genomic scale has provided us with unprecedented amounts of data, posing new computational challenges for scientists working to uncover the secrets of life. Due to the binary-like nature of the DNA code and switch-like behavior of many regulatory mechanisms, many of the questions that are currently in focus in biology are surprisingly related to problems that have been of long-term interest to computer scientists. In this review, we present a glimpse into the current state of research in computational methods applied to modeling the regulatory genome. Our aim is to cover current approaches to selected problems from molecular biology that we consider most interesting from the perspective of computer scientists as well as highlight new challenges that will most likely draw the attention of computational biologists in the coming years. Keywords: computational biology, gene regulation, DNA motifs, regulatory elements Address for correspondence: Umeå Plant Science Centre, Department of Plant Physiology, Umeå University, Umeå, Sweden

2 324 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome 1. Introduction Current research in molecular biology continue to provide inspirations for quantitative scientists. The volumes of data flowing from the work of an ever-growing army of experimental biologists can be overwhelming for even the fastest computers available, making it necessary to use state-of-the-art computational methods and develop new algorithms for data analysis. While statistical data analysis is the most important approach in most cases, there is one field of molecular biology that seems to be particularly interesting for computer scientists. We are referring to the field of regulatory genomics that studies the architecture and function of elements of non-protein-coding DNA sequences involved in regulation of gene expression. Since this review is addressed to computer scientists, we shall first recall some relevant basic facts from molecular biology. We will focus our attention on the genome, which is the total content of the DNA sequences in a given species. Genes are the parts of the genomic sequences encoding for proteins (i.e. protein-coding DNA), the building blocks of a living cell. While DNA are sequences written in a four letter alphabet (four nucleotides), proteins are sequences written in a 20 letter alphabet (20 amino acids). Simplistically speaking, fragments of DNA (i.e. genes) are directly translated into corresponding protein sequences by interpreting each nucleotide triplet as one specific amino acid. While the structure of DNA is relatively homogeneous (i.e. the famous double stranded helix), the structure of a protein is a direct result of the size, shape and chemical properties of its amino acids and varies enormously between different proteins. It is the structure of proteins that in turn determine their functions, much like the shapes of different tools in a mechanical workshop determine their possible uses. And it is this direct path from DNA sequence to protein sequence to protein structure, and ultimately to protein function, that implements the classic effects of genetic events such as mutations and cross-overs on the fitness of organisms, and thus drives the evolution of life. However, while genes and proteins are very important for the possible functions of a cell, they are surprisingly conserved through evolution of different species. For example, the protein catalog of humans is up to 99% identical with that of chimpanzees [22]. Nonetheless, we can clearly see the difference between any individuals from the two species. These differences originate to a large degree from the non-protein-coding parts of the genome, which are also affected by mutations and cross-overs, and that contain regulatory sequences determining the timing and scale of gene transcription (DNA RNA) and translation (RNA Protein). This process by which a gene is used to produce protein is referred to as gene expression. The function of the noncoding sequences is mediated by specific interactions between particular classes of protein and DNA motifs (i.e. relatively short words written in the four letter alphabet of DNA, e.g. TGAT). Some proteins, most notably transcription factors, possess the ability to bind DNA motifs (often referred to as binding sites) and through this binding affect the process of transcription in a localized fashion. Such binding events are usually depending on specific sequence motifs being present in the DNA sequence nearby the gene subject to regulation (the so-called promoter region). In a simplistic, switch-like interpretation of the regulatory genome, transcription factors bind DNA motifs in the non-coding parts of the genome to turn on or off protein-coding genes depending on whether the corresponding proteins are needed in the cell or not. If we accept this simplified description of the aforementioned biological processes, we can make some analogies between the components of regulatory systems and computer systems. In particular, we can think of the protein world as hardware. It contains fixed components based on a very slowly evolving DNA blueprint. It also includes peripheral devices for communication: signal receptors for input and

3 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome 325 secretory pathways for output. In the same methaphoric way, non-coding sequences can be considered similar to software. They are modular, fast-evolving, and they carry information on how and when the protein hardware should be utilized, and how it should respond to external stimulae. The aim of this review is to spark interest in regulatory genomics among computer scientists. We provide the readers with a few, hopefully interesting, applications of concepts well-known in computer science and that are useful for solving real problems of modern molecular biology. Towards the end, we provide an overview of currently open problems and directions we believe will be of interest to computer scientists attracted to the study of biological phenomena connected to the process of gene regulation. 2. Fishing for Informative Bits: Sequence Motifs Transcription factors (TFs) are regulatory proteins that serve a key role in regulating transcription by possessing the ability to bind DNA at specific sites and homing the transcription initiation machinery to the requested locations. TFs thus determine which genes are going to be transcribed at which time [36]. In this way, TFs can be considered the machines for reading and executing the regulatory code of the genome. In order to fully understand their function, one needs to be able to accurately describe their DNA-reading abilities: namely one needs to know which sequences can be recognized by a given TF and whether there is any difference in the affinity of the TF to different sequences. In bacteria, this can be done quite accurately, because there are usually only a few TFs in a given bacteria and the sequences they recognize are long (> 16 symbols) and strongly conserved both between species and between different binding sites in the same species (not more than 1 error per 8 symbols). In eukaryotes, however, the number of different TFs can be very high (there is an estimated > 2048 TFs in the human genome) and the sequence motifs they recognize can be very short and degenerate (e.g. the motif for the TF called activator protein (AP) is so degenerate that its recognition site occurs by random every 256 nucleotides in any genome). Even though the motifs for different TFs can exhibit large variation both in length and error tolerance, we will show in this section that they share common properties, allowing us to sift these bits of information from a sea of non-informative genomic sequences. Since the experimental techniques to discover sites on the genome that can be recognized by TFs give only approximate positions, we need computational methods to further narrow down the search space in order to find the true recognition sites. To properly define this computational problem, we need to specify the space of acceptable descriptions of sequences recognized by TFs for binding. The range of possible representations is quite broad. Even if we only consider representations which have been shown to successfully capture the biological properties of TF-DNA binding, the possibilities span simple words and regular expressions [16], probability matrices [17] and Bayesian models [33]. The most popular description, however, is to describe the binding specificity as a so-called Position-Specific Scoring Matrix (PSSM) consisting of as many columns as the length of the motif, where each column contains a probability distribution over symbols (4 nucleotides in the case of DNA motifs) for a corresponding position in the binding site. These probabilities are considered independent, which is a serious simplification of the biological reality. Nonetheless, PSSM models have been proven to be very useful in practical applications because of their simplicity and relative ease of estimation from limited amounts of experimental data. In order to fully specify our computational task, we need to specify the cost function that will select the optimal PSSM model with respect to the experimental data available. As indicated in the beginning of this paragraph, such data is usually presented in the form of a number of sub-sequences from the genome

4 326 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome that are known to be bound by a TF of interest, but that are much longer than the expected motif. Hence we should expect the true motif to be present in most of these sub-sequences. While such constraints are eliminating some of the obviously wrong PSSMs, there is still a need to find the real motif among very many potential motifs. Surprisingly, the best cost function for PSSMs does not come from the data itself, instead it is based on the observation that DNA motifs in fact are messages encoding information about gene regulation. If we think about them as binary codes, we can recall the works of Shannon [31] on information theory. Namely, we should recall that efficient codes should have high information content; this means that if TFs are able to decode information quickly and reliably, the DNA-binding motifs should contain substantial information. Schneider [28] was the first to notice the importance of information content (IC) of sequence motifs, which led to the development of now standard ways of evaluating and presenting motifs. The problem of finding TF binding motifs has in recent years seen an explosive growth of different approaches [35], which all in some way make use of the IC measure, but differ greatly with respect to optimization strategies and the way they treat the original experimental data. Such proliferation of methods can create certain difficulties in the interpretation of results. In particular, if we compare results of different motif finding procedures it is not always clear which of the results represent different variations of the same motif, and which represent qualitatively different motifs (e.g. binding sequence of another TF, biologically related to the one we are interested in). This task can be solved by clustering the results of different motif finding methods [41, 24]. Another, similar problem occurs if we want to compare a newly discovered motif with a database of known TF binding motifs, such as JASPAR [7]. In such situations, again, while there exist multiple different measures for comparing PSSMs, it seems that the common ground of these methods is the use of the IC measure to limit the similarity computation to the most informative columns, as pioneered by the CompareACE method [18]. 3. Building Reliable Modules: Regulatory Elements Information theory can help us to find small regulatory motifs. However, in all organisms but the simplest bacteria, regulation of transcription of any single gene is determined by larger sequence elements containing meaningful combinations of binding motifs that drive the assembly of condition specific TF combinations, and that in turn determine the condition specific switching of gene transcription. These sequences containing several motifs, so called cis-regulatory modules or CRMs, comprise the core of the regulatory system of information processing in higher organisms. Keeping to our software analogies, we can think of these modules as software modules: their job is to combine the atomic bits of information into larger sequences able to reliably perform a given function. Interestingly, due to different reasons, both evolutionary and physical, such regulatory sequences show highly modular architecture [39] and tend to be re-used by different genes in the same species as well as homologous genes across related species. Knowing this presents us with a great opportunity for finding such modules by selecting for motif combinations that tend to be re-used in different contexts. The first approaches to find CRMs date back to more than 10 years ago [40]. The approach was based on finding unusual concentrations of motifs corresponding to TFs involved in a particular developmental process. These results were later verified and generalized to other species [6] while at the same time it was observed that restricting the analysis of regulatory motifs to those which are conserved across species increases significantly the chance of finding a functional binding site [23]. It did not take long

5 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome 327 for researchers to combine the two approaches and search for clusters of binding sites within highly conserved regions [32]. However, it took much longer to describe the first truly integrated model using both conservation and motif sequence alignment that was applicable to multiple species on a genome scale [14]. The Enhancer Element Locator (EEL) method was using a very elegant binding site alignment method, however, it was dependent on the assumption of exact conservation of the binding site order along the sequence, which was later proven to be a serious simplification of the biology [15]. This assumption was dropped by later approaches (such as [42]) which are able to detect conservation of CRMs with rearranged binding sites. 4. Evaluating Functions: Linking Modules with Gene Expression Knowing that gene regulation is organized in cis-regulatory modules (CRMs), and that these modules tend to be reused by different genes in the genome, can we say something about the function that these modules implement? Thanks to high-throughput technologies that can measure the expression profiles of all genes in a genome over time or in different conditions, we can use machine learning and data mining methods to model the regulatory logic hard-wired in the DNA. By observing the dynamic execution of the system in terms of gene expression (the output) we can learn CRMs (inputs) that agree with the assumption that the underlying function should produce similar output given similar input. Thus, we can discover the CRMs, and in principle also reverse-engineer the underlying function by assuming that genes exhibiting similar expression profiles also contain common CRMs in their promoter regions. Here we will consider two quite different examples separately: microbes that are highly specialized, but robust single-cellular organisms and animals that are enormously complex systems of specialized cells Highly specialized circuits: microbial regulatory systems Microbial gene regulation often imply a relatively simple system where typically one gene correspond to one protein, promoters are short and well defined and regulatory motifs are organized into one CRM per gene. Furthermore, these systems consist of a single cell-type. Thus perturbing the system either indirectly by altering its environment (temperature stress, starvation, etc.) or directly through knocking out genes, results in high quality data that makes it possible to reverse-engineer the function producing the observed response. Yeast has been the main microbial model organism for computational modeling of gene regulation. In a seminal paper, Pilpel et al. [26] showed in 2001 that pairs of genes with the same transcription factor binding sites in their promoters exhibit significantly higher expression similarity than genes sharing only single binding sites. The conclusion was that in order to regulate a large number of processes, and respond to a large number of stress factors, with a relatively small number of transcription factors ( 200), yeast takes extensive use of combinatorial regulation where more than one transcription factor is required to produce a response. Following this relatively simple computational approach, a large number of more advanced, machine learning-based approaches followed [30, 29, 4]. The aim of these studies were to identify non-overlapping sets of genes (gene modules) with common regulatory mechanisms. Segal et al. used Bayesian models and the EM algorithm to iteratively refine initial cluster by re-assigning genes whose promoters did not match the current motif profile of the other genes in the cluster. Beer and Tavazoie took a slightly different approach where a Bayesian network model was used to predict the

6 328 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome expression profile of genes (defined by a set of fixed clusters) from their promoter content. In both these studies, complex models with many parameters were used to describe the system. Later, Yuan et al. [45] showed that a simpler model based on the naive Bayes classifier obtained similar results. We proposed an alternative approach where rule induction was used to associate sets of binding sites with possibly overlapping clusters of genes characterized by similar expression profiles [19, 44, 1]. As we have seen, several heuristics have been used to model the underlying function taking regulatory sequence to dynamic expression. But how to determine what is the most biologically relevant approach is still open for discussion. In yeast, predicted regulatory mechanisms have been evaluated either against high-throughput interaction measurements between transcription factors and promoters (sometimes difficult to interpret due to their context dependence), low-throughput experimentally confirmed interactions (typically giving anecdotal evidence for one or a few of the predictions) and gene function information. The rationally for the latter is that genes regulated together often participate in the same pathway or biological process, and should therefore be associated with similar functional information in relevant databases [2]. This is often measured by computing the probability that the correspondence between predictions and prior functional knowledge could have occurred by chance (i.e. the p-value) System integration: how to make an animal The regulation of gene expression in multicellular organisms, and animals in particular, is a much more complex process. While at the adult stage, it is frequently assumed that whole tissues behave like populations of homogeneous cells, the developmental processes that give rise to the complex body plans made up of billions of cells all sharing the same genome originating from a single fertilized cell, pose completely new challenges when attempting to understand the regulatory mechanisms. From the very early years, scientists with a background in mathematics were interested in modeling the processes of pattern formation in biology. For example, it is not widely known among computer scientists that the most cited work of Alan Turing is actually his groundbreaking work on modeling self-organizing pattern formation inspired by developmental biology [37]. We now know that the substances he called morphogenes, which were able to diffuse in space and generate different patterns in developing organisms, are in fact TFs, and their function is exerted through regulatory modules. However, in order to fully understand the action of TFs in the developmental context of pattern formation, we need to incorporate an additional step of signal integration. Since the spatio-temporal patterns of gene expression depend on the action of multiple TFs and through multiple CRMs, we need to learn how to assemble complex gene regulatory functions from simpler rules governing the activity of single CRMs [43]. Currently, it is usually assumed that different CRMs can activate genes independently [38], however, there is some experimental evidence of long-range repression mechanisms [8] in developmental contexts that makes the problem of integrating inputs from multiple CRMs more complicated. Once we make the step from single CRMs to gene regulatory functions, we can describe how different TFs affect pattern formation in morphogenesis. Since cells during development make discrete choices concerning their fate (e.g. a cell can be either a muscle cell or a bone cell), Boolean networks are typically chosen as the formalism to describe gene regulatory networks. In this field the pioneering work was done by Stuart Kauffman [20] who showed how we can get new insights into biological phenomena such as homeostasis from computer simulations of randomized Boolean networks. Importantly, these models can be used both to discuss general properties of biological systems such as evolvability and robustness [10]

7 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome 329 as well as to provide biologists with a formalism to describe particular biological systems such as the segmentation network [27] and predict its behavior under different perturbations. 5. A Look Ahead: Debugging and Code Generation In the present review, we have climbed through multiple levels of abstraction from tiny regulatory motifs carrying atoms of regulatory information to Boolean gene regulatory networks describing phenomena concerning multicellular organisms. Yet, we are still very far from a sufficient understanding of all important aspects of regulatory processes. In particular, two areas of biological research emerge as rich sources of new problems for computational modeling approaches: personal genomics and synthetic biology. Personal genomics aims at understanding how the observable characteristics (phenotypes) are linked to the underlying variability in the genomes of individuals (genotype). Multiple ongoing scientific endeavors, such as the personal genome project [9] and the 1000 genomes project [11], explore the differences between the genetic code of different individuals. While these studies focus on the medical aspects of personal genomics, their results will without doubt influence our understanding of regulatory mechanisms. As we get to know more and more individual genome sequences, we are no longer looking at a single regulatory genome. Keeping to our software analogy, we are in fact confronted not with one regulatory program of a given species but with many imperfect copies of the same program coming from different individuals. We already know about many mutations of the code that lead to buggy programs, i.e. genetic diseases. For example, the Human Gene Mutations Database [34] lists more than 1500 noncoding mutations associated with different diseases. Even though studying these mutations and their role in diseases is of great value for medical applications, the remaining challenge for basic science is to understand the majority of mutations that currently remain unassigned to any known disease. The question is whether these mutations are truly innocuous or maybe their relevance is masked by our incomplete understanding of the regulatory processes. In any case, just as understanding the semantics of a programming language is indispensable for finding bugs in programs written in this language, we should expect that understanding gene regulation will increase our knowledge of the basis of genetic diseases. Synthetic biology looks at regulation from a completely different angle. It s main goal is to create new life forms, but in a much more creative way than current biotechnology that focuses on modifying existing organisms by either deleting or transplanting genes between species. Synthetic biology aims at creating new genes, new regulatory mechanisms and ultimately new organisms [5]. Even though we are far from creating truly new life forms, the first strides have been made by successfully creating simple regulatory circuits in bacteria [12] and eukaryotes [21]. Combining the ability to make simple working circuits in living cells and massive synthesis of different sequences opens many possibilities for testing hypotheses regarding regulatory mechanisms. For example, a novel approach by Patwardhan [25] finds new regulatory motifs by testing large number of random sequences. However, without a better understanding of regulatory systems it is not likely, if at all possible, to scale up these approaches in order to make new kinds of living cells. Nonetheless, a technical milestone to this end was recently reached when Venter et al. successfully transferred a synthetic genome (although virtually identical to that of a natural bacterium) into a new bacterium that, based on it s new genome, started replicating and making proteins [13].

8 330 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome The system-level approach to modeling biological systems is often referred to as systems biology. The aim is to model the entire biological system by considering its entities (genes, RNAs, proteins and metabolites), not in isolation, but in the context of each other. At the heart of systems biology modeling is gene regulation since it is here that dynamic responses are initiated. Unraveling the hard-coded regulatory logic in the regulatory genome, and identifying the transcription factors that cooperatively bind (synergistically or competitively) the discovered cis-regulatory modules, is a hard computational problem. First and foremost, this requires large amounts of data of high quality. Although molecular biology today is considered a data rich science, the number of measurement points (time points, conditions) is still small compared to the number of variables (e.g. genes). For example, extensive research on network inference from expression data [3] indicate that this is an enormous challenge and that a large number of hard links (e.g. experimentally observed binding of transcription factors to promoters or promoter motifs) is needed as constraint in order to lift the quality of these models to an acceptable level. References [1] Andersson, C. R., Hvidsten, T. R., Isaksson, A., Gustafsson, M. G., Komorowski, J.: Revealing cell cycle control by combining model-based detection of periodic expression with cis-regulatory descriptors, BMC Systems Biology, 1, 2007, 45. [2] Ashburner, M., Ball, C. A., Blake, J. A., Botstein, D., Butler, H., Cherry, J. M., Davis, A. P., Dolinski, K., Dwight, S. S., Eppig, J. T., Harris, M. A., Hill, D. P., Issel-Tarver, L., Kasarskis, A., Lewis, S., Matese, J. C., Richardson, J. E., Ringwald, M., Rubin, G. M., Sherlock, G.: Gene ontology: tool for the unification of biology. The Gene Ontology Consortium, Nature Genetics, 25(1), 2000, [3] Bansal, M., Belcastro, V., Ambesi-Impiombato, A., di Bernardo, D.: How to infer gene networks from expression profiles, Molecular Systems Biology, 3, 2007, 78. [4] Beer, M. A., Tavazoie, S.: Predicting gene expression from sequence, Cell, 117(2), 2004, [5] Benner, S., Sismour, A.: Synthetic biology, Nature Reviews Genetics, 6(7), 2005, [6] Berman, B. P., Nibu, Y., Pfeiffer, B. D., Tomancak, P., Celniker, S. E., Levine, M., Rubin, G. M., Eisen, M. B.: Exploiting transcription factor binding site clustering to identify cis-regulatory modules involved in pattern formation in the Drosophila genome, Proceedings of the National Academy of Sciences of the United States of America, 99(2), 2002, [7] Bryne, J., Valen, E., Tang, M., Marstrand, T., Winther, O., da Piedade, I., Krogh, A., Lenhard, B., Sandelin, A.: JASPAR, the open access database of transcription factor-binding profiles: new content and tools in the 2008 update, Nucleic Acids Research, 36, 2008, D102 D106. [8] Cai, H., Arnosti, D., Levine, M.: Long-range repression in the Drosophila embryo, Proceedings of the National Academy of Sciences of the United States of America, 93, 1996, [9] Church, G., et al.: Personal Genome Project, [10] Ciliberti, S., Martin, O., Wagner, A.: Robustness can evolve gradually in complex regulatory gene networks with varying topology, PLoS Computational Biology, 3(2), 2007, e15. [11] Durbin, R., Altshuler, D., McVean, G., Abecasis, G., Brooks, L.: 1000 genomes project, [12] Gardner, T., Cantor, C., Collins, J.: Construction of a genetic toggle switch in Escherichia coli, Nature, 403, 2000,

9 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome 331 [13] Gibson, D. G., Glass, J. I., Lartigue, C., Noskov, V. N., Chuang, R. Y., Algire, M. A., Benders, G. A., Montague, M. G., Ma, L., Moodie, M. M., Merryman, C., Vashee, S., Krishnakumar, R., Assad-Garcia, N., Andrews-Pfannkoch, C., Denisova, E. A., Young, L., Qi, Z. Q., Segall-Shapiro, T. H., Calvey, C. H., Parmar, P. P., Hutchison, C. A., r., Smith, H. O., Venter, J. C.: Creation of a bacterial cell controlled by a chemically synthesized genome, Science, 329(5987), 2010, [14] Hallikas, O., Palin, K., Sinjushina, N., Rautiainen, R., Partanen, J., Ukkonen, E., Taipale, J.: Genome-wide prediction of mammalian enhancers based on analysis of transcription-factor binding affinity, Cell, 124(1), 2006, [15] Hare, E., Peterson, B., Iyer, V., Meier, R., Eisen, M.: Sepsid even-skipped enhancers are functionally conserved in Drosophila despite lack of sequence conservation, PLoS Genetics, 4(6), 2008, e [16] van Helden, J.: Regulatory Sequence Analysis Tools, Nucleic Acids Research, 31(13), 2003, [17] Hertz, G. Z., Stormo, G. D.: Identifying DNA and protein patterns with statistically significant alignments of multiple sequences, Bioinformatics, 15(7-8), 1999, [18] Hughes, J., Estep, P., Tavazoie, S., Church, G.: Computational identification of cis-regulatory elements associated with groups of functionally related genes in Saccharomyces cerevisiae, Journal of molecular biology, 296(5), 2000, [19] Hvidsten, T. R., Wilczynski, B., Kryshtafovych, A., Tiuryn, J., Komorowski, J., Fidelis, K.: Discovering regulatory binding-site modules using rule-based learning, Genome Research, 15(6), 2005, [20] Kauffman, S.: Homeostasis and differentiation in random genetic control networks, Nature, 224(5215), 1969, [21] Kim, J., White, K., Winfree, E.: Construction of an in vitro bistable circuit from synthetic transcriptional switches, Molecular Systems Biology, 2, 2006, 68. [22] King, M., Wilson, A.: Evolution at two levels in humans and chimpanzees, Science, 188(4184), 1975, [23] Loots, G., Ovcharenko, I., Pachter, L., Dubchak, I., Rubin, E.: rvista for comparative sequence-based discovery of functional transcription factor binding sites, Genome Research, 12(5), 2002, [24] Mahony, S., Benos, P.: STAMP: a web tool for exploring DNA-binding motif similarities, Nucleic acids research, 35, 2007, W253 W258. [25] Patwardhan, R., Lee, C., Litvin, O., Young, D., Pe er, D., Shendure, J.: High-resolution analysis of DNA regulatory elements by synthetic saturation mutagenesis, Nature biotechnology, 27(12), 2009, [26] Pilpel, Y., Sudarsanam, P., Church, G. M.: Identifying regulatory networks by combinatorial analysis of promoter elements, Nature Genetics, 29(2), 2001, [27] Sánchez, L., Chaouiya, C., Thieffry, D.: Segmenting the fly embryo: logical analysis of the role of the segment polarity cross-regulatory module, International Journal of Developmental Biology, 52(8), 2008, [28] Schneider, T., Stephens, R.: Sequence logos: a new way to display consensus sequences, Nucleic Acids Research, 18(20), 1990, [29] Segal, E., Shapira, M., Regev, A., Pe er, D., Botstein, D., Koller, D., Friedman, N.: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data, Nature Genetics, 34(2), 2003, [30] Segal, E., Yelensky, R., Koller, D.: Genome-wide discovery of transcriptional modules from DNA sequence and gene expression, Bioinformatics, 19(Suppl 1), 2003, I273 I282.

10 332 B. Wilczyński and T.R. Hvidsten / Guide to the Regulatory Genome [31] Shannon, C., Petigara, N., Seshasai, S.: A Mathematical Theory of Communication, Bell System Technical Journal, 27, 1948, [32] Sharan, R., Ben-Hur, A., Loots, G., Ovcharenko, I.: CREME: Cis-Regulatory Module Explorer for the human genome, Nucleic acids research, 32, 2004, W253 W256. [33] Sharon, E., Lubliner, S., Segal, E.: A Feature-Based Approach to Modeling ProteinDNA Interactions, PLoS Computational Biology, 4(8), 2008, e [34] Stenson, P., Ball, E., Mort, M., Phillips, A., Shiel, J., Thomas, N., Abeysinghe, S., Krawczak, M., Cooper, D.: Human gene mutation database (HGMD R ): 2003 update, Human mutation, 21(6), 2003, [35] Tompa, M., Li, N., Bailey, T. L., Church, G. M., De Moor, B., Eskin, E., Favorov, A. V., Frith, M. C., Fu, Y., Kent, W. J., Makeev, V. J., Mironov, A. A., Noble, W. S., Pavesi, G., Pesole, G., Regnier, M., Simonis, N., Sinha, S., Thijs, G., van Helden, J., Vandenbogaert, M., Weng, Z., Workman, C., Ye, C., Zhu, Z.: Assessing computational tools for the discovery of transcription factor binding sites, Nature Biotechnology, 23(1), 2005, [36] Tsonis, P.: Anatomy of gene regulation: A three-dimensional structural analysis, Garland Publishing, [37] Turing, A. M.: The chemical basis of morphogenesis, Philosophical Transactions of the Royal Society of London, 237(641), 1952, [38] Visel, A., Akiyama, J., Shoukry, M., Afzal, V., Rubin, E., Pennacchio, L.: Functional autonomy of distantacting human enhancers, Genomics, 93(6), 2009, [39] Wasserman, W., Sandelin, A.: Applied bioinformatics for the identification of regulatory elements, Nature Reviews Genetics, 5(4), 2004, [40] Wasserman, W. W., Fickett, J. W.: Identification of regulatory regions which confer muscle-specific gene expression, Journal of Molecular Biology, 278(1), 1998, [41] Wilczynski, B., Darzynkiewicz, M., Tiuryn, J.: MEMOFinder: combining de novo motif prediction methods with a database of known motifs, Nature Precedings, 2008, Available from [42] Wilczynski, B., Dojer, N., Patelak, M., Tiuryn, J.: Finding evolutionarily conserved cis-regulatory modules with a universal set of motifs, BMC bioinformatics, 10, 2009, 82. [43] Wilczynski, B., Furlong, E.: Challenges for modeling global gene regulatory networks during development: Insights from Drosophila, Developmental Biology, 340(2), 2010, [44] Wilczynski, B., Hvidsten, T. R., Kryshtafovych, A., Tiuryn, J., Komorowski, J., Fidelis, K.: Using local gene expression similarities to discover regulatory binding site modules, BMC Bioinformatics, 7, 2006, 505. [45] Yuan, Y., Guo, L., Shen, L., Liu, J. S.: Predicting gene expression from sequence: a reexamination, PLoS Computational Biology, 3(11), 2007, e243.