Table of Contents. 2 Clinical research informatics and medical image. 3 Biomedical computing in drug research and development

Transcription

1 Table of Contents 1 Introduction Combiomed: A Cooperative Thematic Research Network on COMputational BIOMEDicine in Spain... 1 Fernando Martín-Sánchez, Victoria López-Alonso, Isabel Hermosilla- Gimeno, Guillermo López-Campos and the COMBIOMED Network 2 Clinical research informatics and medical image Patch-Based Image Processing: A new framework for Medical Imaging Jose V. Manjón and Montserrat Robles Automated System for Microscopic Image Acquisition and Analysis Juan Vidal, Oscar Déniz, Jesús Salido, Noelia Vállez, M. Milagros Fernández,Carlos Aguilar and Gloria Bueno From Bayesian to module networks: experiences modeling gene relationships with expression data Borja Calvo, Rubén Armañanzas, Iñaki Inza, and José A. Lozano Genetic Algorithms and Iterative Rule Learning for association rule mining Vanessa Aguiar-Pulido, José A. Seoane, Cristian R. Munteanu, Julián Dorado and Alejandro Pazos 3 Biomedical computing in drug research and development Automatic substantiation of drug safety signals Bauer-Mehren A, Carrascosa MC, Mestres J, Boyer S, Sanz F, Furlong LI Allosteric modulation of 5-HT2B receptors by celecoxib and valdecoxib. Putative involvement in cardiotoxicity Ainhoa Nieto, José Brea, María Isabel Cadavid, Jordi Mestres, Inés Sánchez-Sellero, Rosalía Gallego, Máximo Fraga, María Isabel Loza G protein-coupled receptors: targets to efficiently design drugs Leonardo Pardo, Mercedes Campillo, Gianluigi Caltabiano, Arnau Cordomí, Laura Lopez, Norma Diaz-Vergara, Ivan R. Torrecillas, Jessica Sallander, Angel Gonzalez, Julian Zachman, and Santiago Rios

2 4 Bioinformatics in molecular medicine A Generic Computational Pipeline Architecture for the Analysis of RNA-seq Short Reads P. Ferreira, D. González, P. Ribeca, M. Sammeth and R. Guigó 5 Translational Bioinformatics: toward application and the point of care Towards Openness in Biomedical Informatics Victor Maojo, Ana Jimenez-Castellanos, Diana de la Iglesia Translational Bioinformatics: infectious diseases as a case study Guillermo López-Campos, Isabel Hermosilla, Mª Angeles Villarrubia, Jose Antonio Seoane, Mª Carmen Ramirez-Paniagua, Fernando Martín- Sanchez and Victoria López-Alonso New Computer Tools in Nutritional Genomics: e-24h recall with OntoRecipe Oscar Coltell, Antonio Fabregat, Eduardo Añíbarro, María Arregui, Olga Portolés, Elisabet Barrera and Dolores Corella INBIOMEDvision: Bridging gaps between Bioinformatics and Medical Informatics The INBIOMEDvision Consortium 6 Tutorial Procesamiento de información clínica: historia clínica, datos, imagen y texto Introducción al análisis de datos biomédicos con técnicas de aprendizaje máquina José Antonio Seoane, Carlos Fernández-Lozano y Julián Dorado Métodos de Procesamiento de Imágenes Biomédicas para Ayuda al Diagnóstico Gloria Bueno, Oscar Déniz 7 Tutorial Análisis de datos de microarrays y su uso en el POC Bases de datos y estándares en microarrays Guillermo Hugo López Campos Técnicas de selección de variables en dominios de microarrays de ADN: teoría básica y métodos más frecuentes para la selección de genes diferencialmente expresados Iñaki Inza

3 R + Bioconductor como plataforma de análisis de microarrays de expresion genética para la obtención de modelos predictivos Juan M Garcia-Gomez 8 Tutorial Análisis de variación genética interindividual (Ultrasecuenciación y Análisis de datos genómicos) Introducción a las técnicas de secuenciación de nueva generación (NGS) David González Detección de variantes genómicas en estudios DNA-seq Gonzalo Gómez IntOgen y Gitools: Browsing, visualición y análisis integrativo de datos genómicos Sophia Derdak Infraestructura de Web Services y Supercomputación Josep Gelpi 9 Tutorial Estudios de asociación, farmacogenética DisGeNET: visualize, integrate, search and analyze gene-disease networks Bauer-Mehren A, Rautschka M, Sanz F, Furlong LI

4

5 COMBIOMED: A Cooperative Thematic Research Network on COMputational BIOMEDicine in Spain Fernando Martin-Sanchez, Victoria Lopez-Alonso, Isabel Hermosilla- Gimeno, Guillermo Lopez-Campos, and the COMBIOMED Network. Medical Bioinformatics Department, Institute of Health "Carlos III", Madrid, Spain victorialopez@isciii.es Abstract The Cooperative Thematic Research Network on Computational Biomedicine COMBIOMED was approved in the last call for Thematic Networks in Health Research within the Spanish National Plan for Scientific Research, Development and Technological Innovation and it is funded for the period The COMBIOMED Network is currently addressing various aspects that range from basic to applied research in science for the development of methods and tools to solve problems in biomedical science in the context of personalized medicine. This paper describes and analyses the organizational aspects and scientific areas in which this network has been focused (gene-disease association, pharmainformatics and decision support systems at the point of care). At the same time, COMBIOMED aims to play a central role in the education of researchers and in the training of health professionals in techniques for the processing of biomedical information. In this aspect the COMBIOMED Network has developed a program 1. Introduction The COMBIOMED Network continues the work initiated by INBIOMED, the Cooperative Thematic Research Network on Biomedical Informatics ( ) that developed a platform for the storage, integration and analysis of clinical, genetic, and epidemiological data and images focused on the investigation of complex diseases [1]. Computational biomedicine represents the interface between biomedical and computer sciences. It provides an inclusive environment for a better understanding of the biological processes that take place in each of the levels of organization of living organisms and the intricate network interactions between

6 them. One of the objectives of COMBIOMED is to establish contacts and to collaborate with the most relevant international initiatives in the field, such as the National Centers for Biomedical Computing [2], or the Biomedical Informatics Cores of the Clinical Science and Translational Awards [3] funded by the National Institutes of Health (NIH). 2. COMBIOMED: description and organization Several of the 12 groups participating in COMBIOMED have previously participated and led some of the European Networks of Excellence (NoE) on biomedical informatics, bioinformatics and systems biology, such as INFOBIOMED [4] and BIOSAPIENS [5]. Previous experience in Spanish initiatives (INBIOMED [6], INB [7]) has been crucial to the creation and development of those European initiatives. The design of the network, shown in Figure 1, consists of the following levels: Coordination and management. Computational aspects, which serve as instrumental support to the network including software and middleware, hardware, GRID, algorithms, and programming. Coordination of the work carried out by the research groups. This level addresses aspects such as data and text mining, clinical decision-making, electronic health records, image processing, disease simulation, and biomedical ontologies that help manage and integrate chemical, genetic, environmental, clinical and imaging data. Horizontal activities affecting all groups and lines of work. Particular attention is paid to the connection of the network with the scientific community and society (integrated Knowledge Management, education and training, communication, dissemination, Quality and Safety) (Figure 2). 3. COMBIOMED: scientific areas COMBIOMED focuses on three scientific research areas: gene-disease association (Disease-omics), pharmainformatics and decision support systems at the point of care (Info-POC).

7 3.1 Gene disease association (Disease-omics) The study of the molecular causes of disease and individual genetic variations allows deepening into personalized medicine [8] by developing safer and more efficient preventive, diagnostic and therapeutic solutions. The scientific community needs more advanced computational resources (functional analysis of genes and proteins in the context of genetic variability, alternative splicing ) [9], access to specific comparative genomic information (genomic data visualization) and prediction of the effects of individual mutations (SNPs) in the pathways and macromolecular complexes with the consequent implications in the associated diseases. COMBIOMED works on these computational challenges in genotypephenotype association and genomic epidemiology studies, to advance the understanding and modeling of the influence of environmental and genetic factors in the development of diseases. The network is using modules already developed by the National Institute of Bioinformatics (INB) to connect new methods that will be made available as Web services. This will help develop specific solutions for the analysis of genomic and clinical data. The network is also developing systems that facilitate access to textual information about gene-disease relationships using automated information extrac-

8 tion methods and natural language processing with specific applications to problems of biomedical importance [10]. 3.2 Pharma-informatics The discovery and development of drugs is an area of great significance for human health, and at the same time it is an area of great socio-economic importance because it gives its raison d etre to an industry which business is highly knowledge-intensive. Biomedical research in general and the R & D of drugs in particular, generate enormous amounts of data that require sophisticated computational tools for their management and analysis in order to extract the knowledge they need. This is one of the main reasons for the emergence of a new field of scientific activity that includes disciplines such as Computational Biology, and Biomedical Informatics. Pharmaceutical research labs were pioneers in identifying the need and usefulness of computational approaches for the management and exploitation of the data generated in pre-clinical and clinical research. They are aware that certain

9 computational methods and their associated software can perform simulations and predictions that save time and investment in the development of drugs [11]. Computational approaches in systems biology are facilitating the management, visualization and development of predictive and descriptive mathematical models on interaction networks between biomolecular entities. This information is generated in the experimental laboratory largely based on the use of microarrays technologies [12]. Virtual screening and computer simulation techniques are very useful for the selection and testing of compounds to be considered in the initial stages of the design of a new drug. Moreover, the pharmacological and toxicological knowledge accumulated on the different groups of compounds allows for the development of quantitative models that can be used to perform in-silico prediction studies of the pharmacological and toxicological behavior of compounds not synthesized or tested. Information technology also plays an important role in areas such as the management and exploitation of data from clinical trials. In addition, physiological advanced simulation techniques may allow the study of the behavior of organs of different individuals when exposed to drugs with different properties. In coordination with the INB and the Spanish Technological Platform of Innovative Medicines [13], the COMBIOMED Network is developing technological solutions to facilitate the advancement of biomedical knowledge management geared towards the development of pharmaceutical R & D in all its stages. 3.3 Decision support systems at the point of care (INFO-POC) In recent decades medical practice has sought greater integration of scientific knowledge in its routine. The tremendous growth of scientific knowledge and technological innovation requires the development of solutions that allow the use of a large amount of information in the clinical decision-making process. Within this context, Computational Biomedicine promotes the combination of techniques such as Medical Informatics (MI), bioinformatics (BI) and computing in the development of new methods and standards for clinical and biomolecular data integration and analysis [14]. At the same time, they facilitate a new approach that has as its overall objective to create a new integrated research framework for the development of diagnostic methods within the context of genomic medicine in the so-called "point of care". The COMBIOMED network proposes the common research line of INFO- POC to carry out computational developments to represent and analyze clinical and biomedical knowledge at the point of patient care (POC). The collaboration between the diverse groups of the COMBIOMED network makes possible a continuous exchange of information and tools.

10 The network will support decision-making processes in a context of miniaturization of diagnostic systems and accessibility to information about molecular causes of diseases. This context is in line with recent trends on Convergent Technologies NBIC (Nano, Bio, Info and Cogno) with the objective of contributing to the development of a line of intelligent and miniaturized systems to be used at the point of care. The availability and applicability of new technologies at the point of care could be a key incentive for translational research which may also imply a reduction in the time devoted to decision-making. The DNA microarray technology and the Bioinformatics tools that allow microarray data storage, management and analysis have enabled the development of diagnostic tests for complex diseases [15]. In addition to the biomolecular results obtained through these miniaturized point-of-care test systems there exists the requirement of placing molecular data (i.e. mutations in a gene, sequences of DNA, proteins ) in context, through the recovery of relevant information from reference databases (in silico), and its interpretation by implementing systems to support the diagnosis process (in info). The enormous complexity of cellular processes (metabolism, signal transduction, gene expression, and so on) needs the

11 development of new computational models and simulations to understand their behavior overall. The recent boost of systems biology and computational cell biology reflects this fact. The design of new computer-based methods in the Semantic Web for data recovery can contribute to the representation and computational analysis of biological knowledge at the POC. The knowledge generated will be integrated into computerized protocols for the diagnosis, treatment and management of patients (Figure 3). The combination of bioinformatics and biomedical computing tools will facilitate the development of diagnostic models, supported by new standards. These tools need to be linked by using standard medical terminologies and coding with clear semantics to facilitate the effective implementation within clinical information systems. 3.4 Updating the work of the scientific areas Recent developments in the sequencing technologies and other advances in the nano-technology sector, result in a huge increase in the volume of data and also extend the research areas into smaller dimensions, ranging now from populations over individuals and tissues and organ, down to the cellular and molecular and extending further onto the atomic level. This brought about the need to extend or further refine the focus of the network into one or more of the following research directions in compass with advances in the different field of interest: Disease-omics: explore the use of DNA Ultrasequencing data and their potential clinical application. Pharma-informatics: explore the secondary use of the clinical records for research purposes. Explore the nano-technology related technologies and the related informatics challenges regarding their potential as a new domain for Bio/Medical Informatics. Further expand and develop a collaborative platform which is compliant with Web 2.0 technologies for the horizontal biomedical knowledge management. 4. COMBIOMED Horizontal aspects: Education has been a major concern of this network. COMBIOMED partners were able to initiate a series of training and education efforts, which resulted for example in the setup of the graduate degrees in BMI, and focused training courses for various communities (medical, technological and research oriented). The development of a program of the subject of Biomedical Informatics with the recommendation to introduce it in current grades is geared to promote

12 and increase the awareness of medical professionals on technologies, and will lead to more clarified professional on *informatics and on *omics. Dissemination of the Network and its objectives as well as the work carried out by the different members has also been one of the main goals. COMBIOMED partners have published and presented their work and the network in many international and national scientific journals and conferences. COMBIOMED has organized and promoted many scientific events, Including, among others, the BIOINFORSALUD 2009 and the 2 nd, 3 rd and 4 th International Symposiums on Biomedical Informatics in Europe, held in 2009, 2010 and 2011 respectively. Conclusions The creation of the COMBIOMED Network represents a national and international reference in biomedical computing, which aims to provide solutions to the computational challenges posed by basic and translational research, and clinical practice in the context of the new personalized medicine. The most relevant research groups in Spain are cooperating to develop methods, systems, applications and pilot projects and to yield educational recommendations to promote biomedical computing research in the next years. More specifically, computational developments within the COMBIOMED Network allow advancing in the representation and analysis of clinical and biomolecular knowledge, and the joint research will enable the new generation of miniaturized systems to support decision making with obvious clinical applications in health at the point of care. The COMBIOMED Network is also involved in the promotion of educational and training programs on Biomedical Informatics for medical and biomedical professionals as well as in the wide dissemination of the of its work among the international and national scientific communities. Acknowledgments The Cooperative Research Network COMBIOMED is funded by the Institute of Health Carlos III, Madrid, Spain. The leaders of the network research groups are: F. Martin, V. López-Alonso, A. Valencia, R. Guigó, F. Sanz, M. Orozco, A. Pazos, G. Bueno, L. Pardo, P. Larrañaga, J.A Lozano, O. Coltell, V. Maojo, M. Robles and M.I. Loza.

13 References [1] López V, et al. INBIOMED: a platform for the integration and sharing of genetic, clinical and epidemiological data oriented to biomedical research. 4th IEEE Intern. Symp. on BioInf.. and BioEng. 2004; [2] NCRR webpage available at: ards/ Accessed on 02/26/2009. [3] NIH BISTI webpage available at: accessed on 02/26/2009. [4] The INFOBIOMED Network of Excellence webpage available at Accessed on 02/26/2009. [5] BIOSAPIENS Network of Excellence webpage available at Accessed on 02/26/2009. [6] Cooperative Thematic Research Network of Biomedical Informatics INBIOMED web page available at Accessed on 02/26/2009 [7] Bioinformatics National Institute INB webpage available at Accessed on 02/26/2009. [8] Sadee W, Dai Z. Pharmacogenetics/ genomics and personalized medicine. Hum Mol Genet. 2005;14: [9] Lopez-Bigas N, Audit B, Ouzounis C, Parra G, Guigo G. Are splicing mutations the most frequent cause of hereditary disease?. FEBS Lett. 2005:28: [10] Krallinger M, Valencia A. Text-mining and information-retrieval services for molecular biology. Genome Biol. 2005;6(7):224. [11] Jorgensen WL. The many roles of computation in drug discovery. Science. 2004;303: [12] Butcher EC, Berg EL, Kunkel EJ. Systems Biology in drug discovery. Nat. Biotechnol. 2004;22: [13] The Spanish Technological Platform for Innovative Medicines. Accessed on 02/26/2009.

14 [14] Alonso-Calvo R, Maojo V, Billhardt H, Martin-Sanchez F, García-Remesal M, and Pérez-Rey D. An Agent- and Ontology-based System for Integrating Public Gene, Protein and Disease Databases. J. Biomed. Inform. 2007;40(1): [15] Vissers LE, Veltman JA, van Kessel AG, Brunner HG. Identification of disease genes by whole genome CGH arrays. Hum Mol Genet :

15 Patch- Based Image Processing: A new framework for Medical Imaging Jose V. Manjón 1 and Montserrat Robles 1 1 IBIME Group, ITACA, Universidad Politecnica de Valencia, Spain Camino de Vera, s/n, Valencia, Spain Abstract. In this paper a brief description of patch- based medical image techniques is described showing its relation to the self- similarity property of the natural objects and therefore the images derived from them. Denoising, superresolution and segmentation of Magnetic Resonance images are presented to illustrate the capabilities of this new framework in medical imaging. Keywords: Patch, Self- similarity, pattern redundancy, fractal, non local means. 1 Introduction Image processing field has a relatively short but busy history since the moment that first computers appeared. Such history has been characterized by a natural evolution of the associated techniques from the most basic operations to the most refined ones. Nowadays, we assist to the next step on the image processing techniques evolution, the patch- based processing that naturally refines the pixel- based techniques. This step seems to be natural if we analyze how the nature has solved the problem of vision. Indeed the neurons that interpret the visual signals work together (as a patch) to get a meaningful representation of the visual stimuli. Patch- based techniques can be traced back to the last years of the 20 th century [1] but it was not until the publication of the seminal paper presenting the Non- Local means filter by Buades [2] that the patch based techniques become widespread. In his work, Buades presented the seed to many image processing techniques such as denoising, deconvolution, superresolution, segmentation or demosaicking among others. In this paper we will present several techniques inspired on Buades s work on the field of Medical imaging. These are denoising, superresolution and segmentation.

16 Advances in Biomedical Informatics: COMBIOMED 2 Patch- based image processing Patch- based techniques are rooted on the natural self- similarity of patterns in the images which obviously comes from its origin on the natural world. In other words, every pattern in an image belongs to a given object that has high probability of being self similar to other parts of the same object (edge parts or homogeneous part, see figure 1). This pattern redundancy can be seen as a fractal structure within the same dimension. Fig. 1. Image of a Bromelaciaous plant. At the right we can see some of its characteristic patterns. The whole plant can be approximately represented by displacements and rotations of their characteristic generative patterns. If this pattern redundancy is true for normal objects like a plant it is also met for human organs like the brain. This is the reason because this self- similarity property of the nature objects has been used in the medical image processing area. Following, we will show three examples of application of this self- similarity principle applied to three different medical image processing tasks: denoising, superresolution and segmentation. 2.1 Denoising Image denoising consists on separating the noise n from the measurement process from the original noise- free image x, given acquired image y. y = x + n (1)

17 Such operation can be performed in a number of different ways but one of the most common is the signal averaging which consist in averaging similar pixels of the image. Depending on the way of how this pixels are selected we can find different alternatives. For example, Gaussian smoothing is based on the fact that nearby pixels have a high probability to be good matches for the averaging. Although this is true for homogeneous regions this does not hold for edge regions resulting on blurred images. To solve the problem, Buades et al. [2] proposed their Non- Local Means (NLM) filter. This filter takes advantage from the high level of pattern redundancy in the images achieving high- quality image denoising by averaging similar realizations of the noisy signals. The main assumption of the NLM filter is the fact that similar patches will have similar pixels. Besides, patch comparison provides a statistically robust measure of similarity when noise is present. Thus, the filtered pixel x p is calculated as follows:!! =!!!!!,!!!!!!!!!,!! (2) where Ω represents the search volume, the weights w(np,nq) represent the similarity between any two patches Np and Nq centered around pixels p and q. The similarity w is calculated as follows: w d ( Np, Nq) 1 2 h ( N p, N q ) e = (3) Z( p) where Z(p) is the normalizing constant, h is a exponential decay control parameter related to the noise standard deviation and d is a Gaussian weighted Euclidian distance of all the pixels of each neighborhood: d( p, q) = G ρ Y( N ) Y( )) (4) p N q 2 Rsim where Gρ is a normalized Gaussian weighting function with zero mean and ρ standard deviation (usually set to 1) that penalizes pixels far from the center of the neighborhood window by giving more weight to pixels near the center. Figure 2 shows an example of how similar patches are selected on a search window (blue area) in a MR image. This filter has been extended to deal with Rician noise on MR images [3] and to deal with spatially varying noise levels and multicomponent image sequences [4,5].

18 Fig. 2. Example of patch selection using NLM filtering. Similar patches (red squares are averaged together within the search area (blue square). 2.2 Superresolution In medical imaging, the image voxel size comes limited by a number of factors such as imaging hardware, Signal to Noise Ratio (SNR), time limitations or patient`s comfort. In many cases, the acquired voxel size has to be decreased to fit with a concrete resolution requirement. In such situations, interpolation techniques have been traditionally used. However, such techniques invent new points assuming that the existing ones (in the Low Resolution (LR) images) have the same value in the High Resolution (HR) images which is only valid at homogeneous regions. As a result, interpolated images are typically blurred versions of its corresponding HR reference images. We have proposed a new method [6] that takes benefit from the high level of pattern redundancy to efficiently reconstruct a HR image from a LR image. In short, a LR MR image y can be related to the underlying HR image through this expression: N 1 y p = xi + n N i=1 (5) where y p is the observed LR voxel at location p, x i is each one of the N HR voxels contained within this LR voxel and n is some additive noise from the measurement process. This model assumes that LR voxels can be well modeled as the average of the corresponding HR voxel values.

19 Thus, the aim of any superresolution/interpolation method is to find the x i values from the y j values which is a very ill- posed problem since there are infinite x i values that meet such condition. We have restricted the space of solutions by applying 2 constrains: 1. The reconstructed image has to be regular (self- similar). This is accomplished by the application of a 3D variant of the Non- Local Means filter. The use of this method enforces the structure preserving rather than imposing any smoothness constraint. 2. If the presence of noise is minimized on the LR image by applying an appropriate filter it can be imposed as a new constraint that the downsampled version of the reconstructed image xˆ has to be exactly the same as the original LR image y for all location p. A full description of this iterative method can be found at [6]. Figure 3 shows the scheme of our superresolution approach. Figure 4 shows an example of the application of the method. Fig. 3. Scheme of the proposed method. Fig. 4. Example result of the superresolution method compared to linear interpolation.

20 2.3 Segmentation Self- similarity image property can be applied not only in the intra- image domain but in the inter- image domain (i.e., different images can share similar patterns). This inter- image similarity can be exploited to perform segmentation tasks. As in template- warping methods, our proposed patch- based method [7] uses expert manual segmentations as priors in order to achieve the segmentation of anatomical structures. However, our method has two main differences compared to template- warping methods: the scale of the considered objects and the label fusion scheme. First, while the template- warping methods work at the anatomical structure level, our method deals with finer scale by using patches. When the patch under study and a patch in the training subjects are similar, their central voxels are considered to belong to the same structure and thus this training patch is used to estimate the final label. By this method, several samples from each training subject can be used during the label fusion, thus enabling a drastic increase in the number of sample patches involved in the label estimation. Second, the template- warping methods usually use a majority voting scheme to fuse the labels considering the relevance of all the considered labeled samples as similar. In our method, the intensity- based distances between the patch under study and the patches in training subjects are used to perform a weighted label fusion. Basically, this approach consist in comparing the patches of the a given search area with similar patches of a library of templates previously segmented by an expert an assigning labels according to the patch similarity. This process starts with a linear registration to the standard MNI space followed by a region of interest selection and search area definition based on the sum of label areas of the template library (see figure 5). Fig. 5. Initial steps of the hippocampus segmentation.

21 Once the search area is selected for each voxel in this area, an exhaustive search of similar patches within the template library is performed assigning the final label of the current voxel as a weighted average of the template votes (figure 6 shows an outline of the process). Fig. 6. Outline of the label fusion scheme. This process provides excellent results despite its simplicity. Furthermore, it can be compared to the Radiologist training where the students learn to identify patterns by using multiple examples of the same anatomical part or type of pathology. In figure 7 an example of the results in function of the number of templates used to perform the classification is presented, where κ represents the Dice's coefficient of similarity.

22 Fig. 7. Example results of the segmentation method with different number of training subjects. As can be noticed, the higher the number of training subjects the better the result, κ represents the Dice's coefficient of similarity. 3 Conclusion In this paper the self- similarity property of the images is described and several examples of its application solve medical imaging problems is presented. This property is exploited by using patch- based techniques which are able to capture local patterns for its application on many different problems. It's not surprising that these techniques have obtained this good results if we further analyze its relation to human visual system where neighbor neurons work together to recognize patterns [8]. There is much more to be explored since this is just the beginning of the patch era. We further suspect that multiscale and coupled analysis of image patches could give as a new framework for automatic image processing and analysis. This and other questions has to be explored in the future. Acknowledgments This work has been also supported by the Spanish Health Institute Carlos III through the RETICS Combiomed, RD07/0067/2001. References 1. Efros A. and Leung L. Texture synthesis by non parametric sampling. In Proc. Int. Conf. Computer Vision, volume 2, pages , 1999.

23 2. A. Buades, B. Coll, J.M Morel, "A review of image denoising algorithms, with a new one", Multiscale Modeling and Simulation (SIAM interdisciplinary journal), Vol 4 (2), pp: , José V. Manjón, José Carbonell- Caballero, Juan J. Lull, Gracián García- Martí, Luís Martí- Bonmatí, Montserrat Robles. MRI denoising using Non Local Means. Medical Image Analysis, 12(4):514-23, José V. Manjón, Neil A. Thacker, Juan J. Lull, Gracian Garcia- Martí, Luís Martí- Bonmatí, Montserrat Robles. Multicomponent MR Image Denoising. International Journal of Biomedical imaging. vol Article ID José V. Manjón, Pierrick Coupé, Luis Martí- Bonmatí, Montserrat Robles, Louis Collins. Adaptive Non- Local Means Denoising of MR Images with Spatially Varying Noise Levels. Journal of Magnetic Resonance Imaging, 31, , José V. Manjón, Pierrick Coupé, Antonio Buades, Vladimir Fonov, D. Louis Collins, Montserrat Robles. Non- Local MRI Upsampling. Medical Image Analysis, 14(6), , Pierrick Coupé, Jose V. Manjón, Vladimir Fonov, Jens Pruessner, Montserrat Robles, D. Louis Collins. Patch- based Segmentation using Expert Priors: Application to Hippocampus and Ventricle Segmentation. NeuroImage,54(2): , Thacker, N.A., Manjón, J.V., and Bromiley, P.A. A Statistical Interpretation of Non- local Means. IET Computer Vision, 4(3): , 2010.

24 Automated System for Microscopic Image Acquisition and Analysis Juan Vidal, Oscar Déniz, Jesús Salido, Noelia Vállez, M. Milagros Fernández, Carlos Aguilar, and Gloria Bueno VISILAB, University of Castilla-La Mancha, Spain Abstract. This document shows a complete automated system for microscopic imaging. The system is novel because two different digital cameras are attached to it, providing great flexibility to the pathologists since they can use it to digitize most of the slides which are commonly used in virtual microscopy. We have developed a custom software that enables the pathologists not only to control all the functionality provided by the microscope, but also to correct the acquired images in order to minimize the most usual imaging errors such as camera noise, non-uniform illumination, or camera misalignment. Teh hardware setup will be first presented. Then, we will introduce our system, detailing the software that has been developed to control the automated system. Finally, we will show the techniques used to enhance the images acquired and the results that we have obtained using these techniques. 1 Introduction A novel technology platform, called virtual microscopy, has enabled storage and fast dissemination of image data. Virtual microscopy encompasses the highresolution scanning of tissue slides and cell preparation and derived technologies including automatic digitalization and computational processing of whole microscopic slides, cytological preparations, tissue micro-arrays, and web-based accessibility and analyses [7], [9], [5]. However, the tools for processing and analyzing digital microscopic images are still poorly developed and the virtual microscopy tools have not been implemented as broadly as should be due to the significant challenges involving the microscopic image data. These challenges include acquisition, efficient storage, visualization, registration, segmentation, classification, semantic annotation and data mining. Furthermore, it is necessary to investigate high-performance computational infrastructures, as well as tools for grid and parallel computing to efficiently process these high-resolution images together with the amount of clinical data associated to them [4, 1, 3, 6]. The main contribution of this paper is to present the automatic system built at the research Group. We have focused on image acquisition and enhancement, since we noticed that final results are very sensitive to any imperfection on this part of the process. Section 2 describes the general aspects of the hardware and

25 software system. Section 3 describes digitization process and the enhancement techniques used, aside with the results obtained. In Section 4 the main conclusions are drawn. 2 Hardware and Software Setup The main task of a digital microscopy system is the digitization of tissue samples. In the field of pathologic anatomy, these samples usually correspond to thin tissue cuts that have been placed onto a crystal slide. To protect the tissue from any degradation, a very thin paraffin sheet is placed over the tissue, ensuring that thetissueisstucktothecrystalslide.whentheslideisprepared,itisplaced onto a motorized stage, so that it can be moved in any of the spatial axes. Once the slide is on the stage, it is moved along the vertical axis (to focus the tissue sample) or the horizontal plane (to place the stage on the region which has some interest). Finally, the region of interest is digitized by acquiring several images (tiles) and mosaicing them into one final image. It is usual that either the tiles or the final image are processed in some way to improve the results of the acquisition. When the slide has been digitized, it is analyzed by the pathologist, who may use an automated system to process the image and help him with is work giving measurements, auto-detecting potential regions of interests, etc. Digital microscopy systems have two different parts that fit the two main tasks in pathology imaging: acquisition and analysis. The former task involves the automation of the slide digitalization, moving the stage in the spatial axes, changing objectives, and capturing tile images. The latter task is more related to software and involves more complex tasks, such as analyzing the images to evaluate the focus, correcting image defects, mosaicing the tiles previously acquired into a single image, or searching for regions of interest (ROIs). Although the acquisition task mainly related to hardware, the software that controls it is the key to build a useful and flexible system. There are several commercial systems available in the market. Most of them are bundled systems that feature a fixed hardware and a software package. Sometimes, different modules are available, so the customers can purchase only the ones that fit their interests. We have built a system based on a DM6000B Leica Microscope with a Marzhauser motorized stage, where we have attached two cameras (24bpp colour 1.3 MP Leica DFC 300 FX, and 12 bpp grayscale 1.3 MP Retiga SRV) and LED illumination. The implemented software is flexible enough to be used with several tissues types both for light field and fluorescence. This software has also been designed and implemented with the consideration of changing hardware, so it can easily be adapted to different pieces of hardware too. Figure 1 shows the automatic system.

26 (a) Digital Microscopy Hardware (b) Attached cameras detail Fig. 1. Digital Microscopy System 3 Digitization 3.1 Tile Acquisition As it was mentioned in the previous section, our system features two different digital cameras, as well as LED illumination. Our aim is to work with any of the cameras in any situation, so that the pathologists do not have to care of particular camera details, they will just get 24bpp color images when digitizing an slide. These 24bpp images are directly obtained from the Leica camera, simply using white light as source illumination. When using the Retiga camera, it is necessary to acquire three different images, using just one light channel (red, green, or blue) at a time. Then, these 12bpp monochrome images are shifted to 8bpp, and finally combined into the final 24bpp image. The color camera is obviously faster, but the composed images have better quality, so the decision of using one or the other is dependent on what the pathologist s needs are. To ensure proper image quality, the acquired tiles must be focused. When focusing the image, it is necessary to have some criteria to move the stage in the Z axis and to evaluate the focus of the image [10, 8]. We have used the Smart Focus algorithm from Matrox Imaging Library (MIL). This algorithm evaluates the edginess of the image. It starts moving the stage to a start position (the first focus position), and establishing a lower and upper limits around it. After that, an image is acquired and evaluated. Then, the stage moves upwards and another image is taken and evaluated. If the value obtained by the second image is larger than the value of the previous image, then the focus position is the latter. The process continues until the algorithm obtains smaller evaluation values two consecutive times, or when one of the limits is reached. When one of these conditions happens, the stage gets back to the start position and repeats the process moving in the opposite direction. When the positions have

27 been evaluated in both directions, all the process is repeated in a narrower area around the best focus position, moving the stage more slowly. This autofocus algorithm does not ensure optimal results, although it provides fairly good results with a reasonable speed. It is important that the focus position is between the established movement limits, otherwise it will never be reached. Choosing a start position near the focus position makes the algorithm be faster, although it is not essential. Figure illustrates the results of this algorithm. Fig. 2. Results of the autofocus algorithm on a TMA slide, using 5x magnification After several focus tests, both manual and automatic, we discovered that our stage is slightly misaligned: the objectives central axes and the stage are not perfectly perpendicular. This imperfection introduces two problems: the tiles are out of focus at one side of the image when the opposite side is focused, and vice versa, and they also are affected from a small geometric distortion. Furthermore, when we digitalized a small portion of a calibration grid slide, we noticed that, apart from the geometric distortion (which is not easily recognizable), there were a much more annoying rotation error. This error is produced because the cameras attached to the microscope are not perfectly fixed. We could have tried to calibrate them and operate carefully but since any vibration could move the cameras a bit (and the mirror system that conduces the light to one camera or the other is operated manually), we decided that it would be better to have a fast calibration system that could calculate the camera misalignment and correct it when it appears. The correction algorithm that we have developed is really fast and simple to used, although it requires user interaction. To calibrate the rotation, the user has to choose a representative point of the image (x 1,y 1 ). Then, the stage is moved along the X axis, and the same point is clicked by the user (since the stage has moved, that point will now have coordinates (x 2,y 2 )). If the two points are perfectly aligned, then y 1 = y 2, since the stage was only moved along the X axis. However, if the points are not horizontally aligned, it is easy to compute the rotation angle α using the coordinates of the reference points 1.

28 ( ) y2 y 1 α =arctan x 2 x 1 When we solved this rotation problem, we started to work on correcting the stage inclination problem. Although the focus problem is only significant at high magnification (20x or higher), the geometric distortion presents a trouble when the tiles are to be merged, since structures in one of the images are larger than in the other, and also slightly displaced from what would be their correct position. The displacement was not regular, neither proportional to the position of the camera, so we discarded that there were a translation/rotation problem at this point. Rather than it, we noticed that the tiles seemed to be almost perfectly merged around the middle of the images, but as long as we moved to the corners of the tiles, the quality of the mosaic dropped drastically. We discovered that thestructuresononeofthetilestendedtobebelowitspositionontheupper area of the image, but structures tended to be above its position on the lower area of the image. That is what drove us to realize that what we had to correct was a perspective problem. To correct these images, we use a warp mapping. Considering that the central point of the slide is the one whose position is well-aligned with camera, we deform the rectangular tile to make it trapezium-shaped. This trapezium will look bigger than the original tile on one of the sides of the image, and smaller on the opposite one, whereas the same size will remain in the middle point. To perform this perspective correction, we use a warp matrix (2), and then we mapeachpixelinthedestinationimage(x d,y d ), to a pixel in the source image (x s,y s ) (3). Since the mapping require sub-pixel precision, we have used bilinear interpolation to compute each pixel s value. x a 0 a 1 a 2 x s y = b 1 b 2 b 3 y s (2) w c 0 c 1 c 2 1 (1) x s = x w = a 0x d + a 1 y d + a 2 c 0 x d + c 1 y d + c 2 (3) y s = y w = b 0x d + b 1 y d + b 2 c 0 x d + c 1 y d + c 2 After testing the corrections with several images, we observed that the tiles could be effectively merged without a noticeable edge in the area where the images merge. Figure 3 illustrates the results of the correction process. Apart from all the geometric corrections, there is important carry on illumination correction. The illumination should be uniform along the acquired images, so that the background has the same colour all over the image. It is very common in microscopes that the illumination is not uniform, depending on the type of light and the conditions of the acquisition. This non-uniform illumination has a great impact over acquired images, since color information may be altered (areas where the illumination is high will look brighter, whereas areas where the

29 (a) No correction (b) Rotation only (c) Rotation and Perspective Fig. 3. Impact of geometric corrections in the merge area of two images illumination is poor will look darker). In our case, we have LED illumination and a black, opaque case that covers the entire microscope, blocking all external light. We tested with other illumination such as transmitted light based on a mercury metal halide bulb with a liquid light guide and with Köhler light management and the best results were obtained with LED. We have also developed background subtraction method to make the illumination be uniform all over the image. Our illumination correction systems consists on a background division. At first, we acquire a pattern image in a region of the slide where no tissue is present. Then we divide each image acquired by the pattern image, rescaling the result of the division afterwards. The result of that operation is an image where the color information has not been affected by the non-uniform illumination. To ensure better results, it is important to take clean pattern images. Our approach to that has been to have accumulative patterns. We take sixteen pattern images in slightly different positions of the slide, summing up all them into another image. The result of the sum is another image, with a bit-depth four bits greater than the original images. Moreover, by using accumulation patterns we minimize the noise present on the slide, such as dust, or scratches, and keep only illumination information. There is one interesting aspect to consider here: since we only move the stage while acquiring the accumulative pattern, the noise present on the camera is present on all the sub-patterns, and so it is present on the accumulation pattern too. Rather than being a problem, this is helpful because the noise that comes from the camera will also be removed when performing the illumination correction. Figure 4 shows the result of applying this illumination correction. Finally, it is worth mentioning that it is convenient totakethepatternimageusingthesameslidethatistobedigitizedwhenhigh color fidelity is required, because even the thickness of the protective paraffin sheet may have some minor impact on the light that the camera receives, and thus on the color of the corrected image.

30 (a) No correction (b) Illumination correction Fig. 4. Result of applying illumination correction to autopsy tiles 3.2 Tile Stitching Once all tiles have been acquired, they should be stitched and combined into one single image. Depending on the magnification of the objective used in the digitalization, and also on the size of the region digitalized, the whole image size may be huge (several Gbytes). We have developed a stitching algorithm that can compose images of any size, not depending on the physical memory of the computer on which it is running. This algorithm uses the overlapping part of each pair of tiles to compute the best matching position both in rows and columns. Finally, it builds the merged image one line at a time, without memory problems when stitching the tiles. 4 Image Processing Finally, once the mosaic is finished, the images are ready to be examined by the pathologist. It is very helpful for them to have some processing algorithms that assist them in their job. We have developed a few algorithms that work with the tissue types which we currently have in our laboratory. These algorithms are focused on the automatic search of Regions of Interest (ROI). The approach followed with the tissue types that have already been tested is based on blob analysis. In autopsy images, we look for dark spots corresponding to the biomarker response. These spots have a brown tonality which is significantly different from all other areas, which are pink coloured. There are other dark spots which are purple, but they usually are smaller than the ones of interest, and more rounded. Thus the segmentation is performed using the colour and the size information. The accuracy increases with the magnification, varying from 70-80% of accurate detections and 15-20% of false positives at 2.5x magnification, to 95% of accurate detections and less than 3% false positives at 40x magnification. The first step for ROI detection in cytology and TMA tissue slides is similar. This step consists in detecting the main tissues that are one or several rounded

31 portions or tissue cores. In the case of cytology samples, when dark stain is used, such as Papanicolau, the tissue core can be easily distinguished from the background using colour and shape information. By thresholding the image with colour information, the core is segmented with a fairly good accuracy. Furthermore, morphologic operations have been also used in order to improve the blob compactness in cases where the tissue is not as coloured as it might be expected. When weak stain is used, such as TSA, the detection of the core is far more difficult, because usually it can not be distinguished from the background. In fact most of automatic systems do not detect these tissue cores [2]. Using colour information to threshold the image just produces separate small blobs. The expectation of an almost perfect circle shape may be helpful, although we need to know where the boundaries are located. The approach we have used here has been the same as with dark stain, but using more iterations in the morphologic operations. Results are similar to the dark stained cytology, except that we obtain a digitalized ROI slightly larger than the actual tissue core. However, this is not a problem since, as mentioned before; it is preferable to have a larger digitalized area than to lose any tissue information. The procedure applied to TMA slides is similar to dark stained cytology samples. The stain is usually strong, but the regions where tissue cores are located are smaller. The shape of the tissue cores is also more variable, being most of them rounded, but there are also many stripe-shaped ones due to broken tissue cores. The main characteristic of these slides is that the tissue cores are aligned into a two-dimensional array. We have used the same procedure as described above. A first segmentation is made based on colour, and then morphologic operations are applied. The morphologic operations are constrained to the size and shape of the cores to avoid merging two of them. Figures 5 shows the ROI detection for different tissue types. (a) & (b) show the ROIs in autopsy images at 20x, (c) & (d) shows an example of ROI detection in a TMA. (a) Original autopsy (b) ROI detection (a) (c) Original TMA (d) ROI detection (c) Fig. 5. Results of ROI detection based on blob analysis applied to microscopic images. Further work is being carried on incorporating more processing tools for the specific ROI inside the TMA cores, cytology core, biopsy and FISH tissue samples. In order to deal with FISH we have modified the acquisition system to support larger exposition times and control the light filter cubes of the micro-

32 scope. In these images, colour is the most important feature to look at, since ROIs are sensitive to different wavelengths. The last tissue types that we are incorporating to our system are prostate biopsy slides. The ROIs are more complex than in previous tissue samples, since they are composed of several parts, usually: a glandular light, a glandular border, and a bunch of cells. The shape of these ROIs is also very variable which makes them very difficult to segment. We have been recently using the Insight ToolKit (ITK) with prostate slides, in contrast to the previously used framework (Matrox Imaging Library). ITK provides several segmentation methods, from which level sets are the most promising ones to accomplish our aim in biopsy analysis. 5 Conclusion This paper presents the automatic system that we have developed to acquire and analyse microscopic images for anatomical pathology applications. We have also shown the methods that we have used in order to make the system as fast, robust, and flexible as possible dealing with these high resolution images (several Gbytes). The aim of the system is to have the ability to be flexible enough to be used with several tissue samples both from light field and fluorescence, and treat them in a consistent and similar way, taking advantage of common features, and using specific tissue features when necessary. Moreover, the system is also capable of integrating simultaneously different software tools for image analysis. Acknowledgments The authors want to thank Dpto. Anatomía Patológica, Hospital General de Ciudad Real, Spain. This work was partially supported by project DPI from the Spanish Ministry of Science. References 1. G. Bueno, O. Déniz, J. Salido, and M. García-Rojo. Image processing methods and architectures in diagnostic pathology. Folia Histochem et Cytobiol, 47(4): , G. Bueno, R. González, O. Déniz,J.González, and M. García-Rojo. Colour model analysis for microscopic image processing. Diagnostic Pathology, 3(Suppl 1), C. Daniel, García-Rojo M., K. Bourquard, D. Henin, T. Schrader, V. Della Mea, J. Gilbertson, and BA. Beckwith. Standards to support information systems integration in anatomic pathology. Arch Pathol Lab Med, 133(11): , MJ. Donovan, J. Costa, and C. Cordon-Cardo. Systems pathology: a paradigm shift in the practice of diagnostic and predictive pathology. Cancer 1, 115(13): , 2009.

33 5. M. García Rojo, G. Bueno García, C. Peces Mateos, González García J., and M. Carbajo Vicente. Critical comparison of 31 commercially available digital slide systems in pathology. International Journal of Surgical Pathology, 14(4): , M. García-Rojo, V. Punys, J. Slodkowska, T. Schrader, C. Daniel, and B. Blobel. Digital pathology in europe: coordinating patient care and research efforts. Stud Health Technol Inform, 150: , G. Kayser, D. Radziszowski, P. Bzdyl, R. Sommer, and K. Kayser. Theory and implementation of an electronic, automated measurement system for images obtained from immunohistochemically stained slide. Anal Quant Cytol Histol, 28(1):27 38, K. Kayser, J. Görtler, and K. Metze. How to measure image quality in tissue-based diagnosis (diagnostic surgical pathology). Diagnostic Pathology, 3(Suppl 1), Lundin M., Szymas J., Linder E., Beck H., de Wilde P., van Krieken H. García- Rojo M., Moreno I., Ariza A., Tuzlali S., Dervisoǧlu S., Helin H., Lehto VP., and Lundin J. A european network for virtual microscopy design, implementation and evaluation of performance. Virchows Arch, 454(4): , H. Xie, W. Rong, and L. Sun. Construction and evaluation of a wavelet-based focus measure for microscopy imaging. Microscopy Research And Technique, 70: , 2007.

34 From Bayesian to module networks: experiences modeling gene relationships with expression data Borja Calvo 1, Rubén Armañanzas 2, Iñaki Inza 1, and José A. Lozano 1 1 Intelligent Systems Group, Computer Science Faculty Paseo Manuel de Lardizabal 1, Donostia - San Sebastián, Gipuzkoa, Spain {borja.calvo,inaki.inza,ja.lozano}@ehu.es WWW home page: 2 Computational Intelligence Group, Facultad de Informática Campus de Montegancedo, Boadilla del Monte, Madrid, Spain r.armananzas@upm.es WWW home page: Abstract. This work summarizes 8-year experience of the Intelligent Systems Group on the use of probabilistic graphical models to learn gene relationships (also known as genetic or gene regulatory networks) from expression data. Since our initial works on the use of classical Bayesian network learning techniques proposed in the 90 s, to our current developments on module networks, our research group has tried different strategies to solve the challenging problem posed by the huge dimensionality and low number of available of samples in gene expression experiments. Together with the methodological contributions of a set of data modeling techniques, two real applications over colorectal cancer and multiple sclerosis diseases have been conducted. Both of them have been patented by regional public health-care institutions. Keywords: gene expression data, gene networks, gene regulatory networks, gene relationships, Bayesian networks, module networks 1 Preliminaries The huge quantity of data generated by cutting-edge technologies (such as DNA microarrays, qpcr or mass spectrometry) has opened new perspectives and oportunities in biology, medicine and pharmacology research. Shifting from classical hypothesis-driven research projects which were focused on particular ideas or hypotheses, these technologies have opened a data-driven perspective to guide current research projects on the discovery of novel biomarkers and previously unknown interactions. This has dramatically enhanced during the last decade the protagonism of bioinformatics [3] as a key discipline in current molecular biology laboratories and research projects, where data-mining and machine-learning areas have played a crucial role.

35 Among the large battery of problems addressed by modern bioinformatics studies, DNA microarrays [4] have shown an outstanding protagonism. The availability of genomic expression data is having a profound impact on the understanding of the molecular processes and mechanisms of many diseases. In this way, together with the detection of possible biomarkers in different diseases, the modeling of gene interactions has attracted the attention of the bioinformatics community during the last decade. As a natural reply for learning the so-called genetic networks (or gene regulatory networks) from data, data-mining and machine-learning communities have proposed the use of probabilistic graphical models (PGMs) [10, 11]. Among PGMs, Bayesian networks (BNs), due to their model transparency and resources for reasoninig inside the model, have been a clear candidate to deal with the problem of learning genetic networks. However, the vast amount of data involved in DNA microarray experiments (characterised by a huge dimensionality and a scarcely available amount of samples) renders the direct use of most of the classic BN learning algorithms over gene expression data unfeasible [11]. The bioinformatics community quickly realized that the application of classic BN learning algorithms [15], most of them proposed in the 90 s when this type of domains were not available, were problematic in this scenario. Some of the reasons are: state-of-the-art algorithms do not scale to such high dimensional datasets; the limited number of samples and the inherent noise of microarray experiments dramatically hinders the robustness and stability of learned models; algorithms tend to include all the genes in their graphical structure, eventhough many of them are of no interest to the problem. In an attempt to reply to these questions, the bioinformatics community has proposed a set of methodologies which allow the use of BNs in such domains in order to find biologically accurate networks. Although proposed algorithms [2, 19 22] cover a large set of data analysis resources, they share a common set of interesting characteristics: instead of trying to model the possible interactions of the whole set of genes, proposed techniques try to identify statistically and biologically significant gene substructures (or subnetworks); the complexity of the model is limited in different ways, such as only considering a subset of variables as potential parents or limiting the number of parents per variable; in order to enhance the robustness of final structures, ensemble and consensus approaches are used to combine different models. Guided by these lines of work and our interest in the bioinformatics discipline, our research group, which has an extensive tradition in the proposal of BN learning algorithms (see the group website has covered, during the last years, several lines of research on the development of genetic network learning tehniques by means of PGMs. After briefly introducing the

36 basic concepts around BNs, in Section 3 we will present an ensemble methodology for learning BNs from gene expression data. In this section we will also mention two real application of this method (in colorectal cancer and multiple sclerosis), both patented by regional health-care institutions. Finally, in Section 4 we will talk about our on-going work on module network learning from data. 2 Bayesian networks Bayesian networks (BNs) are probabilistic graphical models represented by a graph and a set of parameters. The graph represents the conditional (in)dependencies considered in the model, and it allows us to factorize in a feasible way a joint probability distribution, with the advantage of reducing the number of parameters (associated to conditional probability distributions) to be computed from data [10]. Depending on the number of variables of the domain, the graph of a BN shows a compact and intuitive structure, reflecting the probabilistic relationships between variables and allowing a general view of the problem. On a BN, each variable X i is associated with a node in the graph and with a conditional probability distribution p(x i = x i Pa i = pa i ), where Pa i is the set of parents of X i in the graphical structure. In genetic networks, genes are associated with variables, and they are commonly discretized in three intervals representing underexpression, baseline and overexpression [4]. The joint probability distribution encoded by a BN follows the expression, p(x) = p(x 1,...,x n ) = n p(x i pa i ). In this way, the conditional probability of each variable X i when taking its k-th value, given that its parents set Pa i takes its j-th configuration, needs to be computed from available data to specify a BN. As an example, Figure 1 contains a BN formed by five dichotomic variables X 1,...,X 5. In Figure 1.a the corresponding acyclic graph is displayed, while on the right side, Figure 1.b, we illustrate the list of parameters to be estimated. Notice that the rest of parameters are directly computed as the difference towards 1 for each variable. On the bottom, the joint probability factorization is addressed. In the case of a nonsimplified probability distribution p(x 1,...,x n ), the number of parameters to be computed would have reached 31 (that is, 2 5 1), while with the factorisation given by the graphical structure we only have to assess 11 values (this reduction is more pronounced as the number of variables increases). The reduction in the number of parameters to be computed from data is one of the key characteristics of BNs, which makes them quite attractive to deal with genetic network modeling. In this way, a limited (but more robust) set of parameters will simplify the initial joint probability distribution. Once the BN is built, it constitutes an efficient device to perform probabilistic inference [15]. It gives us the chance to assess a probability distribution over some genes of interest, given evidence of the value of some other genes in the net. i=1

37 p(x 1) = 0.20 p(x 2 x 1) = 0.80 p(x 2 x 1) = 0.80 p(x 3 x 1) = 0.20 p(x 3 x 1) = 0.05 p(x 4 x 2, x 3) = 0.80 p(x 4 x 2, x 3) = 0.80 p(x 4 x 2, x 3) = 0.80 p(x 4 x 2, x 3) = 0.05 p(x 5 x 3) = 0.80 p(x 5 x 3) = 0.40 a. Bayesian network structure b. Parameters Fig. 1. Achieved joint probability factorisation with the attached Bayesian network: p(x 1, x 2, x 3, x 4, x 5) = p(x 1)p(x 2 x 1)p(x 3 x 1)p(x 4 x 2, x 3)p(x 5 x 3). Note that each variable X i has two values, denoted by x i and x i. The structure and conditional probabilities necessary for characterising the BN can be provided either externally by experts time consuming and subject to mistakes or by automatic learning from a database of cases. Our approaches will limit to the former case, learning BN structures from available gene expression data by machine learning procedures. 3 Extensions of Bayesian networks to model gene expression data: ensemble networks Motivated by the exposed problems for learning BNs from gene expression data and taking the exposed state-of-the-art alternatives into account, we proposed a methodology called Induction of reliable BNs by ensemble learning [1]. Its main objective is to find a parsimonious set of relevant genes and probabilistic gene relationships in the problem under study. In order to achieve this objective, the proposed method has following characteristics: as it is commonly done on domains with a low number of samples and in order to enhance the robustness of the final models, a stratified bootstrap resampling procedure [5], randomly sampling intermediate datasets with replacement (in our experiments, 1,000), is conducted; in order to detect a reduced set of genes which show a notable degree of correlation with the disease under study [23], a multivariate feature selection process [8] is conducted in each of the intermediate bootstrapped datasets. By means of the use of the multivariate CFS ( Correlated Feature Selection ) technique [8], selected genes are differentially expressed between phenotypes, while redundancies are avoided among them; a k-dependence BN classifier [24] is learned for each resampled dataset reduced to the subset of selected genes. This classifier limits the complexity degree of the probabilistic dependency relationships in its associated BN

38 structure, allowing a maximum number of k parents per gene (in our domains, k was fixed to 4); based on the learned k-dependence BN structures in each resampled dataset, the frequency of appearance of each arc between two genes is used to assign a confidence level to each arc. Depending on the confidence level fixed by a biologist or physician, the final ensemble-consensus model can vary from a very simple structure including a small set of highly reliable dependencies, to a structure with hundred of interactions with different degrees of robustness. As intermediary models are created in each resampled dataset, they have been combined into a final, more robust, consensus model. 3.1 Identification of a biomarker panel for colorectal cancer diagnosis Together with the Gaiker BioTechnology Center and the Basque Institute for Health Research (BIOEF), a multidisciplinary research project on colorectal cancer diagnosis and biomarker identification was carried out in the period [7]. The role of each research group was divided in the following way: tissue samples were obtained in Cruces Hospital (BIOEF, Bilbao, Basque Country, Spain), Gaiker carried out the gene expression and qrt-pcr monitoring processes and our universitary research group conducted the bioinformatics analysis. The study was carried out on a total of 31 tumoral samples, corresponding to different stages of the disease, and 33 non-tumoral samples. The study was carried out by hybridisation of the tumour samples against a reference pool of non-tumoral samples using Agilent Human 1A 60-mer oligo microarrays. After a pre-processing step to ensure data quality, the final dataset comprised a total of 8,104 gene probes. The bioinformatics analysis was carried out by means of the exposed Induction of reliable BNs by ensemble learning process [1], having two phenotypes into account (healthy control versus tumoral). The results obtained were validated by qrt-pcr. Apart from the characterization of a reduced biomarker-diagnosis gen-panel, competitive discriminative accuracy results between studied phenotypes were obtained. Figure 2 shows the graphical structure of discovered gene relationships of one of the induced BN models. The results of the exposed research project have been patented by three involved institutions in the European Patent Office under the title Methods and kits for the diagnosis and the staging of colorectal cancer (publication number WO/2010/034794) [6]. 3.2 Identification of a biomarker panel for multiple sclerosis patients Together with the BioDonostia Research Institute, a multidisciplinary research project for the discovery of a panel of biomarkers in multiple sclerosis patients has been carried out in the period [17]. The study has been based on the monitoring of a recently identified gene expression modulator: the microrna

39 Fig. 2. Graphical structure of one of the induced final BN models for the colorectal cancer diagnosis domain (healthy control versus tumoral status). Arcs reflect direct probabilistic dependencies between selected genes. The number in each arc reflects the number of times that arc has been modeled in each of the resampled BN classifiers in the exposed ensemble learning process. or mirna. It has been predicted that mirnas may regulate around 30% of all cellular mrna, suggestting that these molecules play a critical role in virtually all cellular functions [12]. The goal of the project was to analyze the possible role of mirnas in the molecular mechanisms implicated in multiple sclerosis; not only to discriminate between healthy controls and patients, but also to for the staging of relapse and remitting phases of the diasease. The role of each research group was divided in the following way: collection of blood samples and microrna and qrt-pcr monitoring was in charge of the Neurology Department of Hospital Donostia - BioDonostia (Donostia - San Sebastián, Basque Country, Spain), and our universitary research group conducted the bioinformatics analysis. The study was carried out on a total of 9 patient samples in remission, 4 patients during a relapse before the administration of steroids and from 8 healthy volunteers. MicroRNA monitoring was performed by qpcr using Taqman Low Density Array (TLDA) Human MicroRNA Panel v1.0 from Applied Biosystems. The expression patterns of 364 mirnas were monitorized for each sample. The bioinformatics analysis was carried out by means of the exposed Induction of reliable BNs by ensemble learning process [1], having three phenotypes into account (healthy control, and relapse and remitting disease stages). The results obtained were validated by qrt-pcr. Apart from the characterization of a reduced biomarker mirna-panel, competitive discriminative accuracy results between three studied phenotypes were obtained. Figure 3 shows the graphical structure of one of the induced BN models. The results of the exposed research project have been patented by both involved institutions in the European Patent Office under the title Methods for the diagnosis of Multiple Sclerosis based on its microrna expression profiling (publication number WO/2011/003989) [18].

40 mir_34b mir_189 mir_135a mir_125b mir_137 mir_135b mir_554 mir_600 mir_126 mir_18b* mir_ mir_ mir_193a mir_380_5p mir_599* mir_204 mir_376a 738 mir_409_5p mir_ mir_34c mir_ mir_99b mir_542_5p Fig. 3. Graphical structure of one of the induced BN models for the multiple sclerosis domain (relapse versus remission status). Arcs reflect direct probabilistic dependencies between selected mirnas. The number in each arc reflects the number of times that arc has been modeled in each of the resampled BN classifiers in the exposed ensemble learning process. 4 Current work: learning module networks The interest of our group in probabilistic graphical models, together with their suitability of them to cope with the problems posed by our collaborators in the computational biology domain, have motivated our taking a step forward in the use of PGMs to model gene regulatory networks. In the introduction we have seen that, due to the complexity of the typical datasets in this domain, simplified versions of BNs are needed to efficiently model the gene interactions. Apart from our ensemble approach, we have already exposed other sorts of simplifications in previous sections. Another interesting strategy to reduce the complexity of the models consists in grouping variables together. This is the approach used in [13] and the main idea behind module networks [25]. We have focused our current work on module networks, because they propose a formalized framework, they graphically represent both the grouping of variables and the relationships between them and they have proved to suit the problem of genetic networks [26]. Despite the suitability of this formalism to model certain types of problems [26, 27, 16], few algorithms have been developed to induce module networks from data. Most of the effort has been made in the gene regulatory network domain [9, 14]. However, to the best of our knowledge, no general purpose algorithms have been proposed other than that in [25]. The basic idea behind module networks is simple yet useful. We assume that two or more variables in the datasets come from the same probability distribution. In other words, they are conditioned by the same set of variables (parents in BNs, regulators in genetic networks) exactly in the same way (they share the set of conditional distribution parameters). These variables are grouped into modules, which replace these variables in the model, reducing the number of parameters needed. Figure 4 shows an example of module network. Conversely to what happens with BNs, module networks reflect two quite different aspects. On one hand, as in BNs, the model tells us which are the

41 Fig. 4. Example of module network with seven variables and four modules. As we can see in the figure, the variables included in a module share the list of parents and the parameters (represented as θ in the figure). strongest conditional relationships between variables (represented by the arcs). On the other hand, module networks also inform us about which variables have a similar behaviour. The latter is a very important point in genetic networks, as we expect to have coherent groups (genes involved in the same proces, for instance). State-of-the-art algorithms do not take into account this duality. They guide the search by scores that are influenced by both aspects of the model, but they do not allow to control the influence of each one. Therefore, we are currently working on a module network learning algorithm that guides the search evaluating these two aspects separately. Quite interestingly, we have seen that the likelihood of the data given a model can be decomposed in the two scoring functions we have proposed. Having the evaluation of the two aspects of a module network separated into two different terms opens the door to a whole new set of algorithms. Of particular interest are optimization algorithms that, instead of optimizaing a combined search function, try to optimize the two functions simultaneously. Therefore, we have developed a new multi-objective optimization algorithm to learn module network structures from a dataset. We have evaluated this algorithm in synthetic datasets, where we have obtained promising results. Currently, we are testing the algorithm on a public microarray dataset, and our aim is to use it in different real applications proposed by our collaborators, where microrna (parents or regulators in the network) and gene expression data from the same patients is available.

42 References 1. Armañanzas, R., Inza, I., Larrañaga, P.: Detecting reliable gene interactions by a hierarchy of Bayesian network classifiers. Computer Methods and Programs in Biomedicine 91(2), (2008) 2. Badea, L.: Inferring large gene networks from microarray data: a constraint-based approach. In: Proceedings of the Workshop on Learning Graphical Models for Computational Genomics at the Eighteenth International Joint Conference on Artificial Intelligence (2003) 3. Baldi, P., Brunak, S.: Bioinformatics: the Machine Learning Approach. MIT Press (2001) 4. Causton, H., Quackenbush, J., Brazma, A.: Microarray Gene Expression Data Analysis. Blackwell Publishing (2003) 5. Efron, B.: Bootstrap methods: another look at the jacknife. Annals of Statistics 7, 1 26 (1979) 6. García, A., Suárez, B., Betanzos, M., Vivanco, G.L., Armañanzas, R., Inza, I., Larrañaga, P.: Methods and kits for the diagnosis and the staging of colorectal cancer (2010), International Patent Application no. PCT/EP2009/ Publication no. WO/2010/ Garcia-Bilbao, A., Armañanzas, R., Izpizua, Z., Calvo, B., Alonso-Varona, A., Inza, I., Larrañaga, P., López-Vivanco, G., Suárez-Merino, B., Betanzos, M.: Identification of a biomarker panel for colorectal cancer diagnosis. BMC Cancer (2011), submitted 8. Hall, M.A., Smith, L.A.: Feature subset selection: A correlation based filter approach. In: Proceedings of the Fourth International Conference on Neural Information Processing and Intelligent Information Systems. pp (1997) 9. Joshi, A., de Peer, Y.V., Michoel, T.: Analysis of a Gibbs sampler method for model-based clustering of gene expression data. Bioinformatics 24(2), (2008) 10. Koller, D., Friedman, N.: Probabilistic Graphical Models. MIT Press (2009) 11. Larrañaga, P., Inza, I., Flores, J.: A guide to the literature on inferring genetic networks by probabilistic graphical models. In: Data Analysis and Visualization in Genomics and Proteomics. pp (2006) 12. Lewis, B., Burge, C., Bartel, D.: Conserved seed pairing, often flanked by adenosines, indicates that thousands of human genes are microrna targets. Cell 120, (2005) 13. Li, G., Leong, T.Y.: A framework to learn Bayesian network from changing, multiple-source biomedical data. In: Proceedings of the 2005 AAAI Spring Symposium on Challenges to Decision Support in a Changing World. pp (2005) 14. Michoel, T., Maere, S., Bonnet, E., Joshi, A., Saeys, Y., den Bulcke, T.V., Leemput, K.V., van Remortel, P., Kuiper, M., Marchal, K., de Peer, Y.V.: Validating module network learning algorithms using simulated data. BMC Bioinformatics 8(S-2) (2007) 15. Neapolitan, R.: Learning Bayesian Networks. Prentice Hall (2003) 16. Novershtern, N., Itzhaki, Z., Manor, O., Friedman, N., Kaminski, N.: A functional and regulatory map of asthma. American journal of respiratory cell and molecular biology 38(3), (2008) 17. Otaegui, D., Baranzini, S., Armañanzas, R., Calvo, B., Muñoz-Culla, M., Khankhanian, P., Inza, I., Lozano, J., Asensio, A., Castillo-Triviño, T., Olsacoaga, J., de Munain, A.L.: Differential micro RNA expression in PBMC from multiple sclerosis patients. PLoS ONE 4(7), e6309 (2009)

43 18. Otaegui, D., de Munain, A.L., Olascoaga, J., Calvo, B., Armañanzas, R., Inza, I., Lozano, J.: Methods for the diagnosis of multiple sclerosis based on its microrna expression profiling (2011), International Patent Application No. PCT/EP2010/ Publication No. WO/2011/ Ott, S., Imoto, S., Miyano, S.: Finding optimal models for small gene networks. In: Proceedings of the Pacific Symposium on Biocomputing. pp (2004) 20. Pe er, D., Regev, A., Elidan, G., Friedman, N.: Inferring Subnetworks from Perturbed Expression Profiles. Bioinformatics 17, (2001) 21. Pe er, D., Tanay, A., Regev, A.: MinReg: a scalable algorithm for learning parsimonious regulatory networks in yeast and mammals. Journal of Machine Learning Research 7, (2006) 22. Peña, J., Björkegren, J., Tegnér, J.: Growing Bayesian network models of gene networks from seed genes. Bioinformatics 21(2), ii224 ii229 (2005) 23. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), (2007) 24. Sahami, M.: Learning limited dependence Bayesian classifiers. In: Proceedings of the Second International Conference on Knowledge Discovery and Data Mining. pp (1996) 25. Segal, E., Pe er, D., Regev, A., Koller, D., Friedman, N.: Learning module networks. Journal of Machine Learning Research 6, (2005) 26. Segal, E., Shapira, M., Regev, A., Pe er, D., Botstein, D., Koller, D., Friedman, N.: Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nature Genetics 34(2), (2003) 27. Wang, J., Xu, C., Shen, D., Luo, G., Geng, X.: Understanding Topic Influence Based on Module Network, Lecture Notes in Computer Science, vol. 4822, pp (2007)

44 Genetic Algorithms and Iterative Rule Learning for association rule mining Vanessa Aguiar-Pulido, José A. Seoane, Cristian R. Munteanu, Julián Dorado and Alejandro Pazos Information and Communication Technologies Department, Faculty of Informatics, University of A Coruña, Campus de Elviña s/n, Spain vanesa.aguiar@udc.es, jseoane@udc.es, muntisa@gmail.com, julian@udc.es, apazos@udc.es Abstract. This work presents a method based on genetic algorithms (GAs) which follows the Iterative Rule Learning (IRL) approach for association rule mining. It was applied to real data from schizophrenic patients, as well as simulated data generated with the HAP-SAMPLE software. A comparison with a widely-used software based on dimensionality reduction, (MDR), was also carried out. The method proposed obtained better results when there was more noise in the data and the association rules were harder to find. Keywords: SNP, schizophrenia, genetic algorithm, iterative rule learning, data mining, bioinformatics 1 Introduction Schizophrenia, which is a common disease, can be defined as a heterogeneous syndrome characterised by perturbations in language, perception, thinking, social relationships and will. There is not a set of symptoms which characterise the disease uniquely. Even thought researchers have been looking for a unique cause of schizophrenia for years with no success, most of them have concluded that schizophrenia would be the consequence of several accumulative effects of certain risk factors (genetic and environmental) [1]. Thus, schizophrenia is a complex disease. Several studies of families, twins and foster-children confirmed and have allowed quantifying the contribution of genetics to schizophrenia [2]. After this, techniques of molecular genetics started to be used to identify the genes that caused the disease [3]. These genes are not the genes of schizophrenia themselves, that is, they may transmit a set of characteristics which would increase the risk of developing the disease.

45 In order to find these genes, association studies are usually carried out. These studies try to find relationships between population genetic variables and the risk of developing a disease. In most cases, association studies involving candidate genes focus on a set of Single Nucleotide Polymorphisms (SNPs), generally based on previously reported small contributions of these markers of risk of susceptibility to the disease studied. A SNP [4] is a single nucleotide variation in a genetic sequence that occurs at appreciable frequency in the population, that is, at least in 1%. Therefore, methods designed to perform association rule mining can be used in order to extract underlying relationships existing in the genetic data as part of the association study. 2 Recent work Data mining methods and techniques have been successfully applied to different areas such as bioinformatics, web analysis, intrusion detection, stock market... Association rule mining can be considered as one of the most important and well researched techniques of data mining. For this purpose, Genetic Algorithms and the Iterative Rule Learning approach have been widely used. A Genetic Algorithm (GA) [5, 6] is a search method based on Charles Darwin s Theory of Evolution [7]. As a result, they are inspired in biological evolution and genetic-molecular base. These algorithms make a population evolve through random actions similar to the actions existing in biological evolution (mutations and genetic recombination), as well as selections with a certain criteria called fitness. The fitness is used to decide which individuals are selected, i.e., the more suitable individuals are the higher likelihood they will reproduce. For association rule mining, association rules (that in this case correspond to the GA population) can be coded following the Iterative Rule Learning (IRL) [8] approach. In this approach, each rule corresponds to one chromosome of the genetic algorithm, that is, one individual of the population. Thus, the particular solution, which results of an iteration, is the best individual of the GA, while the global solution represents a set with the best individual obtained from several executions of the genetic algorithm. Subsequently, recent works on GAs and the IRL approach are described. GENAR (GENetic Association Rules) [9] is a tool designed to discover association rules in databases containing quantitative attributes. The authors use an evolutionary algorithm to find the different intervals, as well as iterative rule learning to avoid evolving always the same rule. Hoffman [10] presents a boosting algorithm based on the iterative rule learning approach for fuzzy rule base system design. The fuzzy rule base is generated in an incremental fashion using an evolutionary algorithm to optimise one fuzzy classifier rule at a time. This method systematically reduces the weight of the correctly classified examples in order to focus the next iteration of the rule generation method on those training examples that are hard to learn.

46 Özyer et al. [11] use two of the most popular data mining tasks, that is, classification and association rule mining, for intrusion detection. In order to predict behaviours in networked computers, a method based on iterative rule learning using a fuzzy rule-based genetic classifier is proposed. Abdi et al. [12] propose a method based on the iterative rule learning approach to generate the entire rule base of a fuzzy rule base system with the help of genetic algorithms. As a novelty, their algorithm does not need any training set. ARMNGA (Association Rules Mining in Novel Genetic Algorithm) [13] is a spatial mining algorithm which takes advantage of a genetic algorithm specifically designed for discovering association rules. Unlike other previous methods, it avoids generating impossible candidates, thus reducing execution time. Yan et al. [14] designed a genetic algorithm-based strategy for identifying association rules, without specifying a threshold for minimum support, and its corresponding ARMGA/EARMGA algorithm. This approach has two main benefits: high-performance association rule mining and system automation. Finally, Qodaman et al. [15] propose a method based on genetic algorithms. This method extracts the best rules that have the best correlation between support and confidence. Like the previous approach, a minimum support threshold is not needed, but, in addition, neither is a confidence threshold. 3 Methods In this work, an improved version of a method based on genetic algorithms (GAs) is presented [16]. This method follows the Iterative Rule Learning Approach (IRL). Thus, each chromosome or individual of the genetic algorithm represents a rule. 3.1 Global structure The technique presented in this work, firstly, divides the original dataset into two datasets: the first one will be used for training (training dataset from now on) and the second one will be used for testing (test dataset from now on). This is done because, in order to obtain testing results, cross-validation was used. After this, the training dataset is used as input for the iterative algorithm. This algorithm, which is explained in more depth below, is executed at most ten times. As a result of all of its executions, a rule pool is obtained. The iterative algorithm returns a rule in each iteration, which may be stored in the rule pool. Thus, the rule pool is a set of rules. Finally, after this whole process has ended, the samples included in the test dataset are classified using the rules contained in the rule pool. All of which has been explained can be repeated, for example, ten times if 10-fold cross-validation is used. In this case, 90% of the data would be used for training and the remaining 10% for testing. The global structure of the algorithm is shown in Fig. 1.

47 Fig. 1. Global structure 3.2 Iterative algorithm Below, the iterative algorithm is described. Since the IRL approach has been followed, each individual of the genetic algorithm will correspond to one rule. Each iteration of this algorithm includes the following steps (Fig. 2): Firstly, a genetic algorithm, used as basis, is evolved until the change in the fitness value is less than a certain threshold, which is established by the user. After that, the population of individuals of the genetic algorithm is ordered. The first position will correspond to the best individual. Once the genetic algorithm has finished, the best individual is chosen. This individual is then compared to those which are stored in the rule pool. If this individual is too similar to another which is part of the rule pool then it is discarded. If it is not, then it must classify more than a certain percentage of the input data to be added to the rule pool. This percentage is also set by the user. Input data classified by an individual in a specific iteration of the method is marked as classified so that the method is capable of finding, apart from general rules, more specific rules or rules which only affect a lower percentage of the input data. Thus, already classified data will not be considered as input for the next generation. This whole process is repeated at most ten times. If all the input data is classified before ten iterations then the execution is finished and the global method may be run again from the beginning. To sum up, each execution of the method corresponds to, at most, ten iterations of the steps described above. Each time the method is run, the rules obtained as a result are shown, as well as their fitness and classification accuracy percentages obtained using test data. Once all of the runs finished, a global confusion matrix is displayed, as well as specificity and sensitivity values.

48 Fig. 2. Structure of one iteration The method presented was programmed using C++ due to efficiency reasons and takes as input any type of text file. The output of the method is saved in a text file as well. 4 Results and discussion 4.1 Input data Schizophrenia data from Galician patients were used as input [17]. This data contained 48 SNPs at the DRD3 and HTR2A genes, genes which are associated to schizophrenia. These SNPs were encoded taking different values:

49 0 if homozygous (both copies of a given gene have the same allele) for the first allele (one of a number of alternative forms of the same gene occupying a given position on a chromosome), 1 if heterozygous (the patient has two different alleles of a given gene), 2 if homozygous for the second allele or 3 if unknown. The original dataset contained 260 case subjects (genetically predisposed to schizophrenia) and 354 control subjects (not predisposed), a total of 614 patients. To perform more tests, six other datasets were obtained from the original one [18]. This was done adding control subjects generated using the HAP-SAMPLE [19] simulation tool. This data was modified to include genotyping errors (represented as the value 3) taking into account the error frequencies of the real data, but choosing randomly which positions were modified. Thus, these datasets included 307, 614, 1228, 1842, 2456 and 3070 simulated control subjects. Datasets were named following the pattern 1:N, where this label represents the proportion between the real subjects (case and control) and the simulated control subjects. 4.2 Comparison with MDR Multifactor Dimensionality Reduction (MDR) [20, 21] is a data mining approach designed to detect and characterise nonlinear interactions among discrete attributes or variables that influence a binary outcome (for example, case-control status). It is a constructive induction algorithm which reduces the original n-dimensional model to a one-dimensional model, repeating this procedure for each possible n-factor combination and selecting the combination that maximises the case-control ratio of the high-risk group. This method is considered to be a nonparametric alternative to traditional statistical methods. The MDR software combines attribute selection, attribute construction and classification with cross-validation. This method has mostly been used to detect gene-gene interactions or epistasis in genetic studies of common human diseases [22-24] such as schizophrenia [25-27], although it can also be applied to other domains. Below, three graphics are shown with the results obtained by the different methods when SNPs at each of the genes separately were used as input (Fig. 3, Fig. 4) and when all the SNPs were included (Fig. 5). These images show the classification scores obtained for the different datasets used as input. The proposed solution obtains better results when more noise is introduced and when rules are hard to find because they affect to a very low percentage of the samples. Finally, the method presented obtains lower scores than MDR when there is no simulated data because there is less information and more genotyping errors. In addition, unlike MDR, the method proposed here only obtains rules with known values (for example, it would not consider rules like if this SNP has the value unknown then the subject is genetically predisposed ).

50 Fig. 3. Classification results for DRD3 SNPs Fig. 4. Classification results for HTR2A SNPs Fig. 5. Classification results for both genes

51 5 Conclusions A method designed for association rule mining has been described. This method is based on GAs and follows the IRL approach. The proposed method was compared to MDR, obtaining better results when there is more noise in the data and the association rules are harder to find. In the future, this method will be compared to other similar ones and applied to more bioinformatics problems. 6 Acknowledgements José A. Seoane and Cristian R. Munteanu acknowledge the funding support for a research position by Isabel Barreto grant and an Isidro Parga Pondal Program from Xunta de Galicia (Spain), respectively. This work is supported by the following projects: Galician Network for Colorectal Cancer Research (REGICC, Ref. 2009/58) from the General Directorate of Research, Development and Innovation of Xunta de Galicia, Ibero-American Network of the Nano-Bio-Info-Cogno Convergent Technologies, Ibero-NBIC Network (209RT-0366) funded by CYTED (Spain), grant Ref. PIO52048, RD07/0067/0005 funded by the Carlos III Health Institute and PHR2.0: Registro Personal de Salud en Web 2.0 (ref. TSI ) funded by the Spanish Ministry of Industry, Tourism and Trade and the grant (Ref. PIO52048), Development of new image analysis techniques in 2D Gel for biomedical research (ref.10sin105004pr) funded by Xunta de Galicia and RD07/0067/0005, funded by the Carlos III Health Institute. 7 References 1. Chinchilla Moreno, A.: Las esquizofrenias. Sus hechos y valores clínicos y terapéuticos. Elsevier Masson (2007) 2. Sham, P.: Genetic epidemiology. Br Med Bull 52 (1996) Sáiz, J., Fañanás, L.: Introducción: Genética y Psiquiatría. Monografías de Psiquiatría 10 (1998) 4. den Dunnen, J.T., Antonarakis, S.E.: Mutation nomenclature extensions and suggestions to describe complex mutations: a discussion. Hum Mutat 15 (2000) Holland, J.H.: Adaptation in Natural and Artificial Systems. University of Michigan Press, Michigan (1975) 6. Goldberg, D.: Genetic Algorithms in Search, Optimization and Machine Learning. Addison- Wesley Longman Publishing Co., Boston, MA (1989) 7. Darwin, C.: On the Origin of Species by Means of Natural Selection. John Murray, London (1859) 8. Venturini, G.: SIA: A supervised inductive algorithm with genetic search for learning attributes based concepts. In: Brazdil, P. (ed.): European Conference on Machine Learning (ECML), Vol Springer, Vienna (1993) Mata, J., Alvarez, J.L., Riquelme, J.C.: Mining Numeric Association Rules with Genetic Algorithms. In: Kurkova, V., Steele, N.C., Neruda, R., Karny, M. (eds.): 5th Internacional

52 Conference on Artificial Neural Networks and Genetic Algorithms, Vol. Artificial Neural Nets and Genetic Algorithms. Springer Computer Science, Praga (2001) Hoffmann, F.: Combining boosting and evolutionary algorithms for learning of fuzzy classification rules. Fuzzy Sets and Systems 141 (2004) Özyer, T., Alhajj, R., Barker, K.: Intrusion detection by integrating boosting genetic fuzzy classifier and data mining criteria for rule pre-screening. Journal of Computer and Network Applications 30 (2007) Abdi, M.J., Analoui, M., Aghabeigi, B., Rafiee, E., Tabatabaee, S.M.S.: Evolutionary design of a fuzzy rule base for solving the goal-shooting problem in the RoboCup 3D soccer simulation league. Lecture Notes in Computer Science 5001 (2008) Dai, S.P., Gao, L., Zhu, Q., Zhu, C.: A novel genetic algorithm based on image databases for mining association rules. In: Lee, R., Chowdhury, M.U., Ray, S., Lee, T. (eds.): 6th IEEE/ACIS International Conference on Computer and Information Science (2007) Yan, X.W., Zhang, C.Q., Zhang, S.C.: Genetic algorithm-based strategy for identifying association rules without specifying actual minimum support. Expert Systems with Applications 36 (2009) Qodmanan, H.R., Nasiri, M., Minaei-Bidgoli, B.: Multi objective association rule mining with genetic algorithm without specifying minimum support and minimum confidence. Expert Systems with Applications 38 (2011) Aguiar Pulido, V., Seoane Fernández, J.A., Freire, A., Munteanu, C.R.: Data Mining in Complex Diseases Using Evolutionary Computation. Lecture Notes in Computer Science 5517 (2009) Dominguez, E., Loza, M.I., Padin, F., Gesteira, A., Paz, E., Paramo, M., Brenlla, J., Pumar, E., Iglesias, F., Cibeira, A., Castro, M., Caruncho, H., Carracedo, A., Costas, J.: Extensive linkage disequilibrium mapping at HTR2A and DRD3 for schizophrenia susceptibility genes in the Galician population. Schizophr Res 90 (2007) Aguiar-Pulido, V., Seoane, J.A., Rabunal, J.R., Dorado, J., Pazos, A., Munteanu, C.R.: Machine learning techniques for single nucleotide polymorphism - disease classification models in schizophrenia. Molecules 15 (2010) Wright, F.A., Huang, H., Guan, X., Gamiel, K., Jeffries, C., Barry, W.T., de Villena, F.P., Sullivan, P.F., Wilhelmsen, K.C., Zou, F.: Simulating association studies: a data-based resampling method for candidate regions or whole genome scans. Bioinformatics 23 (2007) Moore, J.H., Gilbert, J.C., Tsai, C.T., Chiang, F.T., Holden, T., Barney, N., White, B.C.: A flexible computational framework for detecting, characterizing, and interpreting statistical patterns of epistasis in genetic studies of human disease susceptibility. J Theor Biol 241 (2006) Cordell, H.J.: Detecting gene-gene interactions that underlie human diseases. Nat Rev Genet 10 (2009) Greene, C.S., Sinnott-Armstrong, N.A., Himmelstein, D.S., Park, P.J., Moore, J.H., Harris, B.T.: Multifactor dimensionality reduction for graphics processing units enables genomewide testing of epistasis in sporadic ALS. Bioinformatics 26 (2010) Cattaert, T., Urrea, V., Naj, A.C., De Lobel, L., De Wit, V., Fu, M., Mahachie John, J.M., Shen, H., Calle, M.L., Ritchie, M.D., Edwards, T.L., Van Steen, K.: FAM-MDR: a flexible family-based multifactor dimensionality reduction technique to detect epistasis using related individuals. PloS one 5 e He, H., Oetting, W.S., Brott, M.J., Basu, S.: Pair-wise multifactor dimensionality reduction method to detect gene-gene interactions in a case-control study. Hum Hered 69 (2009) 60-70

53 25.Kang, S.G., Lee, H.J., Choi, J.E., Park, Y.M., Park, J.H., Han, C., Kim, Y.K., Kim, S.H., Lee, M.S., Joe, S.H., Jung, I.K., Kim, L.: Association Study between Antipsychotics - Induced Restless Legs Syndrome and Polymorphisms of Dopamine D1, D2, D3, and D4 Receptor Genes in Schizophrenia. Neuropsychobiology 57 (2008) Vilella, E., Costas, J., Sanjuan, J., Guitart, M., De Diego, Y., Carracedo, A., Martorell, L., Valero, J., Labad, A., De Frutos, R., Najera, C., Molto, M.D., Toirac, I., Guillamat, R., Brunet, A., Valles, V., Perez, L., Leon, M., de Fonseca, F.R., Phillips, C., Torres, M.: Association of schizophrenia with DTNBP1 but not with DAO, DAOA, NRG1 and RGS4 nor their genetic interaction. J Psychiatr Res 42 (2008) Yasuno, K., Ando, S., Misumi, S., Makino, S., Kulski, J.K., Muratake, T., Kaneko, N., Amagane, H., Someya, T., Inoko, H., Suga, H., Kanemoto, K., Tamiya, G.: Synergistic association of mitochondrial uncoupling protein (UCP) genes with schizophrenia. Am J Med Genet B Neuropsychiatr Genet 144B (2007)

54 1 Automatic substantiation of drug safety signals Bauer-Mehren A 1, Carrascosa MC 1, Mestres J 1, Boyer S 2, Sanz F 1, Furlong LI 1 Research Programme on Biomedical Informatics (GRIB) IMIM, DCEX, Universitat Pompeu Fabra, C/Dr. Aiguader 88, Barcelona, Spain 2 AstraZeneca, Sweden lfurlong,fsanz,jmestres, abauer-mehren@imim.es maria.carrascosa@upf.edu Scott.Boyer@astrazeneca.com Abstract. Drug safety issues pose serious health threats to the population and constitute a major cause of mortality worldwide. It is of great importance to unravel the molecular mechanisms by which an adverse drug reaction is elicited. We present a framework for the automatic and systematic analysis of drug adverse reactions (signal substantiation). We seek to provide a possible biological explanation by providing connections that might explain why a drug produces a specific adverse reaction. Our approach is based on the assumption that if the disease phenotype elicited by a drug is similar to the phenotype observed in a genetic disease, then the drug acts on the same molecular processes that are altered in the disease. The substantiation concept is implemented as a workflow combining modules for in silico drug-target profiling, mining of gene-disease databases and pathway analysis. Examples of the application of the signal substantiation workflow are presented. Keywords: drug safety, adverse drug reactions, in silico pharmacology, systems biology 1 Introduction Drug safety issues can arise during pre-clinical screening, clinical trials and, more importantly, after the drug is marketed and tested for the first time on the population (1). Although relatively rare once a drug is marketed, drug safety issues constitute a major cause of morbidity and mortality worldwide. Every year about 2 million patients in the US are affected by a serious adverse drug reaction (ADR) resulting in approximately fatalities, ranking ADRs between the fourth and sixth cause of death in the US, not far behind cancer and heart diseases (2). Similar figures were estimated from other western countries (3). Serious ADRs resulting from the treatment with thalidomide prompted modern drug legislation more than 40 years ago (4). Over the past 10 years, 19 broadly used marketed drugs were withdrawn after presenting unexpected side effects (1). The current and future challenges of drug development and drug utilization, and a number of recent high-impact drug safety

55 issues (e.g. rofecoxib) highlight the need of an improvement of safety monitoring systems (5). Due to the important implications of ADR in both public health and drug development, unravelling the molecular mechanisms by which the ADR is elicited is of great relevance. Understanding the molecular mechanisms of ADRs can be achieved by placing the drug adverse reaction in the context of current biomedical knowledge, which might explain it. Due to the huge amounts of data generated by the omics experiments, and the ever-increasing volume of data and knowledge stored in databases and knowledge bases for studying ADRs, the application of bioinformatics analysis tools is essential in order to study and analyse ADRs. 1.1 ADR mechanisms Although the factors that determine the susceptibility to ADR are not completely well understood, accumulating evidence over the years indicate an important role of genetic factors (6-8). ADRs can be mechanistically related to drug metabolism phenomena, leading for instance to an unusual drug accumulation in the body (6). In addition, they can also be associated to inter-individual genetic variants, most notably single nucleotide polymorphisms (SNPs), in genes encoding drug metabolizing enzymes and drug target genes (6). One of the first ADRs explained by a genetic determinant was the inherited deficiency of the enzyme glucose-6-phosphate dehydrogenase causing severe anemia in patients treated with the antimalarial drug primaquine (1). Alternatively, an ADR can be caused by the interaction of the drug with a target different from the originally intended target (also known as anti-targets) (8). A well-known example of an anti-target ADR is provided by aspirin, whose antiinflammatory effect, exerted by inhibition of prostaglandin production by COX-2, comes at the expense of irritation of the stomach mucosa by its unintended inhibition of COX-1 (9). Furthermore, in addition to mechanisms related to off-target pharmacology, it is becoming evident that ADRs may often be caused by the combined action of multiple genes (9). The anticoagulant warfarin, which shows a varying degree of anticoagulant effects, is often associated with hemorrhages, and leads the list of drugs with serious ADR in the US and Europe. A 50 % of the variable effects of warfarin are explained by polymorphisms in the genes CYP2C9 and VKORC1 (10). A recent study furthermore identified a third gene, CYP4F2 explaining about 1.5% of dose variance (11). However, the associated genes accounting for the remaining variability in the response to warfarin in the population are unknown. Other cases of ADRs may arise as a consequence of drug-drug interactions, or the interaction of the action of the drug with environmental factors (6,12). Indeed, the interplay between genotype and environment observed in several aspects of health and disease also extend to drug response and safety. For example, alcohol consumption and smoking are both associated with changes in the expression of the metabolic enzyme CYP2E1, therefore affecting the pharmacokinetics of certain drugs (13).

56 1.2 Challenges in studying ADRs From the above paragraph it is clear that the study of the molecular mechanisms underlying ADR requires achieving a synthesis of information across multiple disciplines. In particular, it requires the integration of information from a variety of knowledge domains, ranging from the chemical to the biological up to the clinical. It has been already recognized that the adequate management of knowledge is becoming a key factor for biomedical research, especially in the areas that require traversing different disciplines and/or the integration of diverse and heterogeneous pieces of information (14). Here, a key aspect is the integration of heterogeneous data types. Several authors have discussed the challenges of data integration in the life sciences (15,16). These problems are rooted in the inherent complexity of the biological domain, its high degree of fragmentation, the data deluge problem, and the widespread incidence of ambiguity in the naming of entities (17). On the other hand, the computational analysis of current biomedical research questions can only be addressed by using a combination of different methods. An attractive approach that emerged in the last years is the combination of different bioinformatics analysis modules by means of workflows (18)(19). This technology allows the integration of a variety of computational techniques into a processing pipeline in which the input and outputs are standardized. In the last years, this kind of integration has been greatly facilitated by the use of public APIs and web services allowing programmatic access to data repositories and analysis tools. Taverna is an approaches that allow integration of different analysis modules, shared as web services, into a scientific workflow to perform in silico experiments (20). Similar approaches are also used for the processing of free text documents 1 or for combining data mining methods 2. In this article we present a general framework for a systematic analysis of drug adverse reactions. The entry point of the system is a potential drug safety signal, which is composed by the drug and its potential associated adverse reaction. We seek to provide a possible biological explanation; we refer to this process as signal substantiation. This framework was implemented by means of software modules accessible through web services and integrated into a workflow ready to be used for automatic substantiation of drug-event pairs. Finally, we present an example study emphasizing the usefulness of our substantiation workflow. 2 Design and Implementation The substantiation framework here presented has been implemented by means of software modules that perform specific tasks of the process. The modules were implemented as web services and combined into data processing workflows to

57 achieve the aforementioned signal substantiation (Figure 1). The workflow and web services are described in the following sections. Fig. 1. Implementation of the signal substantiation framework by integrating different bioinformatics methods accessed by web services. 2.1 Signal substantiation workflow The signal substantiation workflow seeks to establish a connection between the clinical event and the drug through different paths: (i) through proteins being drug/metabolite targets and also being associated to the clinical event and (ii) through proteins that are drug targets or metabolite targets and proteins associated to the clinical event that participate in a common biological pathway (Figure 2). Note that in the first connecting path, the link between the drug and the event is established through common protein(s), while in the second path the link is established through different proteins that are part of the same biological pathway. Two SOAP web services (cglservice and adrpathservice) allowing access to databases and bioinformatics modules relevant for the signal substantiation have been implemented. To allow a smooth integration of the different modules in Taverna workflows, two complementary XSD schemas 3,4 were developed. In the following, the implemented methods are explained in more detail

58 2.1.1 getsmilefromatc (cglalertservice) This method provides the chemical structure of a drug encoded by its ATC (Anatomical Therapeutic Chemical Code) 5 code at the 7 digits level by means of SMILES (Simplified Molecular Input Line Entry Specification). Table 1. Databases for drug-target annotations Database AffinDB BindingDB ChemblDB DrugBank hgpcrlig IUPHARdb MOAD NRacl PDSP PubChem URL NA getuniprotlistfromsmile (cglalertservice) This method returns a list of proteins that are related to the chemical structure encoded in SMILE and its metabolites. We use known drug-target associations (Table 1) and extend them with in silico target profiling. To account for the possibility that adverse drug reactions may emerge due to the interaction of the drug metabolites with particular targets, we extracted drug metabolites from a commercial database (GVK Biosciences) and applied in silico target profiling to them as well. Supporting information is provided regarding the association type or the binding affinity of the drug (or its metabolite) to the protein. In this regard, we are combining known drugtarget associations derived from the databases listed in Table 1 with in silico target profiling getdiseaseassociatedproteins (adrpathservice) This method returns a list of proteins associated to a clinical event. It queries the DisGeNET database (21), which integrates gene-disease associations from various databases such as OMIM (22) or CTD (23) and the literature. Supporting evidence for each association including the association type according to our gene-disease association ontology (21), publications discussing the association, and in the case of 5

59 text-mining derived associations the sentence that reports the gene-disease association are provided. DisGeNET also provides information about genetic variants or SNPs and their association to diseases or adverse drug events. The input is the clinical event, which can be either encoded as a list of UMLS CUI concept identifier or as defined in the EUADR project (24) getpathways (adrpathservice) This method assesses if proteins associated to the drug and the event are annotated to the same biological pathway by interrogating publicly available pathway and protein expression databases, to obtain a cell-type specific view of a certain pathway. The input of the method is composed of two lists of UniProt identifiers and the output is an XML document listing the pathways, the annotated proteins and their expression profile. For user friendly visualization of the workflow results, a Cytoscape graph (CytoscapeResultGraph) is generated, which contains drug, event, all their associated proteins and all supporting information. The results of the pathway analysis provided as html file Availability AdrPathService cglalertservice 3 Results 3.1 A framework for the substantiation of drug-event pairs The substantiation framework here presented places the drug-event pair or signal in the context of current knowledge of biological mechanisms that might explain it. Essentially, we are searching for evidence that supports causal inference of the signal, i.e. feasible paths that connect the drug with the clinical event of the adverse reaction. The signal substantiation process can be framed as a closed knowledge discovery process, analogous to the Swanson model based on hidden literature relationships (25,26). We extend this framework by considering not only relationships found in the literature, but also relationships mined from other data sources or found by applying different bioinformatics methods (Figure 1). The process is described in detail in the following. For a drug-event pair, we collect information about the targets of the drug by querying publicly available databases and by applying drug-target profiling methods.

60 In parallel, we retrieve information about the genes and proteins associated to the clinical event from a database covering knowledge about the genetic basis of diseases (21). Then, we combine these two pieces of information under the following assumption: if the disease phenotype elicited by a drug is similar to the phenotype observed in a genetic disease, then the drug acts on the same molecular processes that are altered in the disease. This can be regarded as phenocopy, a term originally coined by Goldschmidt in 1935 (27) to describe an individual whose phenotype, under a particular environmental condition, is identical to the one of another individual whose phenotype is determined by the genotype. In other words, in the phenocopy the environmental condition mimics the phenotype produced by a gene. In the case of ADRs, the environmental condition is represented by the exposure to the drug, whose effect mimics the phenotype (disease) produced by a gene in an individual. In this way, we can capitalize on all the knowledge about the genetic basis of diseases to explore mechanisms underlying ADRs. Fig. 2. Framework for the substantiation of drug safety signals Currently we consider two scenarios that can provide a causal inference of the signal (Figure 2). First, we look for connections between the drug and the event through their associated proteins. Here, a connection is established if the target of the drug is found to be directly associated with the clinical event. Many ADRs are caused by altered drug metabolism for which genetic variants in metabolizing enzymes are often responsible. Consequently, we also consider drug metabolism phenomena as an underlying mechanism of the observed ADR by assessing if the drug metabolites are targeting proteins that are known to be associated to the clinical event. Second, the

61 association between the drug and the clinical event can involve proteins that are related in the context of biological networks or signalling reactions. In this regard, it can be argued that the action of a drug on a protein (e.g. altering its function) has an effect on other protein(s) related to the former by signalling reactions. The final consequence of the drug action is the observed clinical event (Figure 2). The concept for signal substantiation was implemented by combining variety of bioinformatic methods: in silico target profiling, text mining and pathway analysis. Diverse data sources and bioinformatics tools are accessed through web services, integrated into processing pipelines by means of Taverna and subsequently visualized and analyzed as graphs using Cytoscape (28), or summarized as html files (Figure 1). More detail about the implementation of web services and workflows can be found in the Design and Implementation section. 3.2 Use case: antipsychotic drugs and risk of cardiac arrhythmias In the 1990s, the occurrence of several cases of serious, life-threatening ventricular arrhythmias and sudden cardiac deaths, secondary to the use of non-cardiac drugs raised remarkable concerns in regulators (29). In 1998, several drugs received a black-box warning in the US due to concerns regarding prolongation of the QT interval. Nowadays, it is known that many seemingly unrelated drugs can cause the prolongation of QT interval and Torsade de Pointes, which eventually might lead to fatal arrhythmias. For instance, in 2000, cisapride, a drug for gastrointestinal protection, was withdrawn from the market due to increased risk for QT prolongation. The first report of sudden cardiac death with an antipsychotic drug appeared in 1963 (30). Since then, several studies found an increased risk for ventricular arrhythmias, cardiac arrest and sudden death associated with the use of antipsychotics (31,32). This increased risk can partly be explained by the prolongation of QT intervals observed with several antipsychotic drugs. It has been suggested that the mechanisms by which antipsychotics can cause prolongation of QT interval involves the potassium channel encoded by the HERG gene that regulates myocyte action potential (33,34). Drugs blocking this potassium channel can slow repolarisation, which in turn might lead to prolongation of the QT interval, and can eventually result in sudden cardiac death (33). Moreover, the concept of drug-induced prolongation of the QT interval is supported by the congenital long QT syndrome associated with mutations in the HERG gene (33-35). However, it would be interesting to know if other proteins are involved in the mechanism underlying drug-induced QT prolongation and cardiac arrhythmias in general. A set of antipsychotic drugs selected according to the risk of producing cardiac arrhythmias is shown in Table 2 (data compiled from (36) and from the QTdrugs database 6 ). The mechanisms underlying the association between cardiac arrhythmias and the selected antipsychotics were explored using the substantiation workflow. A summary of the results are depicted in Table 2. The association between the event (risk of cardiac arrhythmia) and the drugs Amisulpride, Ziprasidone and Haloperidol 6

62 is supported by the connection between the drug and the event through different proteins. Contrasting, the association of Sulpiride and the event, which is considered of low risk, is only supported by ADRB1. Beta-adrenergic blocking is a common treatment to reduce mortality in congenital long QT syndromes (37), thus the association of proteins encoded by genes ADRB1 and 2 are not related to a causative effect. Amisulpride is also considered a low risk drug, however two proteins were found to provide a mechanistic explanation for the association of the drug with cardiac arrhythmias. However, it is worth to note that for both proteins, allelic variants of the gene are reported to be associated with the event, in this case Short QT syndrome. Thus, only a specific subset of patients might be at risk of developing cardiac arrhythmias upon treatment with Amisulpride. An interesting finding is that the Ziprasidone, Haloperidol and Amisulpride are found to be associated not only with prolongation of QT intervals, but also with shorter QT intervals and Brugada syndrome. This is especially relevant for the high risk antipsychotics. In this respect, it is worth to note that, most of the previous studies on the association of antipsychotics use and risk of cardiac arrhythmias focused only prolongation of QT interval (33). The here presented analysis highlights the need of studying other types of electrocardiogram (ECG) abnormalities with relation of antipsychotic drugs. Moreover, patients with underlying conditions related with shortening of the QT interval should also be specially monitored upon treatment with antipsychotics. Table 2. Antipsychotics with low and high risk of producing cardiac arrythmias, the proteins that explain the connection between the drug and the event, and the specific events that are associated. Drug Name Risk Proteins Events Sulpiride low ADRB1 Short QT Syndrome Amisulpride low AGTR1, Short QT Syndrome SLC6A4, ADRB1, ADRB2 Ziprasidone high KCNH2, SLC6A4, Short QT syndromes, Long QT syndromes, Timothy syndrome, ADRB1, ADRB2 Romano-Ward syndrome, Torsade de Pointes Haloperidol high CACNA1C, KCNH1, KCNH2, Short QT syndromes, Long QT syndromes, Timothy syndrome, SLC6A4, Brugada syndrome, Romano- AGTR1, ADRB1, Ward syndrome ADRB2 4 Discussion and conclusion We have presented a general framework and its implementation for the systematic analysis of drug-safety signals. The substantiation framework seeks to find a biological explanation of a drug-induced clinical event by looking for causative

63 connections between the drug, its targets, and their subsequent direct or indirect (through biological pathways) association to the clinical event. The substantiation framework has been implemented by means of a workflow that combines state of the art bioinformatics methods for the integrated and automatic analysis of drug-safety signals. The functionality of the approach is shown by the analysis of cardiac arrhythmias produced by a set of antipsychotic drugs. The results of the analysis combining in silico target profiling and mining of genotype-phenotype databases indicate that, in addition to prolongation of the QT interval, some antipsychotic drugs could induce other abnormalities in the ECG that could eventually lead to fatal arrhythmias. All in all, we showed that automatic signal substantiation can help to systematically analyse drug safety signals and hence to facilitate further expert validation. The use of web services allowed easy building of diverse workflows for signal filtering and substantiation, and furthermore allows the flexible extension and usage of our modules. Hence, the presented modules and workflow provides a userfriendly framework for the automatic analysis of drug-safety signals that can guide and facilitate a more detailed analysis by experts. References (1) Giacomini KM, Krauss RM, Roden DM, Eichelbaum M, Hayden MR, Nakamura Y. When good drugs go bad. Nature /04/26/printTY - JOUR ;446(7139): (2) Lazarou J, Pomeranz BH, Corey PN. Incidence of Adverse Drug Reactions in Hospitalized Patients: A Meta-analysis of Prospective Studies. JAMA 1998 April 15, 1998;279(15): (3) van der Hooft CS, Sturkenboom MC, van Grootheest K, Kingma HJ, Stricker BH. Adverse drug reaction-related hospitalisations: a nationwide study in The Netherlands. Drug Saf 2006;29(2): (4) Harmark L, van Grootheest AC. Pharmacovigilance: methods, recent developments and future perspectives. Eur J Clin Pharmacol 2008 Aug;64(8): (5) Olsson S. The role of the WHO programme on International Drug Monitoring in coordinating worldwide drug safety efforts. Drug Saf 1998 Jul;19(1):1-10. (6) Gurwitz D, Motulsky AG. "Drug reactions, enzymes, and biochemical genetics" : 50 years later. Pharmacogenomics 2007;8(11): (7) Wilke R, Lin D, Roden D, Watkins P, Flockhart D, Zineh I, et al. Identifying genetic risk factors for serious adverse drug reactions: current progress and challenges. Nat Rev Drug Discov 2007 November xxwilke2007;6(11): (8) Ekins S. Predicting undesirable drug interactions with promiscuous proteins in silico. Drug Discov Today /3;9(6):

64 (9) Kawai S. Cyclooxygenase selectivity and the risk of gastro-intestinal complications of various non-steroidal anti-inflammatory drugs: a clinical consideration. Inflamm Res 1998 Oct;47 Suppl 2:S (10) Chiang A, Butte A. Data-Driven Methods to Discover Molecular Determinants of Serious Adverse Drug Events. Clin Pharmacol Ther /01/28;85(3): (11) Takeuchi F, McGinnis R, Bourgeois S, Barnes C, Eriksson N, Soranzo N, et al. A Genome-Wide Association Study Confirms VKORC1, CYP2C9, and CYP4F2 as Principal Genetic Determinants of Warfarin Dose. PLoS Genet /03/20;5(3):e (12) Chiang JH, Yu HC. MeKE: discovering the functions of gene products from biomedical literature via sentence alignment. Bioinformatics 2003 Jul 22;19(11): (13) Howard LA, Miksys S, Hoffmann E, Mash D, Tyndale RF. Brain CYP2E1 is induced by nicotine and ethanol in rat and is higher in smokers and alcoholics. Br J Pharmacol 2003;138(7): (14) Ruttenberg A, Clark T, Bug W, Samwald M, Bodenreider O, Chen H, et al. Advancing translational research with the Semantic Web. BMC Bioinformatics 2007;8:S2. (15) Louie B, Mork P, Martin-Sanchez F, Halevy A, Tarczy-Hornoch P. Data integration and genomic medicine. J Biomed Inform 2007 Feb;40(1):5-16. (16) Philippi S, Kohler J. Addressing the problems with life-science databases for traditional uses and systems biology. Nat Rev Genet /06//printTY - JOUR ;7(6): (17) Antezana E, Kuiper M, Mironov V. Biological knowledge management: the emerging role of the Semantic Web technologies. Brief Bioinform 2009 July 1, 2009;10(4): (18) Gil Y, Deelman E, Ellisman M, Fahringer T, Fox G, Gannon D, et al. Examining the challenges of scientific workflows. Computer 2007;40(12): (19) Giles J. Key biology databases go wiki. Nature 2007;445:691. (20) Oinn T, Addis M, Ferris J, Marvin D, Senger M, Greenwood M, et al. Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 2004 Nov 22;20(17): (21) Bauer-Mehren A, Rautschka M, Sanz F, Furlong LI. DisGeNET: a Cytoscape plugin to visualize, integrate, search and analyze gene-disease networks. Bioinformatics 2010 Nov 15;26(22): (22) Hamosh A, Scott AF, Amberger JS, Bocchini CA, McKusick VA. Online Mendelian Inheritance in Man (OMIM), a knowledgebase of human genes and genetic disorders. Nucleic Acids Res /01/01;33:D514â D517.

65 (23) Mattingly CJ, Rosenstein MC, Davis AP, Colby GT, Forrest JN,Jr, Boyer JL. The Comparative Toxicogenomics Database: A Cross-Species Resource for Building Chemical- Gene Interaction Networks. Toxicol Sci 2006 August 1, /toxsci;92(2): (24) Trifiro G, Pariente A, Coloma PM, Kors JA, Polimeni G, Miremont-Salame G, et al. Data mining on electronic health record databases for signal detection in pharmacovigilance: which events to monitor? Pharmacoepidemiol Drug Saf 2009 Dec;18(12): (25) Huang RS, Duan S, Kistner EO, Hartford CM, Dolan ME. Genetic variants associated with carboplatin-induced cytotoxicity in cell lines derived from Africans. Mol Cancer Ther 2008 Sep;7(9): (26) Huang RS, Duan S, Shukla SJ, Kistner EO, Clark TA, Chen TX, et al. Identification of genetic variants contributing to cisplatin-induced cytotoxicity by use of a genomewide approach. Am J Hum Genet 2007 Sep;81(3): (27) Lenz W. Phenocopy. Hum Genet 1970;9(3): (28) Shannon P, Markiel A, Ozier O, Baliga N, Wang J, Ramage D, et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res 2003;13(11):2498. (29) Hammond TG, Carlsson L, Davis AS, Lynch WG, MacKenzie I, Redfern WS, et al. Methods of collecting and evaluating non-clinical cardiac electrophysiology data in the pharmaceutical industry: results of an international survey. Cardiovasc Res Mar;49(4):741. (30) Glassman AH, Bigger Jr JT. Antipsychotic drugs: prolonged QTc interval, torsade de pointes, and sudden death. Am J Psychiatry ;158(11):1774. (31) Ray WA, Chung CP, Murray KT, Hall K, Stein CM. Atypical antipsychotic drugs and the risk of sudden cardiac death. N Engl J Med Jan 15;360(3): (32) Montout C, Casadebaig F, Lagnaoui R, Verdoux H, Philippe A, Begaud B, et al. Neuroleptics and mortality in schizophrenia: prospective analysis of deaths in a French cohort of schizophrenic patients. Schizophr Res Oct 1;57(2-3): (33) Abdelmawla N, Mitchell AJ. Sudden cardiac death and antipsychotics. Part 1: Risk factors and mechanisms. Advances in Psychiatric Treatment ;12(1):35. (34) Berger SI, Ma'ayan A, Iyengar R. Systems pharmacology of arrhythmias. Sci Signal 2010 Apr 20;3(118):ra30. (35) Sicouri S, Antzelevitch C. Sudden cardiac death secondary to antidepressant and antipsychotic drugs

66 (36) Abdelmawla N, Mitchell AJ. Sudden cardiac death and antipsychotics. Part 2: Monitoring and prevention. Advances in Psychiatric Treatment ;12(2):100. (37) Roden DM. Clinical practice. Long-QT syndrome. N Engl J Med 2008 Jan 10;358(2): Acknowledgments. This work was supported by the European Commission [EU- ADR, ICT ], Innovative Medicines Initiative [etox,115002], the AGAUR [to A.B.M.] and Instituto de Salud Carlos III FEDER (CP10/00524) grants. The Research Unit on Biomedical Informatics (GRIB) is a node of the Spanish National Institute of Bioinformatics (INB) and a member of the COMBIOMED network.

67 Allosteric modulation of 5- HT 2B receptors by celecoxib and valdecoxib. Putative involvement in cardiotoxicity. Ainhoa Nieto, José Brea, María Isabel Cadavid, Jordi Mestres, Inés Sánchez- Sellero, Rosalía Gallego, Máximo Fraga, María Isabel Loza Departamento de Farmacología. Facultad de Farmacia. Campus Vida. Santiago de Compostela. Spain (AN, JB, MIC, MIL). Research Programme on Biomedical Informatics (GRIB) IMIM, DCEX, Universitat Pompeu Fabra, Barcelona, Spain (JM). Departamento de Anatomía Patológica y Ciencias Forenses. Facultad de Veterinaria. Campus de Lugo. Lugo. Spain (I S- S). Departamento de Ciencias Morfólogicas. Facultad de Medicina. Santiago de Compostela. Spain (RG). Departamento de Anatomía Patológica y Ciencias Forenses. Facultad de Medicina. Santiago de Compostela. Spain (MF). ainhoa.nieto@usc.es, pepo.brea@usc.es, mariaisabel.cadavid@usc.es, jmestres@imim.es, ines.sanchez.sellero@usc.es, mrosalia.gallego@usc.es, máximo.fraga@usc.es, mabel.loza@usc.es Abstract. Broad- scale in vitro pharmacology profiling of new chemical entities during early phases of drug discovery has recently become an essential tool to predict clinical adverse effects. The present work was developed within the EU- ADR project, which aims to develop and use advanced ICT technologies for demonstrating new ways to exploit the existing wealth of clinical and biomedical data sources for better and faster detection of ADRs. In this project by means of text- mining tools and by a virtual screening approach a putative interaction between celecoxib and 5- HT 2B receptors was detected. In this way, we hypothesized that selective cyclooxigenase- 2 inhibitors (coxibs) cardiovascular toxicity could be in part related with a 5- HT 2B mediated effect. We studied the interaction of a currently marketed coxib (celecoxib) and a withdrawn one (valdecoxib) with the serotonin 5- HT 2B receptors. Both compounds increased the potency and efficacy of 5- HT in 5- HT 2B receptor. This enhancement is consistent with an allosteric agonist behavior of both compounds on the 5- HT 2B receptor for serotonin, which may contribute to cardiovascular toxicity of these compounds. A histological study of the effect of high doses of serotonin over heart valves was made as a control for a further evaluation of the putative drug effect over the valves. The images obtained will be processed within COMBIOMED project in order to quantify the morphological changes induced by treatments. Keywords: Adverse drug reaction, preliminary safety studies, 5- HT 2B allosteric modulation, heart valve damage, image analysis.

68 1 Introduction Serious adverse effects resulting from the treatment with thalidomide prompted modern drug legislation more than 40 years ago [1]. During that period, the mainstay of drug safety surveillance has been the collection of spontaneous Adverse Drug Reactions (ADRs) [2,3]. The current and future challenges of drug development and drug utilization, and a number of recent high- impact drug safety issues (e.g. rofecoxib (Vioxx) and pergolide) require re- thinking of the way safety monitoring is conducted [4]. The timely discovery of unknown or unexpected ADRs is one of its major challenges, because most of the drugs enter the market with low number of exposed subjects, it has become evident that adverse effects of drugs are being detected too late, when millions of people have already been exposed. Although many ADRs were detected by spontaneous reporting systems, these systems have inherent limitations that hamper signal detection. The major weakness is that these systems depend entirely on the ability of a physician to, first, recognize an adverse event as being related to the drug. Subsequently, the physician needs to actually report the case to the local spontaneous reporting database [5]. The greatest limitations, therefore, are under- reporting and biases due to selective reporting [6]. Investigations have shown that the percentage of ADRs being reported varies between 1 and 10%. These problems may lead to underestimation of the significance of a particular reaction and delay in signal detection, as well as spurious detections. Broad- scale in vitro pharmacology profiling of new chemical entities during early phases of drug discovery has recently become an essential tool to predict clinical adverse effects [7]. Modern, relatively inexpensive assay technologies and rapidly expanding knowledge about G- protein coupled receptors, nuclear receptors, ion channels and enzymes have made it possible to implement a large number of assays addressing possible clinical liabilities. Together with other in vitro assays focusing on toxicology and bioavailability, they provide a powerful tool to aid drug development. The present work was developed within the EU- ADR [8] project, which aims to develop and use advanced ICT technologies for demonstrating new ways to exploit the existing wealth of clinical and biomedical data sources for better and faster detection of ADRs. In the framework of this project different ADRs associated with selective inhibitors of COX- 2 (coxibs) were studied.

69 This class of drugs was developed to avoid gastrointestinal adverse events associated with NSAIDs due to its mechanism of action. Coxibs are supposed to have a lower incidence of gastrointestinal adverse effects [9], however, the analysis of multiple clinical trials has established a clear link between the use of coxibs and the appearance of cardiotoxic effects [10]. The bioinformatic groups involved in EU- ADR consortium identified by means of text and data mining and pathway substantiation tools, as well as virtual screening methodologies a putative interaction between coxibs and serotonin 5- HT 2B receptors. Since many drugs with demonstrated adverse cardiovascular effects (fenfluramine, pergolide,...) share the ability to activate serotonin (5- HT) 5- HT 2B receptor (5- HT), which plays an important role in the pathogenesis of valvular heart diseases [11], and based on results coming from data mining and virtual screening carried out within EU- ADR project, we hypothesized that coxibs cardiovascular toxicity could be in part related with a 5- HT 2B mediated effect. For this, we studied the interaction of a currently marketed coxib (celecoxib) and a withdrawn one (valdecoxib) with the serotonin 5- HT 2B receptors.

70 2 Matherials and methods 2.1 Radioligand binding studies. 15μg of protein per well were incubated with 1 nm [ 3 H]- LSD using 5- HT 50µM as non- specific ligand. Increasing concentrations of the compounds of interest, were incubated in a total volume of 250μl of incubation buffer (50mM Tris- HCl, 4mM CaCl 2, Ascorbic acid 1% ph 7.4) in 96- well plate 30min at 37 C with stirring. After this time, 200μl of the mixture was transferred to a GF/C plate pretreated 24 hours earlier with 0.5% PEI and previously to the assay with incubation buffer for 15min (250μl/well). Content was filtered and washed 4x250μl of wash buffer (50mM Tris- HCl ph 7.4) cold. The filters were dried for 1 hour at 60 C and then 30μl/well of Universol were added to measure radioactivity. The radioactivity was detected in a Microbeta counter Trilux Functional studies in isolated organ. Sprague- Dawley rats g male were used. Once killed by asphyxiation with CO 2, the stomach fundus was dissected and placed in different wells of 20ml organ bath containing Krebs solution continuously aerated with carbogen at a temperature of 37 C. After a period of stabilization of tissues, a priming with a 10μM concentration of 5- HT was carried out. After 1h of stabilization and renovation of the solution of the well every 15 minutes, there was a first cumulative concentration- response curve (CCRC) with 5- HT, after which the individual compounds were incubated for 30min and a new CCRC of 5- HT in the presence of them was built. For measurement of isometric contractions were used force displacement transducers connected to Grass FTO3C a PowerLab 8/30. The results were acquired with 6.0 LabChart software. 2.3 Treatment. Twelve Sprague- Dawley male rats, g were obtained from University of Santiago de Compostela Laboratories. Rats were housed in pairs in polycarbonate solid bottom cages, and placed on a 7- day study. Drinking water and feed were available ad libitum. Six rats received daily injections of 5- HT, subcutaneously (75 mg/kg for the first 4 days and 60 mg/kg thereafter) for 7 days and 6 control rats received sterile water for injection only. Of the 12 rats, 11 (five treated and six control rats) were used for compositional morphometry and for the organ/tissue toxicity screens.

71 2.4 Image acquisition and processing. Heart samples were sectioned in a frontal plane with a matrix HSR002-2 (Zivic Instruments). Were fixed in 10% buffered formalin (phosphate buffer 0.1M ph 7.4) and embedded in paraffin according to routine procedures (dehydration in ethanol 70, 96, 100 ; cleared in xylene, paraffin at 60 C). Serial sections from the paraffin blocks were obtained with a Minot microtome at 4μm and were processed to assess the amount of collagen and the presence of fibrosis (Masson trichrome), or to evaluate the content of GAG, hyaluronic acid, through the presence of metachromasia, which is evidenced by the reddish color of these (Aniline blue ph=5). Once stained, the sections were observed and photographed with an Olympus Provis AX70 microscope.

72 3 Results 3.1 In- vitro agonist effect of celecoxib and valdecoxib. By means of radioligand binding assays we observed that celecoxib and valdecoxib increased the sensitivity of 5- HT 2B receptor for 5- HT. 10µM celecoxib (which is included within the therapeutic plasma concentrations range of µM [12]) increased the sensitivity (p <0.01, Student s t test) of physiological concentrations of 5- HT (2.8 nm [13]) to displace the binding of [ 3 H]- LSD at 5- HT 2B receptor. A similar significant effect (p <0.01, Student s t test) was observed with 10 µm valdecoxib (therapeutic range µm, [14]) (Figure 1). Fig. 1. Celecoxib and valdecoxib potentiated of 5- HT binding to 5- HT 2B receptors. Inhibition of specific [ 3 H]LSD binding showed by 5- HT in the absence (blue column) and in the presence of either 10 µm celecoxib (red column) or 10 µm valdexocib (green column). Values represent the mean±sd (vertical bars) of three independent experiments (n=3). 3.2 Ex- vivo agonist effect of celecoxib and valdecoxib Isolated organ functional assays on rat stomach fundus were carried out to assess the effect of both coxibs on the 5- HT 2B receptor. Celecoxib and valdecoxib showed a slight non- significative fundus contraction (Figure 2).

73 Fig. 2. Measurement of the agonist potency of compounds in the 5- HT 2B receptor. CCRC of 5- HT ( ) and the compound under study ( ): A) celecoxib and B) valdecoxib. Points are the mean±sd (vertical bars) of three (n=3) independent experiments. Celecoxib and valdecoxib by themselves do not cause an appreciable effect. We also studied their modulation of the effect mediated by 5- HT. For this, we constructed CCRC of 5- HT in the absence and presence of celecoxib and valdecoxib 1µM. We observed that both coxibs potentiated the agonist effect of 5- HT on 5- HT 2B receptors in experimental animals (Figure 3).

74 Fig. 3. Measurement of the agonist effect of 5- HT in the absence ( ) and presence ( ) of: A) 1µM celecoxib and B) valdecoxib 1µM. Points are the mean±sd (vertical bars) of three (n=3) independent experiments. 3.3 Valvular toxicity induced by serotonin We setup an assay for evaluating valvular toxicity as it was described by Elangbam and cols [15]. In this way, this model will allow to check on cardiac valves ex vivo the synergistic (allosteric) effects of coxibs, being the results obtained here a positive control of heart valve dysfunction. Rats were treated with serotonin and the plasma levels of serotonin in addition to the response of 5- HT 2B receptors in fundus were assessed. 5- HT- treated rats showed less response in isolated organ assays as well as increased plasma serotonin concentration compared to untreated ones (Figure 4).

75 Advances in Biomedical Informatics: COMBIOMED Fig HT treatment induced a desensitization of rat fundus 5- HT2B receptors. A) CCRCs of 5- HT at rat fundus from control (saline treated) and serotonin treated rats. B) Plasmatic levels of 5- HT in control and treated rats. At the morphological level, glycosaminoglycan accumulations were observed in the treated rats as well as a loss of stiffness in valves (Figure 5). Fig. 5. Valvular morphology evaluation. A) Mitral valve control, B) Mitral valve treated, C) Aortic valve control and D) Aortic valve treated

76 4 Discussion Drugs that inhibit COX- 2, commonly known as coxibs, were developed in the late 90's with the aim to obtain analgesics with a lower incidence of gastrointestinal adverse effects. However, the increase in the incidence of serious cardiovascular adverse events, led to the withdrawn of several of these drugs, as in the case of rofecoxib (CEOXX, Recox, Vioxx ), valdecoxib (Bextra ) and Lumiracoxib (Prexige, Stellige ). Even that different mechanisms were proposed to explain this ADR, a possible mechanism leading to coxib s cardiovascular adverse effects emerged from the analysis of data obtained through bioinformatic treatment of different databases, and associated by virtual screening to the stimulation of the cardiac 5- HT 2B receptors, which has been reported that is associated with a mitral valvulopathy [15-18]. For this reason, in the present work, the interaction of two coxibs (celecoxib and valdecoxib) with the 5- HT 2B receptor was characterized, to assess the possible contribution of this receptor to the cardiac toxicity of these compounds. The affinity of celecoxib and valdecoxib for human 5- HT 2B receptors was evaluated by radioligand binding studies, confirming that these compounds have very low affinity for these receptors (% displacement of radioligand binding at 1mM around 50%). However, when we studied the affinity of serotonin for this receptor in the presence and absence of different concentrations of celecoxib and valdecoxib, we found that these compounds significantly increased the affinity of 5- HT 2B receptor for serotonin (P<0.01, Student s t test). Concentrations close to the therapeutic plasma concentration of celecoxib ( µM [12]) and valdecoxib ( µm [14]) favored 5- HT binding at its physiological plasma levels (2.8nM [13]). An increase on [ 3 H]- LSD displacement from 0% to about 70% was observed with both compounds. Therefore these assays suggested that celecoxib and valdecoxib modulate 5- HT affinity for 5- HT 2B receptors interacting at a different binding site. To assess the functional impact that the observed binding modulation could have, functional studies were carried out to determine the degree to which celecoxib and valdecoxib could be activating the serotonin 5- HT 2B receptor. The tissue used for these studies was the rat stomach fundus, which for decades has been the tissue of choice for functional studies of 5- HT 2B receptors [19].

77 CCRC [20] of 5- HT were performed obtaining potency values compatible with those reported in bibliography for the interaction of 5- HT with 5- HT 2B receptors (EC 50 obtained= 145nM - EC 50 reported 120nM [21]). Celecoxib and valdecoxib CCRC, showed a small non- significative fundus contraction (8.09% of the effect of 5- HT for celecoxib and 13.81% of the effect of 5- HT for valdecoxib). When CCRC of 5- HT were constructed in the absence and presence of celecoxib and valdecoxib, we observed that both compounds increased both the potency and efficacy of 5- HT in 5- HT 2B receptor. This enhancement is consistent with an allosteric agonist behavior [22] of both compounds on the 5- HT 2B receptor for serotonin, which may contribute to cardiovascular toxicity of these compounds. To ascertain if this 5- HT modulation could lead to the ADRs observed, we conducted a study following the protocol proposed by Elangbam et al [15], to develop the methodology to evaluate valvular damage. For this purpose, rats were treated with serotonin at high doses for 7 days and it has been proven that this treatment not only affects the response of the receptors, as seen in Figure 3 there is a desensitization, but also morphological changes can be seen on the valves (Figure 5). As future work we intend to carry out a treatment equivalent therapeutical doses of coxibs and quantify the changes that may occur in heart valves as a result of coxib treatment, through the development of image processing software in collaboration with the group IMEDIR from UDC. In this way, with all these data together, we will be able to gain a deeper knowledge about the involvement, if any, of coxibs mediated 5- HT 2B modulation in heart valve failure. 5 References 1. Mann RD, Andrews EB, editors. Pharmacovigilance. Chichester: John Wiley & Sons; 2002, p Olsson S. The role of WHO programme on International Durg Monitoring in coordinating worldwide drug safety efforts. Drug Saf 1998; 19: Ahmad SR. Adverse drug event monitoring at the Food and Drug Administration. J Gen Intern Med 2003; 18: Avorn J. Evaluating drug effects in the post- Vioxx world: there must be a better way. Circulation 2006; 113:

78 5. Rodriguex EM, Staffa JA, Graham DJ. The role of databases in drug postmarketing surveillance. Pharmacoepidemiol Durg Saf 2001; 10: Meyboom RH, Egberts AC, Edwards IR, Hekster YA, de Koning FH, Gribnau FW. Principles of signal detection in pharmacovigilance. Drug Saf 1997; 16: Whitebread S, Hamon J, Bojanic D, Urban L. Keynote review: in vitro safety pharmacology profiling: an essential tool for successful drug development. Drug Discov Today. 2005;10: EU- ADR Project website, 9. FitzGerald GA, Patrono C. The coxibs, selective inhibitors of cyclooxygenase- 2. N Engl J Med 2001; 345: Meade EA, Smith WL, DeWitt DL. Differential inhibition of prostaglandin endoperoxide synthase (cyclooxygenase) isozymes by aspirin and other non- steroidal antiinflammatory drugs. J Biol Chem 1993; 268: Fitzgerald LW, Burn TC, Brown BS, Patterson JP, Corjay MH, Valentine PA, Sun JH, Link JR, Abbaszade I, Hollis JM, Largent BL, Hartig PR, Hollis GF, Meunier PC, Robichaud AJ, Robertson DW. Possible role of valvular serotonin 5- HT(2B) receptors in the cardiopathy associated with fenfluramine. Mol Pharmacol 2000; 57: EMEA Scientific discussion for the approval of Onsenal 2005, _Scientific_Discussion/human/000466/WC pdf. 13. Kema IP, de Vries EG, Muskiet FA. Clinical chemistry of serotonin and metabolites. J Chromatogr B Biomed Sci Appl. 2000; 747: FDA, Bextra labeling Accesed in FDA webpage: Elangbam CS, Job LE, Zadrozny LM, Barton JC, Yoon LW, Gates LD, Slocum N. 5- hydroxytryptamine (5HT)- induced valvulopathy: compositional valvular alterations are associated with 5HT2B receptor and 5HT transporter transcript changes in Sprague- Dawley rats. Exp Toxicol Pathol 2008; 60: Roth BL. Drugs and valvular heart disease. N Engl J Med 2007; 356: Schade R, Andersohn F, Suissa S, Haverkamp W, Garbe E. Dopamine agonists and the risk of cardiac- valve regurgitation. N Engl J Med 2007; 356: Zanettini R, Antonini A, Gatto G, Gentile R, Tesei S, Pezzoli G. Valvular heart disease and the use of dopamine agonists for Parkinson's disease. N Engl J Med 2007; 356: Vane JR. A sensitive method for the assay of 5- hydroxytryptamine. Br J Pharmacol 1957; 12: Van Rossum JM. Accumulative dose- response curves II. Technique for the making of dose- response curves in isolated organs and the evaluations of drug parameters. Arch Int Pharmacodyn 1963; 143: Villazón M, Enguix MJ, Tristán H, Honrubia MA, Brea J, Maayani S, Cadavid MI, Loza MI. Different pharmacological properties of two equipotent antagonists (clozapine and rauwolscine) for 5- HT2B receptors in rat stomach fundus. Biochem Pharmacol 2003; 66:

79 22. Langmead CJ, Christopoulos A. Allosteric agonists of 7TM receptors: expanding the pharmacological toolbox. Trends Pharmacol Sci 2006; 27: Acknowledgments. This work has been supported by the projects EU- ADR project (VII FP) (Grant Agreement ) and Combiomed (RD07/0067/0002).

80 G protein- coupled receptors: targets to efficiently design drugs Leonardo Pardo, Mercedes Campillo, Gianluigi Caltabiano, Arnau Cordomí, Laura Lopez, Norma Diaz- Vergara, Ivan R. Torrecillas, Jessica Sallander, Angel Gonzalez, Julian Zachman, and Santiago Rios Laboratori de Medicina Computacional, Unitat de Bioestadistica, Facultat de Medicina, Universitat Autonoma de Barcelona Abstract. Drug discovery is a key field of scientific and industrial activity that has an exceptional importance because of its huge impact on healthcare and economy. The R&D process to launch a new drug to the market is extremely expensive and long. In spite of these difficulties, the investigation in new drugs is intense, since the benefits are very high. Therefore, it is important to use new technologies in the R&D process of obtaining new compounds. Keywords. Bioinformatics, drug design, molecular modeling, G protein coupled receptors 1 Introduction Membrane receptors coupled to guanine nucleotide- binding regulatory proteins (commonly known as G protein- coupled receptors, GPCRs) comprise one of the widest and most adaptable families of cellular sensors, as they are able to mediate a wide range of transmembrane signal transduction processes. GPCRs constitute one of the largest protein families in mammals and represent 2%- 3% of the human proteome. Recent estimations assign more than 1000 members to this family in the human genome ( GPCRs transduce sensory signals of external origin, such as photons, odors or pheromones, and endogenous signals such as neurotransmitters, (neuro)peptides, hormones, and many others, into the cytoplasmic side of the cell membrane. Thus, GPCRs constitute one of the most important pharmaceutical targets, as around 40% of prescribed drugs act through this family of proteins. They are the target for the majority of the best selling drugs used today, including many block- buster drugs such

81 as anti- histamines (Claritin, Semprex ), β- blockers (Inderal ), H2- blockers (Zantac, Targamet ), Opioids (morphine), and bronchodilators (Ventoline, Bricanyl ). 2 Orphan GPCRs With the completion of the human genome, many 'orphan' GPCRs genes are uncovered with unknown functions. These are genes that exhibit the seven helical conformation hallmark of GPCRs but that are called 'orphans' because they are activated by none of the primary messengers known to activate GPCRs in vivo. They are the targets of undiscovered transmitters. Yet, because they belong to the supergene family that has the widest regulatory role in the organism, the orphan GPCRs have generated much excitement in academia and industry. Phylogenetic analysis has shown that among 367 human receptors for endogenous ligands, there are 126 orphan receptors in They hold much hope for revealing new intercellular interactions that will open new areas of basic research, which ultimately will lead to new therapeutic applications. 3 Pathophysiological consequences of GPCR constitutive activation Single point mutations can render GPCRs constitutively active, i.e. could signal without the presence of the extracellular transmitter. For a variety of GPCRs, these point mutations have been convincingly linked to human disease. Thus, GPCR malfunction is commonly translated into pathological outcomes. 4 Conventional drug design Drug design programs aim at compounds that fully activate the receptor (agonists); produce a submaximal activity (partial agonists); block the binding of other drugs, without altering the basal activity of the receptor (antagonists); of decrease the basal, agonist- independent level of signaling (inverse agonists). 5 Allosteric ligands These drug discovery programmes have been dominated by efforts to develop agonists and inverse agonists that act at orthosteric sites for endogenous ligands. These ligands lack selectivity for specific GPCR subtypes due to the remarkable structural similarity of their drug binding pockets. Novel ligands that act at an allosteric site and enhance (positive allosteric modulators) or decrease (negative allosteric modulators) the response of orthosteric agonists might provide high selectivity, novel modes of efficacy and may lead to novel therapeutic agents.

82 6 Bivalent ligands targeting GPCRs as homo- and hetero- oligomers GPCRs have been classically described as monomeric transmembrane receptors that form a ternary complex: a ligand, the GPCR, and its associated G protein. Thus, conventional drug design targeting GPCRs has mainly focused on the inhibition of a single receptor at a usually well- defined orthosteric binding site. Nevertheless, it is now well accepted that many GPCRs have been observed to oligomerize in cells. In addition, it has recently been shown that receptor activation is modulated by allosteric communication between protomers of GPCR dimers. Thus, designing small molecules targeting these physiologically relevant GPCR dimers/oligomers or inhibiting receptor- receptor interactions might provide new opportunities for novel drug discovery. One proposed approach to target GPCR oligomers is based on the use of so- called bivalent ligands, i.e. ligands composed of two covalently linked (through a spacer) pharmacological recognition units (pharmacophores) which may target the two receptor orthosteric binding sites on a heterodimer simultaneously. 7 Virtual screening in drug discovery Structure- based virtual screening of large compound libraries is a common technique in early stage drug discovery at most pharmaceutical companies as well as university groups. The success of this initiative is highly beneficial because facilitates the rapid identification of pharmacological hits, which can be later refined into leads. 8 Multidisciplinary research is an essential driver for innovation The successful outcome of this type of research projects requires collaboration among groups with different expertise. Clearly, experimental data progressively improve the robustness of the theoretical models, and their predictive character. We strongly believe that our success in science results from the combination of insights developed from both experiment and theory. Interdisciplinary research permits the advancement of knowledge more efficiently.

83 A Generic Computational Pipeline Architecture for the Analysis of RNA-seq Short Reads P. Ferreira, D. González, P. Ribeca, M. Sammeth and R. Guigó Centre de Regulació Genòmica, Barcelona, Spain May 9, Introduction Recently developed high throughput technologies for cdna sequencing, usually called RNA-seq, have already provided valuable insights on the transcriptome characterization of several species [11]. Among these, we can find RNA-seq studies on the eukaryotic transcriptomes of H. sapiens [10, 8], M. musculus [5], A. thaliana [2], S. cerevisiae and [6, 12] S. pombe [1]. As it is inherently a quantitative technology, RNAseq appears as a good substitute to microarray experiments [4] for measuring gene expression. It is also particularly powerful on other tasks such as identifying novel genes and correcting annotations, detecting low abundance transcripts and alternative isoforms or SNP [3]. However, the large amount (typically tenths of millions) of short reads generated from a RNA-seq experiment poses a non-trivial challenge for computational analysis, and in order to achieve the tasks mentioned above, a set of preparatory analyses steps have to be performed. In principle, these steps are common to all RNA-seq studies. While some of them need to be processed in serial mode, others can be done in parallel. We propose and discuss a computational pipeline, based on previous studies and our particular experience, that can be used as a methodology for the mapping and preliminary analysis of RNA-seq data. This initial processing is the base for any further downstream analyses. 2 Method The initial step of any RNA-seq analysis is usually the mapping of short reads on to a reference. For each read, the location or locations where it can be aligned on the

84 reference sequence are obtained. There has been recently a large effort in the development of efficient programs, called mappers or aligners, to perform this task, many of which are already available to the research community (see [9] for a quick review), and more are likely to become so in the near future. The different algorithmic approaches proposed by these programs result in different performances and limitations, however the main goal is the same and many of the features they use are also shared. In the proposed pipeline we will make use of some of these features that are common to most of these mappers. One of these is the ability to map with mismatches. These may be single nucleotide differences or in some cases small indels. Allowing for mismatches allows a read that differs from the reference sequence by up to the maximum number of differences to the reference to be still mapped (see Figure 2 (A)). This is a fundamental feature when trying to align the reads to a reference that does not come from the same individual as the reads, which is the most common case. The reason for this is that these reads are likely to have their own set of unique nucleotide differences with the reference, and if no mismatches were allowed we would, most likely, miss an important number of valid matches. Another parameter required by some mappers is the maximum number of matches that a read may have in the reference in order to be reported. By using this, reads that map in too many places can be immediately rejected, allowing to speed up the search as once the threshold number of locations is reachd no additional time is spent searching for the remaining ones. Most mappers used for high throughput sequencing analyses require an index creation step prior to the actual read mapping. This step consists on the pre-processing of the reference sequence in order to create from it a data structure that can be efficiently searched for read locations. The choice of indices may seem trivial, but it is important to decide what to include in them, as this will determine what we will and will not find in the first part of our alignment process. The most obvious index to be used is one containing the genome sequence, however, when examining transcriptome data we also need to take into account the sequence of the splice junctions. In the proposed pipeline three indices are used: genome, splicejunctions and transcriptome. This step is depicted in Figure 2. Correspondig to this, we will also perform three main mapping steps. Genome mapping provides the genomic locations of the input reads. In some mappers this feature may not be exhaustive and only the best match (based on some pre-defined heuristic) is reported. Sequence reads may also be obtained from regions of the RNA that correspond to splice junctions, and therefore will not map to the genome. To verify if this is the case, one needs to select those sequences that correspond to neighborhood regions of the splice junctions. Given the reference annotation and the genomic sequence this is done by extracting and concatenating a segment from each of the joining exons, as exemplified by Figure 2 (B). Since this is a combinatorial approach, an exhaustive generation of all candidate splice junctions would be impracticable for most cases, so a compromise is reached by generating all combinations of exons belonging to the same gene that are biologically feasable (this means they would

85 be 5 to 3 junctions of non-overlapping exons). The length of the exon segments used is dependent on the read length and the number of allowed mismatches, and typically given by SegLength = ReadLen 2 Mismatches. Finally, reads that do not map either to the genome or the splice junctions can still map to the transcriptome, this is particularly true for longer reads that span more than one splice junction. This way, transcriptome mapping may capture previously unpredicted cases. An example of a read in this situation is given in Figure 2 (C), where the read spans across three exons, due to the small length of the internal exon. Figure 1: Illustration of different situations: A) mapping of a short read against the reference sequence with 2 mismatches; B) creation of a splice junction sequence by extracting segments from 2 exons; C) example case where the read spans across three exons; D) split-mapped read; The second step of the pipeline consists in filtering the mapped reads for additional analyses and to recover those reads that could not be mapped. Thus, after the mapping to the genome, transcriptome and splice-junctions the reads are divided in two sets: mapped and not-mapped.

86 Reads that fall in splice junctions not represented in the reference annotation or that span across indel regions of the reference genome will be contained in the notmapped set. To detect these cases, programs called split-mappers can be applied. Such programs try to map reads by splitting them in two parts and mapping these two fragments separately to the genome. Depending on the split-mapper used the region between the two fragments in which the read is split can be arbitrary (even interchromosomal) or up to a pre-defined length (see Figure 2 (D)). To consider the regions where the reads map as candidates for new splice junctions usually a set of additional requirements need to be met, these may include the presence of a minimum number - n - of split-mapped reads cluster in this region (for instance, n 5), the presence of canonical intron splice motifs (like GT-AG) in the flanking genomic regions, additional support from other sources, or a combination of the mentioned evidences. Finally an additional step may be performed in order to recover additional reads that still remain unmapped after the split-mapping step. This step consists of an iterative mapping that is applied to this reduced set of reads. Here the number of mismatches is increased to a number proportional to the length of the read, in our case up to 1 mismatch per 25 nt of read length, and the mapping against the genome is repeated followed by split-mapping. This two step mapping is repeated again on the unmapped reads and after each round the remaining reads are trimmed by a set number of nucleotides on their 3 end (in our case 10). The recursive mapping will end when all reads are mapped or the length of the read fall bellow 25 nt. From the mapped set, a further selection of reads with a unique match is done. For tasks such as the identification of novel genes or the detection of low abundance transcripts the confidence of the results increases if only uniquely mapped reads are considered. Other tasks like the calculation of genome/transcriptome coverage can use all the mapped reads. Finally, a set of format conversion scripts are applied to convert the data from mappers specific format to standard formats like GFF, Bed or SAM. The data is then ready to be uploaded in genome browsers for visualization and comparison or to be used in other programs to perform specific analysis tasks. 3 Conclusions Here, we proposed and discussed an architecture for a computational pipeline for the analysis of millions of short reads obtained from high throughput RNA-seq experiments. This is a general and flexible pipeline that combines contributions of previous studies with our own experience. It consists of three major steps: index-creation, read mapping and data filtering/preparation. At the end of these steps, read data is ready for further analyses specific of the study in question. The development of this pipeline was geared towards the GEM mapping tools [7], although in practice any mapper could be used.

87 References [1] J. Castle, C. Zhang, J. Shah, A. Kulkarni, A. Kalsotra, T. Cooper, and J. Johnson. Expression of 24,426 human alternative splicing events and predicted cis regulation in 48 tissues and cell lines. Nature genetics, November [2] R. Lister, R. O Malley, J. Tonti-Filippini, B. Gregory, C. Berry, H. Millar, and J. Ecker. Highly integrated single-base resolution maps of the epigenome in arabidopsis. Cell, 133(3): , May [3] S. Marguerat, B. T. Wilhelm, and J. Bähler. Next-generation sequencing: applications beyond genomes. Biochemical Society transactions, 36(5): , October [4] J. Marioni, C. Mason, S. Mane, M. Stephens, and Y. Gilad. Rna-seq: an assessment of technical reproducibility and comparison with gene expression arrays. Genome research, 18(9): , September [5] A. Mortazavi, B. Williams, K Mccue, L Schaeffer, and B. Wold. Mapping and quantifying mammalian transcriptomes by rna-seq. Nature Methods, 5(7): , July [6] U. Nagalakshmi, Z. Wang, K. Waern, C. Shou, D. Raha, M. Gerstein, and M. Snyder. The transcriptional landscape of the yeast genome defined by rna sequencing. Science, pages , May [7] Paolo Ribeca and et al. Gem mapping tools (in preparation) [8] M. Sultan, M. Schulz, H. Richard, A. Magen, A.s Klingenhoff, M. Scherf, M. Seifert, T. Borodina, A. Soldatov, D. Parkhomchuk, D. Schmidt, S. O Keeffe, S. Haas, M. Vingron, H. Lehrach, and M. Yaspo. A global view of gene activity and alternative splicing by deep sequencing of the human transcriptome. Science, pages , July [9] C. Trapnell and S. Salzberg. How to map billions of short reads onto genomes. Nature Biotechnology, 27(5): , [10] E. Wang, R. Sandberg, S. Luo, I. Khrebtukova, L. Zhang, C. Mayr, S. Kingsmore, Gary P. Schroth, and C. Burge. Alternative isoform regulation in human tissue transcriptomes. Nature, November [11] Z. Wang, M. Gerstein, and M. Snyder. Rna-seq: a revolutionary tool for transcriptomics. Nature Review Genetics, 10(1):57 63, [12] M. Yassour, T. Kaplan, H. Fraser, J. Levin, X. Pfiffner, J.and Adiconis, G. Schroth, S. Luo, I.a Khrebtukova, A. Gnirke, C. Nusbaum, D. Thompson, N. Friedman, and A. Regev. Ab initio construction of a eukaryotic transcriptome by massively parallel mrna sequencing. Proceedings of the National Academy of Sciences of the United States of America, February 2009.

88 6 Figure 2: Architecture of the computational pipeline for the analysis of short reads.

89 Towards Openness in Biomedical Informatics Victor Maojo 1, Ana Jimenez-Castellanos 1, Diana de la Iglesia 1 1 Dept Inteligencia Artificial, Facultad de Informática, Universidad Politécnica de Madrid, Campus de Montegancedo S/N, Boadilla del Monte, Madrid, Spain. {vmaojo, ajimenez, diglesia}@infomed.dia.fi.upm.es Abstract. Over the last years, and particularly in the context of the COMBIOMED network, our biomedical informatics (BMI) group at the Universidad Politecnica de Madrid has carried out several approaches to address a fundamental issue: to facilitate open access and retrieval to BMI resources including software, databases and services. In this regard, we have followed various directions: a) a text mining-based approach to automatically build a resourceome, an inventory of open resources, b) methods for heterogeneous database integration including clinical, -omics and nanoinformatics sources ; c) creating various services to provide access to different resources to African users and professionals, and d) an approach to facilitate access to open resources from research projects. Keywords: Biomedical Informatics. Information retrieval. Web 2.0. Semantic Web. 1 Sharing Information about Medical Informatics (MI) and Biomedical Informatics (BI) resources has dramatically grown over the past decade. A broad interest in these fields is leading professionals to produce new materials that can be shared and exchanged with the rest of the scientific community. To address this rapid growth, it is important to collect information and tools using automatic methods. In this regard, our group at the UPM has been working on a series of topics, in the context of the COMBIOMED network and the ACTION Grid Project, where various members of COMBIOMED also actively participated. The Human Genome Project and other -omics projects, as well strengthened collaborative links among remote institutions that share and exchange software and data across remote organizations. In contrast, most clinical databases cannot be openly accessible due to privacy issues involving confidential patient information. In this context, there is already an extended amount of open-source software tools, created for tasks such as e-learning, in many disciplines. In biology, some examples are BioLogica (for genetics) and Dynamica (for kynematics). In medicine, there are currently proposals for a Medical Wikipedia and various public sources of medical images. Pubmed is the on-line, free access gateway to Medline, a comprehensive bibliographic reference for biomedical researchers and professionals. Medline was a

90 pay-per-service resource for various decades, until it became freely available for the biomedical community in the 1990 s. Medline has had an enormous impact in biomedical research, education and practice. Different Web technologies have been proposed for accessing and sharing remote heterogeneous information from open source tools. In a recent publication [1], members of our group at the UPM have proposed a new method to deal with this challenge. We reviewed several indexes of bioinformatics resources, currently available over the Internet. For instance, BioPortal [2], a web-based repository of biomedical ontology resources, developed by members of the US National Center for Biomedical Ontology. This application supports collaborative development of biomedical ontologies. BioPortal includes, among others, the Open Biomedical Resources (OBR) service for annotating and indexing biomedical resources. Resources are annotated by using a domain ontology. Other examples of such indexes include the Bioinformatics Links Directory (BLD) a catalogue of links to bioinformatics resources, tools and databases classified into eleven major categories where resources can be searched using keyword-based queries. a catalogue of links to bioinformatics resources, tools and databases classified into eleven major categories where resources can be searched using keyword-based queries. Resources are classified according to the type of service they provide such as databases, tools and (web) services. The index includes both internal and external resources. A consortium of various US National Centers for Biomedical Computing has recently developed another index of bioinformatics resources called itools [3]. A web-based interface enables researchers to locate the resources they need using advanced search and visual navigation tools. Web-based repositories of bioinformatics resources have been built to facilitate their access to researchers in the area. Until now, these systems have been developed and updated manually. In this regard, our group proposed a new method, recently reported in a major scientific journal and conference [4], to automate this process. Informatics tools will then become available for the biomedical community and interoperable in actual research scenarios related to the VPH. We describe below the fundamentals of this method. BIRI (BioInformatics Resource Inventory) is a web application that allows users to search for bioinformatics resources (tools, frameworks, repositories, etc). Searches can be filtered by resource name, category or domain. Resources are classified according to a taxonomy of 9 domains and 28 categories. Domains represent the area of influence/application of resources e.g. DNA, RNA or proteins and categories denote the resource functionality or type e.g. annotate, analyze or database. That taxonomy is based on other existing classifications such as the Bioinformatics Links Directory. A novel methodology has been developed to create the BIRI repository from the scientific literature based on Natural Language Processing (NLP) and Artificial Intelligence techniques. These methodologies allow retrieving, discovering and indexing resources automatically from manuscripts published in specialized journals in the bioinformatics domain. Extracting information from published papers guarantees that only relevant and peer-reviewed resources are indexed. Name, functionality and URL of the resources are directly extracted from the text (title and abstract). Additionally, resources are automatically annotated with one or several

91 categories and domains according to the BIRI taxonomy, depending on the textual description contained in the manuscript. The methodology used to create the BIRI repository consists of five main phases: 1) Manuscript selection & surrogate generation. 2) Surrogate pre-processing. 3) Information extraction. 4) Resources classification. 5) Curation. The BIRI approach presents several advantages over similar existing indexes: i) discovery and classification of resources is performed automatically, ii) the repository of resources can be updated by just feeding the system with new papers, iii) additional information sources might be used such as PubMed or Google Scholar, and iv) advanced search capabilities are provided through the web interface. Given the general methodology used in BIRI, a similar approach might be applied in other domains. Currently, some tasks, besides the automatic method, must be carried out, such as manuscript selection, taxonomy creation or final curation. However, whereas sharing data and software tools is frequent in Bioinformatics, Medical Informatics is a discipline where there is an ongoing, long debate about medical Open Source Systems (OSS). One future realistic possibility is to have a pool of medical software systems which can be used on demand, and paid per use. Professionals can access these tools, use them and decide if they want to continue working with them. Such a scenario, proposed by Mandl and Kohane from Harvard, needs a platform and an infrastructure to become feasible [5]. This area of Open Source will surely become a hot topic in the coming years, particularly in the context of the Web Web 2.0 and 3.0 The past ten years, the idea of interactivity evolved from linking and clicking documents to creating and sharing. Thus, the Web 2.0 has been proposed to facilitate communication and simultaneous work between in different groups. Below is a summary of the differences between the Web 2.0 and its previous version: Table 1. Differences between the Web 1.0 and Web 2.0 Web 1.0 Web 2.0 Application-based Web-based Isolated Collaborative Offline Online Licensed or purchased Free Single creator Multiple collaborators Although the use of the WWW is commonly related to searching for information, this new Web 2.0 infrastructure has enormous potential for developers and practitioners. Medical digital libraries, distributed medical records or Geographical Information Systems for medical issues like Google maps used to graphically

92 represent the expansion of pandemias are among the envisioned applications to be collaboratively developed and used by health professionals. While physicians were the initial targets and users of Web-based medical applications, patients are also demanding new applications to improve the quality of medical care. Using the Web, they aim to access second medical opinions, find personalized advice or contact their physicians or other patients directly. A new version of the WWW, called the Semantic Web or Web 3.0 emphasizes the use of semantic-based technologies for organizing and structuring the Web by means of ontology-related technologies. Such a new approach facilitates tasks such as information storage, retrieval or mining. We have also reported various semantic-based research and technologies [6,7]. 3 Importance of Medical Information Systems for developing countries In such expanded context of the Web 2.0, we have carried out an analysis of activities related to BMI in Africa. For this work, collaboration with an expert from Egypt, Dr. Rada Hussein, has been fundamental. An analysis of the literature made by means of BIRI and related tools, combined with manual Medline and Google searches, has suggested enormous opportunities and challenges for transferring results from many previous EU research projects in BMI to African locations for improving medical practice and research. We have to remind that, in the ICT for Health area of the EC, there have been few contacts with African BMI professionals. Thus, there is a great room for improvement. Fig. 1. Members of UPM teaching computer science to young students in Burundi Global health has experienced significant developments, but efforts for cooperation with underdeveloped countries must increase. Countries like China and others have largely improved their health indicators, recently, compared to rich countries, where inequalities can still be widely found. In this context, institutions such as WHO a collaborator of our group have established priorities for improving global health over various decades. These priorities depend on accurate numbers and estimations

93 extracted from public health systems, which are still largely unknown in many African countries. Fig. 2. Representation of examples of potential benefits from a transfer of knowledge and systems in BMI from Europe to Africa 4 Costs of medical technology: Information systems Research on medical technology has had an enormous impact on medicine. In fact, it has been usually considered that one of the most significant issues defining modern medicine is the advance of medical technologies. Within these technologies, we will focus on medical information systems. In 2002, a survey carried out at the Health Affairs journal among 225 medical internists did not consider the use of computers in medicine among the 30 most significant medical innovations of the last decades. Nevertheless, a few years later, another survey carried out over Internet by the British Medical Journal ranked the use of computers in medicine in the 10th place among the most significant advances in medicine since the journal was created in Table 2. An extract of the results of the British Medical Journal survey in 2007 Sanitation 15.8 Antibiotics 14.5 Anesthesia 13.9 Vaccines 11.8 Discovery of DNA structure 8.8 Germ theory 7.4 Although these kinds of surveys should be carefully considered, this last result may indicate a significant shift in the consideration of the use of computers in medicine.

94 In fact, if we consider the time that medical professionals dedicate to information management-related tasks, it has been earlier observed as soon as in 1966 that these rates are quite large, ranging from 95% for medical records professionals to 28% for laboratory workers. For physicians, these rates ranged from 30 to 36%. Since this survey was carried out in 1966, it can be hypothesized that time dedicated to these tasks may be quite higher now. Thus, information management is already assumed to be a fundamental component of modern medical practice. Nevertheless, there are several aspects that can be considered, regarding medical information systems, which will surely have a positive effect in terms of cost control contention a process which seems to have already began: 1. Medical Information systems development has reached a level of stability that allows the acceptance of many current or de facto standards by academics and industry e.g., HL7, DICOM, UMLS, Web components, etc, facilitating systems interoperability and components reuse. 2. Costs and size of computer hardware have been continuously going down for various decades. Currently, a 20 euro pen-drive can store much more data than heavy storage juke-boxes 15 years ago, at lower than a 1% price and size. Plans to market personal computers in developing countries at a price below 100$ have been proposed for several years. 3. There is an increasing culture of developing open-source software systems and data sharing that have pervaded related disciplines such as genomics and bioinformatics, allowing many -omics projects to be completed before schedule. At the same time, there are many public databases offering free gene, protein and disease information to the scientific community. The completion of the Human Genome Project before schedule, due to collaborative efforts and data and software sharing among researchers all over the world, triggered the development of numerous publicly available databases containing gene, protein and disease as well as bioinformatics tools. This number is continuously increasing and it is now over 1300 public databases. Security, confidentiality and ownership rights have prevented to reach similar importance and numbers in medicine but an increasing culture of sharing data and software tools as wells as the development of techniques for issues such as reliable anonymization techniques for managing patient data could help to expand it. In this regard, an interesting trend is to store these databases using cloud computing techniques. For instance, Amazon is providing free storage for some publicly available scientific databases in the genomics area. 5 The explosion of medical information The development of the World Wide Web has configured a new scenario where people exchange huge amounts of information in all domains. The success of the World Wide Web after 1990 what could be called the Web 1.0, as mentioned above caused an explosion of the amount of biomedical information available for practitioners and researchers, and also for patients and public in general. An

95 enormous amount of biomedical information, never seen before, has been available for health practice, policy-making and research. In the past years a new approach has obtained an immediate success. Web services have been defined by the W3C, the WWW consortium, as "a software system designed to support interoperable machine-to-machine interaction over a network. Web services can be accessed over the Internet, and executed on a remote system. Using the appropriate standards, such as WSDL, SOAP and others, Web services have been developed for numerous applications, also in biomedicine. Many applications can be run and executed as services, without a strong computing expertise needed. Web services can be orchestrated by means of workflows, according to the needs of each user. The development of the Semantic Web, the Intelligent Web, or Web 3.0, where documents and tools can be structured, shared and integrated through intelligent semantic techniques promises to expand the above ideas. By way of an example, a special interest group called Semantic Web for Health Care and Life Sciences Interest Group was created by the W3C to analyse the impact of the Semantic Web in the biomedical domain. As a summary, one of the fundamental goals for the forthcoming years will be to structure information to facilitate information search and retrieval. In this regard, regrettably, there are many results from past research in BMI that are difficult to find even those which are in the open community. Thus, we have already proposed a strategy to facilitate how to access such resources, as presented below. 6 A proposal for making open results from biomedical research projects easy to find and access Wald has addressed scientific openness in a recent Science article [8], including data and methods used for research. Advances in software tools for bioinformatics search helps [3], but, just becoming aware of open results of research projects funded by public agencies e.g., databases, software, papers, e-books and finding them efficiently still proves harder than it should. In the course of producing an advanced, automatically generated on-line inventory of bioinformatics resources [1], we analyzed results from research projects publiclyfunded by the European Commission, Spanish agencies and the NIH. We discovered that finding the complete set of available information reported to have been generated by the projects could prove quite elusive. Non-peer-reviewed summary reports were commonplace, but specifics of electronic resources with Web locations were frequently not, even when researchers mentioned their existence as being openly available [9]. To enable searches with sophisticated text mining, publicly-funded projects should provide a minimum information set including titles, authors, funding agency, annotations with concepts from ontologies or controlled vocabularies that characterize the functionalities of the resources, papers reporting significant findings using these

96 resources peer-reviewed quality indicators and their Uniform Resource Identifiers (URIs). Earlier suggestions for structuring abstracts of papers [10] resulted in an experiment with disappointingly limited success [11]. However, to provide basic information resources from projects already on the web ought to be more straightforward. Requiring a minimum information set like the one we propose to be available online under clearly specified standards might help bring about more comprehensive open access which would promote wider reuse of resources and avoid duplication in scientific projects, worldwide. Agencies are increasingly requiring that papers reporting research funded by them become publicly available. Our proposal is that they require that other products of research like open electronic resources that back-up a paper s results should be made equally easily accessible. Similarly, the use of text mining techniques can avoid duplication and plagiarism in proposals, as we have proposed previously in a communication to the Nature journal [12]. References 1. De La Calle G, García-Remesal M, Chiesa S, De La Iglesia D, Maojo V : BIRI: A New Approach for Automatically Discovering and Indexing Available Public Bioinformatics resources from the Literature. BMC Bioinformatics, 10, 320 (2009) 2. Musen M, Shah N, Noy N, Dai B, Dorf M, Griffith N, Buntrock JD, Jonquet C, Montegut MJ, Rubin DL: BioPortal: Ontologies and Data Resources with the Click of a Mouse. AMIA Annual Symposium Proceedings (2008) 3. Dinov ID, Rubin D, Lorensen W, et al. itools: a Framework for Classification, Categorization and Integration of Computational Biology Resources. PLoS ONE, 3(5):e2265 (2008) 4. De la Calle G, García-Remesal M, Maojo V: A Method for Indexing Biomedical Resources over the Internet, Stud Health Technol Inform, 136, (2007) 5. Mandl KD, Kohane IS.: No small change for the health information economy. N Engl J Med. 360(13): (2009) 6. Alonso-Calvo R, Maojo V, Billhardt H, Martin-Sanchez F, García-Remesal M, Pérez-Rey D.: An agent- and ontology-based system for integrating public gene, protein, and disease databases. J Biomed Inform.40(1):17-29 (2007) 7. Pérez-Rey D, Maojo V, García-Remesal M, Alonso-Calvo R, Billhardt H, Martin-Sánchez F, Sousa A.: ONTOFUSION: ontology-based integration of genomic and clinical databases. Comput Biol Med. 36(7-8): (2006) 8. Wald C.:Scientists Embrace Openness. Science Issues and Perspectives. Science. (2010) 9. Maojo V, Garcia-Remesal M, Crespo J, de la Calle G, de la Iglesia D, Kulikowski C.: Open results from biomedical research projects: where are they? ScienceCareers (a section of the Science journal). (2010) 10. Gerstein M, Seringhaus M, Fields S. Structured digital abstract makes text mining easy. Nature. 447(7141):142. (2007) 11. Lok C. Literature mining: Speed reading. Nature.463(7280): (2010) 12. Maojo V, García-Remesal M, Crespo J. Detectors could spot plagiarism in research proposals. Nature. 456(7218):30. (2008)

97 Translational Bioinformatics: infectious diseases as a case study Guillermo López-Campos,1, Isabel Hermosilla 1, Mª Angeles Villarrubia 1, Jose Antonio Seoane 2, Mª Carmen Ramirez-Paniagua 1, Fernando Martín-Sanchez 1, Victoria López-Alonso 1,1 1 Medical Bioinformatics Dept. Institute of Health Carlos III {glopez, isahermo,avillar,cramirez,fmartin,victorialopez}@isciii.es 2 Information Communications Technologies Department, University of A Coruna, Abstract. Genomics, functional and structural genomics, transcriptomics, proteomics, and immunomics are being exploited to development of diagnostic and therapeutic to control infectious diseases. Understanding the dynamics of infectious diseases requires a tremendous amount of integrated comparative sequence, expression, and proteomic data from a variety of pathogens (bacteria, virus, protozoa, fungi), vectors, reservoirs (non- human mammals, environment) and human hosts. Bioinformatics enable the task of generating, linking, analyzing and applying these data for detection of new pathogens, new virulence factors and antimicrobial resistance determinants which is an essential public health task. Translational bioinformatics will facilitate in the near future access to system biology data and infection pathogen- host interactions, generating appropriate diagnosis and treatment point of care (POC) tests for infectious diseases. Keywords: Bioinformatics, Databases, Systems biology, Personalized medicine, High- throughput, Microarray, Single nucleotide polymorphism (SNP), Pathways, Infectious diseases, Genomics 1 Introduction The different genome projects like the 1000 Genomes Project, the Microbiome Project and the Genome 10K [1, 2, 3] bring with the deciphering of large amounts of data that need to be managed and studied. The discipline of bioinformatics [4, 5] gives many applications to manage this information, but the necessity to translate and apply to the clinical practice all these new developments in basic research has driven to the development of the discipline of translational bioinformatics [6]. In recent years, bioinformatics uses have multiplied and diversified. Evolving from data storage, such as DNA and protein sequences databases, to the 1 Corresponding Author: Victoria López Alonso. Medical Bioinformatics Dept. Institute of Health Carlos III. Ctra. Majadahonda-Pozuelo Km Madrid. Spain. victorialopez@isciii.es

98 management, processing and analysis of the huge amounts of data generated by nowadays experiments, including the most recent Next Generation Sequencing (NGS) devices [7]. Figure 1. New high-throughput genomics technologies and bioinformatics tools are already contributing to studies of infectious disease with system biology that address the host, the pathogen, and interaction between the two and the environment. Once the genome sequences were available a new scenario was settled, the post- genomics. In the post genomic era different approaches have been developed based on the use and data source: individual genomics, comparative genomics, functional genomics and proteomics. The role of bioinformatics in all of them is to manage, to process and to analyse data seeking the relationship between structure and function and understanding these relationships. The integration of the information generated from genomic or proteomic studies and clinical information plays a key role for a new focus to study diseases. For a long time, the existing relationship between inherited traits and clinical features has been known for monogenic diseases. Development of different techniques deciphered the association of these traits with different modifications in DNA or proteins. Nowadays the evolution of laboratory techniques leading to high throughput techniques has facilitated the access to large amounts of genomic and proteomic information that might be associated with physiopathological traits. It is important to remark that most prevalent diseases are a result of combining factors, both genetic and environmental. Therefore the use of this recently available genomic information and the study of diseases in this new light is called genomic based medicine or molecular medicine.

99 The synergy and collaboration of two established disciplines, bioinformatics, from the genomics side, and medical informatics, from the clinical side, gave rise to a new discipline known as biomedical informatics (BMI) [8]. Development of methods and tools in the light of BMI has brought about many advances in the knowledge and treatment of diseases. In recent years many clinicians and researchers believe that the application of the knowledge obtained in the laboratory into the clinical practice to advance in the diagnoses and treatment of diseases is not dynamic enough. This need to apply in the clinical practice the new developments done in basic research, moving from the bench- side applications to bed- side applications has given rise to a new concept, translational biomedical research. This approach strongly based on a clinical perspective has driven the use of the term Translational Bioinformatics. A term adopted, mainly by the American Medical Informatics Association (AMIA). Translational bioinformatics is now considered one of the main branches of biomedical informatics (together with clinical informatics and public health informatics). Most infectious agents are bacteria, viruses, fungi, or parasites however only a few hundred are capable of inflicting damage to the human host. Furthermore, the spectrum of human disease caused by a particular pathogen varies greatly depending on ecologic, host related, and infectious agent related factors. The relationship between novel and well- known pathogens and human/zoonotic hosts evolves continuously, and therefore we will be challenged by new pathogens, new disease spectra, and well- known pathogens that have developed alternative mechanisms to persist in their ecologic niche [9]. In this context translational Bioinformatics will help to generate a complete genomic picture of all infectious agents, emerging threats, hosts, and reservoirs, enhancing data generation, integration, analysis and finally its application at the point of care. 2 Contribution of Bioinformatics to diagnosis and therapy of infectious disease Bioinformatics assists infectious disease control targeting questions of immediate public health and clinical utility, as for example it is the use of the molecular genotyping of pathogens and host in a clinically focused manner. In this scenario the data related with an infectious disease should include data from the pathogen and the host. These data gathered from different data sources and repositories has to be interconnected and presented to the users, in many cases the clinical practitioners in a single, easy- to- use interface. Such a system should include scientific literature, annotated genomics information, functional genomics, proteomic information and even a systems biology based view of the host- pathogen interactions. At least one genome sequence is now available for each major human pathogen. As of April 2011, over 1,700 bacterial genomes were completed, more than 10,416 were

100 ongoing, 296 metagenomes, and over 2621 viral genomes were completed ( w.jcvi.org/cms/research/groups/infectious-disease/). Once the complete genome sequence of the organism is available, high-throughput approaches can be used to screen for target molecules. For a bacterial pathogen, which may have more than 4,000 genes, the genome sequence provides information of genetic factors responsible for specific virulence phenotypes or antibiotic resistance and the complete genetic repertoire of antigens or drug targets from which novel candidates can be identified. For viral pathogens that may possess fewer than 10 genes, genomics can be used to define the variability that may exist between isolates. Host genetic factors also play a role in infectious disease [10, 11], and the availability of complete human genome sequences, as well as large-scale human genome projects (see are valuable resources. From the outbreak of an infectious disease, metagenomics, the study of all the genetic material recovered directly from a sample, can be applied to aid the rapid identification of the causative agent [12, 13]. All these genomic data need to be linked via dynamic databases that should be associated with a centralized collection of open-source bioinformatics tools. There are different integration systems enabling studying host-pathogen interactions at multilevel scale as PHI-base [14], PHIDIAS [15], PIG [16], IVDB (Influenza Virus Database) [17], and the NCBI Influenza Virus Database [18] Analysis and integration of infectious disease genomics data and literature information. The genomics data gathered from new molecular technologies has to be used effectively in infectious disease management and public health surveillance. An important related topic is the management of current available information. Literature information is a major and vast information source that can be explored and analyzed searching for interesting properties and characteristics related with the infectious processes. The development of infectious disease focused text mining tools will enable innovative integrated outbreak detection, identification of new sets of molecular markers and response environments based upon bioinformatics knowledge models. Data integration of diverse experimental data types, including molecular interactions, phylogenic classifications, genomic sequences and protein structure information, gene expression and virulence data for pathogen-related studies can be integrated from the databases and user s files. Standardized vocabulary has been developed to describe these isolates and the genes they contain, and is collected in Standards in Genome Sciences ( [19]. Text mining is used to extract interaction data from the literature and to work with to integrate structured data, such as that found in relational databases, with unstructured data, such as literature; making data accessible and usable to life sciences researchers.

101 Nowadays there is a vast amount of information available at public literature databases. Open access publishing has made available for all the scientific community part of this information and data enabling its use and exploitation using text mining techniques. Despite the efforts to set public data repositories for a wide variety of biological data there is a huge amount of information and data that remain in the papers and that can be only exploited by mean of text mining techniques. An example of this information would be the sequence of PCR primers or different probes (such as real time PCR probes or Taqman probes) used for microbiological detection or characterization. [20]. Open-source new bioinformatics software tools are being developed that exploit Web-based services and the increasing computing power provided by academic and commercial cloud computing networks such as the genome sequence read mapper CloudBurst [21]. Several commercial software providers and some open source projects are already available that combine aspects of workflow and cloud computing [22] although the current limitations are Internet bandwidth. Figure 2. For infectious disease personalized medicine is necessary to know the causative agent of an infectious using a variety of screening approaches that focus on the genome, transcriptome and proteome information ( omics sciences) and also to study responses in each person to the pathogen action in function of the human genotype Comparative genomics for diagnosis of emerging infectious diseases The rapid identification and characterization of pathogens in clinical samples is vital to improving patient care. Clinicians need information on antibiotic sensibilities as well as clinically relevant features as virulence factors and toxin production. Whole genome sequencing studies has come the idea of pan genome that consist in a set

102 of genes conserved across all strain in a species (core genome) and the genes that contribute to the diversity including antibiotic resistance, virulence and transmissibility [23]. To facilitate the identification of antigenic diversity and variation, genomes of pathogens will be compared with those of non- pathogenic related strains. Genomes of isolates to be studied will be compared with reference genomes (i.e. completely sequenced genomes). Genomic variability and antigenic diversity of pathogenic strains will also be addressed by analysing single nucleotide polymorphisms (SNPs) and by microarrays representing whole- genomes of diverse species or strains. Most current tools for whole-genome alignment and pan-genome calculation (for instance, all-versus-all BLAST alignments) will dramatically fall off in performance as data sets become very large creating the need of new ways to visualize and navigate the output of these programs. Software, such as ERGATIS [24], or the bacterial annotation pipeline DIYA [25] allow programs to be run one-at-a-time on a library of genome sequences in an automated, parallelized fashion. Many pathogens become increasingly resistant to available drugs and antibiotics. Comparative genomics will be used for a better understanding of genome plasticity, gene pools, the transfer of virulence and resistance determinants as well as the development of new treatment and prevention strategies to reduce hospital infections Comparative genome analysis coupled with bioinformatics tools that identify DNA signatures, such as Insignia [26] have been used to select unique targets within a particular lineage with high specificity and sensitivity Functional and structural genomics to study infection. Pathogen genes that are up-regulated during infection and/or essential for microorganism survival or pathogenesis can be identified by using transcriptomics, or the analysis of a near complete set of RNA transcripts expressed by the pathogen under a specified condition. DNA microarrays [27] and ultra-high-throughput sequencing technologies [28] enable to characterize the transcriptome of a pathogen. The proteome of a pathogen can also be screened to identify the immunome or complete set of pathogen proteins or epitopes that interact with the host immune system, using in vitro or in silico techniques [29]. Functional genomics, through transcriptomics and proteomics, links genotype to phenotype and has been applied to many pathogens to identify genes essential to survival or virulence. Structural genomics to study the three-dimensional structures of proteins produced by infections agents is increasingly being applied to vaccine and drug development as a result of the great amount of genome and proteome data [30]. More than 72,550 protein structures are available in public databases ( [31], focusing on determining the structural basis of immune antigens as well as protein active sites and potential drug-binding sites [32]. The Functional Genomics Data Society, FGED Society, ( works towards providing concrete solutions on defining minimum information

103 specifications for reporting data in functional genomics papers have already enabled large data sets to be used and reused to their greater potential in medical research. The Genomic Open access to large numbers of sequences and associated metadata allows for powerful comparative genomic analyses and thus provides major insights into the characteristics of a pathogen and the Generic Featured Format GFF3 is the standardized data interchange format ( Computational methods for microarrays and next-generation sequencing applications in infectious diseases Low or medium density arrays are the most promising applications of microarray technology in clinical microbiology [33]. Use of broad- range or multiplex PCR, followed by microarray analysis, offers an excellent platform for rapid and efficient identification of bacterial, viral, or fungal pathogens. [34]. Development of such approaches requires as well the support of bioinformatics. Bioinformatics plays a key role because it provides access to the sequence databases and analysis tools required for the selection and identification of those areas better suited for the design of primers or probes. In many cases the goal of these solutions in clinical microbiology is refining the identification of the pathogenic agent going down to the strain, subtype, genotype or serotype identification [35]. In these circumstances bioinformatics by mean of the phylogenetic analysis provides an essential support for the development of such tests and tools. Some examples of this kind of approaches are already commercialized for the detection of respiratory viruses, such as INFINITI RVP (respiratory viral panel) (Autogenomics Inc, Vista, California), ResPlex II assay (Qiagen, Germantown, Maryland), xtag RVP (Luminex Molecular Diagnostics, Toronto, Ontario, Canada), NGEN Respiratory Virus ASR (analyte- specific reagent) (Nanogen, San Diego, California), and MultiCode- PLx RVP (EraGen Biosciences, Madison, Wisconsin). DNA microarrays can also be used to screen libraries of pathogen mutants, by comparing isolates from before and after passage through animal models or exposure to different drugs [36]. For diagnosis of bacterial infections, microarrays are also being incorporated in clinical laboratories for rapid detection of antimicrobial drug resistance in several pathogens [37]. Several software programs have been developed for choosing probes representing the pan- genome of pathogens [38], genotyping of microbial species [39], comparative analysis of species and strains [40] determination of antimicrobial resistance [41] and multipathogen diagnostic arrays [42]. Some examples are TOFI (tool for oligonucleotide fingerprint identification), YODA (yet another oligonucleotide design application) [43], AllelelD [44] and PanArray [45]. In vitro identification and screening of the immunome, based on the idea that antibodies present in serum from a host, which has been exposed to a pathogen, represent a molecular fingerprint of the pathogen s immunogenic proteins and used to identify vaccine candidates, can be determined using

104 protein microarrays. Protein microarrays, in which proteins from the pathogen are spotted onto a microarray chip, can also be used to characterize protein drug interactions. Another promising field for the use of protein microarrays is the development of ELISA- like systems for the detection of microorganisms. Based in this same approach it is also possible to use this devices to analyze and study the evolution of infections and host responses by mean of the study of the immune response mounted against the pathogenic agent. An example of this approach has been used to study the differences in the cross reactivity of the immune response against HIV infection. [46]. Next- generation DNA sequencing technologies (NGS) continues to further improve in providing even greater sequencing depth, requiring lower sample input amounts. In the recent years evolution of NGS techniques has shown that data generation is evolving faster than storage and computing capabilities. Therefore Data management and analysis is an emerging challenge for bioinformatics. The massive amount of data produced by next- generation sequencing needs the parallelization of processing algorithms on high- RAM workstations and the ability to store and transfer hundreds of terabytes of data in an efficient manner. In the recent years community has become used to development of online and web based tools. NGS data represent a challenge as well for the communication infrastructures in order to efficiently transfer data from users to servers where the analysis software is installed. However there is a broad variety of online available tools for NGS data analysis and annotation. [47]. NGS data represents an interface where in the analysis pipelines can be found some tools developed for sequence annotation of traditionally obtained sequences coexisting with newly developed tools adapted for the specific requirements of NGS such as those related with sequence assembly. The different fields of application of NGS also represent another challenge for bioinformatics data analysis and data integration. Metagenomic approaches, where the purpose is to identify the different population of organisms present in a sample, poses different challenges and analyses than de novo sequencing of a microorganism or resequencing. Lastly, integration of NGS results with clinical record annotation is being a very hot topic in biomedical informatics and tools and methods are required for this purpose. Anyway better and established bioinformatics tools are needed for next- generation DNA sequencing. Algorithms have to be more time efficient, robust, accurate and scalable to allow for meaningful analysis and with better graphical interfaces and data outputs that will speed interpretation and publication of meaningful results.

105 2.4. System biology: Study of microbial population and host-pathogen interactions High- throughput technologies enable the measurements and catalogs of genes, proteins, interactions, and behavior. Systems biology studies the interactions among biological elements toward the understanding of diseases at the system level, getting a global insight into cellular behavior in response to diverse infections and this approach leads to the development of new effective strategies for combating for example bacterial infections or improve interaction with beneficial bacteria. A plethora of tools and software for biological systems modeling have been developed and are available for download [48]. An evolution towards a dynamical view of pathogen population is required for full comprehension of in vivo cellular behaviour. Microbial populations are not isolated entities, and the global microbial regulatory picture is becoming increasingly complex with the recent discovery of numerous srna molecules and protein modifications in many prokaryotic organisms that has to be mapped and functional characterized. Systems biology approaches have to evolve towards more community- based approaches. The use of metabolomics to analyse the diversity and potential activity of microbial populations will also contribute to the development of new targets for therapeutic treatment. The transcriptome analysis of the microbe and the host will allow the integration of regulatory pathways and the cell cycle by the analysis of mutant strains. With the comprehensive examination of structures, functions, and relationships between them at the molecular level, we can scale up to the higher level to gain a more complete view of how the immune system works and interacts with other systems. The host immune pathogen interactions also have crucial impacts on pathogen evolution, pathogenesis, and immunogenic design. There are an increasing number of examples of this systems biology based approaches for the study of host- pathogen interactions [49]. In these studies is given a global view of all the processes happening during infection monitorizing the changes in the host and the pathogens. The purpose is moving further from just reflecting the changes due to the infectious process to an integrated analysis where all these changes are incorporated in the biological pathways and processes providing a better and more biologically meaningful understanding of the processes and mechanisms involved during the host- pathogen interactions. Biomedical informatics plays a key role in these new focuses, using bioinformatics tools such as network representation and generation or microarray data analysis tools and combining them with clinical data. This integrative step is crucial in these studies, since it provides the essential clinical phenotypic data related with the analysed pathogenic processes. Metabolic networks are inferred using computational methods. There are some databases and tools for pathway and interaction analyses in immune responses. For example, InnateDB provides information on interactions and signaling pathways associated with the innate immune response to microbial infections in humans and mice. Pathogen Interaction Gateway (PIG) [16] is a

106 database of known host pathogen interactions, JenPep contains quantitative binding data for immunological protein peptide interactions, VirusMINT [50] collects data on protein interactions between viral and human proteins. For the prediction of protein localization sites in cells, the tool PSORT can be used [51] and the Database of Interacting Proteins (DIP) ( mbi.ucla.edu/dip/main.cgi) [52] stores information about experimentally determined interactions between proteins. Cytoscape [53] is a software tool for visualizing molecular interaction networks and integration with other data. General protein protein interaction databases, gene network, and pathway databases such as Kyoto Encyclopedia of Genes and Genomes ( and Reactome ( are also useful for this level of study. Many modeling tools use the systems biology markup language (SBML; for portable model description; however physiology modeling software is not yet well- integrated with molecular and cellular modeling tools. 3. Opportunities and challenges Technological advances for basic scientific discovery (such as next-generation sequencers, microarrays, mass spectrometers, cell-based assay methods ), are novel techniques to increase the throughput of data that need to be used for clinical decision-making and surveillance tools (as point-of-care diagnostics, rapid multipathogen assays). To reach this point, bioinformatics tools will need to be developed that allow clinicians to make diagnostic decisions using classifiers developed making diagnosis and treatment rapid and accurate by incorporating data from population, functional and structural genomics (e.g. transcriptomics, proteomics ) and pathway models. Future analysis will require powerful new bioinformatics tools in conjunction with new computer systems engineered with genomic analysis in mind. References 1. Pennisi E. Human genome 10th anniversary. Digging deep into the microbiome. Science. 25: (2011). 2. Depristo MA, Banks E, Poplin R, Garimella KV, Maguire JR, Hartl C, Philippakis AA, Del Angel G, Rivas MA, Hanna M, McKenna A, Fennell TJ, Kernytsky AM, Sivachenko AY, Cibulskis K, Gabriel SB, Altshuler D, Daly MJ. A framework for variation discovery and genotyping using next-generation DNA sequencing data.. Nat Genet. 43(5):491-8 (2011). 3. Community of scientists. Genome 10K: a proposal to obtain whole-genome sequence for vertebrate species. Journal of Heredity. 100 (6): (2009). 4. Trifonov EN. Earliest pages of Bioinformatics. Bioinformatics. Jan;16(1):5-9.. (2000) 5. Roberts RJ. The Early Days of Bioinformatics Publishing. Bioinformatics. Jan;16(1):2-4. (2000)

107 6. Martin-Sanchez F, and Hermosilla Gimeno I. Translational Bioinformatics Stud Health Technol Inform Vol 151, (2010). 7. Bairoch A. Serendipity in Bioinformatics, the Tribulations of a Swiss Bioinformatician through Exciting Times. Bioinformatics, vol 16, (2000) 8. Martín Sanchez F. et al. Synergy between medical informatics and bioinformatics: facilitating genomic medicine for future health care. Journal of Biomedical Informatics.; 37 (1), (2004) 9. Olano JP, Walker DH.Arch Pathol Lab Med. Diagnosing emerging and reemerging infectious diseases: the pivotal role of the pathologist. Jan;135(1):83. (2011) 10. Casanova JL, Abel L Human genetics of infectious diseases: A unified theory. EMBO J 26: (2007). 11. Burgner D, Jamieson SE, Blackwell JM Genetic susceptibility to infectious diseases: Big is beautiful, but will bigger be even better? Lancet Infect Dis 6: (2006). 12. Nakamura S, Yang CS, Sakon N, Ueda M, Tougan T, et al. Direct metagenomic detection of vira nbiased high-throughput sequencing approach. PLoS Jan 19.e4219. Epub Bittar F, Richet H, Dubus JC, Reynaud-Gaubert M, Stremler N, et al. Molecular detection of multiple emerging pathogens in sputa from cystic fibrosis patients. PLoS ONE 3 (2008). 14. Winnenburg R, Urban M, Beacham A, Baldwin TK, Holland S, Lindeberg M, Hansen H, Rawlings C, Hammond-Kosack KE, Kohler J: PHI-base update: additions to the pathogen host interaction database. Nucleic Acids Res, 36:D (2008) 15. Zuoshuang Xiang, Yuying Tian, Yongqun He: PHIDIAS: a pathogen-host interaction data integration and analysis system. Genome Biology;8(7):R150(2007). 16. Driscoll T, Dyer MD, Murali TM, Sobral BW: PIG the pathogen interaction gateway. Nucleic Acids Res, 37:D647-D650 (2009) 17. Chang S, Zhang J, Liao X, Zhu X, Wang D, Zhu J, Feng T, Zhu B, Gao GF, Wang J, Yang H, Yu J, Wang J: Influenza Virus Database (IVDB): an integrated information resource and analysis platform for influenza virus research. Nucleic Acids Res 35:D376-D380 (2007). 18. Bao Y, Bolotov P, Dernovoy D, Kiryutin B, Zaslavsky L, Tatusova T, Ostell J, Lipman D: The Influenza Virus Resource at the National Center for Biotechnology Information. J. Virol, 82(2): (2008). 19. Garrity GM, Field D, Kyrpides N, Hirschman L, Sansone SA, et al.) Toward a standards-compliant genomic and metagenomic publication record. OMICS 12: 157 (2008). 20. García-Remesal M, Cuevas A, López-Alonso V, López-Campos G, de la Calle G, de la Iglesia D, Pérez-Rey D, Crespo J, Martín-Sánchez F, Maojo V. Training A method for automatically extracting infectious disease-related primers and probes from the literature.. BMC Bioinformatics. 3;11: (2010). 21. Schatz MC. CloudBurst: Highly sensitive read mapping with MapReduce. Bioinformatics 25: (2009) 22. Goecks, J. et al. Galaxy: a comprehensive approach for supporting accessible, reproducible, and transparent computational research in the life sciences. Genome Biol. 11, R86 (2010). 23. Bentley S. Nat Rev Microbiol (4): Orvis, J. et al. Ergatis: a web interface and scalable software system for bioinformatics workflows. Bioinformatics 26, (2010). 25. Stewart, A.C. et al. DIYA: a bacterial annotation pipeline for any genomics lab. Bioinformatics 25, (2009).

108 26. Phillippy AM, Ayanbule K, Edwards NJ, Salzberg SL. Insignia: a DNA signature search web server for diagnostic assay development. Nucleic Acids Res. Jul 1;37;W (2009). 27. Dhiman N, Bonilla R, O Kane DJ, Poland GA Gene expression microarrays: A 21st century tool for directed vaccine design. Vaccine 20: (2001). 28. Morozova O, Marra MA Applications of next-generation sequencing technologies in functional genomics. Genomics 92: (2008). 29. De Groot AS, McMurry J, Moise L) Prediction of immunogenicity: in silico paradigms, ex vivo and in vivo correlates. Curr Opin Pharmacol 8: (2008). 30. Lundstrom K Structural genomics and drug discovery. J Cell Mol Med 11: (2007) 31. Todd AE, Marsden RL, Thornton JM, Orengo CA Progress of structural genomics initiatives: An analysis of solved target structures. J Mol Biol 348: (2005) 32. Nicola G, Abagyan R Structure-based approaches to antibiotic drug discovery. Curr Protoc Microbiol Chapter 17: Unit (2009). 33. Mancini N, Carletti S, Ghidoli N, Cichero P, Burioni R, Clementi M. The era of molecular and other non-culture-based methods in diagnosis of sepsis. Clin Microbiol Rev;23(1): (2010) 34. Mikhailovich V, Gryadunov D, Kolchinsky A, Makarov AA, Zasedatelev A. DNA microarrays in the clinic: infectious diseases. Bioessays;30(7): (2008). 35. Sakata T, Winzeler EA Genomics, systems biology and drug development for infectious diseases. Mol Biosyst 3: (2007) 36. Lopez-Campos G et al, MIM Monecke S, Ehricht R. Rapid genotyping of methicillin-resistant Staphylococcus aureus (MRSA) isolates using miniaturised oligonucleotide arrays. Clin Microbiol Infect.;11(10): (2005) 38. Kostrzynska, M. and Bachand, A. Application of DNAmicroarray technology for detection, identification, and characterization of foodborne pathogens. Can J. Microbiol. 52, 1 8. (2006) 39. Wang, Q. et al. Development of a DNA microarray for detection and serotyping of enterotoxigenic Escherichia coli. J. Clin. Microbiol. 48, (2010) 40. Stabler, R.A. et al. Comparative phylogenomics of Clostridium difficile reveals clade specificity and microevolution of hypervirulent strains. J. Bacteriol. 188, (2006) 41. Gryadunov, D. et al. Evaluation of hybridisation on oligonucleotide microarrays for analysis of drug-resistant Mycobacterium tuberculosis. Clin. Microbiol. Infect. 11, (2005) 42. You, Y. et al. A novel DNA microarray for rapid diagnosis of enteropathogenic bacteria in stool specimens of patients with diarrhea. J. Microbiol. Methods 75, (2008) 43. Nordberg, E.K. YODA: selecting signature oligonucleotides. Bioinformatics 21, (2005) 44. Apte, A. and Singh, S. AlleleID: a pathogen detection and identification system. Methods Mol. Biol. 402, (2007) 45. Phillippy, A.M. et al. Efficient oligonucleotide probe selection for pan-genomic tiling arrays. BMC Bioinformatics 10, 293. (2009) 46. Aziz RK, Bartels D, Best AA, DeJongh M, Disz T, Edwards RA, Formsma K, Gerdes S, Glass EM, Kubal M, Meyer F, Olsen GJ, Olson R, Osterman AL, Overbeek RA, McNeil LK, Paarmann D, Paczian T, Parrello B, Pusch GD, Reich C, Stevens R, Vassieva O, Vonstein V, Wilke A, Zagnitko O. The RAST Server: Rapid Annotations using Subsystems Technology. BMC Genomics, 2008

109 47. Yan Q. Immunoinformatics and systems biology methods for personalized medicine. Methods Mol Biol.;662: (2010) 48. Lynn DJ, Chan C, Naseer M, Yau M, Lo R, Sribnaia A, Ring G, Que J, Wee K, Winsor GL, Laird MR, Breuer K, Foroushani AK, Brinkman FS, Hancock RE. Curating the innate immunity interactome. BMC Syst Biol. ;4:117. (2010) 49. Bermejo-Martin JF, Martin-Loeches I, Rello J, Antón A, Almansa R, Xu L, Lopez- Campos G, Pumarola T, Ran L, Ramirez P, Banner D, Cheuk Ng D, Socias L, Loza A, Andaluz D, Maravi E, Gómez-Sánchez MJ, Gordón M, Gallegos MC, Fernandez V, Aldunate S, León C, Merino P, Blanco J, Martin-Sanchez F, Rico L, Varillas D,Iglesias V, Marcos MÁ, Gandía F, Bobillo F, Nogueira B, Rojo S, Resino S, Castro C, Ortiz de Lejarazu R, Kelvin D. Host adaptive immunity deficiency in severe pandemic influenza. Crit Care.;14(5):R167. (2010). 50. Salwinski L, Duan XJ et al DIP, the Database of Interacting Proteins: a research tool for studying cellular networks of protein interactions. Nucleic Acids Res 30: (2002). 51. Chatr-aryamontri A, Ceol A, Peluso D et al VirusMINT: a viral protein interaction database. Nucleic Acids Res 37: D669 D673. (2009) 52. Horton P, Park KJ, Obayashi T et al WoLF PSORT: protein localization predictor. Nucleic Acids Res 35:W585 W587. (2007). 53. Cline MS, Smoot M, Cerami E et al Integration of biological networks and gene expression data using Cytoscape. Nat Protoc 2: (2007).

110 New Computer Tools in Nutritional Genomics: e- 24h recall with OntoRecipe Oscar COLTELL 1,2 Antonio FABREGAT 1,2, Eduardo AÑÍBARRO 1, María ARREGUI 1,2, Olga PORTOLÉS 1,2, Elisabet BARRERA 1, Dolores CORELLA 3,4 1BioinfoGenómica. Department of Computing Languages and Systems. Universitat Jaume I. Castellon. Spain 2 RETIC «COMBIOMED». ISCIII. Madrid. Spain. 3 Department of Preventive Medicine. School of Medicine. University of Valencia. Valencia. Spain 4 CIBER «Fisiopatología de la Obesidad y Nutrición». ISCIII. Madrid. Spain Abstract. This paper presents a new approach of a recipe management system as a complementary service in a Web application for implementation of the "24- Hour Food Recall" in Nutritional genomics. The aim of its development, called OntoRecipe, has been speeding up the process of making such food surveys in digital records. In this project, the concept of EpiRecipe has been created as a superstructure that defines sets of recipes over standard recipes. The instantiation of EpiRecipes facilitates the execution of 24- Hour Food Recalls allowing customization of standard recipes. The system simulates the behavior of a light ontology using relational databases and it is intuitive and easy to handle. OntoRecipe has been integrated into the platform NutriGenOntology, which is an architecture of ontologies to support research in Nutritional Genomics. 1 Introduction The main goal of Nutritional genomics is to gain knowledge about the interaction between dietary factors and the genome that modulate phenotypic expression. This knowledge could explain the genetic basis for the interindividual response to diet and the reasons for the different clinical phenotypes observed for the same genetic variant (Corella D et al., 2009). Research into nutritional genomics is a leading subject in the current biomedical research. However, despite the huge promises made in numerous articles on this subject, we need to underscore that nutritional genomics is a discipline still in its infancy, and more progress needs to be done before practical tools can be developed for the prevention and treatment of diseases. To increase the validity of individual studies in nutritional genomics is critical to control the potential information and selection bias that may contribute to hinder replication. In experimental studies, these potential biases are minimized. However, the difficulty of conducting dietary intervention studies in large samples is a current limitation in

111 nutritional genomics. In observational studies (ie, cohort, case- control, cross- sectional), the researcher does not provide the diet and has to gather nutritional information from dietary questionnaires (Li et al., 2006; Pinheiro et al., 2005). Therefore, high- quality dietary information in these epidemiological studies is crucial for minimizing information bias. Traditional dietary instruments (ie, diet records, 24- hour recalls, food- frequency questionnaires) should be improved and tailored to the specific gene- diet interaction measured. One of the best dietary instruments is the 24- Hour Food Recall (24HFR). The 24- Hour Food Recall is an in- depth interview that collects detailed information on all foods and beverages consumed by a participant during the previous 24 hours. A single 24HFR is not considered to be representative of habitual diet at an individual level but is adequate for surveying intake in a large group and estimating group mean intakes. Repeat 24HFR can be employed to assess a typical diet at an individual level; this adds complexity to the data management process. Respondents are asked to report everything that they had to eat or drink (e.g. on the previous day between midnight and midnight) in an uninterrupted free flowing list. For each item of food or drink in the quick list, respondents are asked to provide additional detail, including: The time at which the food or drink was consumed; a full description of the food or drink, including brand name where available; any foods likely to be eaten in combination e.g. milk in coffee; recipes and other combinations of foods e.g. sandwiches; the quantity consumed, based on household measures, photographs of different portion sizes of foods, or actual weights from labels or packets (Serra Majem et l., 2003). In the 24HFR process, transforming food into its components (energy, nutrients and no nutrients) is a tedious task, especially when there are many precooked meals and these are not disaggregated. Additionally, nutritional study participants to whom the 24HFR is administrated often indicate the consumption of precooked meals without specifying its composition. Traditionally, nutritional studies are using paper questionnaires to administrate 24HFR s in some format types: open, semi- formal or formal formularies. Figure 1 shows one example of formal formulary. Paper 24HFR are used in recent studies (Freedman et al., 2011; Takahashi et al, 2010; Jaime et al., 2010; Thakwalakwa et al., 2011). Moreover, the 24HFR is one method to assess the validity of other dietary intake instruments as the Semi- quantitative Food Frequency Questionnaire (FFQ) (Brinkman et al., 2011; Huang et al., 2011; Paxton et al., 2011; Martin- Prevel et al., 2010).

112 Fig. 1 A 24HFR formal formulary. Source

113 Currently, there are few cases of nutritional studies that apply electronic 24HFR. The International Vitamin A Consultative Group and Helen Keller International developed a simplified semi- quantitative dietary method to identify groups at risk for suboptimal intake and thus deficiency of vitamin A, including an interactive 24- hour recall method that has been especially modified to collect such information on rural populations in developing countries ( 24- hour- recall- assessing- adequacy- iron- and- zinc- intakes- developing- countries). Another case is the Automated Self- administered 24- hour Dietary Recall (ASA24 ) by the US National Cancer Institute (USNCI, 2011). ASA24 is a software tool that enables automated and self- administered 24- hour dietary recalls. The format and design of ASA24 are based on a modified version of the interviewer- administered Automated Multiple Pass Method (AMPM) 24- hour recall developed by the U.S. Department of Agriculture (USDA). AMPM uses multilevel food probes to accurately assess food types and amounts. AMPM was adapted to enable the development of a computer- based self- administered 24- hour recall where the food list from which respondents select their intakes for the previous day includes all foods available from USDA's Food and Nutrient Database for Dietary Studies (FNDDS) database. Despite those studies are applying automated data acquisition and management, the process of transforming a food into its components remains complex. To speed up this process, we have developed a recipe management system called "OntoRecipe" integrated with a Web application to administrate 24HFR. OntoRecipe is a lightweight ontology (RDF format) implemented by relational databases, where, among other tasks, we can define standard recipes for precooked dishes. OntoRecipe has been integrated into the general services frame called NutriGenOntology (NGO, " (Fabregat et al., 2008) which was developed by the research group BioInfoGenómica (" ). NGO is an architecture of ontologies to support research in Nutritional Genomics with the aim to develop and validate a Biomedical Ontology for the formalization and integration of genomic data, environmental and phenotypic for use in research in Nutritional Genomics applied to the study of cardiovascular disease and related phenotypes. NGO is currently composed by three different ontologies: 1. NutriOntology: alignment of various food composition tables (FCT) to keep information about foods and their components (Figure 2). 2. An ontology of gene- environment interaction on intermediate phenotypes, to maintain information about genes and environmental factors (FA) and how these relate to each other. 3. An ontology for estimating cardiovascular risk according to Framingham calibrated equation.

114 Fig. 2. NutriOntology: Central ontology with alignments between concepts and different FCT with information of food and nutrients. Therefore, treatment of recipes has been mainly based on the design and implementation of NGO. Since NGO had available information about foods and their components, their initial objective was expanded to break down recipes into components (macronutrients, micronutrients, energy, water, etc.) importing data from NutriOntology. Our aim is to describe the OntoRecipe ontology introducing the new concept of EpiRecipe as a virtual recipe. Then, the next section shows the technologies and methods used in the design and development of the ontology. The section

115 "Development project ", explains the initial requirements, the concept of "EpiRecipe" and the problems and solutions related to the management of this concept. The section "Results" shows one of the most characteristic examples with real data. Finally, in "Conclusions" we express the conclusions and the lines of continuity of work. 2 Methods OntoRecipe has been developed as a Web application for using the NGO- NutriOntology and integrating both into the same server. Therefore, that server set the same technical limitations and resources available for the OntoRecipe development as in the NGO project. The limitations were that the scripts should be written in the PHP programming language, and the database manager should be MySQL Server. All FrameWorks used in OntoRecipe should be developed in PHP. To overcome the limitations, we chose to use XHTML and JavaScript to get XAJAX asynchronous Web user interfaces; RAP (RDF API for PHP) of powl for handling ontologies; and NuSOAP to develop Web services in PHP. Moreover, Prototype has been used to facilitate the handling of DOM (Document Object Model) and Prototype Window to generate windows within the Web browser interfaces. The option for implementing OntoRecipe as pure lightweight ontologies (RDF format) had a drawback: in this kind of ontologies, a fixed set of properties ("slots" in Frames systems) must be defined in the created classes; later, users who instance these classes will assign a value to each slot. Since in OntoRecipe need to define sets of variables and heterogeneous properties in the classes, we decided to simulate the behavior of a lightweight ontology using relational databases, although it entails the restriction of not having an RDF file on infer the information. This limitation was overcome by implementing a query application that performs simple inferences from the simulated data ontology, so it only makes sense to connect to this ontology through programmed access interface (API) which provides implementation. The OntoRecipe project was started from initial requirements generated in collaboration with the research group EPIGEM of the School of Medicine, University of Valencia, whose principal investigator is Dr. Dolores Corella Initial Requirements EPIGEM group raised a number of requirements that must necessarily meet OntoRecipe application for it to adapt to the execution of 24- Hour Food Recalls over as project participants, FITUVEROLES (Arregui et al., 2006), PREDIMED (Estruch et al., 2006), OBENUTIC (2011), etc. Main requirements proposed were as follows:

116 1. Prescriptions generated were able to organize by family recipes. 2. OntoRecipe should be able to store information about standard servings. 3. A "simple recipe" had to be generated as a compendium of food proportions. 4. OntoRecipe should allow the generation of recipes in which one or more of the ingredients of these families were food instead of simple food. Thus it is possible to choose at the time of execution of a 24HFR, the particular food consumed by an individual, without replicating recipes that only changes an ingredient of the same family (eg, coffee with skim milk or coffee with whole milk). 5. Prescriptions should be able to be broken down into nutrients, energy, water, etc., employing each of the food composition tables (FCT) that form the NutriOntology. 6. Operations management should allow the application are: (1) generation of new prescriptions, (2) editing existing recipes to modify the composition or name, and (3) removal of recipes provided in the project can maintain referential integrity of data in the system management database. The management of recipes is made in NGO, while projects using this information are not permanently connected to the server to obtain information from NutriOntology and the recipe management service. Therefore, another requirement was to design and implement a module, "NutriGenOnto" which could be installed on all associated projects and maintain local data attached to each project server, allowing to synchronize each project server with NGO and keeping the same API EpiRecipes: Definition of Recipe Sets As the main concept in OntoRecipe, EpiRecipe was defined as the recipe that contains one or more families of food as its components. In other words, a EpiRecipe may be partly or wholly composed by food families in a certain ratio. Therefore, EpiRecipe is the definition of a set of recipes. For example, in the case of "bread with tomato", this recipe can be made with the p i proportions of these foods: p 1 of 'Bread', p 2 of 'Tomato', p 3 of 'Salt', and p 4 of 'Extra virgin olive oil', where p i 100, p i > 0. With this approach, problems can arise when, for example, not everyone uses extra virgin olive oil for cooking this recipe. Thus, EpiRecipe solves this problem by adding the family "Oil" with the p 4 percentage. Thus, in an execution of a 24HFR, the study participant would indicate what oil must be replaced in the family "Oil", such as 'Olive oil'. On the other hand, one of the requirements is that EpiRecipes can be managed like any simple recipe. But a problem arises when in the 24HFR executions EpiRecipes are instantiated: when at any time we decide to change the composition of an EpiRecipe in the recipe manager of NGO, the change must be propagated in all projects that make (and made) use of it. However, because the instance of an EpiRecipe depends on the participant choices a food for each EpiRecipe food family,

117 this process cannot be done automatically. Therefore, the problem is to propagate the changes made in the recipe manager to the EpiRecipe instances generated in the past. It was necessary to find a method to allow that, when adding new families of recipes in EpiRecipe, or changing the status of a (standard or simple) recipe to an EpiRecipe, can generate EpiRecipe instances can be automatically generated without user s assistance. We fixed that problem associating a default food to all EpiRecipe food families. Thus, to propagate changes to the past- instantiated recipes, from EpiRecipes in which a specific food family was not yet part of it, the default food can automatically be selected and therefore maintain the coherence within the participants' 24HFR responses. Moreover, this solution does not affect any of the initial requirements, allowing all actions that can be done about the simple recipes can also be performed on EpiRecipes. The only drawback is that every time we add a family of food in an EpiRecipe, we should select what will be the default food to be replaced when the instantiation occurs automatically.

118 3 Results OntoRecipe can speed up the process of digitization of 24HFR. The technological shift in the development of the recipe management system using relational databases, instead of ontologies, has been transparent to the user and offers all the features that were included in the initial requirements. Thus, the application is intuitive and easy to use by researchers in Nutritional Genomics (Figure 3), offering all the functionality on one screen. Figure 3: Recipe management system: Main interface. Moreover, managing EpiRecipes provides other advantages when running 24HFR. One of the main advantages is that it offers the possibility of generation of standard recipes. Occasionally, participants did not know the exact composition of the recipes for dishes that have consumed, so that, making the default food association to the food families included in EpiRecipes, we are able to generate standard recipes for assign to EpiRecipe instances. This also allows working with 24HFR without having to anticipate each and every one of the recipes that may arise (Figure 4).

119 Figure 4: Administrating a 24HFR using the recipe management system. 4 Conclusions We have developed a recipe management system integrated into the platform NutriGenOntology, which simulates the behaviour of a light ontology using relational databases in the context of the management of 24HFR. This system is intuitive and easy to handle and facilitates the work of researchers in Nutritional Genomics.

120 The recipe management system is a novel approach in the digital implementation of measuring intake instruments in Nutritional Genomics studies, since it is not limited to treating food recipes with a rigid structure and composition, but also allows the generation of recipe instances from recipe superstructures called EpiRecipes. 5 Acknowledgements This work was partially supported by grants COMBIOMED (RD07/0067/0006, ISCIII- FIS, Ministry of Science and Innovation), GEN E_PIA (Ministry of Science and Innovation), ACOMP/2011/145 (Conselleria de Educación- Generalitat Valenciana) OBENUTIC (BI061326, ISCIII- FIS, Ministry of Health) and CIBER "Pathophysiology of Obesity and Nutrition"(ISCIII- FIS, Ministry of Health). 6 References Arregui M, Coltell O, Vázquez R, Fabregat A, Portolés O, Corella D. FITUVEROLES: un portal Web piloto para la determinación de fitoesteroles ingeridos en la dieta mediante cuestionarios digitalizados. Public Health Nutrition 2006; 9(7A): 255. Brinkman MT, Kellen E, Zeegers MP, van Dongen MC, Dagnelie PC, Muls E, Buntinx F. Validation of the immidiet food frequency questionnaire in an adult Belgian population: a report from the Belgian case- control study on bladder cancer risk. Acta Clin Belg. 2011;66(1): Corella D, Ordovas JM. Nutrigenomics in cardiovascular medicine. Circ Cardiovasc Genet. 2009;2(6): Estruch R, Martínez- González MA, Corella D, Salas- Salvadó J, Ruiz- Gutiérrez V, Covas MI, Fiol M, Gómez- Gracia E, López- Sabater MC, Vinyoles E, Arós F, Conde M, Lahoz C, Lapetra J, Sáez G, Ros E; PREDIMED Study Investigators. Effects of a Mediterranean- style diet on cardiovascular risk factors: a randomized trial. Ann Intern Med. 2006;145(1):1-11. Fabregat A, Arregui M, Barrera E, Portolés O, Corella D, Coltell O. NutriGeneOntology: A Biomedical Ontology for Nutrigenomics. International Conference on Biomedical Engineering and Informatics (BMEI), IEEE Conference #13886, May 2008 in Sanya, China. Proceedings code PR3118. (En Prensa). Freedman MR, Keast DR. White potatoes, including french fries, contribute shortfall nutrients to children's and adolescents' diets. Nutr Res. 2011;31(4): Huang YC, Lee MS, Pan WH, Wahlqvist ML. Validation of a simplified food frequency questionnaire as used in the Nutrition and Health Survey in Taiwan (NAHSIT) for the elderly. Asia Pac J Clin Nutr. 2011;20(1): Jaime PC, Bandoni DH, Duran AC, Fisberg RM.Diet quality index adjusted for energy requirements in adults. Cad Saude Publica. 2010;26(11):

121 Li YP, He YN, Zhai FY, Yang XG, Hu XQ, Zhao WH, Ma GS.Comparison of assessment of food intakes by using 3 dietary survey methods. Zhonghua Yu Fang Yi Xue Za Zhi. 2006;40(4): Martin- Prevel Y, Becquey E, Arimond M. Food group diversity indicators derived from qualitative list- based questionnaire misreported some foods compared to same indicators derived from quantitative 24- hour recall in urban Burkina Faso. J Nutr. 2010;140(11):2086S- 93S. Paxton A, Baxter SD, Fleming P, Ammerman A. Validation of the school lunch recall questionnaire to capture school lunch intake of third- to fifth- grade students. J Am Diet Assoc. 2011;111(3): Pinheiro AC, Atalah E. Proposal of a method to assess global quality of diet. Rev Med Chil. 2005;133(2): Project OBENUTIC (2011): Serra Majem L, Ribas Barba L, Pérez Rodrigo C, Roman Viñas B, Aranceta Bartrina J. Dietary habits and food consumption in Spanish children and adolescents ( ): socioeconomic and demographic factors. Med Clin (Barc). 2003;121(4): Takahashi MM, de Oliveira EP, Moreto F, Portero- McLellan KC, Burini RC. Association of dyslipidemia with intakes of fruit and vegetables and the body fat content of adults clinically selected for a lifestyle modification program. Arch Latinoam Nutr. 2010;60(2): Thakwalakwa CM, Kuusipalo HM, Maleta KM, Phuka JC, Ashorn P, Cheung YB. The validity of a structured interactive 24- hour recall in estimating energy and nutrient intakes in 15- month- old rural Malawian children. Matern Child Nutr. 2011; doi: /j x. USNCI (2009): The Automated Self- administered 24- hour Dietary Recall (ASA24 ) of the US National Cancer Institute:

122 INBIOMEDvision: Bridging gaps between Bioinformatics and Medical Informatics The INBIOMEDvision Consortium,1 1 Abstract. INBIOMEDvision aims to become a European-wide initiative intended to monitor the evolution of the Biomedical Informatics field and address its scientific challenges by means of collaborative efforts performed by a broad group of experts with complementary perspectives on the field. These efforts will certainly contribute to the strength and expansion of the Biomedical Informatics scientific community, particularly in Europe. INBIOMEDvision will develop a series of services and activities to serve the aforementioned purposes (inventory of resources and initiatives, state of the art reviews, prospective analyses, community-building actions and dissemination and training activities). Keywords: Biomedical informatics, translational bioinformatics, virtual physiological human personalised medicine, scientific monitoring, state-of-theart analysis, scientific prospective analyses. The INBIOMEDvision project is funded by the European Commission under the Seventh Framework Programme (FP7/ ) and has as its main goal to promote the Biomedical Informatics in Europe by means of the permanent monitoring of the scientific state-of-the-art and existing activities in the theme. The Project Consortium is coordinated by the Dr. Ferran Sanz, of the Research Programme on Biomedical Informatics at IMIM-University Pompeu Fabra (UPF), the project manager is the Fundació IMIM (FIMIM), and the rest of the partners are the Center for Biological Sequence Analysis (CBS) at the Technical University of Denmark, the Medical Informatics Department (EMC), at Erasmus University Medical Center, the Biomedical Informatics Group (GIB) at the UPM School of Computer Science, the Center for Computational Science at University College London (UCL) and the Bioinformatics Unit of the Instituto de Salud Carlos III (ISCIII). Biomedical Informatics deals with the integrative management and synergic exploitation of the wide ranging and inter-related scope of information that is generated and needed in healthcare settings, biomedical research institutions and health-related industry. The Biomedical Informatics concept has, from both the scientific and applied points of view, several challenges to be addressed, such as: 1 Corresponding Author: Ferran Sanz. Research Programme on Biomedical Informatics. IMIM. Universitat Pompeu Fabra. Parc de Recerca Biomèdica de Barcelona (PRBB). Carrer Dr. Aiguader 88, E Barcelona (Spain); fsanz@imim.es

123 The synergic integration between the computational methods and technologies used in life sciences research (Bioinformatics) and the computer sciences and applications supporting healthcare and clinical research (Medical Informatics). This integration requires a more intense interaction between the Bioinformatics and the Medical Informatics scientific communities. The development of effective translational knowledge management approaches that will facilitate a better and quicker application of (a) knowledge resulting from the basic biomedical research in the disease prevention,diagnosis and treatment, and (b) experience accumulated in the clinical practice in the biomedical research. This bidirectional information translation includes the extension of the electronic healthcare record (EHR) concept in order to incorporate and exploit new information types, i.e. from the omics and nano technologies, as well as a greater and automated incorporation of phenotypic information generated in the healthcare settings into the omics and molecular research. The integration and exploitation of heterogeneous information stored in widespread repositories and diverse formats, which requires further progress in systems interoperability, as well as the development of more effective techniques for knowledge extraction, especially from documents in free-text and multilingual format. This aspect implies a focus on all the aspects related to the development, adoption and dissemination of appropriate standards and ontologies. Biomedical ontologies provide essential domain knowledge to drive data integration, information retrieval, data annotation, natural-language processing and decision support. The development of innovative methods for the simulation and modelling of complex biological phenomena, as well as the corresponding computational applications, able to operate on a wide range of data types, as well as diverse length dimensions and time scales. These computational methods and tools have to show reliable predictive capabilities that make them useful for the biomedical scientists and the healthcare professionals. The inter-disciplinary domain between neuroscience and informatics (Neuroinformatics), since a critical challenge in neuroscience is organizing, managing, and accessing the explosion in neuroscientific knowledge, particularly anatomic and neurophysiological knowledge. A strong and active involvement of the industrial actors (from both the IT and biomedical perspectives, as well as from both large international companies and small-medium size enterprises). The dissemination of concepts and challenges related to Biomedical Informatics, as well as of the scientific advances in the field, not only among the scientific community but also among the relevant stakeholders and the general population.

124 The need for training activities filling the gaps that traditional disciplines and academic degrees show in order to face the multi-disciplinary scientific challenges of Biomedical Informatics. The operational objectives of INBIOMEDvision are: 1. To compile the existing knowledge on genotype and phenotype data resources and to provide an overview of the state-of-the-art methods and models that connect biological systems described at the molecular level with the clinical physiopathology. 2. To consolidate a Biomedical Informatics community of researchers by congregating and promoting the interaction between scientists from a wide range of related fields. 3. To develop and facilitate training activities able to engender new generations of scientists and professionals having the Biomedical Informatics perspective, as well as the skills for using the computational methods and tools of this field. 4. To widely disseminate the Biomedical Informatics knowledge and resources. 5. To devise sustainability measures that ensures the long-term maintenance of the INBIOMEDvision activities and services after the termination of the EU-funded project.

125 Introducción al análisis de datos biomédicos con técnicas de aprendizaje máquina José A. Seoane, Carlos Fernández-Lozano y Julián Dorado Departamento de Tecnologías de la Información y las Comunicaciones. Universidade da Coruña {jseoane,carlos.fernandez,julian}@udc.es Resumen. Las mejoras hechas en la tecnología generan gran cantidad de datos biomédicos de diferentes organismos y a diferentes niveles, desde niveles genómicos o proteómicos a los niveles epidemiológicos. El análisis de este tipo de datos requiere un cambio de paradigma, en el que las técnicas de aprendizaje máquina son esenciales para analizar todas estas nuevas fuentes de información. En este tutorial se pretende introducir a los alumnos en las técnicas básicas de aprendizaje máquina utilizando para ello ejemplos de datos biomédicos. Palabras clave: Aprendizaje máquina, machine learning, 1 Introducción El concepto de aprendizaje máquina puede llevar a confusión. Intuitivamente consiste en conseguir que una máquina aprenda, pero el aprendizaje es algo complicado. Normalmente las definiciones de aprendizaje contienen conceptos como obtener conocimiento del estudio o experiencia, tomar conciencia de información o observaciones, almacenar en la memoria etc. Algunos de estos conceptos son difícilmente asimilables a una máquina. Existe otra definición algo aprende cuando cambia su comportamiento de manera que obtiene mejor rendimiento en el futuro que podría adaptarse más a una máquina, pero aún así es confusa. El aprendizaje implica pensamiento, implica intencionalidad, características que no tiene una máquina. Coloquialmente para referirnos a un aprendizaje sin pensamiento o sin intencionalidad, le llamamos entrenamiento, en el aprendizaje, la intencionalidad está en el que aprende, mientras que en el entrenamiento la intencionalidad está en el entrenador. Los algoritmos que se presentarán en este tutorial consisten en entrenar a las máquinas para que resuelvan problemas. La minería de datos puede entenderse como la aplicación práctica de este entrenamiento máquina. La minería de datos consiste en analizar los datos presentes en bases de datos con el objetivo de encontrar patrones o modelos. A partir de estos patrones o modelos es posible obtener predicciones de nuevos datos. Existe una amplia bibliografía sobre minería de datos y aprendizaje máquina, así como varias revistas muy importantes en el campo de la inteligencia artificial dedicadas en exclusiva a este tema. Para profundizar más en este tema,

126 recomendamos el libro de Tan et al [1] que es el libro de referencia en cuanto a minería de datos, en conjunto con el de Hand y Manila [2]. En español el libro de Hernandez et al [3] es un buen comienzo y finalmente, más orientado a un aspecto práctico, el libro de los creadores del Weka Witten y Frank [4]. Descripción de la base de datos Antes de comenzar a explicar los tipos de algoritmos de aprendizaje máquina que nos encontraremos en este tutorial, explicaremos la forma en la que le presentaremos los datos a la máquina para que aprenda. El proceso de entrenamiento consiste normalmente en presentarle una serie de ejemplos o instancias a un modelo y este ajusta sus parámetros para que el modelo se ajuste lo más posible a los ejemplos presentados. Generalmente estos ejemplos se presentan en forma de tabla, de manera que cada ejemplo conforma en una fila y los valores de cada uno de sus atributos conforman las columnas. Fig. 1: Tabla que describe un conjunto de individuos o ejemplos (filas) y los atributos que la describen (columnas) Estos ejemplos están formados por una serie de valores que lo describen. Al conjunto de estos valores se le denomina variable o atributo. Estos atributos pueden tener tipos (numéricos, categóricos, decimales, lógicos, etc). Estas variables pueden ser independientes o variables de entrada, que describen las instancias y son los valores a partir de los cuales se pretende hacer la predicción. Por otro lado, cuando se esté tratando un problema de clasificación, que se analizará más adelante, también aparecen las variables dependientes o variables de salida, que son los valores que el modelo es capaz de predecir una vez está correctamente entrenado.

127 Tipos de algoritmos de minería de datos Los algoritmos de machine learning podrían dividirse en dos conjuntos, en primer lugar los que permiten predecir valores a partir de ejemplos, este grupo se denomina los algoritmos predictivos. Cada ejemplo va acompañado de su salida. Dentro de los algoritmos predictivos la bibliografía divide varias categorías, aunque todas ellas se podrían agrupar en dos grandes categorías. Si la variable a predecir, variable dependiente o variable de salida es un valor categórico, se denominan algoritmo de clasificación, mientras que si la variable de salida es continua, se denomina algoritmo de regresión. En segundo lugar están los algoritmos descriptivos, que son aquellos en los que los datos se presentan sin ninguna variable de salida y el objetivo del algoritmo es describir los datos. En este tutorial veremos los siguientes algoritmos descriptivos; algoritmos de clustering o agrupación, cuyo objetivo es obtener grupos de manera que los ejemplos asociados a cada grupo sean similares y los algoritmos de obtención de reglas de asociación, cuyo objetivo es obtener reglas que definan dependencias entre los atributos de entrada. 2 Weka Durante este tutorial se utilizará un sistema que permita probar distintos tipos de algoritmos de aprendizaje máquina de manera rápida y sencilla, sin tener ningún conocimiento de programación. Se ha elegido Weka porque es una suite que aglutina una gran cantidad de algoritmos de aprendizaje máquina, así como herramientas para preprocesar, visualizar y obtener y analizar los resultados de estos algoritmos. Weka es un desarrollo de la Universidad de Waikato en Nueva Zelanda, que comenzó en 1993 y que desde entonces viene actualizando tanto el entorno como los algoritmos de aprendizaje máquina. Weka son las siglas de Waikato Environment for Knowdledge Analysis. El sistema está desarrollado en Java y está bajo licencia GNU/GPL, con lo que cualquiera puede ver el código, manipularlo y desarrollar nuevos algoritmos para el sistema. Para instalar WEKA es necesario ir a la siguiente dirección ( ) y bajar la última versión estable que incluya Java VM 1.5 (actualmente weka-3-6-2jre.exe en el caso de un sistema operativo Windows). Para comprobar si ya está instalado Java en la máquina, se puede llamar a java desde la consola de comandos situada en inicio ejecutar cmd. Si el sistema dice que no encuentra java, es necesario instalarlo, en caso contrario, desmarcar la casilla de install JRE durante el proceso de instalación. El entorno Weka se divide en 3 interfaces diferentes. En primer lugar está el Explorer, que facilita importar los resultados desde una base de datos, ejecutar los algoritmos de aprendizaje y visualizar los resultados.

128 Fig. 2: Pantalla principal del explorer Dentro de la ventana principal del explorer existen una serie de pestañas que permiten realizar un preprocesado de los datos, clasificar, agrupar, buscar reglas de asociación, seleccionar atributos relevantes y visualizar los datos. El experimenter permite diseñar experimentos probando gran cantidad de métodos con diversos conjunto de datos y aplicar distintos test para comparar los resultados.

129 Fig. 3: Pantalla de configuración del experimenter Fig. 4: Pantalla de resultados del experimenter Finalmente, el Knowdledge flow permite diseñar flujos de trabajo gráficamente para automatizar los procesos de análisis de los datos. 3 Preprocesado En muchas ocasiones es necesario aplicar un preprocesado a los datos con el fin de mejorar el posterior análisis. Estos ser pueden dividirse en varios tipos según el criterio que se siga. Según si utiliza o no las variables de salida, pueden clasificarse

130 como supervisados o no supervisados. El primer grupo tiene en cuenta la variable de salida, con lo cual este tipo de pre-procesado puede afectar a la fiabilidad de los análisis posteriores. Ejemplos de esto son alguno tipos de selección de variables, algunos tipos de muestreo, etc. El segundo grupo no tiene en cuenta la variable de salida, como por ejemplo la normalización de datos. Este tipo de preprocesados también pueden dividirse en función de si se aplica por atributos (columnas), como normalizar una columna o por ejemplos (filas), como añadir ejemplos aleatorios. Fig. 5: Tipos de filtros de preprocesado del Weka Entre los algoritmos de procesado más utilizados están los siguientes:

131 Agregación: Consiste en crear un nuevo atributo a partir de dos atributos existentes. En weka puede usarse filter:unsupervised:attribute:addexpresion, donde se construiría la nueva expresión a partir de una expresión de las anteriores. Muestreo: Transformar una variable continua en discreta. En weka se usa filter:unsupervised:attribute:discretize, aunque también puede ser supervisado. Reducción dimensionalidad: Permite reducir el número de atributos de entrada, convinando las variables existentes en otras. Para hacer esto puede usarse filter:supervised:attribute:attributeselection, utilizando PrincipalComponent Selección de características: Selecciona los atributos más significativos de los atributos de entrada. En weka puede usarse filter:supervised:attribute:attributeselection 4 Clasificación Las tareas de clasificación consisten en asignar ejemplos o instancias a una categoría o clase. Fig. 6: Esquema de clasificación El proceso de aprendizaje consiste en presentarle una serie de ejemplos, tanto los datos de entrada como los datos de salida al ejemplo, junto con un algoritmo de aprendizaje concreto. De esta manera utilizando inducción el modelo de clasificación no solo va aprendiendo los ejemplos, sino que si el entrenamiento es correcto, puede generalizar sobre ejemplos nuevos. Esta fase se denomina entrenamiento. Una vez entrenado el modelo, para validarlo, se le presentan nuevos datos de ejemplo que no han sido utilizados para el entrenamiento y del cual se conoce su salida, y se compara la salida obtenida del clasificador con la salida esperada para cada ejemplo.

132 Fig. 7: Esquema de entrenamiento-test del proceso de clasificación Por lo tanto para cada uno de los ejemplos de prueba se obtienen un par (salida obtenida, salida observada). La evaluación de cada algoritmo se realiza comparando estas salidas usando para ello la matriz de confusión. La matriz de confusión representa el número de ejemplos bien o mal clasificados dependiendo de la clase a la que se añadan. De esta manera obtenemos: Verdaderos positivos: asignación correcta a la clase positiva Falsos positivos: asignación incorrecta a la clase positiva Verdaderos negativos: asignación correcta a la clase negativa Falsos negativos: asignación incorrecta a la clase negativa Valores predichos Positivo Negativo Valores reales Positivo Positivos Falso negativo Negativo Falso positivo negativo Fig. 8: Matriz de confusión A partir de estos datos se obtienen ciertas métricas que permiten evaluar el clasificador. Entre ellas podemos destacar la precisión, positivos negativos precisión total ejemplos o la tasa de error, que es el valor contrario a la precisión

133 falsos tasa de error positivos falsos negativos total ejemplos Otras medidas muy utilizadas son el área bajo la curva ROC (receiver operating characteristic), utilizada para medir la comparar clasificadores cuya salida no es categorica. A continuación se explicarán algunos de los modelos de clasificación que se usarán en este tutorial: Regresión lineal Se dice que un modelo es de regresión cuando las variables de entrada y las variables de salida son continuas. Los modelos de regresión permiten predecir un valor de salida basándose en el conjunto de valores de entrada. En la regresión lineal la salida es una función lineal del atributo o atributos de entrada (regresión lineal simple o múltiple). Existen modelos no lineales que se convierten en modelos lineales aplicándoles una transformación (ej.: modelo doble logarítmico, modelo exponencial o logístico). Las medidas de error que se utilizan en los modelos de regresión suelen ser el coeficiente de regresión o los errores medios o cuadráticos medios Regresión logística La regresión logística es un modelo de predicción lineal para variables de respuesta categórica. Modeliza la salida como la probabilidad de que se asigne a una de las categorías. 1 Prob(1) pi y 0 Prob(0) 1 p La variable de respuesta y sigue una binomial donde p es la posibilidad de responder 1 a los datos de entrada. La función de predicción debe realizar una transformación del predictor lineal en el intervalo [0,1], utilizando la función logística: p 1 1 e p 1 p log ' x x x... x ' x p p Es decir, log(p/1-p) es igual al predictor lineal clásico. i

134 Métodos bayesianos Los métodos bayesianos están basados en teoría de probabilidad. Son descriptivos y predictivos, ya que descubren relaciones entre los atributos y, a su vez, pueden ser utilizados como clasificadores Si se formaliza el problema de clasificación de manera estadística, se puede tratar a X (conjunto de atributos de entrada) e Y (salida) como variables aleatorias y encontrar su relación probabilística como la probabilidad condicionada P(Y X). P( X Y ) xp( Y ) P( Y X ) P( X ) Durante el entrenamiento, se obtienen las probabilidades condicionadas P(Y X) para cada combinación de X e Y a partir del conjunto de entrenamiento. Una vez obtenidos estos valores, se puede clasificar un patrón X encontrando una clase Y que maximice la probabilidad P(Y X ) Naive Bayes Este método es un clasificador probabilístico simple basado en la aplicación del teorema de Bayes, asumiendo una fuerte (naive) independencia entre los atributos. Se implementa utilizando clases estimadoras y los valores de precisión de estimadores numéricos se escogen basándose en el análisis de los datos de entrenamiento. El método naive bayes asume la independencia de los atributos de entrada. Redes Bayesianas Las redes bayesianas son una representación gráfica de dependencias para razonamiento probabilístico en la cual los nodos representan variables aleatorias y los arcos representan relaciones de dependencia directa entre las variables. Un clasificador bayesiano se puede ver como un caso especial de una red bayesiana en la cual hay una variable especial que es la clase y las demás variables son los atributos. Arboles de decisión Junto con las reglas de decisión, forma parte de los métodos denominados comprensibles y proposicionales. Comprensibles porque se pueden expresar de manera inteligible (en forma de árbol en este caso). Proposicionales porque aprenden modelos del tipo relación atributo-valor.

135 Un árbol de decisión es un conjunto de condiciones organizadas de manera jerárquica, de manera que la decisión final se determina siguiendo las condiciones que cumplen desde la raíz del árbol hasta alguna de sus hojas. Fig. 9: Arbol de decisión del Weka Problema sobre-entrenamiento A la hora de construir un árbol de decisión, es importante saber en qué profundidad se debe parar, ya que si se profundiza demasiado, el modelo está sobreentrenado. Se puede afirmar que un modelo esta sobreentrenado cuando el error de entrenamiento es bajo, mientras que el error de test es alto. Esto significa que ha aprendido los ejemplos, pero no tiene capacidad de generalización. Fig. 10: Porcentaje de error de un arbol de clasificación en entrenamiento y en test según el número de nodos

136 Fig. 11: Arboles de clasificación según profundidad Reglas clasificación Un clasificador basado en reglas es una técnica que clasifica los ejemplos utilizando un conjunto de reglas si entonces. El conjunto de reglas representa el operador lógico O de manera que el modelo completo puede representarse como M = (Regla 1 O Regla 2 O. Regla N) Cada una de las reglas está formada por un conjunto de aseveraciones unidas por operadores lógicos Y y un consecuente R1= si condición 1 Y condición 2 Y condición M entonces consecuente A Las condiciones están formadas por una tripla atributo operador valor donde el operador que puede ser (=,<>,<,>,<=,>=). El consecuente consiste en la asignación del ejemplo que cumple la regla a la clase correspondiente.

137 Modelos basados en vecindad Hasta ahora el aprendizaje se basaba en una fase de entrenamiento, donde el modelo aprendía los ejemplos de entrenamiento, y una fase de test donde se validaba el modelo. Existen otros tipos de aprendizaje, como los modelos basados en vecindad, donde el aprendizaje consiste en memorizar el conjunto de entrenamiento y luego aplicar la clasificación con el conjunto de test. Estos tipos de aprendizaje se denominan aprendizaje perezoso, en este apartado se estudiarán los modelos basados en vecindad, concretamente el k-nearest neighbours. En el método de vecinos cercanos (k-nearest neighbours), se representa cada ejemplo como un punto en un espacio n-dimensional (tantas dimensiones como atributos tengan los ejemplos). Para cada ejemplo de test, se calculan las distancias de ese punto a todos los ejemplos de entrenamientos, posteriormente se escogen los K ejemplos más cercanos al punto de test. El ejemplo de test se clasifica de acuerdo con la clase de la mayoría de sus k-vecinos más cercanos Este tipo de aprendizaje no necesita construir el modelo, pero pueden ser muy costosos computacionalmente, ya que requiere calcular todas las distancias, para cada ejemplo de test. El mayor problema de este tipo de modelos es que dado que la clasificación se hace basándose únicamente en información local, son unos modelos muy sensibles al ruido Redes de neuronas artificiales Las redes de neuronas artificiales son modelos computacionales basados en modelos biológicos de neuronas. Como en las neuronas del sistema nervioso, un modelo de redes de neuronas artificiales está compuestos por un conjunto de neuronas interconectadas que partiendo de un impulso (entrada) generan una salida. Cuanto más se utilice una conexión entre una y varias neuronas, más se refuerza dicha conexión. El modelo simple de red de neuronas que se muestra a continuación, está formado por una neurona con 3 entradas. Cada una de las entradas está conectada por medio de una conexión a la neurona. Dichas conexiones tienen asociado un peso w. Este peso emula la conexión sináptica entre las neuronas.

138 X 1 X 2 X 3 Y Input nodes X 1 X 2 X 3 Black box t=0.4 Output node Y Fig. 12: Ejemplo de RNA La salida se computa como la suma de las entradas por los peso, menos la tendencia o bias. A esta salida se le aplica posteriormente una función de activación El proceso de aprendizaje de una red de neuronas artificiales consiste en la modificación de los pesos de manera que la salida de la red se asemeje lo más posible a la salida esperada. En un momento inicial, los pesos están inicializados aleatoriamente. El entrenamiento consiste en presentar los ejemplos de entrenamiento varias veces a la red. Se calcula el error (salida obtenida menos la salida esperada) y se modifican los pesos intentando minimizar dicho error. Este proceso se repite una serie de ciclos que se denominan ciclos de entrenamiento. El parámetro de aprendizaje sirve para controlar la velocidad de convergencia del entrenamiento. A mayor velocidad de aprendizaje, mayor inestabilidad en este. El momento es un mecanismo que permite aumentar la velocidad de convergencia cuando el cambio de pesos coincide con la dirección de los cambios anteriores y reducirla en caso contrario. Una red de neuronas artificiales multicapa proporciona a la red mayor complejidad y potencia a la hora de resolver problemas de clasificación. Este tipo de red está formada por una capa de entrada, una capa de salida y una o varias capas ocultas. Son estas capas ocultas las que permiten a la red modelar relaciones más complejas entre los datos de entrada y la salida No existe una norma establecida para escoger el número de neuronas de la capa oculta. Normalmente se escoge un número grande en los primeros modelos y posteriormente se va reduciendo mientras que no aumente el error de clasificación. Máquinas de vectores soporte Las Maquinas de Vectores Soporte o Support Vector Machines (SVM) son clasificadores lineales, ya que inducen separadores lineales o hiperplanos en espacios de características de muy alta dimensionalidad. El problema radica en encontrar el

139 hiperplano con menor tasa de error. En la figura existen infinitos hiperplanos que permiten una tasa de error cero. Fig. 13: Ejemplo de hiperplanos de separación El mejor hiperplano es el que maximiza el rango antes de que se produzca un error de clasificación Fig. 14: Hiperplano que minimiza el rango Si el problema no es linealmente separable, se transforman los datos para aumentar la dimensionalidad. Se utiliza una función núcleo para calcular el producto escalar de dos vectores en el espacio de características. Existen varios tipos de núcleos por defecto (polinómico, gaussiano, sigmoidal, perceptron, etc). Se pueden diseñar tipos de núcleos específicos para cada problema (arboles, grafos, etc.)

140 Clustering En análisis cluster consiste en dividir los individuos en grupos (clusters) donde sus elementos son similares. El clustering puede ser jerárquico, donde cada clúster puede tener subclusters o no jerárquico. Fig. 15: Clustering y clustering jerarquico Kmeans El algoritmo k-means es el algoritmo básico de clustering. Consiste en agrupar elementos semejantes deltro del mismo grupo. Funciona de la siguiente manera, en primer lugar se seleccionan al azar k centroides. Posteriormente se asigna cada uno de los ejemplos a un centroide, dependiendo de su proximidad a este. Se recalcula el centroide de cada cluster como la media de los ejemplos asignados a ese cluster. Finalmente se repiten todos los pasos hasta que los centroides no se modifican. Con esto se obtienen tantos grupos como centroides se han seleccionado y el centroide conforma el representante de cada clase. Fig. 16: Ejemplo de k-means Medidas de proximidad Como se ha visto en el apartado anterior, es muy importante el criterio que se usa para decir que un centroide está más próximo o no a un centroide. Normalmente se usan las siguientes distancias: Manhattan: Suma de los errores de los atributos. Euclídea: Suma de los errores cuadráticos. Coseno: Mide la cohesión, a partir del ángulo que forman los vectores de atributos.

141 Chebychev: Calcula la distancia más grande en alguna de las dimensiones. Otras: Bregman, Mahalanabis, etc Clustering jerarquico El clúster jerárquico produce un conjunto de clústeres anidados organizados de forma jerárquica. Puede visualizarse como un dendograma. Dentro del clustering jerarquico pueden organizarse como clustering aglomerativo, en el que el árbol se construye empezando por las hojas, hasta llegar a la raíz, al principio, cada ejemplo es un clúster, y se van aglomerando o clustering divisivo, en el que se parte de la raíz y se van haciendo divisiones hasta llegar a las hojas. Fig. 17: Ejemplo de clustering jerarquico DBSCan DBScan es un algoritmo de clustering basado en densidad. Se entiende por densidad la cantidad de ejemplos dentro de un radio específico. Se basa en localizar regiones de alta densidad y separar sus individuos de otras regiones de baja densidad. Para ello clasifica a cada individuo como centro (core point), si los puntos están en el interior de un cluster basado en densidad, borde (border point), si los puntos no están suficientemente cerca como para ser core point, pero que están en la frontera del cluster o ruido (noise point), si los puntos no son ni core point ni border point.

142 Fig. 18: Clustering DBScan. Reglas de asociación Anteriormente se vieron las reglas de clasificación, que asocian un conjunto de atributos con una clase. Las reglas de asociación encuentran relaciones entre atributos, permitiendo obtener patrones de entre una gran cantidad de datos. Al igual que ocurría en las reglas de clasificación, una regla de asociación es una proposición probabilistica sobre la ocurrencia de ciertos estados. A diferencia de las reglas de clasificación en la parte derecha o consecuente, en lugar de aparecer una clase, puede aparecer cualquier atributo. Las reglas de asociación implican co-ocurrencia, no causalidad. El ejemplo clásico de reglas de asociación es la cesta de la compra, en la que un supermercado estudia que artículos se compran en conjunto, para poder colocarlos de manera más eficiente. Items 1 Pan, Leche 2 Pan, Pañales, Cerveza, Huevos 3 Leche, Pañales, Cerveza, Cola 4 Pan, leche, pañales, cerveza 5 Pan, leche, pañales, cola Fig. 19: Ejemplo de reglas de asociación en lista de la compra

143 Las medidas de calidad de las reglas son la cobertura, que es el número de instancias que la regla predice correctamente y la confianza que es el porcentaje de veces que la regla cumple cuando se puede aplicar. Algoritmo a priori Los algoritmos de selección de reglas de asociación se basan en medir la frecuencia de un suceso. El algoritmo Apriori se basa en la búsqueda de los conjuntos con ítems de determinada cobertura. En un primer paso, se construyen conjuntos de un solo ítem que tengan la cobertura mínima. A partir de estos se construyen conjuntos de dos ítems que superen la cobertura mínima y así sucesivamente hasta que no haya conjuntos que superen la cobertura mínima. Una vez seleccionados los conjuntos, se extraen las reglas. Se seleccionan las reglas con mayor porcentaje de confianza. Referencias 1. Tan P., Steinbach M. y Kumar V. Introduction to Data Mining. Pearson, Addison-Wesley. (2006) 2. Hand D., Manila H. y Smyth P. Principles of Data Mining. The MIT Press. (2001) 3. Hernandez J, Ramirez M.J. y Ferri C. Introducción a la Minería de Datos. Pearson, Prentice Hall. (2004) 4. Witten I.H. y Frank E. Data Mining: Practical Machine Learning Tools and Technoques. Elsevier-Morgan- Kaufmann (2005)

144 Métodos de Procesamiento de Imágenes Biomédicas para Ayuda al Diagnóstico Gloria Bueno 1, Oscar Déniz 1 1 VISILAB, E.T.S. Ingenieros Industriales, Universidad de Castilla-La Mancha Avda. Camilo José Cela, 3, Ciudad Real, España gloria.bueno@uclm.es Resumen. En este tutorial se desarrollarán diversos métodos para análisis de imágenes con Matlab orientadas al procesamiento de imágenes médicas. Los métodos incluirán herramientas de preprocesado para realce y mejora de la imagen, herramientas de detección de regiones de interés (ROIs), herramientas de caracterización de dichas ROIs y finalmente terminará con herramientas de clasificación de las ROIs. Keywords: Procesamiento de imágenes médicas, Matlab, anatomía patológica, oncología. 1 Introducción El paquete Matlab es una de las herramientas más potentes en el ámbito de las ciencias y la tecnología. Incluye un entorno de cálculo numérico y un lenguaje de programación, convirtiéndolo así en un paquete muy extensible. En 2004 tenía oficialmente un millón de usuarios en diversas disciplinas como ingenierías, ciencias y economía. A nivel educativo se ha popularizado especialmente en la enseñanza del álgebra y análisis numérico. También se ha popularizado su uso en el ámbito del procesamiento de imágenes, fundamentalmente por las facilidades e interactividad que ofrece su entorno. Matlab incluye una toolbox de proceso de imagen que permite adquirir imágenes y vídeo, procesarlas, visualizarlas y extraer información útil. Esta toolbox es muy potente y ya ha sido utilizada en instalaciones reales para diagnóstico de epilepsia, estudio desórdenes gastrointestinales con microcámaras de cápsula, imagen submarina, etc. 1.1 Introducción Teórica En este tutorial se desarrollarán diversos métodos para análisis de imágenes con Matlab orientadas al procesamiento de imágenes médicas. Los métodos abarcarán: Funciones básicas para trabajar con imágenes Herramientas de preprocesado para realce y mejora de la imagen,

145 Herramientas para captura de imágenes de cámaras Herramientas de detección de regiones de interés (ROIs), Herramientas de caracterización de dichas ROIs 1.2 Casos Prácticos La parte práctica del tutorial no requiere de hardware especial. Como se ha indicado, el tutorial usará fundamentalmente ejemplos prácticos de imágenes médicas, de diversas modalidades, como ecografías, biopsias, TAC (tomografía axial computarizada), RM (resonancia magnética), etc... Fig. 1. Ejemplos de imágenes médicas Los métodos y herramientas teóricas se aplicarán a dos casos prácticos: - Histología: Detección de ROIs en imágenes de anatomía patológica. - Oncología: Detección de ROIs en oncología, cáncer de mama y de próstata. a) Imagen original b) Imagen segmentada Fig. 2. Detección de núcleos mesencefálicos en autopsias a 40x Fig. 3. Detección de ROIs en TAC de la zona pélvica. Aplicación en dosimetría.

146 Bases de datos y estándares en microarrays Guillermo Hugo López Campos Área de Bioinformática y Salud Pública. Instituto de Salud Carlos III, Ctra. Majadahonda- Pozuelo Km , 28220, Majadahonda, España. glopez@isciii.es Abstract. Los microarrays son una herramienta frecuente en los estudios biomédicos y generan una gran cantidad de información. En la última década los esfuerzos realizados en el desarrollo de estándares y repositorios públicos han conducido a la existencia de grandes cantidades de información y datos que pueden ser explotadas mediante la realización de experimentos in silico. En este tutorial se pretende introducir a los alumnos en el manejo y recuperación de datos e información almacenada en los repositorios públicos para su posterior reanálisis o integración en otros estudios. Así mismo se mostrará a los asistentes los requisitos necesarios para enviar sus propios estudios a estos repositorios, una condición necesaria para su posterior publicación.. Keywords: Bioinformática, microarrays, expression génica, bases de datos, estándarización. 1 Introducción En esta sección del tutorial se van a abordar aquellos aspectos relacionados con la estandarización de los experimentos realizados empleando las técnicas basadas en microarrays y en cómo estos estándares afectan al almacenamiento y recuperación de la información relacionada en las bases de datos públicas. Los experimentos con microarrays representaron en sus comienzos un importante salto en el campo de la biología debido a la complejidad y la cantidad de información necesaria para la interpretación de sus resultados. En los primeros trabajos realizados con microarrays los resultados eran presentados de formas muy diversas por los investigadores, del mismo modo no existía uniformidad a la hora de suministrar los datos de los ensayos para su validación. En este escenario, a finales de de la década de 1990 surge la necesidad de establecer unos criterios de uniformidad para el almacenamiento y la distribución de los datos y la información relacionada con estos experimentos. Tras diferentes iniciativas, fue finalmente un consorcio público liderado constituido en la forma de una sociedad científica, la Microarray Gene Expression Data Society (MGED) [1], quien publico un sistema de estándares para la información mínima que es necesario suministrar junto con los experimentos con microarrays, el estándar MIAME (Minimmum Information About a Microarray Experiment) [2] y un estándar para el intercambio de datos basado en XML, MAGEML (MicroArray Gene Expression Mark-up Language) [3]. Junto con el

147 desarrollo de estos dos estándares, se publicaron los principales repositorios de datos públicos que seguían estas recomendaciones. Así mismo las principales editoriales científicas se fueron adhiriendo a la exigencia de la publicación empleando estos estándares de los datos e información relacionada en los artículos científicos. Estos desarrollos de estandarización han facilitado a su vez la realización de ensayos de validación y comparabilidad de las tecnologías basadas en microarrays. Estos ensayos se han recogido en las dos ediciones del estudio MAQC (Micro Array Quality Control) [4, 5]. La primera fase de este trabajo se centró en el análisis de la reproducibilidad de los ensayos entre las principales plataformas tecnológicas desarrolladas para la fabricación de los microarrays y de las metodologías para la generación de los datos. En la segunda fase del estudio se analizaron las diferentes metodologías para el análisis de los datos. En la actualidad las técnicas y los resultados basado en microarrays están ampliamente aceptados y se han convertido en una metodología popular para el abordaje de numerosos estudios post-genómicos en el campo de la salud y la biomedicina. 2 Objetivos del tutorial El objetivo principal de este tutorial es familiarizar a los asistentes con los principales aspectos relacionados con los estándares existentes en el campo de los microarrays para su posterior utilización y con las principales bases de datos existentes en la actualidad para el almacenamiento y la recuperación de información relacionada con estos ensayos. En la realización de la parte práctica de esta sección del tutorial se mostrarán: Los principales repositorios públicos de datos de microarrays presentándose: o Contenidos y características. o Interfaces de búsqueda o Interfaces de presentación de los resultados. Métodos para la recuperación de información de las bases de datos. Métodos para la subida de los datos a las bases de datos. o Requerimientos en GEO. o Requerimientos en AE. 3 Material de la práctica El material para la práctica se basa en la utilización de recursos disponibles on-line, visitándose los siguientes recursos: Gene Expression Omnibus (GEO) [6]. Es el repositorio de abundancia de especies moleculares desarrollado por el NCBI estadounidense. En este gran base de datos se pueden encontrar datos tanto de microarrays como de otras plataformas.

148 ArrayExpress (AE) [7]. Es un repositorio de datos de genómica funcional. Se pueden encontrar datos de microarrays y de experimentos de ultrasecuenciación. Este recurso está mantenido por el Instituto Europeo de Bioinformática (EBI). GeneExpression Atlas [8]. Subconjunto curado y reanotado de los datos procedentes de Array Express que puede ser consultado para genes individuales a través de las diferentes condiciones y experimentos presentes en Array Express. Stanford Microarray Database (SMD) [9]. Base de datos de la Universidade Stanford en EE.UU. uno de los centros pioneros para en el desarrollo de la tecnología basada en microarrays. We would like to draw your attention to the fact that it is not possible to modify a paper in any way, once it has been published. This applies to both the printed book and the online version of the publication. Every detail, including the order of the names of the authors, should be checked before the paper is sent to the Volume Editors. Referencias 1. Functional Genomics Data Society (formerly MEGD), 2. Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, et al. Minimum information about a microarray experiment (MIAME)-toward standards for microarray data. Nat Genet 2001 Dec;29(4): Spellman PT, Miller M, Stewart J, Troup C, Sarkans U, Chervitz S, et al. Design and implementation of microarray gene expression markup language (MAGE-ML). Genome Biol 2002 Aug 23;3(9):RESEARCH Shi L, Reid LH, Jones WD, Shippy R, Warrington JA, Baker SC, et al. The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotechnol 2006 Sep;24(9): Shi L, Campbell G, Jones WD, Campagne F, Wen Z, Walker SJ, et al. The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray-based predictive models. Nat Biotechnol 2010 Aug;28(8): Barrett T, Troup DB, Wilhite SE, Ledoux P, Evangelista C, Kim IF, et al. NCBI GEO: archive for functional genomics data sets--10 years on. Nucleic Acids Res 2010 Nov Parkinson H, Sarkans U, Kolesnikov N, Abeygunawardena N, Burdett T, Dylag M, et al. ArrayExpress update--an archive of microarray and high-throughput sequencing-based functional genomics experiments. Nucleic Acids Res 2010 Nov Kapushesky M, Emam I, Holloway E, Kurnosov P, Zorin A, Malone J, et al. Gene expression atlas at the European bioinformatics institute. Nucleic Acids Res 2010 Jan;38(Database issue):d690-d Hubble J, Demeter J, Jin H, Mao M, Nitzberg M, Reddy TB, et al. Implementation of GenePattern within the Stanford Microarray Database. Nucleic Acids Res 2009 Jan;37(Database issue):d898-d901.

149 Tutorial 2.2. Técnicas de selección de variables en dominios de microarrays de ADN: teoría básica y métodos más frecuentes para la selección de genes diferencialmente expresados Iñaki Inza Intelligent Systems Group, Computer Science Faculty, University of the Basque Country Paseo Manuel de Lardizabal 1, Donostia - San Sebastián, Gipuzkoa, Spain inaki.inza@ehu.es Homepage: 1 Resumen del tutorial La aparición de la tecnología de microarrays de ADN [1] finales de los 90, así como su popularización en estos últimos aos, están suponiendo una silenciosa pero constante revolución en el estudio de los mecanismos subyacentes de muchas enfermedades de origen genético. Su aplicación para la detección de los biomarcadores asociados al origen de las enfermedades es un paso necesario en los modernos estudios biomédicos. Aún así, debido a la ingente cantidad de genes que son capaces de monitorizar, su uso requiere de unos conocimientos básicos a nivel de análisis de datos, y muy especialmente de técnicas de selección de variables [2]. Principalmente desde un punto de vista de análisis y minería de datos, el tutorial presentará la teoría básica en selección de variables, haciendo especial hincapié en las distintas variantes recogidas en la literatura. Se mostrarán las técnicas más utilizadas en estudios sobre microarrays de ADN para la selección de genes diferencialmente expresados entre distintos fenotipos, relatando diversas experiencias previas de nuestro grupo de investigación en estudios reales. El tutorial se basará en las siguientes herramientas y referencias: una revisión de la literatura y herramientas más comunes en el rea de la selección de variables, en la siguiente referencia de Saeys y colaboradores [3]. Esta referencia no sólo revisa el papel de las técnicas de selección de variables en dominios de microarrays de ADN, sino tambin en los de análisis de secuencias, espectrometría de masas, SNPs y text-mining; la herramienta de software libre WEKA [4], tan popular en entornos de minería de datos, ofrece una amplia batería de técnicas de selección de variables; la colección de bases de datos públicas de microarrays de ADN en el formato requerido por la herramienta WEKA, en el siguiente portal del grupo de bioinformática de la Universidad Pablo Olavide (Sevilla):

150 References 1. Causton, H., Quackenbush, J., Brazma, A.: Microarray Gene Expression Data Analysis. Blackwell Publishing (2003) 2. Guyon, I., Elisseeff, A.: An introduction to variable and feature selection. Journal of Machine Learning Research 3, (2003) 3. Saeys, Y., Inza, I., Larrañaga, P.: A review of feature selection techniques in bioinformatics. Bioinformatics 23(19), (2007) 4. Witten, I.H., Frank, E.: Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann (2005),

151 R + Bioconductor como plataforma de análisis de microarrays de expresión genética para la obtención de modelos predictivos Juan M Garcia-Gomez Grupo de Informática Biomédica, Instituto de Aplicaciones de las Tecnologías de la Información y de las Comunicaciones Avanzadas (ITACA), Universidat Politècnica de València, Edificio 8G, Camino de Vera s/n, Valencia, España juanmig@ibime.upv.es Resumen Trabajar con microarrays requiere una plataforma computacional que facilite las diversas tareas para su procesamiento y la preparación de modelos predictivos a partir de los datos. R ( r-project.org) es un software estadístico y un lenguaje de programación para la manipulación de datos, cálculo y representación gráfica. Bioconductor ( es un proyecto que genera y mantiene librerías para el análisis de datos procedentes de las ciencias de la biología y la salud. En este tutorial, estudiaremos las funciones clave que nos proporciona R y Bioconductor para las diferentes tareas del análisis de microarrays de expresión genética, con especial interés en aquellas que nos lleven a la obtención de modelos predictivos. Seguiremos una estructura secuencial, guiados por los pasos necesarios al analizar un problema de diagnóstico mediante expresión genética, comenzando por la lectura de los ficheros de intensidades. Tras un control de calidad y el procesamiento de los datos, aplicaremos alguna de las técnicas de selección de característcas explicadas en el tutorial anterior, con el fin último de obtener y evaluar un modelo predictivo para el diagnóstico basado en expresión genética. Guiaremos el tutorial mediante unos ejemplos de 3 IVT arrays de Affymetrix y finalmente veremos un ejemplo sobre Exon ST arrays. Key words: Bioinformática, Machine Learning, Ayuda a la decisión médica, R-project, Bioconductor, microarrays, expresión genética, modelos predictivos 1. Objetivos del tutorial En esta práctica utilizaremos las funciones disponibles en Bioconductor y R para procesar y analizar experimentos de expresión genética con microarrays de Affymetrix. En esta práctica abordaremos un estudio completo de minería de datos con datos de expresión genética. Los pasos que seguiremos serán los siguientes:

152 1. Preproceso de microarray (incluyendo lectura, visualización y control de calidad) 2. Selección/Extracción de características 3. Clasificación y Evaluación 4. Clustering 2. Material de la práctica R Bioconductor Librerías específicas de Bioconductor Muestra (simulada) de tumores cerebrales basada en Affymetrix tipo HGU133 Plus 2 Muestra de cancer de colon basada en Exon ST arrays 3 Las referencias básicas del tutorial son [1,2,3] 2.1. Notas sobre el material Las funciones de R y Bioconductor son altamente dependientes de la versión de las librerías. Deben instalarse las versiones indicadas en la sección 2 para poder desarrollar los ejercicios del tutorial sin incidencias. 3. Instalación de R y Bioconductor La instalación de R se puede realizar directamente mediante un fichero instalable (p.e. R pkg para Mac OS X 10.5) o bien compilando los fuentes. La instalación directa prepara el sistema para ejecutar la consola R enlazada a las librerías de 32 bits o de 64 bits (p.e. R64.app). Podremos instalar el núcleo de Bioconductor mediante el script online bioclite, que instala las librerías affy, affydata, affyplm, affyqcreport, annaffy, annotate, Biobase, biomart, Biostrings, DynDoc, gcrma, genefilter, geneplotter, GenomicRanges, hgu95av2.db, limma, marray, multtest, vsn, xtable y sus dependencias en nuestro sistema. > source(" > bioclite() Además, durante el tutorial utilizaremos varias funciones de otras librerías de Bioconductor por lo que debemos instalarlas también mediante bioclite data/exon array data.affx

153 > source(" > bioclite("annotate") > bioclite("hgu133plus2.db") > bioclite("hgu133plus2cdf") > bioclite("geneplotter") > bioclite("oligo") > bioclite("pd.huex.1.0.st.v2") 4. Ejercicio: Lectura de 3 IVT arrays 4.1. Objetivo En este ejercicio leeremos un grupo de microarrays situados en la carpeta invdata Notas Utilizaremos el fichero invdata/corpus.txt que contiene la información fenotípica. La función read.annotateddataframe crea un objeto con la información fenotípica de las muestras a partir un fichero txt. ReadAffy crea un objeto de la clase AffyBatch que contiene los microarrays. Probe es el término utilizado para referirse a la secuencia corta (p.e. un oligonucleótido tiene 25 bases) de un gen que está situado en una celda del microarray, y al que se hibridará una secuencia etiquetada de la muestra (llamada target). En los microarrays de 3 IVT, cada gen está representado por varios probes significativos (p.e. entre 11 y 20, llamado probe set) llamadas Perfect Match (PM) y copias alteradas de los mismos llamadas Miss Match (MM) Desarrollo 1. Ejecuta el siguiente código para leer los ficheros cel con los microarrays. El resultado quedará en el objeto celdata. Incluiremos información sobre el diagnóstico de las muestras en el atributo phenodata asociado cada array. > library(affy) > library(hgu133plus2cdf) > celfiles <- list.celfiles(path = "invdata", full.names = TRUE) > celdata <- ReadAffy(filenames = celfiles) > phenodatatable <- read.annotateddataframe(filename = "invdata/corpus.txt") > samplenames(celdata) <- samplenames(phenodatatable) > phenodata(celdata) <- phenodatatable 2. Escribiendo el nombre del objeto celdata podemos averiguar información sobre su contenido de forma resumida. Cuántos microarrays hemos cargado?, cuál es ls dimensión de cada microarray? Podemos acceder al segundo microarray en notación matricial.

154 > celdata > celdata2 <- celdata[, 2] 3. Con la función samplenames puedes averiguar el nombre de los muestras contenidas en el objeto celdata. Lista los nombres de las muestras. 4. Con la función phenodata puedes recuperar el objeto AnnotatedDataFrame con la información fenotípica incluida en el fichero csv. Qué variables fenotípicas tenemos disponibles?, Cúal es el diagnóstico de cada caso? > varlabels(phenodata(celdata)) > pdata(celdata)$consensus_diagnosis 5. Con la función genenames puedes listar el nombre de todos los genes contenidos en el tipo al que pertenecen los microarrays de objeto celdata. 6. Averigua con la función indexprobes donde se encuentran las posiciones PM del gen con código at en los microarrays del objeto celdata. Utiliza la siguiente línea para conseguirlo: > genenames(celdata) > indexprobes(celdata, which = "pm", genenames = "213622_at") 5. Ejercicio: Visualización y control de calidad 5.1. Objetivo En este ejercicio visualizaremos las intensidades de nuestros microarrays y aplicaremos algunos controles sobre nuestros datos con el fin de descartar muestras que no tengan un mínimo de calidad para ser utilizadas Notas La degradación de las moléculas de RNA comienza en el extremo 5. Por lo tanto, es de esperar que haya una menor intensidad en los probes del extremo 5 que en el extremo 3. También es de esperar que la degradación media en cada microarray será similar a la del resto. Por lo tanto, la observación patrones diferentes a una degradación mayor en 5 o una velocidad de degradación muy diferente de algún microarray respecto al resto indicaría una mala calidad en la hibridación del microarray Desarrollo 1. Visualiza con el comando image la matriz de intensidades del segundo microarray mediante una escala heat.colors de 32 colores. > image(celdata[, 1], col = heat.colors(32)) 2. Visualiza mediante la función boxplot los diagramas caja-bigotes de las intensidades de cada microarray en logaritmo base 2. Son las distribuciones de los microarrays exactamente iguales entre si? > boxplot(celdata)

155 3. Visualiza las densidades de probabilidad de las intensidades (en logaritmo base 2) de los primeros 8 microarray mediante la función hist. > hist(celdata[, 1:8]) 4. Observa si existen variaciones dependientes de la intensidad de los probes entre casos mediante la representación MA-plot. Aplica la función MAplot a las muestras s7, s8 y s9 de la muestra. > MAplot(celdata[, 7:9], pairs = TRUE, plot.method = "smoothscatter") 5. Podemos explorar la degradación de RNA de nuestros microarrays con la función AffyRNAdeg. Concretamente, las pendientes (slopes) de las rectas indican la velocidad de degradación en cada microarray, por lo que deben ser lo más parecida posible entre ellas. Compara cualitativamente la velocidad de degradación de las muestras incluidas en celdata. > deg <- AffyRNAdeg(celdata) > summaryaffyrnadeg(deg) > plotaffyrnadeg(deg) 6. Ejercicio: Preproceso de los microarrays 6.1. Objetivo Realizaremos el preproceso de nuestros microarrays mediante la metodología RMA (Robust Multi-array Average) Notas El preproceso se realiza para obtener el valor de expresión genética asociado a cada gen a partir de los valores de intensidad de los probes de nuestros microarrays. El preproceso suele incluir los siguentes pasos: 1. Corrección del fondo. 2. Normalización. 3. Corrección específica del probe (p.e. substracción del MM). 4. Sumarización de los valores de un probe set para obtener el nivel de expresión de cada gen. La librería affy incluye las funciones expresso y rma, entre otras, que facilitan el preproceso completo de un conjunto de microarrays. Ambos métodos convierten el objeto de tipo AffyBatch en un objeto de tipo ExpressionSet.

156 6.3. Desarrollo 1. El método expresso permite especificar los métodos a utilizar en cada uno de los pasos de preproceso. La metodología RMA utiliza el método de corrección de fondo RMA, la normalización por cuantiles, toma la expresión de cada probe como el valor PM e ignora el MM, y realiza la sumarización robusta basada en median polish. > esetexpresso <- expresso(celdata, normalize.method = "quantiles", + bgcorrect.method = "rma", pmcorrect.method = "pmonly", + summary.method = "medianpolish") 2. La función rma accede a una implementación de la metodología en C, lo que mejora enormemente el tiempo y memoria necesarios para su ejecución. Sin embargo no permite el uso de métodos intermedios diferentes. > eset <- rma(celdata) 3. Comprueba con hist que las funciones de densidad de probabilidad obtenidas en los microarrays son similares por ambos métodos. 7. Ejercicio: Selección por expresión diferencial de genes 7.1. Objetivo En este ejercicio buscaremos los genes que están expresados con diferente intensidad en dos grupos de diagnóstico A y B. Para ello, utilizaremos la selección por expresión diferencial mediante el modelo lineal proporcionado por limma Notas El modelo lineal para microarrays (limma) ajusta las intensidades de cada gen (y g ) mediante un modelo lineal a partir de los factores del experimento (y g = Xα g ). En nuestro caso, el único factor es la clase de tumor, con dos posibles niveles (A y B). Cuando disponemos de n muestras, limma estima la matriz de coeficientes α del modelo Y = Xα, siendo en nuestro caso Y la matriz n G de las intensidades de los g genes en las n muestras. X es la matriz n 2 de diseño, siendo la primera columna un vector de 1, y la segunda columna el vector asociado al factor con 0 cuando la muestra es A y 1 si es B. α es la matrix 2 G con los coeficientes (término independiente y peso del factor) del modelo lineal para cada gen. Basándose en el ajuste del modelo lineal, limma estima que genes separan mejor las muestras en las clases A y B. Cargaremos una base de datos con más microarrays en memoria. La lectura de los ficheros cel y preproceso de los mismos requiere del orden de una hora para finalizar, por lo que partimos de los resultados que nos daría el ejercicio de la sección 6 sobre dicha base de datos.

157 7.3. Desarrollo 1. Cargamos la base de datos de microarrays preprocesados desde un fichero de variables de R. > load("invdata/exprdata.rdata") > exprdata 2. La función phenodata sobre el objeto exprdata devuelve la información fenotípica de las muestras. La función samplenames sobre el mismo objeto nos devuelve el vector con los nombres de las muestras. Crearemos la matriz de diseño del experimento diferencial 3. Especificamos la matriz X del modelo lineal, con n filas y 2 columnas. La primera columna será un vectorde 1s. En la segunda columna, el valor 0 corresponde a la etiqueta A y 1 a la etiqueta B. > labels <- phenodata(exprdata)$diagnosis > design <- model.matrix(~factor(labels)) > colnames(design) <- c("intercept", "label") > rownames(design) <- samplenames(exprdata) 4. Ajusta el modelo lineal limma a nuestros datos con lmfit y calcula los estadísticos de expresión diferencial de genes en las clases con ebayes, mostrando los genes con mayor diferencia entre clases con toptable. Ajustamos los p- valores de la expresión diferencial por múltiple test, seleccionado aquellos con un p-valor ajustado menor de 1e-4. > library(limma) > fit <- lmfit(exprdata, design) > fit <- ebayes(fit) > difexpressedgenestable <- toptable(fit, coef = "label", number = 100, + adjust = "fdr") > difexpressedgenes <- difexpressedgenestable[difexpressedgenestable[, + 6] < 1e-04, 1] 5. Visualiza los genes seleccionados en la representación de los cromosomas. Qué cromosoma tiene mayor número de genes implicados?. > library(annotate) > newchrom <- buildchromlocation("hgu133plus2.db") > cplot(newchrom, c("1", "2", "3", "4", "5", "6", "7", "8", "9", + "10", "11", "12", "13", "14", "15", "16", "17", "18", "19", + "20", "21", "22", "X", "Y")) > ccolor(difexpressedgenes, "red", newchrom) 8. Ejercicio: Clasificación supervisada de muestras 8.1. Objetivo En este ejercicio entrenaremos y evaluaremos un modelo de clasificación LDA (linear discriminant analysis) basado en una firma genética para discriminar entre tumores A y B.

158 8.2. Desarrollo 1. Entrena con lda el modelo de clasificación basado en los genes con expresión diferencial seleccionados anteriormente. > library(mass) > z <- lda(factor(labels) ~., data.frame(t(exprs(exprdata[difexpressedgenes, + ])))) > plot(z) 2. Predice la etiqueta de las muestras con el modelo lda. Obten la matriz de confusión de las etiquetas reales y de las etiquetas predichas. > trainresult <- predict(z, data.frame(t(exprs(exprdata[difexpressedgenes, + ]))))$class > labels <- factor(labels) > trainresult <- factor(trainresult) > levels(trainresult) <- levels(factor(labels)) > confusmatrix <- table(labels, trainresult) 3. Cuál es el estimación de error del clasificador? 4. Modifica el probabilidad a priori utilizada por predict para mejorar la sensibilidad de la clase B. > trainresult <- predict(z, data.frame(t(exprs(exprdata[difexpressedgenes, + ]))), prior = c(0.2, 0.8))$class > labels <- factor(labels) > trainresult <- factor(trainresult) > levels(trainresult) <- levels(factor(labels)) > confusmatrix <- table(labels, trainresult) 5. Inventa una métrica de evaluación que combine las sensibilidades de ambas clases. Utilizalá para calcular la mejor combinación de probabilidades a priori del clasificador. 6. Opcional: Modifica el código anterior para realizar una validación leavingone-out y vuelve a estimar el error. 9. Ejercicio: Clasificación no-supervisada de genes y muestras 9.1. Objetivo Realizaremos un clustering jerárquico aglomerativo de los genes y otro similar de las muestras, representándo los resultados de forma conjunta mediante la herramienta heatmap Desarrollo Utiliza el comando heatmap para hacer un clustering doble de genes y muestras. Observa el uso de los parámetros ColSideColors y scale, para qué se utiliza cada uno de ellos?

159 > labelcolors <- factor(labels) > levels(labelcolors) <- rainbow(length(unique(labels))) > mexprdata <- exprs(exprdata[difexpressedgenes, ]) > heatmap(mexprdata, col = topo.colors(100), ColSideColors = + as.character(labelcolors), scale = "row") 10. Ejercicio: Lectura y procesamiento de Exon ST arrays Objetivo Utilizaremos las funciones del paquete oligo para desarrollar el ejercicio. Debido a la cantidad de datos manejados con este tipo de arrays, se recomienda un sistema de cómputo y memoria potente para su ejecución Notas Debe instalarse el paquete pd.huex.1.0.st.v2 antes de ejecutar el código de este ejercicio, ya que supone la descarga de 256 MB Desarrollo La lectura de los datos desde los ficheros.cel es muy similar al caso anterior, nos lo facilitarán las funciones de la clase oligo. > celfiles <- list.celfiles("exondata", full.names = TRUE) > affyexonfs <- read.celfiles(celfiles) Para procesar los arrays a nivel de exon podremos utilizar la función rma a nivel de probeset. > probesetsummaries <- rma(affyexonfs, target = "probeset") Podremos procesar los microarrays a nivel de gen utilizando los ficheros de anotación de meta-probeset ()MPSs) proporcionados por affymetrix. Dependiendo del nivel de confianza, podremos elegir entre las anotaciones core, extended, y full. > probesetsummaries <- rma(affyexonfs, target = "core") Referencias 1. Carvalho BS, Irizarry RA Hands-On : A Framework for Oligonucleotide Microarrays Preprocessing. Vignettes (Bioconductor) pp Gautier L, Irizarry R, Cope L, Bolstad B (2011) Description of affy. Vignettes (Bioconductor) pp Garcia-Gomez JM (2011) Asignatura Bioinformática. Escuela Técnica Superior de Ingenieria Informática, UPV, Valencia, Spain

160 Detección de variantes genómicas en estudios DNA-seq Introducción: Gonzalo Gómez, PhD, Unidad de Bioinformática CNIO La tutorial ofrece a los participantes la base teórica para el análisis de variantes genómicas en estudios de secuenciación masiva (NGS) del cáncer. Se mostrarán las principales herramientas que se emplean actualmente para extracción de variantes, comparación y visualización de las mismas. En la parte práctica, se propondrán varios ejercicios utilizando PileLine, una herramienta que permite manejar ficheros de variantes en DNA de forma eficiente. Para ello, los participantes necesitarán acceder a internet y a las páginas para descargar PileLine y para seguir el guion de las actividades propuestas. Contenido del tutorial: A. Introducción teórica (presentación): Introducción a las variantes genómicas. Variantes genómica y enfermedad. Herramientas de extracción de variantes (variant calling). Análisis de variantes y predicción de consecuencias biológicas. B. Prácticas (hands-on para participantes): Ejercicios asesorados usando la herramienta PileLine: Preguntas y discusión. Ejercicios con PileLine: Ejercicio 1 Comparación entre ficheros de variantes. Comparaciones caso-control. Comparaciones múltiples. Ejercicio 2 Anotación de variantes.

161 Anotación con SNPs. Anotación con identificadores de genes. Ejercicio 3 Generación de inputs para software de predicción de consecuencias de mutación: Generación de input para la herramienta SIFT.

162 Detección de variantes genómicas en estudios DNA-seq Introducción: Gonzalo Gómez, PhD, Unidad de Bioinformática CNIO La tutorial ofrece a los participantes la base teórica para el análisis de variantes genómicas en estudios de secuenciación masiva (NGS) del cáncer. Se mostrarán las principales herramientas que se emplean actualmente para extracción de variantes, comparación y visualización de las mismas. En la parte práctica, se propondrán varios ejercicios utilizando PileLine, una herramienta que permite manejar ficheros de variantes en DNA de forma eficiente. Para ello, los participantes necesitarán acceder a internet y a las páginas para descargar PileLine y para seguir el guion de las actividades propuestas. Contenido del tutorial: A. Introducción teórica (presentación): Introducción a las variantes genómicas. Variantes genómica y enfermedad. Herramientas de extracción de variantes (variant calling). Análisis de variantes y predicción de consecuencias biológicas. B. Prácticas (hands-on para participantes): Ejercicios asesorados usando la herramienta PileLine: Preguntas y discusión. Ejercicios con PileLine: Ejercicio 1 Comparación entre ficheros de variantes. Comparaciones caso-control. Comparaciones múltiples. Ejercicio 2 Anotación de variantes.

163 Anotación con SNPs. Anotación con identificadores de genes. Ejercicio 3 Generación de inputs para software de predicción de consecuencias de mutación: Generación de input para la herramienta SIFT.

164 IntOGen y Gitools: Browsing, visualización y análisis integrativo de datos genómicos. 1 Introducción Sophia Derdak, PhD Biomedical Genomics Group GRIB, IMIM-UPF sophia.derdak@upf.edu El tutorial ofrece a los participantes la base teórica del campo de genómica integrativa de cáncer. Conoceremos las formas de análisis de IntOGen y Gitools, dos herramientas que permiten acceso, visualización, extracción de información y manipulación de la multitud de data sets de oncogenómica. En la parte práctica, se trabajarán varios ejercicios utilizando el browser de IntOGen. Para ello, los participantes necesitarán acceder a internet y a la página 2 Contenido del tutorial A. Introducción teórica (presentación): Introducción a la genética y genómica de cáncer. Fuentes públicas de datos genómicos. Análisis integrativo de datos oncogenómicos. IntOGen and Gitools: el browser y la interfaz de análisis integrativo de datos. La presentación estará disponible próximamente en B. Prácticas (hands-on para participantes): Ejercicios asesorados usando el browser de Intogen en Preguntas y discusión Ejercicios de IntOGen: Ejercicio 1 Búsqueda por genes:

165 Buscar el gen Myc en IntOGen: En qué tipos de tumor está Myc alterado? Qué tipos de alteración encontramos? Ingresar al tipo de tumor bones and joints para el gen Myc y acceder a la pestaña experiments : Cuántos experimentos de alteración de número de copias (CNA) documentan una amplificación mayormente significativa de Myc? Entra al experimento de expresión génica de Li Z et al. Cuántas muestras contiene este experimento? Cuántas de estas muestras sobreexpresan el gen Myc? Ejercicio 2 Búsqueda por experimentos y pathways: Buscar el experimento de glioblastoma Cancer Genome Atlas Research Network: Cuáles son los pathways de KEGG con expresión más alterada en este experimento? Ejercicio 3 Priorización de genes de cáncer: Un screening por mutaciones de genes humanos en 22 muestras tumorales de Glioblastoma multiforme (Parson et al Science 2008) identificó 37 genes de cáncer candidatos con mutaciones frecuentes: Accede la lista de genes aquí: Cargar ésta lista de genes en IntOGen utilizando la opción de filtros. Emplea IntOGen para priorizar la lista de genes disponible. Ejercicio 4 Sugerencia de los participantes

166 Infraestructura de Web Services y Supercomputación 1 Introducción Josep Gelpi El desarrollo de análisis bioinformáticos complejos requiere la combinación de herramientas a menudo distribuidas, y, especialmente en el entorno de NGS a datos también distribuidos. El paradigma de la bioinformática está evolucionando hacia el concepto de workflow basado en web-services y en herramientas de grid e incluso cloud computing. 2 Contenidos 1. Integración de datos en biología 2. Concepto de Web-service y workflow 3. Tecnologías disponibles 4. Adaptación a HPC 5. Herramientas para NGS Prácticas Herramientas para el manejo de Workflows Taverna, Jorca Gestores de datos y herramientas integradas Biomart, Galaxy, Genome browsers

167 DisGeNET: visualize, integrate, search and analyze gene-disease networks Bauer-Mehren A, Rautschka M, Sanz F, Furlong LI Research Programme on Biomedical Informatics (GRIB), IMIM (Hospital del Mar Research Institute), Universitat Pompeu Fabra, Barcelona, Spain abauer-mehren, mrautschka, Abstract. DisGeNET is a plugin for Cytoscape to query and analyze human gene-disease networks. DisGeNET allows user-friendly access to a new gene-disease database that we have developed by integrating data from several public sources. DisGeNET allows queries restricted to (i) the original data source, (ii) the association type, (iii) the disease class or (iv) specific gene(s)/disease(s). It represents gene-disease associations in terms of bipartite graphs and provides gene centric and disease centric views of the data. It assists the user in the interpretation and exploration of the genetic basis of human diseases by a variety of built-in functions. Moreover, DisGeNET permits multicolouring of nodes (genes/diseases) according to standard disease classification for expedient visualization. In this tutorial we will cover basic and advanced functionalities of DisGeNET. Keywords: biological networks; network analysis; data integration; text mining; systems biology; network medicine

168 DisGeNET user guide 1. Installation guide Download and install DisGeNET Troubleshooting Allocating more memory Download and installation problems DisGeNET database Original data sources Generation of gene-disease networks Mapping of disease vocabularies Gene-disease association ontology Tutorial Basic functions Generate gene-disease association network Generate gene or disease projection network Restrict the network to a certain association type Restrict the network to a certain disease class Search for a particular gene/disease or set of genes/diseases DisGeNET LinkOut DisGeNET Expand Expand DisGeNET networks Expand foreign networks Specific use cases Which are the genes annotated to breast cancer in expert curated databases? Do comorbidities observed in patients reflect a common genetic origin of the diseases? Which are the diseases that are associated to post-translational modifications such as phosphorylation? Analyzing DisGeNET data using external tools Extract data from DisGeNET database Build networks using igraph library Acknowledgements Attribute tables References... 29

169 1. Installation guide 1.1. Download and install DisGeNET Download DisGeNET.jar from Put the jar (DisGeNET.jar) in the Cytoscape "plugins" folder. (The default location in Windows is C:\Program Files\Cytoscape-v2.x\plugins). The plugin will be automatically loaded the next time Cytoscape is started, and will appear as a menu item in the plugins menu. You can start the plugin by clicking on Start DisGeNET. The first time you start the plugin it will automatically download and unpack the gene-disease database (DisGeNET.db ~326,5MB) into a directory of your choice.

170 The download might take several minutes. When the download is finished, the plugin starts automatically. Now the plugin is ready to be used. The database folder can be changed at any time. Please restart the plugin to activate the changes.

171 1.2. Troubleshooting Allocating more memory Some of the networks are very large, especially when using LHGDN or ALL as source databases. In order to visualize large networks, you need to allocate more memory to Cytoscape. Memory usage depends on the number of nodes/edges and number of attributes. For detailed information check the Cytoscape manual available at For Cytoscape version 2.7.0, you can find the information here: e%20memory%20for%20cytoscape Download and installation problems Make sure you have writing permission for the Cytoscape subfolders Download is interrupted with NullPointerException (in Linux or Mac OSX) Instead of starting Cytoscape via the icon, try to start it via command line from the installation folder, e.g..: sh /Applications/Cytoscape-6.x/cytoscape.sh

172 2. DisGeNET database The DisGeNET database integrates human gene-disease associations from various expert curated databases and text-mining derived associations including mendelian, complex and environmental diseases (Bauer-Mehren, et al., 2010). The integration is performed by means of gene and disease vocabulary mapping and by using a genedisease association ontology as described below Original data sources OMIM: Online Mendelian Inheritance in Man (OMIM) focuses on inherited or heritable diseases (Hamosh, et al., 2005). Gene-disease associations were obtained by parsing the mim2gene file for associations of type phenotype (data was downloaded from ftp://ftp.ncbi.nlm.nih.gov/gene/data/mim2gene on June, 6th 2009). All associations were labelled phenotype as provided in the mim2gene file and classified as Marker in our gene-disease association ontology. In total, we obtained 2198 distinct genes and 2473 distinct disease terms resulting in 3432 gene-disease associations. After mapping of disease vocabularies, the OMIM network contained 2417 distinct diseases. UNIPROT: UniProt/SwissProt is a database containing curated information about protein sequence, structure and function (Apweiler, et al., 2004). Moreover, it provides information on the functional effect of sequence variants and their association to disease. We extracted this information from UniProt/SwissProt release 57.0 (March 2009) as described in (Bauer-Mehren, et al., 2009). All protein identifiers were converted to Entrez Gene identifiers in order to allow integration with the other data sources. All gene-disease associations were classified as GeneticVariation. UniProt provided 1746 distinct gene-disease associations for 1240 distinct genes and 1475 distinct diseases. PHARMGKB: The Pharmacogenomics Knowledge Base (PharmGKB) is specialized on the knowledge about pharmacogenes, genes that are involved in modulating drug response. Genes are classified as pharmacogenes because they are (i) involved in the pharmacokinetics of a drug (how the drug is absorbed, distributed, metabolized and eliminated) or (ii) the pharmacodynamics of a drug (how the drug acts on its target and its mechanisms of action) (Altman, 2007). Hence, it covers less broadly human genedisease associations but was found to be complementary to the other sources, as it contains some gene-disease associations not present in the other repositories. We downloaded the genes.zip, diseases.zip and relationships.zip from on June 6th 2009 and parsed the files to extract gene-disease associations. We furthermore made use of the perl webservices to obtain all available annotations and supporting information. We included 1772 associations for 79 distinct genes and 261 distinct diseases. PharmGKB associations were classified as Marker if the original label was Related and as RegulatoryModification if the original label was Positively Related or Negatively Related. CTD: The Comparative Toxicogenomics Database (CTD) contains manually curated information about gene-disease relationships with focus on understanding the effects of environmental chemicals on human health (Mattingly, et al., 2006). We downloaded the CTD_gene_disease_relations.tsv file from on June 2nd 2009 and parsed it for gene-disease associations of type marker or therapeutic (see

173 for description of the original labels). CTD includes associations from OMIM but with some differences (i) for some associations extra information such as cross-links to PubMed are available and (ii) some associations are missing in either of the two databases. Hence, we kept all available gene-disease associations from both sources. All CTD gene-disease associations were classified as Marker if the original label was marker and as Therapeutic if the original label was therapeutic. All cross-links to PubMed were kept. In total CTD data provided 6469 associations for 2702 distinct diseases and 3345 distinct genes. LHGDN: The literature-derived human gene-disease network (LHGDN) is a text mining derived database with focus on extracting and classifying gene-disease associations with respect to several biomolecular conditions. It uses a machine learning based algorithm to extract semantic gene-disease relations from a textual source of interest. The semantic gene-disease relations were extracted with F-measures of 78 (see (Bundschus, et al., 2008) for further details). More specifically, the textual source utilized here originates from Entrez Gene s GeneRIF (Gene Reference Into Function) database (Mitchell, et al., 2003). This database represents a rapidly growing knowledge repository and consists of high-quality phrases created or reviewed by MeSH indexers. Hereby, the phrases refer to a particular gene in the Entrez Gene database and describe its function in a concise phrase. Using this textual repository for text mining has recently gained increasing attention, due to the high quality of the provided textual data in the GeneRIF database (Bundschus, et al., 2008; Lu, et al., 2007; Rubinstein and Simon, 2005). LHGDN was created based on a GeneRIF version from March 31st, 2009, consisting of phrases. These phrases were further restricted to the organism Homo sapiens, which resulted in a total of phrases. We extracted all data from LHGDN and classified the original associations using our ontology. In total, LHGDN provided distinct gene-disease associations for 1850 diseases and 6154 distinct genes. The LHGDN is also available in the Linked Life Data Cloud ( Generation of gene-disease networks Gene-disease associations were collected from several sources. The source databases use two different disease vocabularies (MIM and MeSH). Entrez Gene identifiers are used for genes (except for UniProt/SwissProt which uses UniProt identifiers). Moreover, the kind of association differs among the databases and ranges from the generic term related to more specific terms such as altered expression. In order to merge all genedisease associations and to present them in one comprehensive gene-disease network, we (i) mapped UniProt identifiers to EntrezGene identifiers if necessary, (ii) mapped MIM to MeSH vocabulary if possible (see Mapping of disease vocabularies) and (iii) integrated associations through our gene-disease association ontology (see Genedisease association ontology). We furthermore constructed different gene-disease networks for each source (OMIM, UNIPROT, PHARMGKB, CTD, LHGDN), as well as two integrated networks CURATED (containing gene-disease associations of OMIM, UNIPROT, PHARMGKB or CTD) and ALL (containing all gene-disease associations). Our comprehensive database is also available as SQLite database (DisGeNET.db). All genedisease networks are represented as bipartite graphs. A bipartite graph has two types of vertices and the edges run only between vertices of un-like types (Newman, 2003). The bipartite graphs are multigraphs in which two vertices can be connected by more than one edge. In our networks, the multiple edges represent the multiple data sources reporting the gene-disease association. We generated two projections, one for the

174 diseases and one for the genes using the igraph library in R (Gabor and Tamas, 2006). The projected graphs contain only vertices of the same kind (monopartite) and two nodes are connected if they share a neighbour in the original bipartite graph. Before projecting the networks, we simplified the graphs and removed multiple edges. Hence, nodes that are connected by multiple edges are only connected by one edge in the simplified graph. This simplification is needed in order to correctly run the projection as implemented in the igraph library. Moreover, the node degree in the simplified graphs represents the number of first neighbours Mapping of disease vocabularies We used the MeSH hierarchy for disease classification. The repositories of gene-disease associations use two different disease vocabularies, MIM terms for OMIM diseases (used by OMIM, UniProt, CTD) and MeSH terms (used by CTD, PharmGKB, LHGDN). We used the UMLS metathesaurus to map from MIM to MeSH vocabularies. This step was performed to merge disease terms representing the same disorder, thus reducing redundancy. We were able to map 497 MIM terms directly to MeSH using UMLS and we additionally mapped 23 MIM terms by using a string mapping approach. Briefly, we searched the UMLS metathesaurus for MeSH terms for which there is at least one synonym exactly matching one of the synonyms describing the MIM term of interest. The resulting 63 matched terms were manually checked and reduced to 23 terms. For disease classification, we considered all 23 upper level concepts of the MeSH tree branch C (Diseases), plus two concepts ( Psychological Phenomena and Processes and Mental Disorders ) of the F branch (Psychiatry and Psychology). Moreover, we added one disease class Unclassified for all disease terms for which a classification was not possible. We categorized all diseases into one or more of the 26 possible disease classes. For MeSH disease terms we directly used its position in the MeSH hierarchy, for MIM disease terms that were not mapped to MeSH, we used the disease classification of (Goh, et al., 2007). Then, we mapped their disease classification to the MeSH hierarchy and extended the mapping using a disease classification available at CTD (CTD_disease_hierarchy.tsv downloaded August, 8th 2009). In total, we were able to classify 3980 (98.39 %) diseases. The disease classification allows filtering and searching of the network restricted to disease class Gene-disease association ontology For a correct integration of gene-disease association data, we developed a gene-disease association ontology. We classified all association types as found in the original source databases into Association if there is a relationship between the gene/protein and the disease, and into NoAssociation if there is no association between a gene/protein and a certain disease (in other words, if there is evidence for the independence between a gene/protein and a disease). The different association types from the original databases were mapped to the ontology for a seamless integration. In this study, we only considered gene-disease associations of type Association. The ontology is available at

175 Figure 1: Gene-disease association ontology

176 3. Tutorial DisGeNET is a plugin for Cytoscape (Shannon, et al., 2003) to query and analyze human gene-disease networks. For this purpose, we have developed a new gene-disease assocciation database integrating information from several expert curated databases and a resource containing text-mining derived associations (Bauer-Mehren, et al., 2010) Basic functions By selecting different data sources, association types and/or disease classes from their respective drop-down menus, you can generate different gene-disease association networks. In addition, gene-disease association networks can be generated around a specific disease or gene of interest using the search box provided with the plugin. Most of these functionalities are also available to generate disease and gene monopartite networks Generate gene-disease association network In order to obtain a gene-disease association network without any restrictions on association type and disease class follow the next steps: Select the source of interest, e.g. CURATED containing information from all expert curated databases in our database (OMIM, PHARMGKB, UNIPROT and CTD). Set Association Type and Disease Class Any Press Create Network Apply a Cytoscape layout algorithm to generate the view of choice, e.g. select the layout Organic Once the network is obtained, specific information on the nodes and their relationships can be explored as detailed below:

177 Select nodes and edges and check their attributes. For example, use the Cytoscape search function to query for Alzheimer Disease. For this purpose, modify the search options and select the attribute diseasename. Search for a particular disease, e.g. Alzheimer Disease Zoom into the network and select the Alzheimer Disease node More information about this node is found in the Node Attribute Browser All available node and edge attributes are listed in Tables 1 and 2. For this purpose you might want to select attributes to be displayed in the Node Attribute Browser or Edge Attribute Browser of the Cytoscape Data Panel. Select an edge to display information about a particular gene-disease association such as associationtype, data source providing this association, supporting evidence (PubMed identifiers), etc Generate gene or disease projection network In addition to bipartite graphs representing gene-disease associations, DisGeNET allows generating monopartite networks representing the gene or the disease projection of the

178 gene-disease association network. In order to obtain the disease projection of the network generated from CURATED source (described in 2.1.1) follow the instructions detailed below: Select the Disease Projection tab in the DisGeNET main panel. Select the source, e.g. CURATED Press Create Network Restrict the network to a certain association type Note: This option is only available for Gene Disease Networks. Select the Source, e.g. CURATED Select the Association Type, e.g. Genetic variation Press Create Network Restrict the network to a certain disease class Note: This option is available for all types of networks. The classification is based on the disease branch of the MeSH hierarchy. Select the Source, e.g. CURATED Select the Disease Class, e.g. Digestive System Diseases Press Create Network Search for a particular gene/disease or set of genes/diseases The search option included in the DisGeNET tab can be used to generate networks around a disease or gene of interest. In addition, it can be used to search for a given disease or gene of interest in a network already generated.

179 If only current net is not ticked, a network only containing associations related to the query will be created (using Create Network). If only current net is ticked, the according node will be selected (highlighted yellow) in the current network (with active view) when pressing [Enter]. The search is restricted to Source, Association Type and Disease Class as selected. In this example, we are searching for any kind of Alzheimer Disease (there are four different types) in the CURATED dataset without any restriction of association type or disease class. Note: The DisGeNET search allows the use of the wildcard symbol (*). For performance reasons only the first 50 matching terms are listed in the drop-down box but all are included in the generated network DisGeNET LinkOut In order to get more information about a gene or a disease node, you can linkout to the according website (Entrez Gene, OMIM or MeSH) using the DisGeNET LinkOut function. It is available in the node context menu, which can be accessed by right-clicking a selected node. For gene nodes, a linkout to Entrez Gene is given. For disease nodes, linkouts to MeSH or OMIM (depending on the type of disease node) are given.

180 DisGeNET Expand In order to find all diseases/genes that are associated to a gene/disease node in an existing network you can use the DisGeNET Expand function. It can either be used to create new DisGeNET networks using the selected nodes for the query or to expand the existing nodes with edges found in DisGeNET. Note: the function works with one or more selected nodes. To call the function, select one or more nodes, then click the right mouse button. This will open the node context menu containing the DisGeNET LinkOut and DisGeNET Expand functions. You can then choose between DisGeNET Expand -> Expand current net and DisGeNET Expand -> Build new net Expand DisGeNET networks This is a network generated with DisGeNET using as source OMIM, as AssociationType and DiseaseClass Any and as search term PSEN2. In OMIM, there is only one disease (Alzheimer disease-4) annotated to the gene PSEN2. The DisGeNET Expand function can be used to query for more associated diseases (click the right mouse button on the PSEN2 node to open the context menu). This function uses as data source the whole DisGeNET database. You can either add more gene-disease associations to the current net or build a new net. The result is this expanded network in which all found gene-disease associations for PSEN2 were added. You can see that there are 5 more diseases annotated to PSEN2.

181 The expansion can be repeated various times. For instance, in a next step, we can expand this network by querying for more genes associated to Alzheimer disease - 4 and Alzheimer Disease. This results in a large network with 373 nodes and 893 edges. It is visible that there are many more genes associated to Alzheimer Disease Expand foreign networks The same functionality to expand gene or disease nodes with more associations found in DisGeNET can be used to expand foreign network that were not created with DisGeNET but contain gene or disease nodes. In order to use the DisGeNET Expand function on nodes that were not built within DisGeNET, the node label needs to contain a valid Entrez Gene identifier or valid disease identifiers that are allowed by DisGeNET. Note: DisGeNET only contains human gene-disease associations and hence can only be queried with human gene identifiers. Examples for valid identifiers: 5080 for PAX 6 gene mesh:d for Alzheimer Disease omim: for Corneal endothelial dystrophy 2 In the following example, we show how a network not generated with DisGeNET can be expanded with DisGeNET gene-disease associations.

182 First, we generate a network using the File->Import->Network from webservices function within Cytoscape. We query the Pathway Commons database for pathways containing the human gene PSEN2. For this, first set the Data Source to Pathway Commons Web Service Client, enter PSEN2 in the Search field and select the organism Human. Press search. We select a pathway we are interested in, for instance the NOTCH signalling pathway from the Cancer Cell Map database. Double-click the pathway to retrieve it.

183 Advances in Biomedical Informatics: COMBIOMED This results in a network with 113 node and 272 edges. The network contains the PSEN2 gene (PSN2_HUMAN). Morever, there are various node attributes available among them the Entrez Gene identifier (biopax.xref.entrez_gene) In order to use DisGeNET expand, we need to ensure that the node labels contains the Entrez Gene identifier since DisGeNET uses node labels to query the database. To do so, we first create a new visual style in the VizMapper, for example called ExpandDisGeNETStyle Then, we set the node label to the attribute containing the Entrez Gene identifiers, here to biopax.xref.entrez_gene and use the Passthrough Mapping.

184 Now, we can use the DisGeNET Expand function to search for gene-disease associations containing the selected node. Using the function for the PSEN2 node, we can search for all associated diseases in DisGeNET. We can either add the found associations to this net or create a new net. In the resulting network all diseases associated to PSEN2 are added For Cytoscape users: You can make use of the Nested networks functionality to add the gene-disease association networks as nested networks to the nodes. To add the gene-disease association network as nested network to the PSEN2 gene node, right click on the node and select Nested Network -> Set Nested Network Now select the genedisease association network for PSEN2 as created before using DisGeNET Expand