Big Data in Drug Discovery



Similar documents
Knowledge Graph: Connecting Big Data Semantics. Ying Ding Indiana University

ProteinQuest user guide

Integrating Bioinformatics, Medical Sciences and Drug Discovery

A leader in the development and application of information technology to prevent and treat disease.

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Web-Based Genomic Information Integration with Gene Ontology

Protein Protein Interaction Networks

How To Use Data Analysis To Get More Information From A Computer Or Cell Phone To A Computer

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Cheminformatics and Pharmacophore Modeling, Together at Last

Using the Grid for the interactive workflow management in biomedicine. Andrea Schenone BIOLAB DIST University of Genova

dixa a data infrastructure for chemical safety Jos Kleinjans Dept of Toxicogenomics Maastricht University

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

De novo design in the cloud from mining big data to clinical candidate

Extracting value from scientific literature: the power of mining full-text articles for pathway analysis

Ingenuity Pathway Analysis (IPA )

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

WORKSHOP ON TOPOLOGY AND ABSTRACT ALGEBRA FOR BIOMEDICINE

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Complexity and Scalability in Semantic Graph Analysis Semantic Days 2013

Using Ontologies in Proteus for Modeling Data Mining Analysis of Proteomics Experiments

Cheminformatics and its Role in the Modern Drug Discovery Process

LINKED OPEN DRUG DATA FROM THE HEALTH INSURANCE FUND OF MACEDONIA

Syllabus of B.Sc. (Bioinformatics) Subject- Bioinformatics (as one subject) B.Sc. I Year Semester I Paper I: Basic of Bioinformatics 85 marks

Module 1. Sequence Formats and Retrieval. Charles Steward

Integrating Medicinal Chemistry and Computational Chemistry: The Molecular Forecaster Approach

Paradigm Changes Affecting the Practice of Scientific Communication in the Life Sciences

Bioinformatics: course introduction

Consensus Scoring to Improve the Predictive Power of in-silico Screening for Drug Design

> Semantic Web Use Cases and Case Studies

Data Visualization in Cheminformatics. Simon Xi Computational Sciences CoE Pfizer Cambridge

Big Data and Text Mining

Dr Alexander Henzing

Pathway Analysis : An Introduction

BIG DATA EUROPE. Integrating Big Data, Software & Communities for Addressing Europe s Societal Challenges

Using Open Source software and Open data to support Clinical Trial Protocol design

From Data to Foresight:

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

A Multiple DNA Sequence Translation Tool Incorporating Web Robot and Intelligent Recommendation Techniques

DATA SCIENCE ADVISING NOTES David Wild - updated May 2015

Twister4Azure: Data Analytics in the Cloud

Pipeline Pilot Enterprise Server. Flexible Integration of Disparate Data and Applications. Capture and Deployment of Best Practices

Visualizing Networks: Cytoscape. Prat Thiru

Presenting data: how to convey information most effectively Centre of Research Excellence in Patient Safety 20 Feb 2015

A Primer of Genome Science THIRD

Electronic Laboratory Notebook in the Graduate Level Laboratory Informatics Program

Data, Measurements, Features

Integration of DiscoveryQuant Software into Automated In-Vitro ADME Assay Workflows

Text Mining for Health Care and Medicine. Sophia Ananiadou Director National Centre for Text Mining

Pharmacology skills for drug discovery. Why is pharmacology important?

Informatics and Knowledge Management at the Novartis Institutes for BioMedical Research (NIBR)

The INFUSIS Project Data and Text Mining for In Silico Modeling

PONTE Presentation CETIC. EU Open Day, Cambridge, 31/01/2012. Philippe Massonet

How to create and interpret the predictive analysis of a compound

An Ontology Based Method to Solve Query Identifier Heterogeneity in Post- Genomic Clinical Trials

SIPBS Portfolio Entry

Big Data Healthcare. Fei Wang Associate Professor Department of Computer Science and Engineering School of Engineering University of Connecticut

Analysis of Illumina Gene Expression Microarray Data

Design and Implementation of a Semantic Web Solution for Real-time Reservoir Management

BIO 3350: ELEMENTS OF BIOINFORMATICS PARTIALLY ONLINE SYLLABUS

BIOINFORMATICS Supporting competencies for the pharma industry

Global and Discovery Proteomics Lecture Agenda

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

MRMPilot Software: Accelerating MRM Assay Development for Targeted Quantitative Proteomics

Extraction and Visualization of Protein-Protein Interactions from PubMed

AOP Knowledge Base/ Wiki Tool Set

org.rn.eg.db December 16, 2015 org.rn.egaccnum is an R object that contains mappings between Entrez Gene identifiers and GenBank accession numbers.

How Can Institutions Foster OMICS Research While Protecting Patients?

REGULATIONS FOR THE DEGREE OF BACHELOR OF SCIENCE IN BIOINFORMATICS (BSc[BioInf])

Beyond Health 2.0: the semantic web and intelligent systems

Parameter inference of a basic p53 model using ABC

Big Data. Tom Plunkett Senior Consultant

FACULTY OF MEDICAL SCIENCE

Special report. Chronic Lymphocytic Leukemia (CLL) Genomic Biology 3020 April 20, 2006

Transcription:

Big Data in Drug Discovery David J. Wild Assistant Professor & Director, Cheminformatics Program Indiana University School of Informatics and Computing djwild@indiana.edu - http://djwild.info

Epochs in drug discovery Empirical up until 1960 s 754 First pharmacy opened in Baghdad Late 1800 s major pharmaceutical companies, mass production 1900-1960 major discoveries (insulin, penicillin, the pill ) Rational 1960 s to 1990 s Designing molecules to target protein active sites lock and key Computational Drug Discovery Biggest success HIV (RT, protease inhibitors) Big Experiment 1990 s to 2000 s High throughput screening Microarray Assays Gene Sequencing and Human Genome Project Big Data 2010 s onwards Informatics-driven drug discovery Accepting the body is amazingly complex and we don t understand it well Everything is connected

The metabolic pathways of a single cell David Wild, December 2009. Page 3 http://djwild.info

The inner life of the cell http://video.google.com/videoplay?docid=-2351549868099343381&hl=en# David Wild, December 2009. Page 4 http://djwild.info

Big Data in the public domain There is now an incredibly rich resource of public information relating compounds, targets, genes, pathways, and diseases. Just for starters there is in the public domain information on: 69 million compounds and 449,392 bioassays (PubChem) 4,763 drugs (DrugBank) 9 million protein sequences (SwissProt) and 58,000 3D structures (PDB) 14 million human nucleotide sequences (EMBL) 19 million life science publications - 800,000 new each year (PubMed) Multitude of other sets (drugs, toxicogenomics, chemogenomics, SAR, ) Even more important are the relationships between these entities. For example a chemical compound can be linked to a gene or a protein target in a multitude of ways: Biological assay with percent inhibition, IC50, etc Crystal structure of ligand/protein complex Co-occurrence in a paper abstract Computational experiment (docking, predictive model) Statistical relationship System association (e.g. involved in same pathways cellular processes)

PubChem growth since 2005 80,000,000 70,000,000 60,000,000 50,000,000 40,000,000 30,000,000 PubChem Substance Size 2005-2010 Addition of ChemSpider 56,774,950 69,088,100 1000000 100000 10000 PubChem Bioassays 2005-2010 1000 100 10 1 Addition of ChEMBL 434635 2005-01 2005-04 2005-07 2005-10 2006-01 2006-04 2006-07 2006-10 2007-01 2007-04 2007-07 2007-10 2008-01 2008-04 2008-07 2008-10 2009-01 2009-04 2009-07 2009-10 2010-01 2010-04 2010-07 20,000,000 35,379,748 10,000,000 0 2,824,265 2005-01 2005-03 2005-05 2005-07 2005-09 2005-11 2006-01 2006-03 2006-05 2006-07 2006-09 2006-11 2007-01 2007-03 2007-05 2007-07 2007-09 2007-11 2008-01 2008-03 2008-05 2008-07 2008-09 2008-11 2009-01 2009-03 2009-05 2009-07 2009-09 2009-11 2010-01 2010-03 2010-05 2010-07 David Wild, December 2009. Page 6 http://djwild.info

Large amount of data and links for each compound David Wild, December 2009. Page 7 http://djwild.info

Proteins & Genes http://www.genome.jp/en/db_growth.html David Wild, December 2009. Page 8 http://djwild.info

Chem2Bio2RDF: The FaceBook of Drug Discovery David Wild, December 2009. Page 9 http://djwild.info

You are a big pile of data too!

Large-scale predictive modeling adds even more data Range of ROCV values from different classes of BioAssay data set. Range of ROCV values from three different classes of BioAssay data set for original models and models built with additional inactive compounds ( improved ). Chen, B. and Wild, D.J. PubChem BioAssays as a data source for predictive models, Journal of Molecular Graphics and Modeling. 2010; 28, 420-426.

Informatics-based drug discovery Predicting new molecular targets for known drugs. Nature 462, 175-181(12 November 2009)

Systems chemical biology and chemogenomics

Recent enabling technologies for SCB / Chemogenomics Analysis Cloud computing allows processing and data mining on a vast scale Semantic technologies and complex systems tools allow seamless integration and human-scale data mining Integrative cheminformatics & bioinformatics connects compounds, targets genes, pathways, diseases and side effects Health informatics (PHRs and EHRs) allows integration of the molecular and patient models (QP) Visualization, projection, data mining, hypothesis generation, network tools Integration RDF, XML, Triple Stores Ontologies, SPARQL, Graph algorithms Access Web Services, RPC Information extraction

ChemBioGrid.org: Web service infrastructure for cheminformatics Dong, X., Gilbert, K.E., Guha, R., Heiland, R., Kim, J., Pierce, M.E. Pierce, Fox, G.C. and Wild, D.J. Web service infrastructure for chemoinformatics, J. Chem. Inf. Model., 2007; 47(4) pp 1303-1307.

The Semantic Web meaning & relationships

Chem2Bio2RDF RDF integration & SPARQL querying Chen, B., Dong. X., Jiao, D., Wang, H., Zhu, Q., Ding, Y., Wild, D.J. Chem2Bio2RDF: a semantic framework for linking and data mining chemogenomic and systems chemical biology data. BMC Bioinformatics 2010, 11, 255

Chem2Bio2RDF context

Chem2Bio2RDF Relationships

Linked Open Data Cloud (linkeddata.org)

Converting data into RDF

Finding multi-target inhibitors of MAPK pathway with a SPARQL query

Finding compounds with similar polypharmacology using SPARQL

Projecting queries into chemical space GTM / MDS projection and embedding of all PubChem using clouds Plotting and embedding unknown compounds with SCB property labels Dynamic querying and projection into chemical space

Projecting queries into chemical space Choi, J.Y., Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

Doppler Radar Plot Kinase Specificity Choi, J.Y., Bae, S.H., Qiu, J., Fox, G., Chen, B., Wild. D.J. Browsing Large Scale Cheminformatics Data with Dimension Reduction. Emerging Computational Methods for the Life Sciences Workshop, ACM Symposium for High Performance Distributed Computing Jun 21-25, 2010, Chicago IL

Doppler Radar Plot Kinase Specificity

Chem2Bio2RDF Dashboard: finding paths

Pathfinder NFKB1 Glucocorticoid Receptor Triamcinalone Dexamethasone http://ella.slis.indiana.edu/~yuysun/flex/pathfinder.html

Dynamic exploration with clouds and Cytoscape Virtuoso runs Chem2Bio2RDF queries on the cloud Cytoscape plugins give access to Chem2Bio2RDF, LPG and chemical structure visualization Dynamic exploration in Cytoscape

Hydrocortisone Dexamethasone links Fig. Use Case 1.Network diagram of the paths obtained between Hydrocortisone and Dexamethasone using ChemBioScape.Drugbank interaction contains information about every drug s target. In this case, DB00741 and DB01234 share common targets through several different Drugbank interaction ID s.

Tolcapone and Entacapone links Fig. Use case 2.Tolcapone and Entacapone are connected to each other through drugbank interaction 2348 and 1962.Also, the two drugs appear in PubMed articles 8119326 and 8223912 via their CID (Compound ID)

Isoniazid and Ethionamide replicate paper results Banerjee, A., Dubnau, E., Quemard, A., Balasubramanian, V., Um, K., Wilson, T., et al.: inha, a gene encoding a target for isoniazid and ethionamide in Mycobacterium tuberculosis. Science, 263(5144), 227-230 (1994).

WENDI exploring compound knowledge space Doxorubicin (anthracyclin antibiotic)

WENDI v1.0 - insights from the literature

WENDI v2.0 - Automated reasoning with RDF Simple OWL ontology for relationships Large RDF network expands out from Query RDF inference engines applied & results filtered / prioritized CID 86427 active_against AID 328 contains_term Breast Cancer similar_to QUERY similar_to CID 8642 contained_in PubMed 12856 contains_term Breast Cancer predicted_ inactive_against HER2 contains_term Breast Cancer

WENDI OWL/RDF Network & Inferred Associations

Semantic text mining of journal articles Jiao, D. and Wild, D.J. Extraction of CYP Chemical Interactions from Biomedical Literature Using Natural Language Processing Methods, Journal of Chemical Information and Modeling, 49(2); pp263-269

Chemical & Biological Literature Extraction

Validating topics by experimental relationships Topic 26: cell, expression, cancer, tumor, Related Disease: DNA Damage, Melanoma, Glioblastoma, Drug Target Un-proved link proved link by c2b2r_chemogenomics

Bio-LDA III Entropy In information theory, entropy is a measure of the uncertainty associated with a random variable. Here we can compute the bio-term entropies over topics Kullback-Leibler divergence (KL divergence) a non-symmetric measure of the difference between two probability distributions. Here we used the KL divergence as the non-symmetric distance measure for two bio-terms over topics

Combining path finding and Bio-LDA Detect semantic association Path finding algorithm millions of RDF triples from Chem2bio2rdf Assess semantic association Bio-LDA model Entropy and KL divergence Additional knowledge base: 50, 100 and 200 topics using the recent 336,899 MEDLINE abstracts, which contains 13,338 identical bio-terms

Summary Drug discovery is enetering a new era that is arguably centered on informatics analysis of the vast amount of biological and chemical data now being produced, and which looks at the effect of drugs on biological systems as a whole. This new approach underlies the new fields of systems chemical biology and chemogenomics Analyzing this data and particularly the relationships beween compounds, drugs, proteins, genes, diseases, pathways and people promises to provide important understanding of the nature of disease and treatment The Semantic Web provides an effective framework for logically managing the data, and Cloud Computing provides a physical framework for computation and searching Early-stage methods developed at Indiana allow integrated access to this data, path finding between any two points, visualization in chemical space and network tools, and advanced handling of the scholarly literature Critical next steps include ranking and intelligent filtering of paths and relationships to provide aggregate evidence-based approaches, and integration of NGS and patient data

Cheminformatics group at Indiana University http://djwild.info