Big Data Healthcare. Fei Wang Associate Professor Department of Computer Science and Engineering School of Engineering University of Connecticut



Similar documents
Basic Concepts of DNA, Proteins, Genes and Genomes

RNA & Protein Synthesis

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE E15

Translation Study Guide

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Structure and Function of DNA

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Genetics Lecture Notes Lectures 1 2

Replication Study Guide

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

The Spectrum of Biomedical Informatics and the UAB Informatics Institute

GenBank, Entrez, & FASTA

Transcription and Translation of DNA

Protein Synthesis How Genes Become Constituent Molecules

Human Genome Organization: An Update. Genome Organization: An Update

Regents Biology REGENTS REVIEW: PROTEIN SYNTHESIS

Genomes and SNPs in Malaria and Sickle Cell Anemia

Molecular Genetics. RNA, Transcription, & Protein Synthesis

From DNA to Protein

Integrating Bioinformatics, Medical Sciences and Drug Discovery

The National Institute of Genomic Medicine (INMEGEN) was

13.2 Ribosomes & Protein Synthesis

DNA, RNA, Protein synthesis, and Mutations. Chapters

Genetic information (DNA) determines structure of proteins DNA RNA proteins cell structure enzymes control cell chemistry ( metabolism )

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

BioBoot Camp Genetics

To be able to describe polypeptide synthesis including transcription and splicing

13.4 Gene Regulation and Expression

Anforderungen der Life-Science Industrie an die Hochschulen. Hans Widmer Novartis Institutes for BioMedical Research

Cancer Genomics: What Does It Mean for You?

The sequence of bases on the mrna is a code that determines the sequence of amino acids in the polypeptide being synthesized:

Ms. Campbell Protein Synthesis Practice Questions Regents L.E.

CCR Biology - Chapter 9 Practice Test - Summer 2012

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

Bioinformatics Resources at a Glance

School of Nursing. Presented by Yvette Conley, PhD

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

SNP Essentials The same SNP story

The EcoCyc Curation Process

How To Get A Masters Degree In Biomedical Informatics

Thymine = orange Adenine = dark green Guanine = purple Cytosine = yellow Uracil = brown

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

Name: Date: Period: DNA Unit: DNA Webquest

PRACTICE TEST QUESTIONS

DNA and the Cell. Version 2.3. English version. ELLS European Learning Laboratory for the Life Sciences

Computational Drug Repositioning by Ranking and Integrating Multiple Data Sources

The role of big data in medicine

2. The number of different kinds of nucleotides present in any DNA molecule is A) four B) six C) two D) three

SAP HANA Enabling Genome Analysis

The Human Genome Project

Big Data Analytics and Healthcare

Recombinant DNA and Biotechnology

The Steps. 1. Transcription. 2. Transferal. 3. Translation

RNA and Protein Synthesis

BIOINF 525 Winter 2016 Foundations of Bioinformatics and Systems Biology

Bob Jesberg. Boston, MA April 3, 2014

Health Informatics for Medical Librarians. Ana D. Cleveland and Donald B. Cleveland. Table of Contents

Subject Area(s) Biology. Associated Unit Engineering Nature: DNA Visualization and Manipulation. Associated Lesson Imaging the DNA Structure

Essentials of Real Time PCR. About Sequence Detection Chemistries

RNA Structure and folding

Hacking Brain Disease for a Cure

DNA Damage and Repair

Annex to the Accreditation Certificate D-PL according to DIN EN ISO/IEC 17025:2005

Computational Biomarker Discovery in the Big Data Era: from Translational Biomedical Informatics to Systems Medicine

MUTATION, DNA REPAIR AND CANCER

The Extension of the DICOM Standard to Incorporate Omics

Genetics Module B, Anchor 3

Core Facility Genomics

From DNA to Protein. Proteins. Chapter 13. Prokaryotes and Eukaryotes. The Path From Genes to Proteins. All proteins consist of polypeptide chains

Integration of Genetic and Familial Data into. Electronic Medical Records and Healthcare Processes

Teacher Guide: Have Your DNA and Eat It Too ACTIVITY OVERVIEW.

BASIC STATISTICAL METHODS FOR GENOMIC DATA ANALYSIS

Vad är bioinformatik och varför behöver vi det i vården? a bioinformatician's perspectives

Overview. Overarching observations

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

Gene Switches Teacher Information

Long-Term Effects of Drug Addiction

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Regulatory Issues in Genetic Testing and Targeted Drug Development

Name Class Date. Figure Which nucleotide in Figure 13 1 indicates the nucleic acid above is RNA? a. uracil c. cytosine b. guanine d.

Biological Sequence Data Formats

Web-Based Genomic Information Integration with Gene Ontology

Healthcare Analytics. Aryya Gangopadhyay UMBC

Genetic Testing in Research & Healthcare

Hidden Markov Models in Bioinformatics. By Máthé Zoltán Kőrösi Zoltán 2006

Course Requirements for the Ph.D., M.S. and Certificate Programs

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

Microarray Technology

Brochure More information from

Activity 4 Long-Term Effects of Drug Addiction

Electronic Medical Records and Genomics: Possibilities, Realities, Ethical Issues to Consider

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

Molecular Biology And Biotechnology

Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future

Big Data An Opportunity or a Distraction? Signal or Noise?

12.1 The Role of DNA in Heredity

European Medicines Agency

Transcription:

Big Data Healthcare Fei Wang Associate Professor Department of Computer Science and Engineering School of Engineering University of Connecticut

Healthcare Is in Crisis Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327.

Groves, P., Kayyali, B., Knott, D. & Kuilen, S. V. (2013). The Big-Data Revolution in Healthcare. MicKinsey & Company Report. Healthcare Data Activity (Claims) and Cost Data Owner: Payers, Providers Example: Utilization of care, Cost estimates Clinical Data Owner: Providers Example: Electronic Health Records, Medical Images Pharmaceutical R&D Data Owner: Pharmaceutical Companies, Academia Example: Clinical Trials, High- Throughput-Screening Libraries Patient Behavior and Sentiment Data Owner: Consumers and Stakeholders Example: Patient Behaviors and Preferences, Sensory Data

Groves, P., Kayyali, B., Knott, D. & Kuilen, S. V. (2013). The Big-Data Revolution in Healthcare. MicKinsey & Company Report. Healthcare Data Activity (Claims) and Cost Data Owner: Payers, Providers Example: Utilization of care, Cost estimates Pharmaceutical R&D Data Owner: Pharmaceutical Companies, Academia Example: Clinical Trials, High- Throughput-Screening Libraries Clinical Data Owner: Providers Example: Electronic Health Records, Medical Images There is approximately 500 petabytes of healthcare data in existence today and that number is expected to skyrocket to more than 25,000 petabytes within the next seven years. Patient Behavior and Sentiment Data Owner: Consumers and Stakeholders Example: Patient Behaviors and Preferences, Sensory Data

Resources Knowledge Data

Electronic Health Records An Electronic Health Record (EHR) is an evolving concept defined as a systematic collection of electronic health information about individual patients or populations Jensen, Peter B., Lars J. Jensen, and SØren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012).

Medical Imaging X-ray Computed Tomography (CT) Positron Emission Tomography (PET) Magnetic Resonance Imaging (MRI)

Physiology Physiology is the scientific study of function in living systems. A sub-discipline of biology, its focus is in how organisms, organ systems, organs, cells, and bio-molecules carry out the chemical or physical functions that exist in a living system

Drug: Chemical Compounds A chemical compound is a pure chemical substance consisting of two or more different chemical elements that can be separated into simpler substances by chemical reactions https://pubchem.ncbi.nlm.nih.gov/

Drug: Protein Targets The term biological target is frequently used in pharmaceutical research to describe the native protein in the body whose activity is modified by a drug resulting in a desirable therapeutic effect. In this context, the biological target is often referred to as a drug target. http://www.uniprot.org/help/uniprotkb

Gene DNA: A long molecule that looks like a twisted ladder. It is made of four types of simple units and the sequence of these units carries information, just as the sequence of letters carries information on a page. Gene expression: The process in which the information encoded in a gene is converted into a form useful for the cell. The first step is transcription, which produces a messenger RNA molecule complementary to the DNA molecule on which a gene is encoded. For protein-coding genes, the second step is translation, in which the messenger RNA is read by the ribosome to produce a protein. Gene: A segment of DNA. Genes are like sentences made of the "letters" of the nucleotide alphabet, between them genes direct the physical development and behavior of an organism. Genes are like a recipe or instruction book, providing information that an organism needs so it can build or do something - like making an eye or a leg, or repairing a wound. A Single Nucleotide Polymorphism (SNP, pronounced snip; plural snips) is a DNA sequence variation occurring commonly within a population (e.g. 1%) in which a single nucleotide A, T, C or G in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. All definitions from Wikipedia

Patient Survey

Online Social Media

Environmental Data

Research Field Genomics Clinical! Genomics Medicine Biomedical Informatics Bioinformatics Medical Informatics Informatics

Disease Network Phenotype/ Genotype Association Gene Network Drug Network Personalized Medicine Drug Repurposing Genomic Medicine Patient Network DNA Prognostication Sequencing

Personalized Medicine! PM:$the$right$pa-ent$with$the$right$drug$at$the$right$dose$at$the$ right$-me.$ In$his$2015$State$of$the$Union$address,$President$Obama$stated$his$inten-on$ to$infuse$funds$into$a$united$states$na-onal$"precision*medicine*ini-a-ve * For$pa-ents,$safer$and$more$effec-ve$treatments$ For$doctors,$reduce$wasted$-me$and$resources$with$fu-le$treatments$ For$pharms,$lower$cost$marke-ng$due$to$targeted$pa-ents,$faster$clinical$ trials,$less$focus$on$animal$trials$

Patient Similarity and Drug Similarity! Pa#ent'Similarity'analy#cs:'Find'pa#ents' who'display'similar'clinical'characteris#c' to'the'pa#ent'of'interest'! Resul#ng'insights:'medical'prognosis,'risk' stra#fica#on,'care'planning'(especially' for'pa#ents'has'mul#ple'diseases)'! Drug'Similarity'analy#cs:'Find'drugs' which'display'similar'pharmacological' characteris#c'to'the'drug'of'interest'! Resul#ng'insights:'drug'reposi#oning,' sideaeffect'predic#on,'drugadrug' interac#on'predic#on' How'to'leverage'both'pa#ent'similarity'and'drug'similarity' for'personalized'medicine?'

Information Sources Drug) Calculate)drug/pa=ent) similari=es) Chemical)Structure) Target)Proteins) Side5effect)Keywords) Pa=ent) Demographic) SNP)

Joint Factorization

Joint Factorization Nota4ons#and#symbols#of#the# methodology#! We#aim#to#analyze#the#drug2pa4ent#network#by#minimizing#the#following#objec4ve:# J = J 0 + λ 1 J 1 + λ 2 J 2! The#reconstruc4on#loss#of#observed#drug2pa4ent#associa4ons:# J 0 = Θ UΛV T F 2! The#reconstruc4on#loss#of#drug#similari4es:# K J 1 = d ω k D k UU T 2 2 F +δ 1 ω 2 k=1! The#reconstruc4on#loss#of#pa4ent#similari4es:# K J 2 = s π l S l VV T 2 2 F +δ 2 π 2 l=1 Similar#drugs/pa4ent#(latent#groups)#have#similar#behaviors# Reconstruct#integrated# drug/pa4ent#networks#! Pu@ng#everything#together,#we#obtained#the#op4miza4on#problem#to#be#resolved:# ####min U,V,Λ,Θ,ω,π J,#subject#to#U 0,#V 0,#Λ 0,#ω 0,#ω T 1=1,#π 0,#π T 1=1,#P Ω (Θ)=#P Ω (R)#

Case Study! Background:,Glioblastoma,mul4forme,(GBM),is,the,most,common, and,most,aggressive,malignant,primary,brain,tumor,in,humans.,! Raw,data:,GBM,data,from,The,Cancer,Genome,Atlas,(TCGA),website.,! How,to,define,an, effec4ve,treatment?, Median,survival,without,treatment,is,4½,months.,Median,survival, with,standardloflcare,radia4on,and,chemotherapy,is,15,months., We,define,treatments,for,pa4ents,who,live,for,more,than,15, months,(i.e.,,450,days),are,effec4ve.,! Final,data:,118,pa4ents,,41,dis4nct,drugs,and,261,known,effec4ve, pa4entldrug,associa4ons., by,matching,pa4ents,who,have,clinical,informa4on,(e.g.,,age,, race,,gender),,treatment,informa4on,,genomic,informa4on,(i.e.,, SNPs),,and,a,posi4ve,response,to,treatment,(i.e.,,live,for,more, than,15,months),,we,got,118,pa4ents.,! Task:,For,a,given,pa4ent,,predict,a,personalized,treatment,of,GBM.,

Basic Information! A"pa%ent"takes"2.2"drugs"at"the"same"%me"on"average"! A"drug"is"used"by"6.37"pa%ents"on"average"! Some"drugs"are"more"popular"than"others" Drug" #"of"pa%ents"use" Temozolomide" 62" Carmus%ne" 28" Procarbazine" 24" Dexamethasone" 21" Irinotecan" 17" Erlo%nib" 15" Lomus%ne" 11" Etoposide" 10"! Temozolomide"slows"or"stops"the"growth"of"cancer"cells,"which" causes"them"to"die."

Patient and Drug Similarities! Drug%similarity%evalua/on% Chemical%structure:%each%drug%was%represented%by%an%881< dimen/onal%pubchem%fingerprint.%tanimoto%coefficient%(tc)%of% two%fingerprints%as%chemical%structure%similarity.%tc(a,b)%=% A% % B / A% %B.%! Pa/ent%similarity%evalua/on% Demographics:%based%on%pa/ent s%risk%factors%of%gbm,%derive% demographic%vectors%based%on%gender%(male,%female),%race%(asian,% African%American,%Caucasian),%and%age%(<41,%41~50,%51~60,%61~70,% >70).% SNPs:%9507%mutated%genes%in%the%data,%each%pa/ent%was% represented%by%a%9507<dimensional%binary%profile%<%elements% encode%for%the%presence%or%absence%of%each%gene%by%1%or%0% respec/vely.% Use%Jaccard%similarity%to%measure%both%pa/ent%demographics%and% SNPs%similari/es.%

Label Propagation F* = (1 µ )( I µ W) 1 Label&Propaga,on&on&pa,ent&similarity& Y F,&Y& W& Pa,ent& similarity& pa,ents&propagate&their&known&effec,ve& treatments&to&other&pa,ents&based&on&the& pa,ent&network.& Label&Propaga,on&on&drug&similarity& F,&Y& W& Drug& Similarity& drugs&propagate&their&target&effec,ve&pa,ents&to& other&drugs&based&on&the&drug&network.& LP&method&doesn t&take&the&considera,on&of&both&drug&and&pa,ent&at&the&same&,me.&

Performance Comparison

Drug Repositioning! Drug%reposi+oning%(also&known&as&Drug%repurposing,&Drug%reprofiling,&Therapeu+c%Switching%and&Drug%re-tasking)&is&the& applica3on&of&known&drugs&and&compounds&to&new&indica3ons& (i.e.,&new&diseases). Drug%% Original%indica+on% New%indica+on% Viagra& Hypertension& Erec3le&dysfunc3on& Wellbutrin& Depression& Smoking&cessa3on& Thalidomide& An3eme3c& Mul3ple&Myeloma&! The&reposi3oned&drug&has&already&passed&a&significant&number&of& toxicity&and&other&tests,&its&safety&is&known&and&the&risk&of&failure& for&reasons&of&adverse&toxicology&are&reduced.&

Why Drug Repositioning Ashburn(TT,(Thor(KB.("Drug(reposi5oning:(iden5fying(and(developing(new(uses(for(exis5ng(drugs."(Nature(reviews(Drug(discovery(3.8((2004):(673K683.(

Computational Drug Repositioning Most%of%current%methods%only%focus%on%one%aspect%of%drug/disease% ac6vi6es.%% Few%methods%consider%both%drug%informa6on%and%disease%informa6on.% Few%methods%can%determine%interpretable%importance%of%different% informa6on%sources%during%the%predic6on.% Dudley(JT,(Deshpande(T,(BuMe(AJ.("Exploi5ng(drugKdisease(rela5onships(for(computa5onal(drug(reposi5oning."(Brief(Bioinform((2011)(12(4):303K11.(

Drug & Disease Information Drug) Calculate)drug/disease) similari?es) Chemical)Structure) Target)Proteins) Side5effect)Keywords) Disease) Phenotype/Symptom) Ontology) Disease)Gene)

Drug Similarities!! Drug!Similarity!of!Chemical!Structures!D chem.!we!calculated!the!drug!pairwise!similarity!based!on!a! chemical!structure!fingerprint!corresponding!to!the!881!chemical!substructures!defined!in!pubchem! database.!the!pairwise!chemical!similarity!between!two!drugs!d!and!d'!is!computed!as!the!tanimoto! coefficient!of!their!chemical!fingerprints:! D chem dd, ' hd ( ) hd ( ') = hd ( ) + hd ( ') hd ( ) hd ( ')! Drug!Similarity!of!Target!Proteins!D target.!we!collected!all!target!proteins!for!each!drug!from! DrugBank.!Then!we!calculated!the!pairwise!drug!target!similarity!between!drugs!d!and!d'!based!on! the!average!of!sequence!similariges!of!their!target!protein!sets:! target D d,d ' = 1 P(d) P(d ') P(d ) P(d ') SW (P i (d),p j (d '))! Drug!Similarity!of!Side!Effects!D se.!we!obtained!side!effect!keywords!from!sider,!an!online!database! containing!drug!side!effect!informagon!extracted!from!package!inserts!using!text!mining!methods.! The!pairwise!side!effect!similarity!between!two!drugs!d!and!d'!is!computed!as!the!Tanimoto! coefficient!of!their!side!effect!profiles:! i=1 j=1 D se dd, ' = ed ( ) ed ( ') ed ( ) + ed ( ') ed ( ) ed ( ') Smith(TF,(Waterman(MS,(Burks(C.(The(sta5s5cal(distribu5on(of(nucleic(acid(similari5es.(Nucleic(Acids(Res(1985;(13(2):645K665.

Disease Similarities! Disease&Similarity&of&Phenotypes&S pheno.&the&disease&phenotypic&similarity&was&constructed&by& iden:fying&similarity&between&the&mesh&terms&appearing&in&the&medical&descrip:on&("full&text"&and& "clinical&synopsis"&fields)&of&diseases&from&omim&database.&the&pairwise&disease&phenotype& similarity&between&two&diseases&s&and&s'&is&computed&as&the&cosine&of&the&angle&between&their& feature&vectors:& S pheno ss' = K K i=1 m(s) i m(s') i i=1 m2 (s) i K i=1 m 2 (s') i! Disease&Similarity&of&Disease&Ontology&S do.&the&disease&ontology&(do)&is&an&open&source&ontological& descrip:on&of&human&disease,&organized&from&a&clinical&perspec:ve&of&disease&e:ology&and&loca:on.& The&seman:c&similarity&of&two&diseases&s&and&s'&is&defined&as&the&informa:on&content&of&their&lowest& common&ancestor&by:& S do ss' = log min p x x C(s,s')! Disease&Similarity&of&Disease&Genes&S gene.&diseasekcausing&aberra:ons&in&the&normal&func:on&of&a& gene&define&that&gene&as&a&disease&gene.&we&collected&all&disease&genes&for&each&disease&from& "phenotypekgene&rela:onships"&field&from&omim&database.&then&we&calculated&the&pairwise&disease& similarity&between&diseases&s&and&s'&based&on&the&average&of&sequence&similari:es&of&their&disease& gene&sets:& S gene ss' = 1 G(s) G(s') G(s) G(s') i=1 j=1 SW (G i (s),g j (s')) We&are&also&calcula:ng&gene& similarity&based&on&go&

Joint Factorization

Joint Factorization Nota:ons#and#symbols#of#the# methodology#! We#aim#to#analyze#the#drug2disease#network#by#minimizing#the#following#objec:ve:# J = J 0 + λ 1 J 1 + λ 2 J 2! The#reconstruc:on#loss#of#observed#drug2disease#associa:ons:# J 0 = Θ UΛV T F 2! The#reconstruc:on#loss#of#drug#similari:es:# K J 1 = d ω k D k UU T 2 2 F +δ 1 ω 2 k=1! The#reconstruc:on#loss#of#disease#similari:es:# K J 2 = s π l S l VV T 2 2 F +δ 2 π 2 l=1 Similar#Drugs/diseases#(latent#groups)#have#similar#behaviors# Reconstruct#integrated# drug/disease#networks#! Pu?ng#everything#together,#we#obtained#the#op:miza:on#problem#to#be#resolved:# ####min U,V,Λ,Θ,ω,π J,#subject#to#U 0,#V 0,#Λ 0,#ω 0,#ω T 1=1,#π 0,#π T 1=1,#P Ω (Θ)=#P Ω (R)#

Data Set! Benchmark*dataset*was*extracted*from*NDF5RT,*spanning*3,250*treatment* associa@ons*between*799*drugs*and*719*diseases*! Three*799 799*matrices*were*used*to*represent*drug*similari@es*between*799* drugs*from*different*perspec@ves*! Three*719 719*matrices*were*used*to*represent*disease*similari@es*between* 719*human*diseases*from*different*perspec@ves*

Average ROC Curve

Information Source Importance

Repositioning Examples (a) Top 10 drugs predicted for AD (b) Top 10 drugs predicted for SLE Drug Prediction Score Clinical Evidence? Drug Prediction Score Clinical Evidence? Selegiline* 0.7091 Desoximetasone 0.7409 No Carbidopa 0.6924 No Azathioprine* 0.7269 Amantadine 0.6897 No Leflunomide 0.7078 Yes Procyclidine 0.6826 No Fluorometholone 0.7054 No Valproic Acid* 0.6745 Reposi'oning* candidates* Triamcinolone* 0.6862 Metformin 0.6543 Yes Beclomethasone 0.6522 No Bexarotene 0.6426 Yes Etodolac 0.6445 No Neostigmine 0.6385 No Hydroxychloroquine* 0.6374 Galantamine* 0.6348 Nelfinavir 0.6371 Yes Nilvadipine 0.6159 Yes Mercaptopurine 0.6150 No * denotes the drug is known and approved to treat the disease

Challenges

Data Acquisition

Data Characteristics Scale Dimension ality Heterogen eity Timeevolving

Data Quality Sparsity Noise Incomplete ness Irregularity

Dat Privacy

Acknowledgement 49

fei_wang@uconn.edu