Big Data Healthcare Fei Wang Associate Professor Department of Computer Science and Engineering School of Engineering University of Connecticut
Healthcare Is in Crisis Hersh, W., Jacko, J. A., Greenes, R., Tan, J., Janies, D., Embi, P. J., & Payne, P. R. (2011). Health-care hit or miss? Nature, 470(7334), 327.
Groves, P., Kayyali, B., Knott, D. & Kuilen, S. V. (2013). The Big-Data Revolution in Healthcare. MicKinsey & Company Report. Healthcare Data Activity (Claims) and Cost Data Owner: Payers, Providers Example: Utilization of care, Cost estimates Clinical Data Owner: Providers Example: Electronic Health Records, Medical Images Pharmaceutical R&D Data Owner: Pharmaceutical Companies, Academia Example: Clinical Trials, High- Throughput-Screening Libraries Patient Behavior and Sentiment Data Owner: Consumers and Stakeholders Example: Patient Behaviors and Preferences, Sensory Data
Groves, P., Kayyali, B., Knott, D. & Kuilen, S. V. (2013). The Big-Data Revolution in Healthcare. MicKinsey & Company Report. Healthcare Data Activity (Claims) and Cost Data Owner: Payers, Providers Example: Utilization of care, Cost estimates Pharmaceutical R&D Data Owner: Pharmaceutical Companies, Academia Example: Clinical Trials, High- Throughput-Screening Libraries Clinical Data Owner: Providers Example: Electronic Health Records, Medical Images There is approximately 500 petabytes of healthcare data in existence today and that number is expected to skyrocket to more than 25,000 petabytes within the next seven years. Patient Behavior and Sentiment Data Owner: Consumers and Stakeholders Example: Patient Behaviors and Preferences, Sensory Data
Resources Knowledge Data
Electronic Health Records An Electronic Health Record (EHR) is an evolving concept defined as a systematic collection of electronic health information about individual patients or populations Jensen, Peter B., Lars J. Jensen, and SØren Brunak. "Mining electronic health records: towards better research applications and clinical care." Nature Reviews Genetics (2012).
Medical Imaging X-ray Computed Tomography (CT) Positron Emission Tomography (PET) Magnetic Resonance Imaging (MRI)
Physiology Physiology is the scientific study of function in living systems. A sub-discipline of biology, its focus is in how organisms, organ systems, organs, cells, and bio-molecules carry out the chemical or physical functions that exist in a living system
Drug: Chemical Compounds A chemical compound is a pure chemical substance consisting of two or more different chemical elements that can be separated into simpler substances by chemical reactions https://pubchem.ncbi.nlm.nih.gov/
Drug: Protein Targets The term biological target is frequently used in pharmaceutical research to describe the native protein in the body whose activity is modified by a drug resulting in a desirable therapeutic effect. In this context, the biological target is often referred to as a drug target. http://www.uniprot.org/help/uniprotkb
Gene DNA: A long molecule that looks like a twisted ladder. It is made of four types of simple units and the sequence of these units carries information, just as the sequence of letters carries information on a page. Gene expression: The process in which the information encoded in a gene is converted into a form useful for the cell. The first step is transcription, which produces a messenger RNA molecule complementary to the DNA molecule on which a gene is encoded. For protein-coding genes, the second step is translation, in which the messenger RNA is read by the ribosome to produce a protein. Gene: A segment of DNA. Genes are like sentences made of the "letters" of the nucleotide alphabet, between them genes direct the physical development and behavior of an organism. Genes are like a recipe or instruction book, providing information that an organism needs so it can build or do something - like making an eye or a leg, or repairing a wound. A Single Nucleotide Polymorphism (SNP, pronounced snip; plural snips) is a DNA sequence variation occurring commonly within a population (e.g. 1%) in which a single nucleotide A, T, C or G in the genome (or other shared sequence) differs between members of a biological species or paired chromosomes. All definitions from Wikipedia
Patient Survey
Online Social Media
Environmental Data
Research Field Genomics Clinical! Genomics Medicine Biomedical Informatics Bioinformatics Medical Informatics Informatics
Disease Network Phenotype/ Genotype Association Gene Network Drug Network Personalized Medicine Drug Repurposing Genomic Medicine Patient Network DNA Prognostication Sequencing
Personalized Medicine! PM:$the$right$pa-ent$with$the$right$drug$at$the$right$dose$at$the$ right$-me.$ In$his$2015$State$of$the$Union$address,$President$Obama$stated$his$inten-on$ to$infuse$funds$into$a$united$states$na-onal$"precision*medicine*ini-a-ve * For$pa-ents,$safer$and$more$effec-ve$treatments$ For$doctors,$reduce$wasted$-me$and$resources$with$fu-le$treatments$ For$pharms,$lower$cost$marke-ng$due$to$targeted$pa-ents,$faster$clinical$ trials,$less$focus$on$animal$trials$
Patient Similarity and Drug Similarity! Pa#ent'Similarity'analy#cs:'Find'pa#ents' who'display'similar'clinical'characteris#c' to'the'pa#ent'of'interest'! Resul#ng'insights:'medical'prognosis,'risk' stra#fica#on,'care'planning'(especially' for'pa#ents'has'mul#ple'diseases)'! Drug'Similarity'analy#cs:'Find'drugs' which'display'similar'pharmacological' characteris#c'to'the'drug'of'interest'! Resul#ng'insights:'drug'reposi#oning,' sideaeffect'predic#on,'drugadrug' interac#on'predic#on' How'to'leverage'both'pa#ent'similarity'and'drug'similarity' for'personalized'medicine?'
Information Sources Drug) Calculate)drug/pa=ent) similari=es) Chemical)Structure) Target)Proteins) Side5effect)Keywords) Pa=ent) Demographic) SNP)
Joint Factorization
Joint Factorization Nota4ons#and#symbols#of#the# methodology#! We#aim#to#analyze#the#drug2pa4ent#network#by#minimizing#the#following#objec4ve:# J = J 0 + λ 1 J 1 + λ 2 J 2! The#reconstruc4on#loss#of#observed#drug2pa4ent#associa4ons:# J 0 = Θ UΛV T F 2! The#reconstruc4on#loss#of#drug#similari4es:# K J 1 = d ω k D k UU T 2 2 F +δ 1 ω 2 k=1! The#reconstruc4on#loss#of#pa4ent#similari4es:# K J 2 = s π l S l VV T 2 2 F +δ 2 π 2 l=1 Similar#drugs/pa4ent#(latent#groups)#have#similar#behaviors# Reconstruct#integrated# drug/pa4ent#networks#! Pu@ng#everything#together,#we#obtained#the#op4miza4on#problem#to#be#resolved:# ####min U,V,Λ,Θ,ω,π J,#subject#to#U 0,#V 0,#Λ 0,#ω 0,#ω T 1=1,#π 0,#π T 1=1,#P Ω (Θ)=#P Ω (R)#
Case Study! Background:,Glioblastoma,mul4forme,(GBM),is,the,most,common, and,most,aggressive,malignant,primary,brain,tumor,in,humans.,! Raw,data:,GBM,data,from,The,Cancer,Genome,Atlas,(TCGA),website.,! How,to,define,an, effec4ve,treatment?, Median,survival,without,treatment,is,4½,months.,Median,survival, with,standardloflcare,radia4on,and,chemotherapy,is,15,months., We,define,treatments,for,pa4ents,who,live,for,more,than,15, months,(i.e.,,450,days),are,effec4ve.,! Final,data:,118,pa4ents,,41,dis4nct,drugs,and,261,known,effec4ve, pa4entldrug,associa4ons., by,matching,pa4ents,who,have,clinical,informa4on,(e.g.,,age,, race,,gender),,treatment,informa4on,,genomic,informa4on,(i.e.,, SNPs),,and,a,posi4ve,response,to,treatment,(i.e.,,live,for,more, than,15,months),,we,got,118,pa4ents.,! Task:,For,a,given,pa4ent,,predict,a,personalized,treatment,of,GBM.,
Basic Information! A"pa%ent"takes"2.2"drugs"at"the"same"%me"on"average"! A"drug"is"used"by"6.37"pa%ents"on"average"! Some"drugs"are"more"popular"than"others" Drug" #"of"pa%ents"use" Temozolomide" 62" Carmus%ne" 28" Procarbazine" 24" Dexamethasone" 21" Irinotecan" 17" Erlo%nib" 15" Lomus%ne" 11" Etoposide" 10"! Temozolomide"slows"or"stops"the"growth"of"cancer"cells,"which" causes"them"to"die."
Patient and Drug Similarities! Drug%similarity%evalua/on% Chemical%structure:%each%drug%was%represented%by%an%881< dimen/onal%pubchem%fingerprint.%tanimoto%coefficient%(tc)%of% two%fingerprints%as%chemical%structure%similarity.%tc(a,b)%=% A% % B / A% %B.%! Pa/ent%similarity%evalua/on% Demographics:%based%on%pa/ent s%risk%factors%of%gbm,%derive% demographic%vectors%based%on%gender%(male,%female),%race%(asian,% African%American,%Caucasian),%and%age%(<41,%41~50,%51~60,%61~70,% >70).% SNPs:%9507%mutated%genes%in%the%data,%each%pa/ent%was% represented%by%a%9507<dimensional%binary%profile%<%elements% encode%for%the%presence%or%absence%of%each%gene%by%1%or%0% respec/vely.% Use%Jaccard%similarity%to%measure%both%pa/ent%demographics%and% SNPs%similari/es.%
Label Propagation F* = (1 µ )( I µ W) 1 Label&Propaga,on&on&pa,ent&similarity& Y F,&Y& W& Pa,ent& similarity& pa,ents&propagate&their&known&effec,ve& treatments&to&other&pa,ents&based&on&the& pa,ent&network.& Label&Propaga,on&on&drug&similarity& F,&Y& W& Drug& Similarity& drugs&propagate&their&target&effec,ve&pa,ents&to& other&drugs&based&on&the&drug&network.& LP&method&doesn t&take&the&considera,on&of&both&drug&and&pa,ent&at&the&same&,me.&
Performance Comparison
Drug Repositioning! Drug%reposi+oning%(also&known&as&Drug%repurposing,&Drug%reprofiling,&Therapeu+c%Switching%and&Drug%re-tasking)&is&the& applica3on&of&known&drugs&and&compounds&to&new&indica3ons& (i.e.,&new&diseases). Drug%% Original%indica+on% New%indica+on% Viagra& Hypertension& Erec3le&dysfunc3on& Wellbutrin& Depression& Smoking&cessa3on& Thalidomide& An3eme3c& Mul3ple&Myeloma&! The&reposi3oned&drug&has&already&passed&a&significant&number&of& toxicity&and&other&tests,&its&safety&is&known&and&the&risk&of&failure& for&reasons&of&adverse&toxicology&are&reduced.&
Why Drug Repositioning Ashburn(TT,(Thor(KB.("Drug(reposi5oning:(iden5fying(and(developing(new(uses(for(exis5ng(drugs."(Nature(reviews(Drug(discovery(3.8((2004):(673K683.(
Computational Drug Repositioning Most%of%current%methods%only%focus%on%one%aspect%of%drug/disease% ac6vi6es.%% Few%methods%consider%both%drug%informa6on%and%disease%informa6on.% Few%methods%can%determine%interpretable%importance%of%different% informa6on%sources%during%the%predic6on.% Dudley(JT,(Deshpande(T,(BuMe(AJ.("Exploi5ng(drugKdisease(rela5onships(for(computa5onal(drug(reposi5oning."(Brief(Bioinform((2011)(12(4):303K11.(
Drug & Disease Information Drug) Calculate)drug/disease) similari?es) Chemical)Structure) Target)Proteins) Side5effect)Keywords) Disease) Phenotype/Symptom) Ontology) Disease)Gene)
Drug Similarities!! Drug!Similarity!of!Chemical!Structures!D chem.!we!calculated!the!drug!pairwise!similarity!based!on!a! chemical!structure!fingerprint!corresponding!to!the!881!chemical!substructures!defined!in!pubchem! database.!the!pairwise!chemical!similarity!between!two!drugs!d!and!d'!is!computed!as!the!tanimoto! coefficient!of!their!chemical!fingerprints:! D chem dd, ' hd ( ) hd ( ') = hd ( ) + hd ( ') hd ( ) hd ( ')! Drug!Similarity!of!Target!Proteins!D target.!we!collected!all!target!proteins!for!each!drug!from! DrugBank.!Then!we!calculated!the!pairwise!drug!target!similarity!between!drugs!d!and!d'!based!on! the!average!of!sequence!similariges!of!their!target!protein!sets:! target D d,d ' = 1 P(d) P(d ') P(d ) P(d ') SW (P i (d),p j (d '))! Drug!Similarity!of!Side!Effects!D se.!we!obtained!side!effect!keywords!from!sider,!an!online!database! containing!drug!side!effect!informagon!extracted!from!package!inserts!using!text!mining!methods.! The!pairwise!side!effect!similarity!between!two!drugs!d!and!d'!is!computed!as!the!Tanimoto! coefficient!of!their!side!effect!profiles:! i=1 j=1 D se dd, ' = ed ( ) ed ( ') ed ( ) + ed ( ') ed ( ) ed ( ') Smith(TF,(Waterman(MS,(Burks(C.(The(sta5s5cal(distribu5on(of(nucleic(acid(similari5es.(Nucleic(Acids(Res(1985;(13(2):645K665.
Disease Similarities! Disease&Similarity&of&Phenotypes&S pheno.&the&disease&phenotypic&similarity&was&constructed&by& iden:fying&similarity&between&the&mesh&terms&appearing&in&the&medical&descrip:on&("full&text"&and& "clinical&synopsis"&fields)&of&diseases&from&omim&database.&the&pairwise&disease&phenotype& similarity&between&two&diseases&s&and&s'&is&computed&as&the&cosine&of&the&angle&between&their& feature&vectors:& S pheno ss' = K K i=1 m(s) i m(s') i i=1 m2 (s) i K i=1 m 2 (s') i! Disease&Similarity&of&Disease&Ontology&S do.&the&disease&ontology&(do)&is&an&open&source&ontological& descrip:on&of&human&disease,&organized&from&a&clinical&perspec:ve&of&disease&e:ology&and&loca:on.& The&seman:c&similarity&of&two&diseases&s&and&s'&is&defined&as&the&informa:on&content&of&their&lowest& common&ancestor&by:& S do ss' = log min p x x C(s,s')! Disease&Similarity&of&Disease&Genes&S gene.&diseasekcausing&aberra:ons&in&the&normal&func:on&of&a& gene&define&that&gene&as&a&disease&gene.&we&collected&all&disease&genes&for&each&disease&from& "phenotypekgene&rela:onships"&field&from&omim&database.&then&we&calculated&the&pairwise&disease& similarity&between&diseases&s&and&s'&based&on&the&average&of&sequence&similari:es&of&their&disease& gene&sets:& S gene ss' = 1 G(s) G(s') G(s) G(s') i=1 j=1 SW (G i (s),g j (s')) We&are&also&calcula:ng&gene& similarity&based&on&go&
Joint Factorization
Joint Factorization Nota:ons#and#symbols#of#the# methodology#! We#aim#to#analyze#the#drug2disease#network#by#minimizing#the#following#objec:ve:# J = J 0 + λ 1 J 1 + λ 2 J 2! The#reconstruc:on#loss#of#observed#drug2disease#associa:ons:# J 0 = Θ UΛV T F 2! The#reconstruc:on#loss#of#drug#similari:es:# K J 1 = d ω k D k UU T 2 2 F +δ 1 ω 2 k=1! The#reconstruc:on#loss#of#disease#similari:es:# K J 2 = s π l S l VV T 2 2 F +δ 2 π 2 l=1 Similar#Drugs/diseases#(latent#groups)#have#similar#behaviors# Reconstruct#integrated# drug/disease#networks#! Pu?ng#everything#together,#we#obtained#the#op:miza:on#problem#to#be#resolved:# ####min U,V,Λ,Θ,ω,π J,#subject#to#U 0,#V 0,#Λ 0,#ω 0,#ω T 1=1,#π 0,#π T 1=1,#P Ω (Θ)=#P Ω (R)#
Data Set! Benchmark*dataset*was*extracted*from*NDF5RT,*spanning*3,250*treatment* associa@ons*between*799*drugs*and*719*diseases*! Three*799 799*matrices*were*used*to*represent*drug*similari@es*between*799* drugs*from*different*perspec@ves*! Three*719 719*matrices*were*used*to*represent*disease*similari@es*between* 719*human*diseases*from*different*perspec@ves*
Average ROC Curve
Information Source Importance
Repositioning Examples (a) Top 10 drugs predicted for AD (b) Top 10 drugs predicted for SLE Drug Prediction Score Clinical Evidence? Drug Prediction Score Clinical Evidence? Selegiline* 0.7091 Desoximetasone 0.7409 No Carbidopa 0.6924 No Azathioprine* 0.7269 Amantadine 0.6897 No Leflunomide 0.7078 Yes Procyclidine 0.6826 No Fluorometholone 0.7054 No Valproic Acid* 0.6745 Reposi'oning* candidates* Triamcinolone* 0.6862 Metformin 0.6543 Yes Beclomethasone 0.6522 No Bexarotene 0.6426 Yes Etodolac 0.6445 No Neostigmine 0.6385 No Hydroxychloroquine* 0.6374 Galantamine* 0.6348 Nelfinavir 0.6371 Yes Nilvadipine 0.6159 Yes Mercaptopurine 0.6150 No * denotes the drug is known and approved to treat the disease
Challenges
Data Acquisition
Data Characteristics Scale Dimension ality Heterogen eity Timeevolving
Data Quality Sparsity Noise Incomplete ness Irregularity
Dat Privacy
Acknowledgement 49
fei_wang@uconn.edu