GENETIC DATA ANALYSIS

Similar documents

Biomedical Big Data and Precision Medicine

Environmental Health Science. Brian S. Schwartz, MD, MS

Differential privacy in health care analytics and medical research An interactive tutorial

Factors for success in big data science

PharmaSUG2011 Paper HS03

Smarter Research. Joseph M. Jasinski, Ph.D. Distinguished Engineer IBM Research

Workshop on Establishing a Central Resource of Data from Genome Sequencing Projects

Big Data Analytics for Healthcare

Big Data Analytics and Healthcare

Butler Memorial Hospital Community Health Needs Assessment 2013

Electronic Health Record (EHR) Data Analysis Capabilities

School of Nursing. Presented by Yvette Conley, PhD

Electronic Medical Record Mining. Prafulla Dawadi School of Electrical Engineering and Computer Science

How To Create A Health Analytics Framework

Big Data and Oncology Care Quality Improvement in the United States

Presentation by: Ahmad Alsahaf. Research collaborator at the Hydroinformatics lab - Politecnico di Milano MSc in Automation and Control Engineering

TABLE 4: STAGE 2 MEANINGFUL USE OBJECTIVES AND ASSOCIATED MEASURES SORTED BY CORE AND MENU SET

Genomics and Health Data Standards: Lessons from the Past and Present for a Genome-enabled Future

Early mortality rate (EMR) in Acute Myeloid Leukemia (AML)

Stage 1 vs. Stage 2 Comparison Table for Eligible Hospitals and CAHs Last Updated: August, 2012

Big Data for Population Health

A Multi-locus Genetic Risk Score for Abdominal Aortic Aneurysm

Logistic Regression (1/24/13)

NGS and complex genetics

Big Data Analytics in Healthcare In pursuit of the Triple Aim with Analytics. David Wiggin, Director, Industry Marketing, Teradata 20 November, 2014

Meaningful Use Criteria for Eligible Hospitals and Eligible Professionals (EPs)

Meaningful Use. Medicare and Medicaid EHR Incentive Programs

National Cancer Institute

IMPLEMENTING BIG DATA IN TODAY S HEALTH CARE PRAXIS: A CONUNDRUM TO PATIENTS, CAREGIVERS AND OTHER STAKEHOLDERS - WHAT IS THE VALUE AND WHO PAYS

Supplemental Technical Information

Principles on Health Care Reform

ITT Advanced Medical Technologies - A Programmer's Overview

Aggregate data available; release of county or case-based data requires approval by the DHMH Institutional Review Board

Acceleration for Personalized Medicine Big Data Applications

Bayesian Penalized Methods for High Dimensional Data

Meaningful Use: Registration, Attestation, Workflow Tips and Tricks

Medicare Advantage Stars: Are the Grades Fair?

Disease Prevention to Reduce New Hampshire Healthcare Claims and Costs: A Data Mining Approach

Using Large Datasets for Population-based Health Services Research

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

Electronic health records to study population health: opportunities and challenges

TABLE B5: STAGE 2 OBJECTIVES AND MEASURES

Big Data for Population Health and Personalised Medicine through EMR Linkages

Heritability: Twin Studies. Twin studies are often used to assess genetic effects on variation in a trait

Cloud-Based Big Data Analytics in Bioinformatics

How Can Institutions Foster OMICS Research While Protecting Patients?

Saving healthcare costs by implementing new genetic risk tests for early detection of cancer and prevention of cardiovascular diseases

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Likelihood of Cancer

How To Be A Health Care Worker

How To Plan Healthy People 2020

Overview of Vital Records and Public Health Informatics in CDPH

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Meaningful Use. Goals and Principles

Data mining on the EPIRARE survey data

Carolina s Journey: Turning Big Data Into Better Care. Michael Dulin, MD, PhD

International Journal of Computer Science Trends and Technology (IJCST) Volume 2 Issue 3, May-Jun 2014

Information for patients and the public and patient information about DNA / Biobanking across Europe

The National Institute of Genomic Medicine (INMEGEN) was

The Future of the Electronic Health Record. Gerry Higgins, Ph.D., Johns Hopkins

SOLUTION BRIEF. IMAT Enhances Clinical Trial Cohort Identification. imatsolutions.com

Contact Information: West Texas Health Information Technology Regional Extension Center th Street MS 6232 Lubbock, Texas

Consultation Response Medical profiling and online medicine: the ethics of 'personalised' healthcare in a consumer age Nuffield Council on Bioethics

HealthCare Partners of Nevada. Heart Failure

Uncovering Value in Healthcare Data with Cognitive Analytics. Christine Livingston, Perficient Ken Dugan, IBM

SNP Essentials The same SNP story

TRACKS GENETIC EPIDEMIOLOGY

Data Driven Approaches to Prescription Medication Outcomes Analysis Using EMR

Big Data and Graph Analytics in a Health Care Setting

Large Gene Interaction Analytics at University at Buffalo, SUNY

If you were diagnosed with cancer today, what would your chances of survival be?

How To Analyze Health Data

Meaningful Use - The Basics

Hospital Information Systems HIS

N Basic, including 100% Part B coinsurance. Basic including 100% Part B coinsurance* Basic including 100% Part B coinsurance

Research Skills for Non-Researchers: Using Electronic Health Data and Other Existing Data Resources

Delivering the power of the world s most successful genomics platform

MedInsight Healthcare Analytics Brief: Population Health Management Concepts

Wasteful spending in the U.S. health care. Strategies for Changing Members Behavior to Reduce Unnecessary Health Care Costs

Utilizing Public Data to Successfully Target Population for Prevention

Chronic Disease and Nursing:

Upstate New York adults with diagnosed type 1 and type 2 diabetes and estimated treatment costs

EMR Name/ Model. meridianemr 4.2 CCHIT 2011 certified

Secondary Uses of Data for Comparative Effectiveness Research

PCHC FACTS ABOUT HEALTH CONDITIONS AND MOOD DIFFICULTIES

Massachusetts Population

for patients, for families,

Transcription:

GENETIC DATA ANALYSIS 1

Genetic Data: Future of Personalized Healthcare To achieve personalization in Healthcare, there is a need for more advancements in the field of Genomics. The human genome is made up of DNA which consists of four different chemical building blocks (called bases and abbreviated A, T, C, and G). It contains 3 billion pairs of bases and the particular order of As, Ts, Cs, and Gs is extremely important. Size of a single human genome is about 3GB. Thanks to the Human Genome Project (1990-2003) To determine the complete sequence of the DNA bases The total cost was around $3 billion. 2

Genetic Data The whole genome sequencing data is currently being annotated and not many analytics have been applied so far since the data is relatively new. Several publicly available genome repositories. http://aws.amazon.com/1000genomes/ It costs around $5000 to get a complete genome. It is still in the research phase. Heavily used in the cancer biology. In this tutorial, we will focus on Genome-Wide Association Studies (GWAS). It is more relevant to healthcare practice. Some clinical trials have already started using GWAS. Most of the computing literature (in terms of analytics) is available for the GWAS. It is still in rudimentary stage for whole genome sequences. 3

Genome-Wide Association Studies (GWAS) Genome-wide association studies (GWAS) are used to identify common genetic factors that influence health and disease. These studies normally compare the DNA of two groups of participants: people with the disease (cases) and similar people without (controls). (One million Loci) Single nucleotide polymorphisms (SNPs) are DNA sequence variations that occur when a single nucleotide (A,T,C,or G) in the genome sequence differs between individuals. SNPs occur every 100 to 300 bases along the 3-billion-base human genome. 4

Important Computational Challenges in GWAS Epistasis Modeling GOAL: To understand complex relationship between genotype and phenotype by identifying SNP-SNP interactionss. METHODS: Exhaustive, Stochastic, and Heuristic. High-dimensional SNP (Variable) Selection GOAL: To extract the SNPs that are significantly associated with the phenotype outcome. To obtain a set of reduced number of SNPs that are statistically significant to the genotype-phenotype relationship. METHODS: Sparse Linear Methods and Random Forests. 5

Epistasis Modeling For simple Mendelian diseases, single SNPs can explain phenotype very well. The complex relationship between genotype and phenotype is inadequately described by marginal effects of individual SNPs. Increasing empirical evidence suggests that interactions among loci contribute broadly to complex traits. The difficulty in the problem of detecting SNP pair interactions is the heavy computational burden. To detect pairwise interactions from 300,000 SNPs genotyped in thousands of samples, a total of 4.5 X 10 statistical tests are needed. Since a huge number of possible combinations are tested, a large proportion of significant associations are expected to be false positives. Thus, reducing the number of false positives while retaining the significance power is another challenge. 6

Epistasis Detection Methods Exhaustive Enumerates all K-locus interactions among SNPs. Efficient implementations mostly aiming at reducing computations by eliminating unnecessary calculations. Non-Exhaustive Stochastic: randomized search. Performance lowers when the # SNPs increase. Heuristic: greedy methods that do not guarantee optimal solution. Shang, Junliang, et al. "Performance analysis of novel methods for detecting epistasis." BMC bioinformatics 12.1 (2011): 475. 7

Variable Selection in SNP (High-Dimensional) Data In Genome-wide association study (GWAS), Single Nucleotide Polymorphism (SNPs) can be considered as features (usually in the range of hundreds of thousands). Not all the features are significantly associated with the phenotype outcome. A reduced number of features that are statistically significant might help us to better understand the genotypephenotype relationship. A subset of features may produce predictive models with better accuracy. It removes the noisy, irrelevant and redundant features and finally, reduces the computational and memory usage complexity. Two popular methods are: Sparse methods Random Forests 8

Sparse Methods for SNP Data Analysis Successful identification of SNPs strongly predictive of disease promises a better understanding of the biological mechanisms underlying the disease. Scalability: The number of features is too high to be handled by traditional feature selection / ranking methods. Sparse linear methods have been used to fit the genotype data and obtain a selected set of SNPs. Minimizing the squared loss function (L) of N individuals and p variables (SNPs) is used for linear regression and is defined as 1 L( N p T 2 0, ) ( y i 0 x i ) 2 i 1 j 1 where x i p are inputs for the i th sample, y N is the N vector of outputs, β 0 is the intercept, β p is a p-vector of model weights, and λ is user penalty. j 9

Packages for Lasso methods for SNP Data Analysis R package glmnet 1.7 with logistic loss (binomial family) implemented as a Fortran library Link: http://www.jstatsoft.org/v33/i01/paper LIBLINEAR 1.8 [8] c, with l1 -penalised squared hinge loss (model 5), implemented in C++ Link: http://www.csie.ntu.edu.tw/~cjlin/liblinear/ SparSNP package http://bioinformatics.research.nicta.com.au/software/sparsnp HyperLasso, logistic regression with the double exponential (DE) prior (equivalent to lasso), implemented in C++. HyperLasso implements cyclical coordinate descent as well. Software: http://www.ebi.ac.uk/projects/bargen/ Paper: http://www.ncbi.nlm.nih.gov/pmc/articles/pmc2464715/ 10

Random Forests Feature importance measure is calculated by permuting each variable. Algorithm High-level Description Bootstrap sampling of the data. Create trees with bootstrap samples. For each tree The classification error rate is calculated using out-of-bag (oob) samples. For each variable in the tree, permute the variable values and compute the error rate. The increase in this value is a indication of the variable s importance. Aggregate the error and importance measures from all trees to determine overall error rate and Variable Importance measures. 11

RESOURCES 12

Public Resources for Genetic (SNP) Data The Wellcome Trust Case Control Consortium (WTCCC) is a group of 50 research groups across the UK which was established in 2005. Data available at http://www.wtccc.org.uk/ Seven different diseases: bipolar disorder (1868), coronary heart disease (1926), Crohn's disease (1748), hypertension (1952), rheumatoid arthritis (1860), type I diabetes (1963) or type II diabetes (1924). Around 3,000 healthy controls common for these disorders. The individuals were genotyped using Affymetrix chip and obtained approximately 500K SNPs. The database of Genotypes and Phenotypes (dbgap) maintained by National Center of Biotechnology Information (NCBT) at NIH. Data available at http://www.ncbi.nlm.nih.gov/gap 13

Structured EHR Data Repositories Dataset Link Description Texas Hospital Inpatient Discharge Framingham Health Care Data Set http://www.dshs.state.tx.us/thcic/ho spitals/inpatientpudf.shtm http://www.framinghamheartstudy.o rg/share/index.html Medicare Basic Stand Alone http://resdac.advantagelabs.com/c Claim Public Use Files ms-data/files/bsa-puf VHA Medical SAS Datasets http://www.virec.research.va.gov/m edsas/overview.htm Nationwide Inpatient Sample CA Patient Discharge Data MIMIC II Clinical Database http://www.hcupus.ahrq.gov/nisoverview.jsp http://www.oshpd.ca.gov/hid/produ cts/patdischargedata/publicdatas et/index.html http://mimic.physionet.org/database.html Patient: hospital location, admission type/source, claims, admit day, age, icd9 codes + surgical codes Genetic dataset for cardiovascular disease Inpatient, skilled nursing facility, outpatient, home health agency, hospice, carrier, durable medical equipment, prescription drug event, and chronic conditions on an aggregate level Patient care encounters primarily for Veterans: inpatient/outpatient data from VHA facilities Discharge data from 1051 hospitals in 45 states with diagnosis, procedures, status, demographics, cost, length of stay Discharge data for licensed general acute hospital in CA with demographic, diagnostic and treatment information, disposition, total charges ICU data including demographics, diagnosis, clinical measurements, lab results, interventions, notes Thanks to Prof. Joydeep Ghosh from UT Austin for providing this information 14

Publicly Available Medical Image Repositories Image database Name Moda lities No. Of patients No. Of Images Size Of Data Notes/Applications Download Link Cancer Imaging Archive Database Digital Mammog raphy database Public Lung Image Database Image CLEF Database MS Lesion Segment ation ADNI Database CT DX CR 1010 244,527 241 GB Lesion Detection and classification, Accelerated Diagnostic Image Decision, Quantitative image assessment of drug response DX 2620 9,428 211 GB Research in Development of Computer Algorithm to aid in screening CT 119 28,227 28 GB Identifying Lung Cancer by Screening Images PET CT MRI US unknown 306,549 316 GB Modality Classification, Visual Image Annotation, Scientific Multimedia Data Management MRI 41 145 36 GB Develop and Compare 3D MS Lesion Segmentation Techniques MRI PET 2851 67,871 16 GB Define the progression of Alzheimer s disease https://public.cancerimagingarchive.net/ ncia/databasketdisplay.jsf http://marathon.csee.usf.edu/mammogr aphy/database.html https://eddie.via.cornell.edu/crpf.html http://www.imageclef.org/2013/medical http://www.ia.unc.edu/msseg/download.php http://adni.loni.ucla.edu/datasamples/acscess-data/ 15

Epidemiology Data The Surveillance Epidemiology and End Results Program (SEER) at NIH. Publishes cancer incidence and survival data from population-based cancer registries covering approximately 28% of the population of the US. Collected over the past 40 years (starting from January 1973 until now). Contains a total of 7.7M cases and >350,000 cases are added each year. Collect data on patient demographics, tumor site, tumor morphology and stage at diagnosis, first course of treatment, and follow-up for vital status. Usage: Widely used for understanding disparities related to race, age, and gender. Can not be used for predictive analysis, but mostly used for studying trends. Medicare data for SEER patients is already linked. SEER Database is available at http://seer.cancer.gov/ SEER-Medicare Linked Database available at http://healthservices.cancer.gov/seermedicare/ 16

Public Health and Behavior Data Repositories Dataset Link Description Behavioral Risk Factor Surveillance System (BRFSS) Ohio Hospital Inpatient/Outpatient Data US Mortality Data Human Mortality Database http://www.cdc.gov/brfss/technical_ infodata/index.htm http://publicapps.odh.ohio.gov/pwh /PWHMain.aspx?q=021813114232 Healthcare survey data: smoking, alcohol, lifestyle (diet, exercise), major diseases (diabetes, cancer), mental illness Hospital: number of discharges, transfers, length of stay, admissions, transfers, number of patients with specific procedure codes http://www.cdc.gov/nchs/data_acce Mortality information on countylevel ss/cmf.htm http://www.mortality.org/ Birth, death, population size by country Utah Public Health Database http://ibis.health.utah.gov/query Summary statistics for mortality, charges, discharges, length of stay on a county-level basis Dartmouth Atlas of Health Care http://www.dartmouthatlas.org/tools Post discharge events, chronically /downloads.aspx ill care, surgical discharge rate Thanks to Prof. Joydeep Ghosh from UT Austin for providing this information. 17

Conclusion Healthcare is a data-rich domain. As more and more data is being collected, there will be increasing demand for efficient data analytics. As the EHR data keeps growing at a rapid pace, more research on building new scalable predictive modeling platforms is required. An effective modeling platform for healthcare analytics research must integrate techniques from various disciplines such as information extraction, statistics, data mining, and visualization. Unraveling the Big Data related complexities can provide many insights about making the right decisions at the right time for the patients. Efficiently utilizing the colossal healthcare data repositories can yield some immediate returns in terms of patient outcomes and lowering care costs. 18

Acknowledgements Funding Sources National Science foundation National Institutes of Health Susan G. Komen for the Cure Delphinus Medical Technologies IBM Research 19

Thank You Questions and Comments Feel free to email questions or suggestions to jimeng@cs.cmu.edu reddy@cs.wayne.edu 20