Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Size: px
Start display at page:

Download "Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD"

Transcription

1 Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA Statistical Methods and Machine Learning ISPOR International Meeting, Montreal, Canada

2 Overview Explosion in Data Availability Traditional Methods for Analyzing Observational Data Machine Learning Methods Widely used outside of health care especially in consumer retail Many methods Model development and testing approach Is more data better? Traditional focus on prediction versus estimation of treatment effects How Can We Find the Best of Both? Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 3 The Growing Availability of Data

3 Market Context Velocity Complexity Tests and Treatments (Medical, Lab, Pharmacy Claims, Standardized Costs) Health Risk Assessments Socioeconomic (Race, Income, Education, Language, ) Vital Signs Medication Orders Admissions, Discharges, Transfers Patient Health Survey (PHQ-9) Health Survey Measurement (SF-12, SF-36) Care Coaching Engagements Evidence Based Medicine (Recommended Care Pathways) Mobile Applications / Social Networking Medical Research Genomic Volume Future Variety Gartner model, adapted Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 5 Examples of Data Partnerships m Multi-Stakeholder Life Sciences/Data and Analytics PCORI CDRN PCORI PCORnet FDA Sentinel m Government Life Sciences/Payer Delivery System/Partners Payer/IT Life Sciences/PBM Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 6

4 Traditional Health Services Research and Epidemiological Methods Statistical Analysis of Observational Data Good methods for developing well-matched control groups but no magic bullets--e.g., propensity score. These methods control only for observables. Do not control for endogeneity or confounding. Johnson, M., Crown, W., Martin, B., Dormuth, C., Siebert U Good Research Practices for Comparative Effectiveness Research: Analytic Methods to Improve Causal Inference from Nonrandomized Studies of Treatment Effects Using Secondary Data Sources. Report of the ISPOR Retrospective Database Analysis Task Force Part III. Value in Health 12(8): Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 8

5 Machine Learning Methods Machine Learning Methods Many methods: Classification Trees Neural Networks Random Forests Ridge and Lasso Regression Support Vector Machines And many others Hastie T., Tibshirani R., Friedman J. The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 nd Edition. New York: Springer. Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 10

6 Basic Approach Use learning datasets to develop highly accurate classification algorithm. Apply algorithm to another dataset to predict classification. Rules should be as simple as possible while maintaining accuracy. Should be able to classify data without human intervention Should be efficient with very large datasets Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 11 Rob Schapire. Machine Learning Algorithms for Classification.

7 K-Fold Cross-Validation Randomly divide the full dataset into learning/validation datasets Randomly divide the learning/validation data into K equal subsamples (typically 5 or 10) For each subsample K, fit the data using the other K-1 subsamples Estimate the prediction error (e.g., sum of squared errors) for subsample K using the models estimated from the other K-1 subsamples Pick the model specification that generates the lowest average cross validation error Estimate the final model using the full dataset Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 13 Classification and Regression Trees Advantages Easily handle huge datasets Can include both qualitative and quantitative predictor variables Very good for missing or sparse data Small trees are easy to interpret Disadvantages Large trees are difficult to interpret Overall prediction performance tends to be poor Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 14

8 Rob Schapire. Machine Learning Algorithms for Classification. Rob Schapire. Machine Learning Algorithms for Classification.

9 Approach Pick a rule to subset data Using the rule, divide data into subsets Keep repeating until remaining subsets are almost pure (e.g, measured by entropy or gini index) Usual approach is to build a very large tree and then prune it back Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 17 Neural Networks Y 1 Y 2 Outcome Layer Y i =e z /(1+e z ) Z 1 Z 2 Z 3 Z 4 Hidden Layer Z i = f(b k X k ) X 1 X 2 X 3 Input Layer Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 18

10 Prediction Is Not the Same as Estimating Treatment Effects Some machine learning methods (e.g., Ridge and Lasso regression) use regression methods with a penalty term to adjust for the danger of overfitting. Enables application of the machine learning approach to the estimation of treatment effects. But we know that results from observational studies can be sensitive to spurious correlations and methodological approach Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 19 Things with strong trends will tend to be highly correlated (1) Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 20

11 Things With Strong Trends Will Tend to Be Highly Correlated (2) Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 21 Small Sample Sizes Can Generate Some Really Weird Findings Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 22

12 But Even With Big Data You Have To Be Careful! MI Outcome (Unmatched) MI Outcome (After Matching) HR=2.11 ( ) 111% (46%-204%) Risk Increase HR=0.69 ( ) 31% (7%-48%) Risk Reduction Cumulative Incidence Statin Initiators Statin Non-Initiators Cumulative Incidence Statin Non-Initiators Statin Initiators Months of Follow-Up Months of Follow-Up Seeger, John, Alexander Walker, Paige Williams, Gordon Saperia, Frank Sacks (2003) A Propensity Score-Matched Cohort Study of the Effect of Statins, Mainly Fluvastatin, on the Occurrence of Acute Myocardial Infarction. Am J. Cardiol 92: Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 23 Bigger Samples Don t Reduce Bias N=200 N=10,000 (,z)=0 (z,e)= Estimation Error iv ols iv ols Crown, W., Henk, H., VanNess D. Some Cautions on the Use of Instrumental Variables (IV) Estimators in Outcomes Research: How Bias in IV Estimators is Affected by Instrument Strength, Instrument Contamination, and Sample Size. Value in Health 14: , Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 24

13 EHR/Claims Linkages Can Help Reduce Missing Variable Bias Relevant information Claims alone EHR alone Linked data Clinical data and severity measures + + Retail/specialty drugs across treatment settings + + Leakage + + Patient-reported outcomes Selection biases due to payer type + + Longitudinality of patient follow-up + ++ Self-pay data + + Coding biases + + Unstructured data + + Timing of events + + Continuous coverage + ++ Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 25 Data from disparate sources can be linked and de-identified Name Address Birthdate SSN Phone, etc. Direct identifiers (EMR / Clinical) Primary hash Shared Salt Code (same for all contributors) Data is then hashed by contributors at their site. Name Address Birthdate Member ID Phone, etc. Direct identifiers (Insurer example) Primary hash Secondary hash Uses Confidential Salt De-identification Statistically de-identified views Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 26

14 Summary Rapid expansion in data (volume, velocity, and variety) Machine learning approaches focus on prediction but some can also be used to estimate treatment effects Machine learning methods offer opportunities for speed to answer but traditional challenges with observational data do not go away More data doesn t help with bias problems unless it helps with control variables through data linkage For treatment effect estimation still need to think about possible sources of bias and their implications for methodology and data used for model building Confidential property of Optum. Do not distribute or reproduce without express permission from Optum. 27 Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA

HIPAA and Big Data Twenty Third National HIPAA Summit. March 17, 2015 Mitchell W. Granberg, Optum Chief Privacy Officer

HIPAA and Big Data Twenty Third National HIPAA Summit. March 17, 2015 Mitchell W. Granberg, Optum Chief Privacy Officer HIPAA and Big Data Twenty Third National HIPAA Summit March 17, 2015 Mitchell W. Granberg, Optum Chief Privacy Officer Overview HIPAA and Big Data Big Data Definitions Big Data and Health Care Benefits

More information

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA IMS Symposium at ISPOR at Montreal June 2 nd, 2014 Agenda Topic Presenter Time Introduction:

More information

Data Mining. Nonlinear Classification

Data Mining. Nonlinear Classification Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15

More information

FULL COVERAGE FOR PREVENTIVE MEDICATIONS AFTER MYOCARDIAL INFARCTION IMPACT ON RACIAL AND ETHNIC DISPARITIES

FULL COVERAGE FOR PREVENTIVE MEDICATIONS AFTER MYOCARDIAL INFARCTION IMPACT ON RACIAL AND ETHNIC DISPARITIES FULL COVERAGE FOR PREVENTIVE MEDICATIONS AFTER MYOCARDIAL INFARCTION IMPACT ON RACIAL AND ETHNIC DISPARITIES Niteesh K. Choudhry, MD, PhD Harvard Medical School Division of Pharmacoepidemiology and Pharmacoeconomics

More information

Data Mining Practical Machine Learning Tools and Techniques

Data Mining Practical Machine Learning Tools and Techniques Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea

More information

Secondary Uses of Data for Comparative Effectiveness Research

Secondary Uses of Data for Comparative Effectiveness Research Secondary Uses of Data for Comparative Effectiveness Research Paul Wallace MD Director, Center for Comparative Effectiveness Research The Lewin Group Paul.Wallace@lewin.com Disclosure/Perspectives Training:

More information

Lecture 13: Validation

Lecture 13: Validation Lecture 3: Validation g Motivation g The Holdout g Re-sampling techniques g Three-way data splits Motivation g Validation techniques are motivated by two fundamental problems in pattern recognition: model

More information

Big Data Analytics for SCADA

Big Data Analytics for SCADA ENERGY Big Data Analytics for SCADA Machine Learning Models for Fault Detection and Turbine Performance Elizabeth Traiger, Ph.D., M.Sc. 14 April 2016 1 SAFER, SMARTER, GREENER Points to Convey Big Data

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 11 Sajjad Haider Fall 2013 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Marcus Wilson, PharmD. First Plenary Session

Marcus Wilson, PharmD. First Plenary Session Moderator First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? Marcus Wilson, PharmD HealthCore Wilmington, DE, USA Speakers First Plenary Session THE USE OF "BIG DATA"

More information

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013 Model selection in R featuring the lasso Chris Franck LISA Short Course March 26, 2013 Goals Overview of LISA Classic data example: prostate data (Stamey et. al) Brief review of regression and model selection.

More information

Data Mining - Evaluation of Classifiers

Data Mining - Evaluation of Classifiers Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010

More information

Using 'Big Data' to Estimate Benefits and Harms of Healthcare Interventions

Using 'Big Data' to Estimate Benefits and Harms of Healthcare Interventions Using 'Big Data' to Estimate Benefits and Harms of Healthcare Interventions Experience with ICES and CNODES DAVID HENRY, PROFESSOR OF HEALTH SYSTEMS DATA, UNIVERSITY OF TORONTO, SENIOR SCIENTIST, INSTITUTE

More information

Regularized Logistic Regression for Mind Reading with Parallel Validation

Regularized Logistic Regression for Mind Reading with Parallel Validation Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland

More information

Decision Trees from large Databases: SLIQ

Decision Trees from large Databases: SLIQ Decision Trees from large Databases: SLIQ C4.5 often iterates over the training set How often? If the training set does not fit into main memory, swapping makes C4.5 unpractical! SLIQ: Sort the values

More information

Package acrm. R topics documented: February 19, 2015

Package acrm. R topics documented: February 19, 2015 Package acrm February 19, 2015 Type Package Title Convenience functions for analytical Customer Relationship Management Version 0.1.1 Date 2014-03-28 Imports dummies, randomforest, kernelfactory, ada Author

More information

How is Big Data Different? A Paradigm Shift

How is Big Data Different? A Paradigm Shift How is Big Data Different? A Paradigm Shift Jennifer Clarke, Ph.D. Associate Professor Department of Statistics Department of Food Science and Technology University of Nebraska Lincoln ASA Snake River

More information

COMMON METHODOLOGICAL ISSUES FOR CER IN BIG DATA

COMMON METHODOLOGICAL ISSUES FOR CER IN BIG DATA COMMON METHODOLOGICAL ISSUES FOR CER IN BIG DATA Harvard Medical School and Harvard School of Public Health sharon@hcp.med.harvard.edu December 2013 1 / 16 OUTLINE UNCERTAINTY AND SELECTIVE INFERENCE 1

More information

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction

New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.

More information

Leveraging electronic health records for predictive modeling of surgical complications. Grant Weller

Leveraging electronic health records for predictive modeling of surgical complications. Grant Weller Leveraging electronic health records for predictive modeling of surgical complications Grant Weller ISCB 2015 Utrecht NL August 26, 2015 Collaborators: David W. Larson, MD; Jenna Lovely, PharmD, RPh; Berton

More information

How To Make A Credit Risk Model For A Bank Account

How To Make A Credit Risk Model For A Bank Account TRANSACTIONAL DATA MINING AT LLOYDS BANKING GROUP Csaba Főző csaba.fozo@lloydsbanking.com 15 October 2015 CONTENTS Introduction 04 Random Forest Methodology 06 Transactional Data Mining Project 17 Conclusions

More information

Electronic health records to study population health: opportunities and challenges

Electronic health records to study population health: opportunities and challenges Electronic health records to study population health: opportunities and challenges Caroline A. Thompson, PhD, MPH Assistant Professor of Epidemiology San Diego State University Caroline.Thompson@mail.sdsu.edu

More information

Cross Validation. Dr. Thomas Jensen Expedia.com

Cross Validation. Dr. Thomas Jensen Expedia.com Cross Validation Dr. Thomas Jensen Expedia.com About Me PhD from ETH Used to be a statistician at Link, now Senior Business Analyst at Expedia Manage a database with 720,000 Hotels that are not on contract

More information

Data Mining Methods: Applications for Institutional Research

Data Mining Methods: Applications for Institutional Research Data Mining Methods: Applications for Institutional Research Nora Galambos, PhD Office of Institutional Research, Planning & Effectiveness Stony Brook University NEAIR Annual Conference Philadelphia 2014

More information

Data Mining Techniques Chapter 6: Decision Trees

Data Mining Techniques Chapter 6: Decision Trees Data Mining Techniques Chapter 6: Decision Trees What is a classification decision tree?.......................................... 2 Visualizing decision trees...................................................

More information

EXECUTIVE SUMMARY...1 II.

EXECUTIVE SUMMARY...1 II. EXTENDING COMPARATIVE EFFECTIVENESS RESEARCH AND MEDICAL PRODUCT SAFETY SURVEILLANCE CAPABILITY THROUGH LINKAGE OF ADMINISTRATIVE CLAIMS DATA WITH ELECTRONIC HEALTH RECORDS: A SENTINEL-PCORnet COLLABORATION

More information

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris Class #6: Non-linear classification ML4Bio 2012 February 17 th, 2012 Quaid Morris 1 Module #: Title of Module 2 Review Overview Linear separability Non-linear classification Linear Support Vector Machines

More information

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University AIRPO

More information

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully

More information

Predictive Data modeling for health care: Comparative performance study of different prediction models

Predictive Data modeling for health care: Comparative performance study of different prediction models Predictive Data modeling for health care: Comparative performance study of different prediction models Shivanand Hiremath hiremat.nitie@gmail.com National Institute of Industrial Engineering (NITIE) Vihar

More information

Understanding Diagnosis Assignment from Billing Systems Relative to Electronic Health Records for Clinical Research Cohort Identification

Understanding Diagnosis Assignment from Billing Systems Relative to Electronic Health Records for Clinical Research Cohort Identification Understanding Diagnosis Assignment from Billing Systems Relative to Electronic Health Records for Clinical Research Cohort Identification Russ Waitman Kelly Gerard Daniel W. Connolly Gregory A. Ator Division

More information

Data Mining Algorithms Part 1. Dejan Sarka

Data Mining Algorithms Part 1. Dejan Sarka Data Mining Algorithms Part 1 Dejan Sarka Join the conversation on Twitter: @DevWeek #DW2015 Instructor Bio Dejan Sarka (dsarka@solidq.com) 30 years of experience SQL Server MVP, MCT, 13 books 7+ courses

More information

L13: cross-validation

L13: cross-validation Resampling methods Cross validation Bootstrap L13: cross-validation Bias and variance estimation with the Bootstrap Three-way data partitioning CSCE 666 Pattern Analysis Ricardo Gutierrez-Osuna CSE@TAMU

More information

Model Validation Techniques

Model Validation Techniques Model Validation Techniques Kevin Mahoney, FCAS kmahoney@ travelers.com CAS RPM Seminar March 17, 2010 Uses of Statistical Models in P/C Insurance Examples of Applications Determine expected loss cost

More information

ICPSR Summer Program

ICPSR Summer Program ICPSR Summer Program Data Mining Tools for Exploring Big Data Department of Statistics Wharton School, University of Pennsylvania www-stat.wharton.upenn.edu/~stine Modern data mining combines familiar

More information

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes Knowledge Discovery and Data Mining Lecture 19 - Bagging Tom Kelsey School of Computer Science University of St Andrews http://tom.host.cs.st-andrews.ac.uk twk@st-andrews.ac.uk Tom Kelsey ID5059-19-B &

More information

Model Combination. 24 Novembre 2009

Model Combination. 24 Novembre 2009 Model Combination 24 Novembre 2009 Datamining 1 2009-2010 Plan 1 Principles of model combination 2 Resampling methods Bagging Random Forests Boosting 3 Hybrid methods Stacking Generic algorithm for mulistrategy

More information

Benchmarking of different classes of models used for credit scoring

Benchmarking of different classes of models used for credit scoring Benchmarking of different classes of models used for credit scoring We use this competition as an opportunity to compare the performance of different classes of predictive models. In particular we want

More information

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw

Healthcare data analytics. Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Healthcare data analytics Da-Wei Wang Institute of Information Science wdw@iis.sinica.edu.tw Outline Data Science Enabling technologies Grand goals Issues Google flu trend Privacy Conclusion Analytics

More information

Knowledge Discovery and Data Mining

Knowledge Discovery and Data Mining Knowledge Discovery and Data Mining Unit # 10 Sajjad Haider Fall 2012 1 Supervised Learning Process Data Collection/Preparation Data Cleaning Discretization Supervised/Unuspervised Identification of right

More information

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University

Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised

More information

Smarter Healthcare@IBM Research. Joseph M. Jasinski, Ph.D. Distinguished Engineer IBM Research

Smarter Healthcare@IBM Research. Joseph M. Jasinski, Ph.D. Distinguished Engineer IBM Research Smarter Healthcare@IBM Research Joseph M. Jasinski, Ph.D. Distinguished Engineer IBM Research Our researchers work on a wide spectrum of topics Basic Science Industry specific innovation Nanotechnology

More information

DATA MINING TECHNIQUES AND APPLICATIONS

DATA MINING TECHNIQUES AND APPLICATIONS DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,

More information

Safety & Effectiveness of Drug Therapies for Type 2 Diabetes: Are pharmacoepi studies part of the problem, or part of the solution?

Safety & Effectiveness of Drug Therapies for Type 2 Diabetes: Are pharmacoepi studies part of the problem, or part of the solution? Safety & Effectiveness of Drug Therapies for Type 2 Diabetes: Are pharmacoepi studies part of the problem, or part of the solution? IDEG Training Workshop Melbourne, Australia November 29, 2013 Jeffrey

More information

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,

More information

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set

Overview. Evaluation Connectionist and Statistical Language Processing. Test and Validation Set. Training and Test Set Overview Evaluation Connectionist and Statistical Language Processing Frank Keller keller@coli.uni-sb.de Computerlinguistik Universität des Saarlandes training set, validation set, test set holdout, stratification

More information

Research Skills for Non-Researchers: Using Electronic Health Data and Other Existing Data Resources

Research Skills for Non-Researchers: Using Electronic Health Data and Other Existing Data Resources Research Skills for Non-Researchers: Using Electronic Health Data and Other Existing Data Resources James Floyd, MD, MS Sep 17, 2015 UW Hospital Medicine Faculty Development Program Objectives Become more

More information

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) (707.004) Roman Kern. KTI, TU Graz 2015-03-05 Ensemble Methods Knowledge Discovery and Data Mining 2 (VU) (707004) Roman Kern KTI, TU Graz 2015-03-05 Roman Kern (KTI, TU Graz) Ensemble Methods 2015-03-05 1 / 38 Outline 1 Introduction 2 Classification

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

ID: FDA-2015-N-2048-0001:

ID: FDA-2015-N-2048-0001: October 26, 2015 Division of Dockets Management (HFA-305) Food and Drug Administration 5630 Fishers Lane Rm. 1061 Rockville, MD 20852 Submitted electronically via regulations.gov Re: Docket ID: FDA-2015-N-2048-0001:

More information

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008

Ensemble Learning Better Predictions Through Diversity. Todd Holloway ETech 2008 Ensemble Learning Better Predictions Through Diversity Todd Holloway ETech 2008 Outline Building a classifier (a tutorial example) Neighbor method Major ideas and challenges in classification Ensembles

More information

Competency 1 Describe the role of epidemiology in public health

Competency 1 Describe the role of epidemiology in public health The Northwest Center for Public Health Practice (NWCPHP) has developed competency-based epidemiology training materials for public health professionals in practice. Epidemiology is broadly accepted as

More information

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015

IN THE CITY OF NEW YORK Decision Risk and Operations. Advanced Business Analytics Fall 2015 Advanced Business Analytics Fall 2015 Course Description Business Analytics is about information turning data into action. Its value derives fundamentally from information gaps in the economic choices

More information

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms Scott Pion and Lutz Hamel Abstract This paper presents the results of a series of analyses performed on direct mail

More information

Car Insurance. Havránek, Pokorný, Tomášek

Car Insurance. Havránek, Pokorný, Tomášek Car Insurance Havránek, Pokorný, Tomášek Outline Data overview Horizontal approach + Decision tree/forests Vertical (column) approach + Neural networks SVM Data overview Customers Viewed policies Bought

More information

Big Data Analytics for Healthcare

Big Data Analytics for Healthcare Big Data Analytics for Healthcare Jimeng Sun Chandan K. Reddy Healthcare Analytics Department IBM TJ Watson Research Center Department of Computer Science Wayne State University 1 Healthcare Analytics

More information

Statistics for BIG data

Statistics for BIG data Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before

More information

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376 Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.

More information

Data Mining Analysis (breast-cancer data)

Data Mining Analysis (breast-cancer data) Data Mining Analysis (breast-cancer data) Jung-Ying Wang Register number: D9115007, May, 2003 Abstract In this AI term project, we compare some world renowned machine learning tools. Including WEKA data

More information

Data Mining: Overview. What is Data Mining?

Data Mining: Overview. What is Data Mining? Data Mining: Overview What is Data Mining? Recently * coined term for confluence of ideas from statistics and computer science (machine learning and database methods) applied to large databases in science,

More information

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning.

CS 2750 Machine Learning. Lecture 1. Machine Learning. http://www.cs.pitt.edu/~milos/courses/cs2750/ CS 2750 Machine Learning. Lecture Machine Learning Milos Hauskrecht milos@cs.pitt.edu 539 Sennott Square, x5 http://www.cs.pitt.edu/~milos/courses/cs75/ Administration Instructor: Milos Hauskrecht milos@cs.pitt.edu 539 Sennott

More information

Using Predictive Analytics to Reduce COPD Readmissions

Using Predictive Analytics to Reduce COPD Readmissions Using Predictive Analytics to Reduce COPD Readmissions Agenda Information about PinnacleHealth Today s Environment PinnacleHealth Case Study Questions? PinnacleHealth System Non-profit, community teaching

More information

Following are detailed competencies which are addressed to various extents in coursework, field training and the integrative project.

Following are detailed competencies which are addressed to various extents in coursework, field training and the integrative project. MPH Epidemiology Following are detailed competencies which are addressed to various extents in coursework, field training and the integrative project. Biostatistics Describe the roles biostatistics serves

More information

Biomedical Big Data and Precision Medicine

Biomedical Big Data and Precision Medicine Biomedical Big Data and Precision Medicine Jie Yang Department of Mathematics, Statistics, and Computer Science University of Illinois at Chicago October 8, 2015 1 Explosion of Biomedical Data 2 Types

More information

Predictive Modeling and Big Data

Predictive Modeling and Big Data Predictive Modeling and Presented by Eileen Burns, FSA, MAAA Milliman Agenda Current uses of predictive modeling in the life insurance industry Potential applications of 2 1 June 16, 2014 [Enter presentation

More information

Data Mining Classification: Decision Trees

Data Mining Classification: Decision Trees Data Mining Classification: Decision Trees Classification Decision Trees: what they are and how they work Hunt s (TDIDT) algorithm How to select the best split How to handle Inconsistent data Continuous

More information

Cross-validation for detecting and preventing overfitting

Cross-validation for detecting and preventing overfitting Cross-validation for detecting and preventing overfitting Note to other teachers and users of these slides. Andrew would be delighted if ou found this source material useful in giving our own lectures.

More information

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods

Predicting Student Persistence Using Data Mining and Statistical Analysis Methods Predicting Student Persistence Using Data Mining and Statistical Analysis Methods Koji Fujiwara Office of Institutional Research and Effectiveness Bemidji State University & Northwest Technical College

More information

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel

Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision

More information

Machine Learning Big Data using Map Reduce

Machine Learning Big Data using Map Reduce Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories

More information

Speaker Second Plenary Session ELECTRONIC HEALTH RECORDS FOR INFORMED HEALTH CARE IN ASIA- PACIFIC: LEARNING FROM EACH OTHER.

Speaker Second Plenary Session ELECTRONIC HEALTH RECORDS FOR INFORMED HEALTH CARE IN ASIA- PACIFIC: LEARNING FROM EACH OTHER. Speaker Second Plenary Session ELECTRONIC HEALTH RECORDS FOR INFORMED HEALTH CARE IN ASIA- PACIFIC: LEARNING FROM EACH OTHER Naoto Kume, PhD Kyoto University Kyoto Prefecture, Japan ISPOR 2014.09.08 11:15am-12:45pm

More information

Data Mining Applications in Higher Education

Data Mining Applications in Higher Education Executive report Data Mining Applications in Higher Education Jing Luan, PhD Chief Planning and Research Officer, Cabrillo College Founder, Knowledge Discovery Laboratories Table of contents Introduction..............................................................2

More information

Car Insurance. Prvák, Tomi, Havri

Car Insurance. Prvák, Tomi, Havri Car Insurance Prvák, Tomi, Havri Sumo report - expectations Sumo report - reality Bc. Jan Tomášek Deeper look into data set Column approach Reminder What the hell is this competition about??? Attributes

More information

Data mining is used to develop models for the early prediction of freshmen GPA. Since

Data mining is used to develop models for the early prediction of freshmen GPA. Since 1 USING DATA MINING TO PREDICT FRESHMEN OUTCOMES Nora Galambos, PhD Senior Data Scientist Office of Institutional Research, Planning & Effectiveness Stony Brook University Abstract Data mining is used

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

Use advanced techniques for summary and visualization of complex data for exploratory analysis and presentation.

Use advanced techniques for summary and visualization of complex data for exploratory analysis and presentation. MS Biostatistics MS Biostatistics Competencies Study Development: Work collaboratively with biomedical or public health researchers and PhD biostatisticians, as necessary, to provide biostatistical expertise

More information

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification

Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Feature vs. Classifier Fusion for Predictive Data Mining a Case Study in Pesticide Classification Henrik Boström School of Humanities and Informatics University of Skövde P.O. Box 408, SE-541 28 Skövde

More information

Certified in Public Health (CPH) Exam CONTENT OUTLINE

Certified in Public Health (CPH) Exam CONTENT OUTLINE NATIONAL BOARD OF PUBLIC HEALTH EXAMINERS Certified in Public Health (CPH) Exam CONTENT OUTLINE April 2014 INTRODUCTION This document was prepared by the National Board of Public Health Examiners for the

More information

Speech and Network Marketing Model - A Review

Speech and Network Marketing Model - A Review Jastrzȩbia Góra, 16 th 20 th September 2013 APPLYING DATA MINING CLASSIFICATION TECHNIQUES TO SPEAKER IDENTIFICATION Kinga Sałapa 1,, Agata Trawińska 2 and Irena Roterman-Konieczna 1, 1 Department of Bioinformatics

More information

Automatic Resolver Group Assignment of IT Service Desk Outsourcing

Automatic Resolver Group Assignment of IT Service Desk Outsourcing Automatic Resolver Group Assignment of IT Service Desk Outsourcing in Banking Business Padej Phomasakha Na Sakolnakorn*, Phayung Meesad ** and Gareth Clayton*** Abstract This paper proposes a framework

More information

Machine Learning Capacity and Performance Analysis and R

Machine Learning Capacity and Performance Analysis and R Machine Learning and R May 3, 11 30 25 15 10 5 25 15 10 5 30 25 15 10 5 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 0 2 4 6 8 101214161822 100 80 60 40 100 80 60 40 100 80 60 40 30 25 15 10 5 25 15 10

More information

Joseph M. Juran Center for Research in Supply Chain, Operations, and Quality

Joseph M. Juran Center for Research in Supply Chain, Operations, and Quality Joseph M. Juran Center for Research in Supply Chain, Operations, and Quality Professor Kevin Linderman Academic Co-Director of Joseph M. Juran Center for Research in Supply Chain, Operations, and Quality

More information

Big Challenges of Big Data - What are the statistical tasks for the precision medicine era?

Big Challenges of Big Data - What are the statistical tasks for the precision medicine era? Big Challenges of Big Data - What are the statistical tasks for the precision medicine era? Oct 18, 2015 Yu Shyr, Ph.D. Vanderbilt Center for Quantitative Sciences Highlights Overview of the BIG data in

More information

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign

Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Role of Customer Response Models in Customer Solicitation Center s Direct Marketing Campaign Arun K Mandapaka, Amit Singh Kushwah, Dr.Goutam Chakraborty Oklahoma State University, OK, USA ABSTRACT Direct

More information

SAP/PHEMI Big Data Warehouse and the Transformation to Value-Based Health Care

SAP/PHEMI Big Data Warehouse and the Transformation to Value-Based Health Care PHEMI Health Systems Process Automation and Big Data Warehouse http://www.phemi.com SAP/PHEMI Big Data Warehouse and the Transformation to Value-Based Health Care Bringing Privacy and Performance to Big

More information

Comparison of Data Mining Techniques used for Financial Data Analysis

Comparison of Data Mining Techniques used for Financial Data Analysis Comparison of Data Mining Techniques used for Financial Data Analysis Abhijit A. Sawant 1, P. M. Chawan 2 1 Student, 2 Associate Professor, Department of Computer Technology, VJTI, Mumbai, INDIA Abstract

More information

Data mining and statistical models in marketing campaigns of BT Retail

Data mining and statistical models in marketing campaigns of BT Retail Data mining and statistical models in marketing campaigns of BT Retail Francesco Vivarelli and Martyn Johnson Database Exploitation, Segmentation and Targeting group BT Retail Pp501 Holborn centre 120

More information

304 Predictive Informatics: What Is Its Place in Healthcare?

304 Predictive Informatics: What Is Its Place in Healthcare? close window ANNUAL CONFERENCE AND EXHIBITION APRIL 4-8, 2009 / CHICAGO www.himssconference.org View PowerPoint Presentation Print PowerPoint Presentation Roundtable 304 Predictive Informatics: What Is

More information

Treatment of Low Risk MDS. Overview. Myelodysplastic Syndromes (MDS)

Treatment of Low Risk MDS. Overview. Myelodysplastic Syndromes (MDS) Overview Amy Davidoff, Ph.D., M.S. Associate Professor Pharmaceutical Health Services Research Department, Peter Lamy Center on Drug Therapy and Aging University of Maryland School of Pharmacy Clinical

More information

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0

Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Technical Approaches for Protecting Privacy in the PCORnet Distributed Research Network V1.0 Guidance Document Prepared by: PCORnet Data Privacy Task Force Submitted to the PMO Approved by the PMO Submitted

More information

SOLUTION BRIEF. SAP/PHEMI Big Data Warehouse and the Transformation to Value-Based Health Care

SOLUTION BRIEF. SAP/PHEMI Big Data Warehouse and the Transformation to Value-Based Health Care SOLUTION BRIEF SAP/PHEMI Big Data Warehouse and the Transformation to Value-Based Health Care Bringing Privacy and Performance to Big Data with SAP HANA and PHEMI Central Objectives Every healthcare organization

More information

Neural Networks for Sentiment Detection in Financial Text

Neural Networks for Sentiment Detection in Financial Text Neural Networks for Sentiment Detection in Financial Text Caslav Bozic* and Detlef Seese* With a rise of algorithmic trading volume in recent years, the need for automatic analysis of financial news emerged.

More information

Supervised Learning (Big Data Analytics)

Supervised Learning (Big Data Analytics) Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used

More information

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning How to use Big Data in Industry 4.0 implementations LAURI ILISON, PhD Head of Big Data and Machine Learning Big Data definition? Big Data is about structured vs unstructured data Big Data is about Volume

More information

One Statistician s Perspectives on Statistics and "Big Data" Analytics

One Statistician s Perspectives on Statistics and Big Data Analytics One Statistician s Perspectives on Statistics and "Big Data" Analytics Some (Ultimately Unsurprising) Lessons Learned Prof. Stephen Vardeman IMSE & Statistics Departments Iowa State University July 2014

More information

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4. Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics

More information

FDA s Sentinel Initiative: Current Status and Future Plans. Janet Woodcock M.D. Director, CDER, FDA

FDA s Sentinel Initiative: Current Status and Future Plans. Janet Woodcock M.D. Director, CDER, FDA FDA s Sentinel Initiative: Current Status and Future Plans Janet Woodcock M.D. Director, CDER, FDA FDA Amendments Act of 2007 Section 905: Active Postmarket Risk Identification and Analysis Establish a

More information

The Data Mining Process

The Data Mining Process Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data

More information

Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data

Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Imputation Methods to Deal with Missing Values when Data Mining Trauma Injury Data Kay I Penny Centre for Mathematics and Statistics, Napier University, Craiglockhart Campus, Edinburgh, EH14 1DJ k.penny@napier.ac.uk

More information

Weather forecast prediction: a Data Mining application

Weather forecast prediction: a Data Mining application Weather forecast prediction: a Data Mining application Ms. Ashwini Mandale, Mrs. Jadhawar B.A. Assistant professor, Dr.Daulatrao Aher College of engg,karad,ashwini.mandale@gmail.com,8407974457 Abstract

More information