Predictive Big Data Analytics: Imaging-Genetics Fundamentals, Research Challenges, & Opportunities. Outline



Similar documents
Ivo D. Dinov. Statistics Online Computational Resource Michigan Institute for Data Science School of Nursing University of Michigan

Big Data Challenges: Data Management, Analytics & Security

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Logistic Regression (1/24/13)

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Chapter 6. The stacking ensemble approach

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Bootstrapping Big Data

Tensor Methods for Machine Learning, Computer Vision, and Computer Graphics

STA 4273H: Statistical Machine Learning

Statistical Machine Learning

The Data Mining Process

Healthcare data analytics. Da-Wei Wang Institute of Information Science

Big Data - Lecture 1 Optimization reminders

How To Cluster

Introduction to General and Generalized Linear Models

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Modelling, Extraction and Description of Intrinsic Cues of High Resolution Satellite Images: Independent Component Analysis based approaches

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Chapter 13 Introduction to Nonlinear Regression( 非 線 性 迴 歸 )

Scalable Machine Learning - or what to do with all that Big Data infrastructure

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

The Artificial Prediction Market

Linear Threshold Units

How To Understand The Theory Of Probability

Machine Learning Logistic Regression

SAS Software to Fit the Generalized Linear Model

Methodology for Emulating Self Organizing Maps for Visualization of Large Datasets

Supervised Learning (Big Data Analytics)

Simple and efficient online algorithms for real world applications

Learning outcomes. Knowledge and understanding. Competence and skills

Introduction to Online Learning Theory

Predict Influencers in the Social Network

SVM Ensemble Model for Investment Prediction

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Joint models for classification and comparison of mortality in different countries.

Statistics Graduate Courses

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Wavelet analysis. Wavelet requirements. Example signals. Stationary signal 2 Hz + 10 Hz + 20Hz. Zero mean, oscillatory (wave) Fast decay (let)

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Nimble Algorithms for Cloud Computing. Ravi Kannan, Santosh Vempala and David Woodruff

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

An Overview of Knowledge Discovery Database and Data mining Techniques

Sublinear Algorithms for Big Data. Part 4: Random Topics

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

The Scientific Data Mining Process

TOWARD BIG DATA ANALYSIS WORKSHOP

171:290 Model Selection Lecture II: The Akaike Information Criterion

Lecture 3: Linear methods for classification

Statistical machine learning, high dimension and big data

240ST014 - Data Analysis of Transport and Logistics

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Data Mining. Nonlinear Classification

Statistics for BIG data

APPLIED MISSING DATA ANALYSIS

Logistic Regression for Spam Filtering

Machine Learning with MATLAB David Willingham Application Engineer

Football Match Winner Prediction

Journée Thématique Big Data 13/03/2015

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Towards running complex models on big data

It is important to bear in mind that one of the first three subscripts is redundant since k = i -j +3.

Big Data Text Mining and Visualization. Anton Heijs

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Biomarker Discovery and Data Visualization Tool for Ovarian Cancer Screening

NMR Measurement of T1-T2 Spectra with Partial Measurements using Compressive Sensing

II. RELATED WORK. Sentiment Mining

Chapter 6: Multivariate Cointegration Analysis

Introduction to Longitudinal Data Analysis

Subspace Analysis and Optimization for AAM Based Face Alignment

Predictive Data modeling for health care: Comparative performance study of different prediction models

Server Load Prediction

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

PSG College of Technology, Coimbatore Department of Computer & Information Sciences BSc (CT) G1 & G2 Sixth Semester PROJECT DETAILS.

Making Sense of the Mayhem: Machine Learning and March Madness

Statistical Models in Data Mining

Part II Redundant Dictionaries and Pursuit Algorithms

International Journal of Advance Research in Computer Science and Management Studies

Publication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore

Predicting borrowers chance of defaulting on credit loans

Regression Modeling Strategies

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Virtual Landmarks for the Internet

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

ADVANCED MACHINE LEARNING. Introduction

Social Media Mining. Data Mining Essentials

Big Data: Rethinking Text Visualization

CS Introduction to Data Mining Instructor: Abdullah Mueen

1. Classification problems

Big Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify

Machine Learning and Data Mining. Regression Problem. (adapted from) Prof. Alexander Ihler

Is a Data Scientist the New Quant? Stuart Kozola MathWorks

Classification of Bad Accounts in Credit Card Industry

Distance Metric Learning in Data Mining (Part I) Fei Wang and Jimeng Sun IBM TJ Watson Research Center

CS3220 Lecture Notes: QR factorization and orthogonal transformations

Classification by Pairwise Coupling

Transcription:

Predictive Big Data Analytics: Imaging-Genetics Fundamentals, Research Challenges, & Opportunities Ivo D. Dinov Statistics Online Computational Resource Michigan Institute for Data Science Health Behavior and Biological Sciences University of Michigan www.socr.umich.edu Outline Big, Deep & Dark Data Big Data Analytics Applications to Neurodegenerative Disease Data Dashboarding Compressive Big Data Analytics (CBDA) 1

Big, Deep & Dark Data Outline Big, Deep & Dark Data Big: Size + Complexity + Variability + Heterogeneous Deep: Covering + Multi source + Multi dimensional Dark: Incomplete + Incongruent + Unconcealed 2

Data: Process & Model Representation Native Process Model Representation Characteristics of Big Biomed Data Size Incompleteness Expected Increase Example: analyzing observational data of 1,000 s Parkinson s disease patients based on 10,000 s signature biomarkers derived from multi source imaging, genetics, clinical, physiologic, phenomics and demographic data elements. Software developments, student training, service platforms and methodological advances associated with the Big Data Discovery Science all present existing opportunities for learners, educators, researchers, practitioners and policy makers Mixture of quantitative & qualitative estimates Dinov, et al. (2014) 3

Outline Big Data Analytics BD I K A Big Data Information Knowledge Action Raw Observations Processed Data Maps, Models Actionable Decisions Data Aggregation Data Fusion Causal Inference Treatment Regimens Data Scrubbing Summary Stats Networks, Analytics Forecasts, Predictions Semantic Mapping Derived Biomarkers Linkages, Associations Healthcare Outcomes Dinov, GigaScience, 2016 4

Big Data Analytics Resourceome http://socr.umich.edu/docs/bd2k/bigdataresourceome.html 5

Kryder s law: Exponential Growth of Data Increase of Imaging Resolution 6E+15 4E+15 2E+15 0 1 µm 10 µm 100 µm 1mm Gryo_Byte Cryo_Short Cryo_Color Cryo_Color Cryo_Short Gryo_Byte 1cm 15000000 10000000 5000000 Neuroimaging(GB) Genomics_BP(GB) Moore s Law (1000'sTrans/CPU) Data volume Increases faster than computational power 0 1985 1989 1990 1994 1995 1999 2000 2004 2005 2009 2010 2014 2015 2019 (estimated) Moore s Law (1000'sTrans/CPU) Genomics_BP(GB) Neuroimaging(GB) Dinov, et al., 2014 Big Data Challenges Inference High-throughput Expeditive, Adaptive Modeling Constraints, Bio, Optimization, Computation Complexity Volume, Heterogeneity Representation Incompleteness (space, time, measure) 6

Energy & Life Span of Big Data Energy encapsulates the holistic information content included in the data (exponentially increasing with time), which may often represent a significant portion of the joint distribution of the underlying healthcare process Life span refers to the relevance and importance of the Data (exponentially decreasing with time) Exponential Growth of Big Data (Size, Complexity, Importance) Exponential Value Decay of Static Big Data T=10 T=12 Time of Data Fixation T=14 T=16 Data Fixation Time Points Time Dinov, JMSI, 2016 End to end Pipeline Workflow Solutions Dinov, et al., 2014, Front. Neuroinform.; Dinov, et al., Brain Imaging & Behavior, 2013 7

End to end Pipeline Workflow Solutions Dinov, et al., 2014, Front. Neuroinform.; Dinov, et al., Brain Imaging & Behavior, 2013 Outline Applications to Neurodegenerative Disease 8

Predictive Big Data Analytics in Parkinson s Disease A unique archive of Big Data: Parkinson s Progression Markers Initiative (PPMI). Defining data characteristics large size, incongruency, incompleteness, complexity, multiplicity of scales, and heterogeneity of informationgenerating sources (imaging, genetics, clinical, demographic) Approach introduce methods for rebalancing imbalanced cohorts, utilize a wide spectrum of classification methods to generate consistent and powerful phenotypic predictions, generate reproducible machine learning based classification that enables the reporting of model parameters and diagnostic forecasting based on new data. Results of machine learning based classification show significant power to predict Parkinson s disease in the PPMI subjects (consistent accuracy, sensitivity, and specificity exceeding 96%, confirmed using internal statistical 5 fold cross validation). Clinical (e.g., Unified Parkinson's Disease Rating Scale (UPDRS) scores), demographic (e.g., age), genetics (e.g., rs34637584, chr12), and derived neuroimaging biomarker (e.g., cerebellum shape index) data all contributed to the predictive analytics and diagnostic forecasting. Model free Big Data machine learning based classification methods (e.g., adaptive boosting, support vector machines) outperform model based techniques (GEE, GLM, MEM) in terms of predictive precision and reliability (e.g., forecasting patient diagnosis). UPDRS scores play a critical role in predicting diagnosis, which is expected based on the clinical definition of Parkinson s disease. Even without longitudinal UPDRS data, however, the accuracy of model free machine learning based classification is over 80%. The methods, software and protocols developed here are openly shared and can be employed to study other neurodegenerative disorders (e.g., Alzheimer s, Huntington s). Predictive Big Data Analytics: Applications to Parkinson s Disease Varplot illustrating: o the critical predictive data elements (Y axis) o and their impact scores (X axis) AdaBoost classifier for Controls vs. Patients prediction ML classifier accuracy sensitivity specificity positive predictive value negative log odds ratio predictive value (LOR) AdaBoost 0.996324 0.994141 0.998264 0.9980392 0.9948097 11.4882058 SVM 0.985294 0.994140 0.977431 0.9750958 0.9946996 8.902166 in review 9

Predictive Big Data Analytics: Applications to Parkinson s Disease Best Generalized Linear Logistic Regression model * UPDRS summaries (Parts II and III) along with Age are predictive of PD diagnosis Estimate Std.Dev. Z P(> Z ) (Intercept) 1.38E+01 1.18E+03 0.012 0.9907 L_caudate_ComputeArea 1.93E 03 1.16E 03 1.656 0.0977 R_caudate_ComputeArea 1.57E 03 9.61E 04 1.631 0.103 Age 5.06E 02 2.19E 02 2.313 0.0207 UPDRS_Part_I_Summary_Score_Baseline 4.30E 01 2.25E 01 1.909 0.0563 UPDRS_Part_II_Patient_Questionnaire_Summary_Score_B aseline 7.26E 01 1.73E 01 4.186 2.84E 05 UPDRS_Part_III_Summary_Score_Baseline 2.58E 01 4.03E 02 6.399 1.57E 10 COGSTATEMild 1.01E+01 1.18E+03 0.009 0.9932 COGSTATENormal 1.33E+01 1.18E+03 0.011 0.9911 FNCDTCOGYes 2.69E+00 1.54E+00 1.745 0.081 * Resulting Logit models had low predictive value and high Akaike Information Criterion (AIC) values, and remained largely unchanged when groups of covariates were excluded by reducing the model complexity. Predictive Big Data Analytics: Applications to Parkinson s Disease Data Strata Best Machine Learning Based Classification Results (according to average measures of 5 fold cross validation) Class ifier FP TP TN FN Accuracy Sensitivity Specificity positive predictiv e value negative predictive value log odds ratio (LOR) balanced control/pd ada 0.2 102 115 0.6 0.99632 0.99414 0.99826 0.998039 0.9948097 11.4882 balanced control/pd svm 2.6 102 113 0.6 0.98529 0.99414 0.97743 0.975096 0.9946996 8.902166 balanced control/pd&swedd balanced control/pd&swedd unbalanced control/pd unbalanced control/pd&swedd svm 3.4 101 112 1 0.97978 0.99023 0.97049 0.967557 0.9911348 8.1120092 ada 1.4 101 114 1.8 0.98529 0.98242 0.98785 0.986275 0.9844291 8.4214 ada 0.2 25 54 0.8 0.98753 0.96875 0.99634 0.992 0.9855072 9.039789 ada 1.2 24 61 1.8 0.96599 0.92969 0.98083 0.952 0.971519 6.516987 in review 10

Predictive Big Data Analytics: Applications to Parkinson s Disease 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 Frequency plot of data elements that appear as more reliable predictors of subject diagnosis PD_vs_HC PD+SWEDD_vs_HC Rebalanced Imbalanced With_Clinical_data Without_Clinical_data in review Outline Data Dashboarding 11

SOCR Big Data Dashboard http://socr.umich.edu/html5/dashboard Web service combining and integrating multi source socioeconomic and medical datasets Big data analytic processing Interface for exploratory navigation, manipulation and visualization Adding/removing of visual queries and interactive exploration of multivariate associations Powerful HTML5 technology enabling mobile on demand computing Husain, et al., J Big Data, 2015 SOCR Dashboard (Exploratory Big Data Analytics): Data Fusion http://socr.umich.edu/html5/dashboard 12

SOCR Dashboard (Exploratory Big Data Analytics): Data QC http://socr.umich.edu/html5/dashboard SOCR Dashboard (Exploratory Big Data Analytics): Associations 13

Outline Compressive Big Data Analytics (CBDA) Big Data Analytics Compressive Sensing o Currently there is no analytical foundation for systematic representation of Big Data that facilitates the handling of data complexities and at the same time enables joint modeling, information extraction, high throughput and adaptive scientific inference o Compressive Big Data Analytics (CBDA) borrows some of the compelling ideas for representation, reconstruction, recovery and data denoising recently developed for compressive sensing o In compressive sensing, a sparse (incomplete) data is observed and one looks for a high fidelity estimation of the complete dataset. Sparse data (or signals) can be described as observations with a small support, i.e., small magnitude according to the zero norm Dinov, JMSI, 2016 14

Big Data Analytics Compressive Sensing o Define the nested sets :, where the data, as a vector or tensor, has at most non trivial elements. Note that if,, then o If Φ,,,, represents an orthonormal basis, the data may be expressed as Φ, where,, i.e., Φ, and. Even if is not strictly sparse, its representation may be sparse. For each dataset, we can assess and quantify the error of approximating by an optimal estimate by computing min Big Data Analytics Compressive Sensing o In compressive sensing, if, and we have a data stream generating linear measurements, we can represent, where is a dimensionality reducing matrix ( ), i.e., : o The null space of, : 0. uniquely represents all contains no vectors in. o The spark of a matrix represents the smallest number of columns of that are linearly dependent. If is a random matrix whose entries are independent and identically distributed, then 1, with probability 1. 15

Big Data Analytics Compressive Sensing o If the entries of are chosen according to a sub Gaussian distribution, then with high probability, for each, there exists 0,1 such that 1 1 (1) for all (RIP=Restricted isometry property) o When we know that the original signal is sparse, to reconstruct given the observed measurements, we can solve the optimization problem: arg min : Big Data Analytics Compressive Sensing o Linear programming may be used to solve the optimization problem if we replace the zero norm by its more tractable convex approximation, the norm, arg min : o Given that has the above property and, then the solution satisfies 21, if we observe o Thus, in compressive sensing applications, if and satisfies the RIP, condition (1), we can recover any sparse signal exactly (as 0) using only log observations, since o Finally, if is random (e.g., chosen according to a Gaussian distribution) and Φ is an orthonormal basis, then Φ will also have a Gaussian distribution, and if is large, Φwill also satisfy the RIP condition (1) with high probability. 16

Big Data Analytics Compressive Sensing Obs Proxy Prism (dim reducing matrix) Orthonormal basis Nat Process Compressive Big Data Analytics (CBDA) o The foundation for Compressive Big Data Analytics (CBDA) involves o Iteratively generating random (sub)samples from the Big Data collection. o Then, using classical techniques to obtain model based or non parametric inference based on the sample. o Next, compute likelihood estimates (e.g., probability values quantifying effects, relations, sizes) o Repeat the process continues iteratively. o Repeating the (re)sampling and inference steps many times (with or without using the results of previous iterations as priors for subsequent steps). Dinov, J Med Stat & Info, in press. 17

Compressive Big Data Analytics (CBDA) o Finally, bootstrapping techniques may be employed to quantify joint probabilities, estimate likelihoods, predict associations, identify trends, forecast future outcomes, or assess accuracy of findings. o The goals of compressive sensing and compressive big data analytics are different. o CS aims to obtain a stochastic estimate of a complete dataset using sparsely sampled incomplete observations. o CBDA attempts to obtain a quantitative joint inference characterizing likelihoods, tendencies, prognoses, or relationships. o However, a common objective of both problem formulations is the optimality (e.g., reliability, consistency) of their corresponding estimates. Compressive Big Data Analytics (CBDA) o Suppose we represent (observed) Big Data as a large matrix, where = sample size (instances) and = elements (e.g., time, space, measurements, etc.) o To formulate the problem in an analytical framework, let s assume is a low rank matrix representing the mean or background data features, is a (known or unknown) design or dictionary matrix, is a sparce parameter matrix with small support ( ), denote the model error term, and. be a sampling operator generating incomplete data over the indexing pairs of instances and data elements 1, 2,, 1,2,, o In this generalized model setting, the problem formulation involves estimation of, (and, if it is unknown), according to this model representation: (2) 18

Compressive Big Data Analytics (CBDA) o Having quick, reliable and efficient estimates of, and would allow us to make inference, compute likelihoods (e.g., p values), predict trends, forecast outcomes, and adapt the model to obtain revised inference using new data o When is known, the model in equation (2) is jointly convex for and, and there exist iterative solvers based on sub gradient recursion (e.g., alternating direction method of multipliers) o However, in practice, the size of Big Datasets presents significant computational problems, related to slow algorithm convergence, for estimating these components that are critical for the final study inference Compressive Big Data Analytics (CBDA) o One strategy for tackling this optimization problem is to use a random Gaussian sub sampling matrix (much like in the compressive sensing protocol) to reduce the rank of the observed data (, where, ) and then solve the minimization using least squares o This partitioning of the difficult general problem into smaller chunks has several advantages. It reduces the hardware and computational burden, enables algorithmic parallelization of the global solution, and ensures feasibility of the analytical results o Because of the stochastic nature of the index sampling, this approach may have desirable analytical properties like predictable asymptotic behavior, limited error bounds, estimates optimality and consistency characteristics 19

Compressive Big Data Analytics (CBDA) Data Structure (Representation) Sample Data (Instance) Compressive Big Data Analytics (CBDA) o One can design an algorithm that searches and keeps only the most informative data elements by requiring that the derived estimates represent optimal approximations to within a specific sampling index subspace, o We want to investigate if CBDA inference estimates can be shown to obey error bounds similar to the upper bound results of point imbedding s in high dimensions (e.g., Johnson Lindenstrauss lemma) or the restricted isometry property 20

Compressive Big Data Analytics (CBDA) o The Johnson Lindenstrauss lemma guarantees that for any 01, a set of points can be linearly embedded Ψ: into Ψ, for 4 almost preserving their pairwise distances, i.e., 1 1 o The restricted isometry property ensures that if 21and the estimate arg min, where satisfies property : (1), then the data reconstruction is reasonable, i.e., o Can we develop iterative space partitioning CBDA algorithms that either converge to a fix point or generate estimates that are close to their corresponding inferential parameters?, Acknowledgments Funding NIH: P50 NS091856, P30 DK089503, P20 NR015331, U54 EB020406 NSF: 1416953, 0716055 1023115 http://socr.umich.edu Collaborators SOCR: Alexandr Kalinin, Selvam Palanimalai, Syed Husain, Matt Leventhal, Ashwini Khare, Rami Elkest, Abhishek Chowdhury, Patrick Tan, Gary Chan, Andy Foglia, Pratyush Pati, Brian Zhang, Juana Sanchez, Dennis Pearl, Kyle Siegrist, Rob Gould, Jingshu Xu, Nellie Ponarul, Ming Tang, Asiyah Lin, Nicolas Christou LONI/INI: Arthur Toga, Roger Woods, Jack Van Horn, Zhuowen Tu, Yonggang Shi, David Shattuck, Elizabeth Sowell, Katherine Narr, Anand Joshi, Shantanu Joshi, Paul Thompson, Luminita Vese, Stan Osher, Stefano Soatto, Seok Moon, Junning Li, Young Sung, Fabio Macciardi, Federica Torri, Carl Kesselman, UMich AD/PD Centers: Cathie Spino, Ben Hampstead, Bill Dauer 21