Targeted Learning with Big Data
|
|
|
- Warren Norton
- 10 years ago
- Views:
Transcription
1 Targeted Learning with Big Data Mark van der Laan UC Berkeley Center for Philosophy and History of Science Revisiting the Foundations of Statistics in the Era of Big Data: Scaling Up to Meet the Challenge February 20, 2014
2 Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks
3 Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks
4 Foundations of the statistical estimation problem Observed data: Realizations of random variables with a probability distribution. Statistical model: Set of possible distributions for the data-generating distribution, defined by actual knowledge about the data. e.g. in an RCT, we know the probability of each subject receiving treatment. Statistical target parameter: Function of the data-generating distribution that we wish to learn from the data. Estimator: An a priori-specified algorithm that takes the observed data and returns an estimate of the target parameter. Benchmarked by a dissimilarity-measure (e.g., MSE) w.r.t target parameter. Inference: Establish limit distribution and corresponding statistical inference.
5 Causal inference Non-testable assumptions in addition to the assumptions defining the statistical model. (e.g. the no unmeasured confounders assumption). Defines causal quantity and establishes identifiability under these assumptions. This process generates interesting statistical target parameters. Allows for causal interpretation of statistical parameter/estimand. Even if we don t believe the non-testable causal assumptions, the statistical estimation problem is still the same, and estimands still have valid statistical interpretations.
6 Targeted learning Define valid (and thus LARGE) statistical semi parametric models and interesting target parameters. Exactly deals with statistical challenges of high dimensional and large data sets (Big Data). Avoid reliance on human art and nonrealistic (e.g., parametric) models Plug-in estimator based on targeted fit of the (relevant part of) data-generating distribution to the parameter of interest Semiparametric efficient and robust Statistical inference Has been applied to: static or dynamic treatments, direct and indirect effects, parameters of MSMs, variable importance analysis in genomics, longitudinal/repeated measures data with time-dependent confounding, censoring/missingness, case-control studies, RCTs, networks.
7 Targeted Learning Book Springer Series in Statistics van der laan & Rose targetedlearningbook.com
8 First Chapter by R.J.C.M. Starmans Models, Inference, and Truth provides historical philosophical perspective on Targeted Learning. Discusses the erosion of the notion of model and truth throughout history and the resulting lack of unified approach in statistics. It stresses the importance of a reconciliation between machine learning and statistical inference, as provided by Targeted Learning.
9 Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks
10 Two stage methodology Super learning (SL) van der Laan et al. (2007),Polley et al. (2012),Polley and van der Laan (2012) Uses a library of candidate estimators (e.g. multiple parametric models, machine learning algorithms like neural networks, RandomForest, etc.) Builds data-adaptive weighted combination of estimators using cross validation Targeted maximum likelihood estimation (TMLE) van der Laan and Rubin (2006) Updates initial estimate, often a Super Learner, to remove bias for the parameter of interest Calculates final parameter from updated fit of the data-generating distribution
11 Super learning No need to chose a priori a particular parametric model or machine learning algorithm for a particular problem Allows one to combine many data-adaptive estimators into one improved estimator. Grounded by oracle results for loss-function based cross-validation (Van Der Laan and Dudoit (2003),van der Vaart et al. (2006)). Loss function needs to be bounded. Performs asymptotically as well as best (oracle) weighted combination, or achieves parametric rate of convergence.
12 Super learning Figure: Relative Cross-Validated Mean Squared Error (compared to main terms least squares regression)
13 Super learning
14 TMLE algorithm
15 TMLE algorithm: Formal Template
16 Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks
17 General Longitudinal Data Structure We observe n i.i.d. copies of a longitudinal data structure O = (L(0), A(0),..., L(K), A(K), Y = L(K + 1)), where A(t) denotes a discrete valued intervention node, L(t) is an intermediate covariate realized after A(t 1) and before A(t), t = 0,..., K, and Y is a final outcome of interest. For example, A(t) = (A 1 (t), A 2 (t)) could be a vector of two binary indicators of censoring and treatment, respectively.
18 Likelihood and Statistical Model The probability distribution P 0 of O can be factorized according to the time-ordering as K+1 K P 0 (O) = P 0 (L(t) Pa(L(t))) P 0 (A(t) Pa(A(t))) t=0 K+1 t=0 Q 0 g 0, Q 0,L(t) (O) t=0 K g 0,A(t) (O) t=0 where Pa(L(t)) ( L(t 1), Ā(t 1)) and Pa(A(t)) ( L(t), Ā(t 1)) denote the parents of L(t) and A(t) in the time-ordered sequence, respectively. The g 0 -factor represents the intervention mechanism: e..g, treatment and right-censoring mechanism. Statistical Model: We make no assumptions on Q 0, but could
19 Statistical Target Parameter: G-computation Formula for Post-Intervention Distribution Let P d (l) = K+1 t=0 Q d L(t) ( l(t)), (1) where Q d L(t) ( l(t)) = Q L(t) (l(t) l(t 1), Ā(t 1) = d(t 1)). Let L d = (L(0), L d (1),..., Y d = L d (K + 1)) denote the random variable with probability distribution P d. This is the so called G-computation formula for the post-intervention distribution corresponding with the dynamic intervention d.
20 Example: When to switch a failing drug regimen in HIV-infected patients Observed data on unit O = (L(0), A(0), L(1), A(1),..., L(K), A(K), A 2 (K)Y ), where L(0) is baseline history, A(t) = (A 1 (t), A 2 (t)), A 1 (t) is indicator of switching drug regimen, A 2 (t) is indicator of being right-censored, t = 0,..., K, and Y is indicator of observing death by time K + 1. Define interventions nodes A(0),..., A(K) and interventions dynamic rules d θ that switch when CD4-count drops below θ, and enforces no-censoring. Our target parameter is defined as projection of (E(Y d θ(t)) : t, d) onto a working model m β (θ) with parameter β.
21 A Sequential Regression G-computation Formula (Bang, Robins, 2005) By the iterative conditional expectation rule (tower rule), we have E P d Y d = E... E(E(Y d L d (K)) L d (K 1))... L(0)). In addition, the conditional expectation, given L d (K) is equivalent with conditioning on L(K), Ā(K 1) = d(k 1).
22 In this manner, one can represent E P d Y d as an iterative conditional expectation, first take conditional expectation, given L d (K) (equivalent with L(K), Ā(K 1)), then take the conditional expectation, given L d (K 1) (equivalent with L(K 1), Ā(K 2)), and so on, until the conditional expectation given L(0), and finally take the mean over L(0). We developed a targeted plug-in estimator/tmle of general summary measures of dose-response curves (EY d : d D) (Petersen et al., 2013, van der Laan, Gruber 2012).
23 Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks
24 Variable Importance: Problem Description (Diaz, Hubbard) Around 800 patients that entered the emergency room with severe trauma About 80 physiological and clinical variables were measured at 0, 6, 12, 24, 48, and 72 hours after admission Objective is predicting the most likely medical outcome of a patient (e.g., survival), and provide an ordered list of the covariates that drive this prediction (variable importance). This will help doctors decide what variables are relevant at each time point. Variables are subject to missingness Variables are continuous, variable importance parameter is Ψ(P 0 ) E 0 {E 0 (Y A + δ, W ) E 0 (Y A, W )} for user-given value δ.
25 Variable Importance: Results * * * * * * * * * * HCT HGB APC CREA ISS PTT GCS HR SBP PLTS RR BUN PT DDIM PF12NM INR FVIII BDE PC ATIII FV Hour 72 Hour 48 Hour 24 Hour 12 Hour 06 Hour 00 Figure: Effect sizes and significance codes
26 TMLE with Genomic Data 570 case-control samples on spina bifida We want to identify associated genes to spina bifida from 115 SNPs. In the original paper Shaw et. al. 2009, a univariate analysis was performed. The original analysis missed rs and rs in TYMS gene because they are closely linked with counteracting effects on spina bifida. With TMLE, signals from these two SNPs were recovered. In TMEL, Q0 was obtained from LASSO, and g(w ) is obtained from a simple regression of SNP on its two flanking markers to account for confounding effects of neighborhood markers.
27 TMLE p-values for SNPs log10 p value rs hcv rs rs rs rs rs rs rs hcv rs rs hcv hcv hcv hcv hcv hcv hcv rs hcv rs hcv rs hcv rs rs hcv hcv hcv hcv hcv hcv hcv hcv hcv rs hcv hcv hcv rs rs hcv rs rs hcv rs rs rs rs rs10380 rs rs9332 hcv hcv hcv rs hcv hcv rs hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv rs rs rs hcv rs rs rs rs hcv rs rs rs hcv rs hcv hcv hcv rs rs rs rs hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv hcv rs rs rs MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFR MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 MTHFD2 R R R R R R R R R R R R R BHMT2 BHMT2 BHMT2 BHMT2 BHMT2 BHMT2 BHMT2 BHMT BHMT BHMT BHMT BHMT BHMT BHMT BHMT DHFR DHFR DHFR DHFR DHFR DHFR DHFR DHFR DHFR NOS3 NOS3 NOS3 FOLR1 FOLR1 FOLR1 FOLR2 FOLR2 FOLR2 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 MTHFD1 TYMS TYMS TYMS TYMS CBS CBS CBS CBS CBS CBS CBS CBS CBS RFC1 RFC1 RFC1 RFC1 RFC1 RFC1
28 Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks
29 Targeted Learning of Data Dependent Target Parameters (vdl, Hubbard, 2013) Define algorithms that map data into a target parameter: Ψ Pn : M IR d, thereby generalizing the notion of target parameters. Develop methods to obtain statistical inference for Ψ Pn (P 0 ) or 1/V V v=1 Ψ P n,v (P 0 ), where P n,v is the empirical distribution of parameter-generating sample corresponding with v-th sample split. We have developed cross-validated TMLE for the latter data dependent target parameter, without any additional conditions. In particular, this generalized framework allows us to generate a subset of target parameters among a massive set of candidate target parameters, while only having to deal with multiple testing for the data adaptively selected set of target parameters. Thus, much more powerful than regular multiple testing for a fixed set of null hypotheses.
30 Online Targeted MLE: Ongoing work Order data if not ordered naturally. Partition in subsets numbered from 1 to K. Initiate initial estimator and TMLE based on first subset. Update initial estimator based on second subset, and update TMLE based on second subset. Iterate till last subset. Final estimator is average of all stage specific TMLEs. In this manner, for each subset number of calculations is bounded by number of observations in subset and total computation time increases linearly in number of subsets. One can still prove asymptotic efficiency of this online TMLE.
31 Outline 1 Targeted Learning 2 Two stage methodology: Super Learning+ TMLE 3 Definition of Estimation Problem for Causal Effects of Multiple Time Point Interventions 4 Variable importance analysis examples of Targeted Learning 5 Scaling up Targeted Learning to handle Big Data 6 Concluding remarks
32 Concluding remarks Sound foundations of statistics are in place (Data is random variable, Model, Target Parameter, Inference based on Limit Distribution), but these have eroded over many decades. However, MLE had to be revamped into TMLE to deal with large models. Big Data asks for development of fast TMLE without giving up on statistical properties: e.g., Online TMLE. Big Data asks for research teams that consists of top statisticians and computer scientists, beyond subject matter experts. Philosophical soundness of proposed methods are hugely important and should become a norm.
33 References I H. Bang and J. Robins. Doubly robust estimation in missing data and causal inference models. Biometrics, 61: , M. Petersen, J. Schwab, S. Gruber, N. Blaser, M. Schomaker, and M. van der Laan. Targeted minimum loss based estimation of marginal structural working models. Journal of Causal Inference, submitted, E. Polley and M. van der Laan. SuperLearner: Super Learner Prediction, URL R package version E. Polley, S. Rose, and M. van der Laan. Super learning. Chapter 3 in Targeted Learning, van der Laan,Rose (2012), 2012.
34 References II M. Schnitzer, E. Moodie, M. van der Laan, R. Platt, and M. Klein. Marginal structural modeling of a survival outcome with targeted maximum likelihood estimation. Biometrics, to appear, M. Van Der Laan and S. Dudoit. Unified cross-validation methodology for selection among estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle inequalities and examples. UC Berkeley Division of Biostatistics Working Paper Series, page 130, M. J. van der Laan and S. Gruber. Targeted Minimum Loss Based Estimation of Causal Effects of Multiple Time Point Interventions. The International Journal of Biostatistics, 8(1), Jan ISSN doi: /
35 References III M. J. van der Laan and D. Rubin. Targeted Maximum Likelihood Learning. The International Journal of Biostatistics, 2(1), Jan ISSN doi: / M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical applications in genetics and molecular biology, 6(1), Jan ISSN doi: / A. van der Vaart, S. Dudoit, and M. van der Laan. Oracle inequalities for multi-fold cross-validation. Stat Decis, 24(3): , 2006.
Many research questions in epidemiology are concerned. Estimation of Direct Causal Effects ORIGINAL ARTICLE
ORIGINAL ARTICLE Maya L. Petersen, Sandra E. Sinisi, and Mark J. van der Laan Abstract: Many common problems in epidemiologic and clinical research involve estimating the effect of an exposure on an outcome
Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
Overview. Longitudinal Data Variation and Correlation Different Approaches. Linear Mixed Models Generalized Linear Mixed Models
Overview 1 Introduction Longitudinal Data Variation and Correlation Different Approaches 2 Mixed Models Linear Mixed Models Generalized Linear Mixed Models 3 Marginal Models Linear Models Generalized Linear
Personalized Predictive Medicine and Genomic Clinical Trials
Personalized Predictive Medicine and Genomic Clinical Trials Richard Simon, D.Sc. Chief, Biometric Research Branch National Cancer Institute http://brb.nci.nih.gov brb.nci.nih.gov Powerpoint presentations
Logistic Regression (1/24/13)
STA63/CBB540: Statistical methods in computational biology Logistic Regression (/24/3) Lecturer: Barbara Engelhardt Scribe: Dinesh Manandhar Introduction Logistic regression is model for regression used
Statistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
Statistics in Retail Finance. Chapter 6: Behavioural models
Statistics in Retail Finance 1 Overview > So far we have focussed mainly on application scorecards. In this chapter we shall look at behavioural models. We shall cover the following topics:- Behavioural
Bootstrapping Big Data
Bootstrapping Big Data Ariel Kleiner Ameet Talwalkar Purnamrita Sarkar Michael I. Jordan Computer Science Division University of California, Berkeley {akleiner, ameet, psarkar, jordan}@eecs.berkeley.edu
BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA
BIG DATA: CONVENTIONAL METHODS MEET UNCONVENTIONAL DATA Harvard Medical School & Harvard School of Public Health [email protected] October 14, 2014 1 / 7 THE SETTING Unprecedented advances in
Average Redistributional Effects. IFAI/IZA Conference on Labor Market Policy Evaluation
Average Redistributional Effects IFAI/IZA Conference on Labor Market Policy Evaluation Geert Ridder, Department of Economics, University of Southern California. October 10, 2006 1 Motivation Most papers
Missing data and net survival analysis Bernard Rachet
Workshop on Flexible Models for Longitudinal and Survival Data with Applications in Biostatistics Warwick, 27-29 July 2015 Missing data and net survival analysis Bernard Rachet General context Population-based,
Machine Learning Methods for Causal Effects. Susan Athey, Stanford University Guido Imbens, Stanford University
Machine Learning Methods for Causal Effects Susan Athey, Stanford University Guido Imbens, Stanford University Introduction Supervised Machine Learning v. Econometrics/Statistics Lit. on Causality Supervised
Introduction to General and Generalized Linear Models
Introduction to General and Generalized Linear Models General Linear Models - part I Henrik Madsen Poul Thyregod Informatics and Mathematical Modelling Technical University of Denmark DK-2800 Kgs. Lyngby
False Discovery Rates
False Discovery Rates John D. Storey Princeton University, Princeton, USA January 2010 Multiple Hypothesis Testing In hypothesis testing, statistical significance is typically based on calculations involving
Organizing Your Approach to a Data Analysis
Biost/Stat 578 B: Data Analysis Emerson, September 29, 2003 Handout #1 Organizing Your Approach to a Data Analysis The general theme should be to maximize thinking about the data analysis and to minimize
Adaptive Online Gradient Descent
Adaptive Online Gradient Descent Peter L Bartlett Division of Computer Science Department of Statistics UC Berkeley Berkeley, CA 94709 bartlett@csberkeleyedu Elad Hazan IBM Almaden Research Center 650
Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague. http://ida.felk.cvut.
Machine Learning and Data Analysis overview Jiří Kléma Department of Cybernetics, Czech Technical University in Prague http://ida.felk.cvut.cz psyllabus Lecture Lecturer Content 1. J. Kléma Introduction,
Statistical issues in the analysis of microarray data
Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data
Analysis of Longitudinal Data for Inference and Prediction. Patrick J. Heagerty PhD Department of Biostatistics University of Washington
Analysis of Longitudinal Data for Inference and Prediction Patrick J. Heagerty PhD Department of Biostatistics University of Washington 1 ISCB 2010 Outline of Short Course Inference How to handle stochastic
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups
Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups Achim Zeileis, Torsten Hothorn, Kurt Hornik http://eeecon.uibk.ac.at/~zeileis/ Overview Motivation: Trees, leaves, and
STATISTICS COURSES UNDERGRADUATE CERTIFICATE FACULTY. Explanation of Course Numbers. Bachelor's program. Master's programs.
STATISTICS Statistics is one of the natural, mathematical, and biomedical sciences programs in the Columbian College of Arts and Sciences. The curriculum emphasizes the important role of statistics as
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh
Modern Optimization Methods for Big Data Problems MATH11146 The University of Edinburgh Peter Richtárik Week 3 Randomized Coordinate Descent With Arbitrary Sampling January 27, 2016 1 / 30 The Problem
Multivariate Normal Distribution
Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues
Regression Modeling Strategies
Frank E. Harrell, Jr. Regression Modeling Strategies With Applications to Linear Models, Logistic Regression, and Survival Analysis With 141 Figures Springer Contents Preface Typographical Conventions
STATISTICA Formula Guide: Logistic Regression. Table of Contents
: Table of Contents... 1 Overview of Model... 1 Dispersion... 2 Parameterization... 3 Sigma-Restricted Model... 3 Overparameterized Model... 4 Reference Coding... 4 Model Summary (Summary Tab)... 5 Summary
MATHEMATICAL METHODS OF STATISTICS
MATHEMATICAL METHODS OF STATISTICS By HARALD CRAMER TROFESSOK IN THE UNIVERSITY OF STOCKHOLM Princeton PRINCETON UNIVERSITY PRESS 1946 TABLE OF CONTENTS. First Part. MATHEMATICAL INTRODUCTION. CHAPTERS
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
Towards running complex models on big data
Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17 Motivation
Statistical Machine Learning
Statistical Machine Learning UoC Stats 37700, Winter quarter Lecture 4: classical linear and quadratic discriminants. 1 / 25 Linear separation For two classes in R d : simple idea: separate the classes
Data Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
Linda Staub & Alexandros Gekenidis
Seminar in Statistics: Survival Analysis Chapter 2 Kaplan-Meier Survival Curves and the Log- Rank Test Linda Staub & Alexandros Gekenidis March 7th, 2011 1 Review Outcome variable of interest: time until
Nominal and ordinal logistic regression
Nominal and ordinal logistic regression April 26 Nominal and ordinal logistic regression Our goal for today is to briefly go over ways to extend the logistic regression model to the case where the outcome
Handling missing data in Stata a whirlwind tour
Handling missing data in Stata a whirlwind tour 2012 Italian Stata Users Group Meeting Jonathan Bartlett www.missingdata.org.uk 20th September 2012 1/55 Outline The problem of missing data and a principled
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13
Missing Data: Part 1 What to Do? Carol B. Thompson Johns Hopkins Biostatistics Center SON Brown Bag 3/20/13 Overview Missingness and impact on statistical analysis Missing data assumptions/mechanisms Conventional
Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD
Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD Optum Labs Cambridge, MA, USA Statistical Methods and Machine Learning ISPOR International
Basics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu [email protected] Modern machine learning is rooted in statistics. You will find many familiar
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model
Overview of Violations of the Basic Assumptions in the Classical Normal Linear Regression Model 1 September 004 A. Introduction and assumptions The classical normal linear regression model can be written
Example application (1) Telecommunication. Lecture 1: Data Mining Overview and Process. Example application (2) Health
Lecture 1: Data Mining Overview and Process What is data mining? Example applications Definitions Multi disciplinary Techniques Major challenges The data mining process History of data mining Data mining
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION
PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION Introduction In the previous chapter, we explored a class of regression models having particularly simple analytical
CONTENTS OF DAY 2. II. Why Random Sampling is Important 9 A myth, an urban legend, and the real reason NOTES FOR SUMMER STATISTICS INSTITUTE COURSE
1 2 CONTENTS OF DAY 2 I. More Precise Definition of Simple Random Sample 3 Connection with independent random variables 3 Problems with small populations 8 II. Why Random Sampling is Important 9 A myth,
CAUSALLY MOTIVATED ATTRIBUTION FOR ONLINE ADVERTISING
CAUSALLY MOTIVATED ATTRIBUTION FOR ONLINE ADVERTISING ABSTRACT Brian Dalessandro M6D Research, NYC,USA [email protected] Claudia Perlich M6D Research, NYC,USA [email protected] In many online advertising campaigns,
2DI36 Statistics. 2DI36 Part II (Chapter 7 of MR)
2DI36 Statistics 2DI36 Part II (Chapter 7 of MR) What Have we Done so Far? Last time we introduced the concept of a dataset and seen how we can represent it in various ways But, how did this dataset came
Detection of changes in variance using binary segmentation and optimal partitioning
Detection of changes in variance using binary segmentation and optimal partitioning Christian Rohrbeck Abstract This work explores the performance of binary segmentation and optimal partitioning in the
Least Squares Estimation
Least Squares Estimation SARA A VAN DE GEER Volume 2, pp 1041 1045 in Encyclopedia of Statistics in Behavioral Science ISBN-13: 978-0-470-86080-9 ISBN-10: 0-470-86080-4 Editors Brian S Everitt & David
Exact Nonparametric Tests for Comparing Means - A Personal Summary
Exact Nonparametric Tests for Comparing Means - A Personal Summary Karl H. Schlag European University Institute 1 December 14, 2006 1 Economics Department, European University Institute. Via della Piazzuola
Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
TOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
The Data Mining Process
Sequence for Determining Necessary Data. Wrong: Catalog everything you have, and decide what data is important. Right: Work backward from the solution, define the problem explicitly, and map out the data
Segmentation models and applications with R
Segmentation models and applications with R Franck Picard UMR 5558 UCB CNRS LBBE, Lyon, France [email protected] http://pbil.univ-lyon1.fr/members/fpicard/ INED-28/04/11 F. Picard (CNRS-LBBE)
STA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! [email protected]! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
SAS Certificate Applied Statistics and SAS Programming
SAS Certificate Applied Statistics and SAS Programming SAS Certificate Applied Statistics and Advanced SAS Programming Brigham Young University Department of Statistics offers an Applied Statistics and
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)
large-scale machine learning revisited Léon Bottou Microsoft Research (NYC) 1 three frequent ideas in machine learning. independent and identically distributed data This experimental paradigm has driven
A Basic Introduction to Missing Data
John Fox Sociology 740 Winter 2014 Outline Why Missing Data Arise Why Missing Data Arise Global or unit non-response. In a survey, certain respondents may be unreachable or may refuse to participate. Item
GI01/M055 Supervised Learning Proximal Methods
GI01/M055 Supervised Learning Proximal Methods Massimiliano Pontil (based on notes by Luca Baldassarre) (UCL) Proximal Methods 1 / 20 Today s Plan Problem setting Convex analysis concepts Proximal operators
Elements of statistics (MATH0487-1)
Elements of statistics (MATH0487-1) Prof. Dr. Dr. K. Van Steen University of Liège, Belgium December 10, 2012 Introduction to Statistics Basic Probability Revisited Sampling Exploratory Data Analysis -
A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY. Workshop
A LONGITUDINAL AND SURVIVAL MODEL WITH HEALTH CARE USAGE FOR INSURED ELDERLY Ramon Alemany Montserrat Guillén Xavier Piulachs Lozada Riskcenter - IREA Universitat de Barcelona http://www.ub.edu/riskcenter
With Big Data Comes Big Responsibility
With Big Data Comes Big Responsibility Using health care data to emulate randomized trials when randomized trials are not available Miguel A. Hernán Departments of Epidemiology and Biostatistics Harvard
Predicting Customer Default Times using Survival Analysis Methods in SAS
Predicting Customer Default Times using Survival Analysis Methods in SAS Bart Baesens [email protected] Overview The credit scoring survival analysis problem Statistical methods for Survival
R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models
Faculty of Health Sciences R 2 -type Curves for Dynamic Predictions from Joint Longitudinal-Survival Models Inference & application to prediction of kidney graft failure Paul Blanche joint work with M-C.
School of Public Health and Health Services Department of Epidemiology and Biostatistics
School of Public Health and Health Services Department of Epidemiology and Biostatistics Master of Public Health and Graduate Certificate Biostatistics 0-04 Note: All curriculum revisions will be updated
Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection
Directions in Statistical Methodology for Multivariable Predictive Modeling Frank E Harrell Jr University of Virginia Seattle WA 19May98 Overview of Modeling Process Model selection Regression shape Diagnostics
A simple to implement algorithm for natural direct and indirect effects in survival studies with a repeatedly measured mediator
A simple to implement algorithm for natural direct and indirect effects in survival studies with a repeatedly measured mediator Susanne Strohmaier 1, Nicolai Rosenkranz 2, Jørn Wetterslev 2 and Theis Lange
Introduction to Online Learning Theory
Introduction to Online Learning Theory Wojciech Kot lowski Institute of Computing Science, Poznań University of Technology IDSS, 04.06.2013 1 / 53 Outline 1 Example: Online (Stochastic) Gradient Descent
Jordan Chamberlain Brooks
Jordan Chamberlain Brooks CURRICULUM VITAE February 2015 CONTACT Life Expectancy Project 1439-17th Avenue San Francisco, CA 94122-3402 Phone: (415) 731-0276 Fax: (415) 731-0290 Cell: (510) 507-1418 [email protected]
Simple and efficient online algorithms for real world applications
Simple and efficient online algorithms for real world applications Università degli Studi di Milano Milano, Italy Talk @ Centro de Visión por Computador Something about me PhD in Robotics at LIRA-Lab,
Data mining on the EPIRARE survey data
Deliverable D1.5 Data mining on the EPIRARE survey data A. Coi 1, M. Santoro 2, M. Lipucci 2, A.M. Bianucci 1, F. Bianchi 2 1 Department of Pharmacy, Unit of Research of Bioinformatic and Computational
Polynomial Neural Network Discovery Client User Guide
Polynomial Neural Network Discovery Client User Guide Version 1.3 Table of contents Table of contents...2 1. Introduction...3 1.1 Overview...3 1.2 PNN algorithm principles...3 1.3 Additional criteria...3
PS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng [email protected] The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
CS229 Project Report Automated Stock Trading Using Machine Learning Algorithms
CS229 roject Report Automated Stock Trading Using Machine Learning Algorithms Tianxin Dai [email protected] Arpan Shah [email protected] Hongxia Zhong [email protected] 1. Introduction
Study Design and Statistical Analysis
Study Design and Statistical Analysis Anny H Xiang, PhD Department of Preventive Medicine University of Southern California Outline Designing Clinical Research Studies Statistical Data Analysis Designing
Lecture 15 Introduction to Survival Analysis
Lecture 15 Introduction to Survival Analysis BIOST 515 February 26, 2004 BIOST 515, Lecture 15 Background In logistic regression, we were interested in studying how risk factors were associated with presence
Classification and Regression by randomforest
Vol. 2/3, December 02 18 Classification and Regression by randomforest Andy Liaw and Matthew Wiener Introduction Recently there has been a lot of interest in ensemble learning methods that generate many
Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
ALLERGY/MOLD Gene & Variation rsid # Risk Allele Your Alleles and Results HLA rs7775228 C TT -/- HLA rs2155219 T GG -/-
DETOX CYP1A1*2C A4889G rs1048943 C TT -/- CYP1A1*4 C2453A rs1799814 T GG -/- CYP1A2 C164A rs762551 C AA -/- CYP1B1 L432V rs1056836 C CC +/+ CYP1B1 N453S rs1800440 C TT -/- CYP1B1 R48G rs10012 C no call
Machine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
Bayesian Penalized Methods for High Dimensional Data
Bayesian Penalized Methods for High Dimensional Data Joseph G. Ibrahim Joint with Hongtu Zhu and Zakaria Khondker What is Covered? Motivation GLRR: Bayesian Generalized Low Rank Regression L2R2: Bayesian
Differential privacy in health care analytics and medical research An interactive tutorial
Differential privacy in health care analytics and medical research An interactive tutorial Speaker: Moritz Hardt Theory Group, IBM Almaden February 21, 2012 Overview 1. Releasing medical data: What could
Data, Measurements, Features
Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are
Statistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before
CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York
BME I5100: Biomedical Signal Processing Linear Discrimination Lucas C. Parra Biomedical Engineering Department CCNY 1 Schedule Week 1: Introduction Linear, stationary, normal - the stuff biology is not
Data Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
An Application of the G-formula to Asbestos and Lung Cancer. Stephen R. Cole. Epidemiology, UNC Chapel Hill. Slides: www.unc.
An Application of the G-formula to Asbestos and Lung Cancer Stephen R. Cole Epidemiology, UNC Chapel Hill Slides: www.unc.edu/~colesr/ 1 Acknowledgements Collaboration with David B. Richardson, Haitao
Statistical Models in R
Statistical Models in R Some Examples Steven Buechler Department of Mathematics 276B Hurley Hall; 1-6233 Fall, 2007 Outline Statistical Models Structure of models in R Model Assessment (Part IA) Anova
Data Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
Tree based ensemble models regularization by convex optimization
Tree based ensemble models regularization by convex optimization Bertrand Cornélusse, Pierre Geurts and Louis Wehenkel Department of Electrical Engineering and Computer Science University of Liège B-4000
An extension of the factoring likelihood approach for non-monotone missing data
An extension of the factoring likelihood approach for non-monotone missing data Jae Kwang Kim Dong Wan Shin January 14, 2010 ABSTRACT We address the problem of parameter estimation in multivariate distributions
Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34
Machine Learning Javier Béjar cbea LSI - FIB Term 2012/2013 Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/2013 1 / 34 Outline 1 Introduction to Inductive learning 2 Search and inductive learning
Problem of Missing Data
VASA Mission of VA Statisticians Association (VASA) Promote & disseminate statistical methodological research relevant to VA studies; Facilitate communication & collaboration among VA-affiliated statisticians;
Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza
Handling missing data in large data sets Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza The problem Often in official statistics we have large data sets with many variables and
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.
Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví Pavel Kříž Seminář z aktuárských věd MFF 4. dubna 2014 Summary 1. Application areas of Insurance Analytics 2. Insurance Analytics
Agenda. Mathias Lanner Sas Institute. Predictive Modeling Applications. Predictive Modeling Training Data. Beslutsträd och andra prediktiva modeller
Agenda Introduktion till Prediktiva modeller Beslutsträd Beslutsträd och andra prediktiva modeller Mathias Lanner Sas Institute Pruning Regressioner Neurala Nätverk Utvärdering av modeller 2 Predictive
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries
Basic Statistics and Data Analysis for Health Researchers from Foreign Countries Volkert Siersma [email protected] The Research Unit for General Practice in Copenhagen Dias 1 Content Quantifying association
A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS
A SURVEY ON CONTINUOUS ELLIPTICAL VECTOR DISTRIBUTIONS Eusebio GÓMEZ, Miguel A. GÓMEZ-VILLEGAS and J. Miguel MARÍN Abstract In this paper it is taken up a revision and characterization of the class of
Some Essential Statistics The Lure of Statistics
Some Essential Statistics The Lure of Statistics Data Mining Techniques, by M.J.A. Berry and G.S Linoff, 2004 Statistics vs. Data Mining..lie, damn lie, and statistics mining data to support preconceived
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected]
SPSS TRAINING SESSION 3 ADVANCED TOPICS (PASW STATISTICS 17.0) Sun Li Centre for Academic Computing [email protected] IN SPSS SESSION 2, WE HAVE LEARNT: Elementary Data Analysis Group Comparison & One-way
How to assess the risk of a large portfolio? How to estimate a large covariance matrix?
Chapter 3 Sparse Portfolio Allocation This chapter touches some practical aspects of portfolio allocation and risk assessment from a large pool of financial assets (e.g. stocks) How to assess the risk
