How To Understand Data Science



Similar documents
Big Data Big Knowledge?

MS1b Statistical Data Mining

Statistics Graduate Courses

Statistics for BIG data

Learning outcomes. Knowledge and understanding. Competence and skills

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Classification Problems

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

An Introduction to Data Mining

Big Data Hope or Hype?

Microsoft Azure Machine learning Algorithms

Model Combination. 24 Novembre 2009

Machine Learning: Overview

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Programme du parcours Clinical Epidemiology UMR 1. Methods in therapeutic evaluation A Dechartres/A Flahault

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Data Mining Practical Machine Learning Tools and Techniques

Big Data Analytics and Optimization

HT2015: SC4 Statistical Data Mining and Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Linear Classification. Volker Tresp Summer 2015

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

STATISTICS and PROBABILITY Definition Statistics is the science and practice of developing human knowledge through the use of empirical data

Get to Know the IBM SPSS Product Portfolio

Introduction to Data Mining Techniques

Machine Learning for Data Science (CS4786) Lecture 1

TDS - Socio-Environmental Data Science

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Statistics, Big Data and Data Science!?

Predicting Customer Default Times using Survival Analysis Methods in SAS

Opportunities and Limitations of Big Data

An Introduction to Advanced Analytics and Data Mining

Statistical Models in Data Mining

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Big Data Analytics and Optimization

Azure Machine Learning, SQL Data Mining and R

STA 4273H: Statistical Machine Learning

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Data mining on the EPIRARE survey data

Course Requirements for the Ph.D., M.S. and Certificate Programs

Improving Demand Forecasting

MSCA Introduction to Statistical Concepts

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Statistics in Applications III. Distribution Theory and Inference

Information Management course

On the effect of data set size on bias and variance in classification learning

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Marketing Mix Modelling and Big Data P. M Cain

Data Mining and Machine Learning in Bioinformatics

Predict Influencers in the Social Network

Chapter 6. The stacking ensemble approach

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

INVESTIGATIONS INTO EFFECTIVENESS OF GAUSSIAN AND NEAREST MEAN CLASSIFIERS FOR SPAM DETECTION

Better credit models benefit us all

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Title. Introduction to Data Mining. Dr Arulsivanathan Naidoo Statistics South Africa. OECD Conference Cape Town 8-10 December 2010.

Benchmarking of different classes of models used for credit scoring

How is Big Data Different? A Paradigm Shift

Predictive Data modeling for health care: Comparative performance study of different prediction models

Simple and efficient online algorithms for real world applications

Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model

Evaluation of Machine Learning Techniques for Green Energy Prediction

SAS Software to Fit the Generalized Linear Model

Lecture/Recitation Topic SMA 5303 L1 Sampling and statistical distributions

MSCA Introduction to Statistical Concepts

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

Machine learning for algo trading

WebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Regression Modeling Strategies

COURSE PLAN BDA: Biomedical Data Analysis Master in Bioinformatics for Health Sciences Academic Year Qualification.

Bayesian Statistical Analysis in Medical Research

BINOMIAL OPTIONS PRICING MODEL. Mark Ioffe. Abstract

Statistical Challenges with Big Data in Management Science

Blue Yonder Research Papers

Data Mining and Exploration. Data Mining and Exploration: Introduction. Relationships between courses. Overview. Course Introduction

Adequacy of Biomath. Models. Empirical Modeling Tools. Bayesian Modeling. Model Uncertainty / Selection

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Machine Learning.

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Lecture 6: Logistic Regression

Supervised Learning (Big Data Analytics)

Introduction to Big Data with Apache Spark UC BERKELEY

Bayesian networks - Time-series models - Apache Spark & Scala

Self-Organising Data Mining

Data Mining and Neural Networks in Stata

Gerard Mc Nulty Systems Optimisation Ltd BA.,B.A.I.,C.Eng.,F.I.E.I

A Statistician s Introductory View on Big Data and Data Science (Version 7 as of )

Predicting required bandwidth for educational institutes using prediction techniques in data mining (Case Study: Qom Payame Noor University)

Transcription:

EBPI Epidemiology, Biostatistics and Prevention Institute Big Data Science Torsten Hothorn 2014-03-31

The end of theory The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Chris Anderson, Wired Magazine 16.07) Petabytes allow us to say: Correlation is enough. University of Zurich, EBPI 2014-03-31 Big Data Science Page 2

Big data science Big data Data science Predictive modelling Business intelligence Machine learning (parts of) Artificial intelligence; neural networks (parts of) Pattern recognition Knowledge discovery in data (KDD)... University of Zurich, EBPI 2014-03-31 Big Data Science Page 3

Big data science Big data revolution Data science Predictive modelling Business intelligence Machine learning (parts of) Artificial intelligence; neural networks (parts of) Pattern recognition Knowledge discovery in data (KDD)... University of Zurich, EBPI 2014-03-31 Big Data Science Page 4

Big data in journal titles Number of papers 0 50 100 150 200 250 300 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 Year published (Source: Web of Science) University of Zurich, EBPI 2014-03-31 Big Data Science Page 5

Big data science Big data revolution Data science Predictive modelling Business intelligence Machine learning (parts of) Artificial intelligence; neural networks (parts of) Pattern recognition Knowledge discovery in data (KDD)... University of Zurich, EBPI 2014-03-31 Big Data Science Page 6

But what about... Statistics? Interestingly, Andersons article starts with the famous quote of George Box All models are wrong, but some are useful. Anderson uses 8 times the term statistic* in his 1336 words long article. So, what is the connection between big data etc. and statistics now and what is the future of statistics? Whoever wishes to foresee the future must consult the past. (Machiavelli) University of Zurich, EBPI 2014-03-31 Big Data Science Page 7

Statistics Statistics is the science (the art?) of collecting, analysing, interpreting and communicating data. The word statistics refers to state statisticum (lat) regarding the state statista (ital) statesman, politician So, originally (and, to a large extend, still today), statistics is concerned with data describing the population, economy, administration etc. of a state. This is where the bean counter connotation comes from. University of Zurich, EBPI 2014-03-31 Big Data Science Page 8

Early Zürich statistics Johann Heinrich Waser (statistician in Zürich, 1742-1780) published a book with the title Swiss Blood and French Money containing data about the Zürich war fonds with a publisher in Göttingen (1780). He was accused of treason, sentenced to death and executed in Zürich in 1780. University of Zurich, EBPI 2014-03-31 Big Data Science Page 9

Early Zürich statistics Johann Heinrich Waser (statistician in Zürich, 1742-1780) published a book with the title Swiss Blood and French Money containing data about the Zürich war fonds with a publisher in Göttingen (1780). He was accused of treason, sentenced to death and executed in Zürich in 1780 (the year the NZZ was founded). University of Zurich, EBPI 2014-03-31 Big Data Science Page 10

Rings a bell? University of Zurich, EBPI 2014-03-31 Big Data Science Page 11

Statistics in academia Scientists (working empirically) have a hypothesis/theory and thus a (probabilistic) model an experiment and thus data Statistical methods use the data to estimate free parameters in the model assess their uncertainty and provide means to falsify a theory and/or to formulate a better theory Estimation is performed by either optimisation (frequentists, this talk) or integration (Bayesians, not really today). University of Zurich, EBPI 2014-03-31 Big Data Science Page 12

Models for conditional distributions As an example, suppose a theory states that one or more explanatory variables X affect the distribution of a (so-called response ) variable Y. We are interested if and how the conditional distribution of Y given X = x (Y X = x) P Y X =x depends on x through a function f (x): ξ(y X = x) = f (x) := arg min E } {{ } Y,X ρ(y, f (X )) f statistical model } {{ } minimisation problem University of Zurich, EBPI 2014-03-31 Big Data Science Page 13

Statistical decision theory Abraham Wald (1902-1950) established statistical decision theory; in a nutshell, a statistical model is defined by the minimal expected loss E Y,X (ρ(y, f (X ))). Statistical decision theory is the common foundation of statistics, machine learning, neural networks, pattern recognition, KDD, etc. But the language is different in comput[er,ational] science and statistics. University of Zurich, EBPI 2014-03-31 Big Data Science Page 14

Same thing, different name Machine learning supervised learning Statistics regression ξ(y X = x) = f (x) target variable response variable Y attribute, feature explanatory variable, covariate X hypothesis model, regression function f University of Zurich, EBPI 2014-03-31 Big Data Science Page 15

Same thing, different name instances, examples samples, observations, realisations (Y i, X i ) P (Y,X ), i = 1,..., n learning estimation, fitting ˆf = arg min f n ρ(y i, f (X i )) + λpen(f ) i=1 classification prediction ˆf (x) generalisation error risk E Y,X ρ(y, ˆf (X )) University of Zurich, EBPI 2014-03-31 Big Data Science Page 16

So, what s the difference? ρ (and thus ξ, the optimisation problem and optimiser) is often different causing much confusion. For binary Y, the loss ρ is hinge loss, exponential loss log-density binomial distribution monotone non monotone Loss 0 1 2 3 4 5 6 ρ 0 1 ρ SVM ρ exp ρ log lik Loss 0 1 2 3 4 5 6 ρ 0 1 ρ L2 ρ L1 3 2 1 0 1 2 3 3 2 1 0 1 2 3 (2y 1)f (2y 1)f University of Zurich, EBPI 2014-03-31 Big Data Science Page 17

So, what s the difference? Traditionally, machine learners are more interested in black box classification, i.e. ˆf (x) or even only Ŷ. Statisticians focus on interpretation, i.e., look at ˆf (x) = x ˆβ (linear model) or ˆf (x) = J f j (x) j=1 (additive model) Have a strong background in optimisation. Have a strong background in modelling. University of Zurich, EBPI 2014-03-31 Big Data Science Page 18

Some history The median regression model ρ(y, f (X )) = Y f (X ) f (x) = Median(Y X = x) was suggested by Boscovic and Laplace in the late 18th century. The optimisation problem ˆf = arg min f n i=1 Y i f (X i ) is (was?) hard to solve. University of Zurich, EBPI 2014-03-31 Big Data Science Page 19

Some history The mean regression model ρ(y, f (X )) = Y f (X ) 2 f (x) = E(Y X = x). was suggested only a little later by Legendre and Gauß. Why? Because ˆf = arg min f n i=1 Y i f (X i ) 2 was relatively easy to compute with f (x) = x β. University of Zurich, EBPI 2014-03-31 Big Data Science Page 20

Some history Carl-Friedrich Gauß (1777-1855), the great-grandfather of statistics, replaced a not-so-nice loss function with a nice one and suggested a fast optimisation algorithm (Gaussian elimination). So he was actually a machine learner! We see this pattern over and over again. University of Zurich, EBPI 2014-03-31 Big Data Science Page 21

Same model, different optimiser Machine learning Statistics artificially neural networks additive/nonlinear logistic regression support vector machines generalised mixed/additive models boosting generalised additive models decision trees regression trees random forests random forests random forests? University of Zurich, EBPI 2014-03-31 Big Data Science Page 22

Working together (cited 5189 times) Talking to each other really helps. University of Zurich, EBPI 2014-03-31 Big Data Science Page 23

What s different in big data? Doug Laney (2001), a META Group/Gartner (!) employee: Big data is high volume, high velocity, and/or high variety information assets that require new forms of processing to enable enhanced decision making, insight discovery and process optimisation. Wikipedia has Big data uses inductive statistics and concepts from nonlinear system identification to infer laws (regressions, nonlinear relationships, and causal effects) from large data sets to reveal relationships, dependencies, and to perform predictions of outcomes and behaviours. In other words: Statistics for (large) data sets from multiple unplanned retrospective observational studies / sources. University of Zurich, EBPI 2014-03-31 Big Data Science Page 24

Not much! One of the most shattering examples of re-selling existing statistical technology under a new name is A/B testing. (Source: smashingmagazine.com) University of Zurich, EBPI 2014-03-31 Big Data Science Page 25

Not much! This is a permutation test, most of the time applied incorrectly. And with big data, the test will always be significant anyways. University of Zurich, EBPI 2014-03-31 Big Data Science Page 26

What s the technical challenge? Problem: RAM too small for data; can t load all the data to compute something. This has been the rule with all data over the last 300 years, not the exception. Solution: (Finite) sampling and assessment of variability: go back to STA101. Good news for statisticians: you can bootstrap from the true instead of the empirical distribution. University of Zurich, EBPI 2014-03-31 Big Data Science Page 27

Bias & missings are much bigger problems Y X 1 X 2 1 2 3 observed Observations i n Variables University of Zurich, EBPI 2014-03-31 Big Data Science Page 28

Opportunities We may have enough data to model the whole conditional distribution P Y X =x and not just some real-valued functional ξ(p Y X =x ) like the mean, for example by conditional transformation models (Hothorn, Kneib, Bühlmann, 2014). This allows probabilistic forecasts (Gneiting & Katzfuss, 2014). Funny: In biometry, Kaplan-Meier estimates and (to a certain extend) the Cox model for survival times always looked at the whole conditional distribution! University of Zurich, EBPI 2014-03-31 Big Data Science Page 29

Opportunities Big data instead of meta-analysis: The PRO-ACT data base has time-course information of more than 8500 ALS patients from multiple clinical trials. Use this pooled data to model ALS disease progression (Hothorn & Jung, 2014) instead of somehow merging multiple analyses. Merge different data sources (police records, road information systems, weather records, satellite images, browsing surveys) to model spatial and temporal distribution of wildlife-vehicle collisions (Hothorn et al, 2012). University of Zurich, EBPI 2014-03-31 Big Data Science Page 30

Can we learn something? Statisticians are rather hesitant to new models and techniques because partially educated and employed for policing science (sample size? power? analysis plan? significance?). In the 1990ies, statisticians lost track of microbiology; now there is bioinformatics. However, it seems statisticians are still needed. Think (lack of) reproducibility (Lancet Jan 11 series Increasing value, reducing waste ; Hothorn & Leisch, 2011). Is p <.05 necessary and sufficient for reproducibility? Statistics needs better marketing. The trademark of my own field, biometry, was hijacked by people scanning fingerprints and irises. University of Zurich, EBPI 2014-03-31 Big Data Science Page 31

And data science? (from R-blogger Drew Conway) University of Zurich, EBPI 2014-03-31 Big Data Science Page 32

And data science? Nate Silver @ twitter after his JSM 2013 talk Data scientist is just a sexed up word for statistician. Just do good work and call yourself whatever you want. Just make sure your grant agency gets the point. Thank you very much! University of Zurich, EBPI 2014-03-31 Big Data Science Page 33

And data science? Nate Silver @ twitter after his JSM 2013 talk Data scientist is just a sexed up word for statistician. Just do good work and call yourself whatever you want. Just make sure your grant agency gets the point. Thank you very much! University of Zurich, EBPI 2014-03-31 Big Data Science Page 34

And data science? Nate Silver @ twitter after his JSM 2013 talk Data scientist is just a sexed up word for statistician. Just do good work and call yourself whatever you want. Just make sure your grant agency gets the point. Thank you very much! University of Zurich, EBPI 2014-03-31 Big Data Science Page 35

References Hothorn & Leisch (2011) http://dx.doi.org/10.1093/bib/bbq084 Hothorn, Brandl & Müller (2012) http://dx.doi.org/10.1371/journal.pone.0029510 Gneiting & Katzfuss (2014) http://dx.doi.org/10.1146/ annurev-statistics-062713-085831 Hothorn & Jung (2014) http://dx.doi.org/10.3109/21678421.2014.893361 Hothorn, Kneib & Bühlmann (2014) http://dx.doi.org/10.1111/rssb.12017 University of Zurich, EBPI 2014-03-31 Big Data Science Page 36