One Statistician s Perspectives on Statistics and "Big Data" Analytics



Similar documents
Supervised Learning (Big Data Analytics)

How is Big Data Different? A Paradigm Shift

Fast Analytics on Big Data with H20

Data Mining Methods: Applications for Institutional Research

Data Mining. Nonlinear Classification

CI6227: Data Mining. Lesson 11b: Ensemble Learning. Data Analytics Department, Institute for Infocomm Research, A*STAR, Singapore.

Data Mining Practical Machine Learning Tools and Techniques

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Faculty of Science School of Mathematics and Statistics

Knowledge Discovery and Data Mining

Local classification and local likelihoods

Car Insurance. Prvák, Tomi, Havri

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

ANALYTICS CENTER LEARNING PROGRAM

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

2015 Workshops for Professors

Model Combination. 24 Novembre 2009

Scalable Developments for Big Data Analytics in Remote Sensing

Why do statisticians "hate" us?

Gerry Hobbs, Department of Statistics, West Virginia University

Classification and Regression by randomforest

Leveraging Ensemble Models in SAS Enterprise Miner

Why Ensembles Win Data Mining Competitions

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

CS570 Data Mining Classification: Ensemble Methods

How To Make A Credit Risk Model For A Bank Account

Making Sense of the Mayhem: Machine Learning and March Madness

Data Mining Introduction

Knowledge Discovery and Data Mining

A Statistician s View of Big Data

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

Regression Modeling Strategies

SAS Certificate Applied Statistics and SAS Programming

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Implications of Big Data for Statistics Instruction 17 Nov 2013

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

Drugs store sales forecast using Machine Learning

Supervised and unsupervised learning - 1

IS YOUR DATA WAREHOUSE SUCCESSFUL? Developing a Data Warehouse Process that responds to the needs of the Enterprise.

Machine learning for algo trading

Azure Machine Learning, SQL Data Mining and R

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

Putting Big Data To Work. Bill Franks Chief Analytics Officer, Teradata August 2014

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Homework Assignment 7

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Better credit models benefit us all

Statistical Distributions in Astronomy

Principles of Data Mining by Hand&Mannila&Smyth

Beating the MLB Moneyline

Course Descriptions: Undergraduate/Graduate Certificate Program in Data Visualization and Analysis

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

MHI3000 Big Data Analytics for Health Care Final Project Report

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Knowledge Discovery and Data Mining

Big-data Analytics: Challenges and Opportunities

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

IMPORTANCE OF QUANTITATIVE TECHNIQUES IN MANAGERIAL DECISIONS

Some Essential Statistics The Lure of Statistics

Risk pricing for Australian Motor Insurance

HT2015: SC4 Statistical Data Mining and Machine Learning

Data Mining: Overview. What is Data Mining?

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Proposal for New Program: BS in Data Science: Computational Analytics

Knowledge Discovery and Data Mining. Bootstrap review. Bagging Important Concepts. Notes. Lecture 19 - Bagging. Tom Kelsey. Notes

What is Data Science? Girl Develop It! Meetup Renée M. P. Teate, March 2015

Cross Validation techniques in R: A brief overview of some methods, packages, and functions for assessing prediction models.

Data UNC. Vinayak Deshpande

Promises and Pitfalls of Big-Data-Predictive Analytics: Best Practices and Trends

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Prerequisites. Course Outline

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Chapter 6. The stacking ensemble approach

Statistics Graduate Courses

Risk vs. Reward: A recipe for success

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

The aims: Chapter 14: Usability testing and field studies. Usability testing. Experimental study. Example. Example

How To Understand The Theory Of Probability

Combining GLM and datamining techniques for modelling accident compensation data. Peter Mulquiney

How To Solve The Kd Cup 2010 Challenge

Statistical Machine Learning

Statistics, Data Mining and Machine Learning in Astronomy: A Practical Python Guide for the Analysis of Survey Data. and Alex Gray

Transcription:

One Statistician s Perspectives on Statistics and "Big Data" Analytics Some (Ultimately Unsurprising) Lessons Learned Prof. Stephen Vardeman IMSE & Statistics Departments Iowa State University July 2014 Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 1 / 16

My Background in "Modern Analytics"/"Statistical Learning" Teaching PhD and MS Courses in the Area 1 3 iterations of a 3-credit PhD-level course since Spring 09 1 1-credit summer course beyond 1st 2 versions of the PhD course 1st iteration of a 3-credit MS-level shared course (with Morris and Wu) Spring 14 Substantial (400-500 hours of) External Work with Corporate Advanced Analytics Groups Personal Participation on (Mixed Participant HS Through PhD) Predictive Analytics Teams at ISU Netflix Challenge Several kaggle Contests ("compete as a data scientist for fortune, fame, and fun" AND you learn stuff!) Observation of Successful Student Teams 1 Typed notes and.pdf slides avaialbe at http://www.analyticsiowa.com/coursematerials/modern-multivariate-statistical-learning/ Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 2 / 16

Some (Indirect) Evidence of Real (System) Effectiveness 2013 and 2014 Prudsys AG Data Mining Cup Student Team Performance 5th Place International Finish (99 Entrants) in 2013 2 and 1st Place International Finish (98 Entrants 28 Countries) in 2014 34 by ISU Graduate Student Teams Composed Mostly of Statistics Students 2 http://www.news.iastate.edu/news/2013/07/25/datamining 3 http://www.news.iastate.edu/news/2014/07/10/data-miners 4 http://www.data-mining-cup.de/dmc-wettbewerb/preistraeger/ Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 3 / 16

Generalities About "Big Data Analytics" There is Nothing New Under the Sun... at a Fundamental Level There Really Are NO Surprises Statistical Theory Still Matters Linear Models and Linear Algebra Still Matter (Like Most Things) Most People Will Need to "Do It" to Learn It Learning to "Do It" is Perhaps Harder than Learning "Small Data" Statistics But Only Because of the Scale Involved Practice Does Raise Interesting Research Questions (That Can Have Some Emphases Different From Those Statisticians Are Used To) Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 4 / 16

Some More Specific Topics for Discussion Here Up-Front Work Data Handling and Handlers Analytical Computing Systems and Practices The Relevance of Decision Theory and Linear Models The (New But Limited) Relevance of Non-Parametric Methods (Smoothing, etc.) and Highly Flexible Parametric Methods Increased Attention to Model Bias and Over-Fit in Predictive Analytics The Role of "Ensembles" The Power of Statistical Theory and First Principles in Practical Data Analyses The Necessity of Demystification/Clear Thinking Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 5 / 16

Up-Front Work Before Variables x 11 x 12 x 1p y 1 x 21 x 22 x 2p y 2 Cases....... x N1 x N2 x Np y N typically comes a huge amount of work. Assembling Data from Disparate Sources Useful Definition of x s and Their Computation (just as in small data contexts) this can t be profitably done "synthetically" 80% of corporate project hours and 2/3 of student team project hours serious data handling and record keeping issues must be faced Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 6 / 16

Data Handlers and Data Handling Particularly in Light of the Up-Front Matters, in an Effi cient Corporate "Production Mode" a Set of Core Data Analysts is supported by at least an equal number of "data techs" who function more or less like scientific "lab techs" is supported by a good-sized dedicated IT operation (personnel, hardware, and appropriate software) interfaces with a much larger set of less technical business analysts/data scientists placed in business units Currently, Much of Our Favorite Software Can Fail to Be Adequate to "Simple" Up-Front Processing R versus MatLab R versus Python need for Perl, SQL Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 7 / 16

Analytical Computing Systems and Practices For Some Kinds of "Big Data" Predictive Modeling, R Can Be Perfectly Adequate it s usually p rather than N that is a problem (if there is one) MatLab and Python can work where R does not For Production-Scale (Time Series Type) Corporate Forecasting, SAS FS is Currently the Only Game in Town and Serious IT Support is Essential Attention to Standard (CS-type) Code-Sharing and Dataset-Sharing, Versioning and Communication Practice and Corresponding Resources are Essential Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 8 / 16

The Relevance of Decision Theory and Linear Models (After Once Passing Quite Out of Vogue During My Professional Lifetime) Decision Theory is Back loss, risk, optimality are the right constructs for guiding analytics... and modern problem size makes it possible to really use them it tells what is optimal and provides a framework for thinking clearly about what are obstacles... for example, it promises that in general in predictive analytics Err = minimum expected loss possible + modeling penalty + fitting penalty (that among other things puts realistic limits on what is possible) Directly or Indirectly Linear Theory Remains at the Center of Methodological Progress and Practice real understanding of everything in modern analytics, from shrinkage to smoothing to ensembles to kernels, turns on linear theory Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 9 / 16

The New but Limited Relevance of Non-Parametric Methods And Highly Flexible Parametric Methods Splines and Kernel Smoothers, etc. Begin to Look Promising When N is Big big p and the curse of dimensionality dampens enthusiasm some clever piecing together of low-dimensional smoothers (e.g. additive models) must be adopted to progress Tree-based Methods Become Effective... But (EVEN IN RF Form) Are No Silver Bullet Neural (and Other) "Networks" Are Highly Flexible And Could Be Alternative Smoothers, But Ad Hoc Fitting Makes Effi cacy Hard to Assess Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 10 / 16

Necessary Increased Attention to Model Bias and Over-Fit in Predictive Analytics Small Data Statistical Practice Has Typically Focused on "Quality of Fit" For Fixed Predictors "Fitting Bias and Variance" Big Data Statistical Practice Typically Pays Far More Attention to "Model Quality/Class" "Model Bias" Flexibility of a Prediction Method MUST Be Balanced Against Propensity for Over-Fit some version of a test set external to a training must be used to assess fit cross-validation (CVE) and bootstrapping (OOBE) are the only real options Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 11 / 16

The Role of "Ensembles" For SEL (Unless One Predictor IS the Conditional Mean Function) A Linear Combination of Two Predictors That Beats Both this motivates fitting a number of predictors and looking for good linear combinations another version of this is "boosting" which for SEL amounts to successively fitting to residuals and correcting a current predictor by some part of the fit to residuals BUT NOTICE THAT ONE IS ONLY ATTEMPTING TO BETTER APPROXIMATE THE CONDITIONAL MEAN FUNCTION! (There is NO Magic Here) For Classification the Combining of Predictors Might Be Accomplished By Combining "Voting Functions" or "Votes" the former creates a set of classifiers bigger than either produced by two constituent voting functions, and thus again allows for improved classification the latter is less defensible but is often effective BUT AGAIN, THERE IS NO MAGIC HERE Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 12 / 16

The Power of Statistical Theory and First Principles in Practical Data Analyses Basic Statistical Theory Wisely Used Beats Ad Hoc Methodology Every Day EXAMPLE: A Post-Mortem on the DMC (Classification Problem) Win Strongly Suggests that the Margin of Victory Was Theory-Driven Real Variable Encoding of Categorical Information. Suppose that x = (x 1, x 2,..., x k ) is a set of categorical predictors. What real function(s) of x can serve as practical numerical predictors? N-P Theory and/or suffi ciency considerations point at (log) likelihood ratios that can (up to some nontrivial number of cells) be estimated as essentially ratios of cell counts ln ˆf [( ) ] 1 (x) ˆf 0 (x) = ln N0 N1 (x) +.5 N 1 N 0 (x) +.5 This is MUCH preferable in practice to a set of cell indicators. Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 13 / 16

The Necessity of Demystification/Clear Thinking Silliness abounds. Folklore in machine learning is sometimes given more credence than careful thought. One clear example is the widely held belief that "majority voting by independent classifiers improves error rates." That s just not necessarily true, and obscures the fact that a vector of K classifiers in a 2-class problem takes values in {0, 1} K and that (again) the minimum risk function of the vector is a NP test (based on the likelihood ratio). Another example are persistent claims that some particular methodology (packaged by person or group X) is universally best. Common sense (and "no free lunch" theorems if one really must appeal to them) says that s silly. Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 14 / 16

In The End Modern "Big Data Analytics" is fun and a realm where the discipline of Statistics has much to offer. Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 15 / 16

Thanks Thanks for your attention! Vardeman (Iowa State University) Perspectives on "Big Data" Analytics July 2014 16 / 16