How is Big Data Different? A Paradigm Shift



Similar documents
One Statistician s Perspectives on Statistics and "Big Data" Analytics

Advanced Big Data Analytics with R and Hadoop

Big Data and Data Science: Behind the Buzz Words


Fast Analytics on Big Data with H20

Speaker First Plenary Session THE USE OF "BIG DATA" - WHERE ARE WE AND WHAT DOES THE FUTURE HOLD? William H. Crown, PhD

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

The Internet of Things and Big Data: Intro

Data UNC. Vinayak Deshpande

The Application of Exploratory Data Analysis (EDA) in Auditing

Causal Infraction and Network Marketing - Trends in Data Science

Building risk prediction models - with a focus on Genome-Wide Association Studies. Charles Kooperberg

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Designing a learning system

Penalized Logistic Regression and Classification of Microarray Data

MHI3000 Big Data Analytics for Health Care Final Project Report

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Data Mining. Nonlinear Classification

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Statistics for BIG data

Classification and Regression by randomforest

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

INTRODUCING AZURE MACHINE LEARNING

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Predictive Modeling and Big Data

BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING

Regularized Logistic Regression for Mind Reading with Parallel Validation

Correlational Research

" Y. Notation and Equations for Regression Lecture 11/4. Notation:

SURVEY REPORT DATA SCIENCE SOCIETY 2014

Lecture 10: Regression Trees

POSTGRAD PLACEMENTS. Placements are an integral part of the Masters programmes, so international students will not require additional work visas.

Predictive Analytics Certificate Program

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

AcademyR Course Catalog

The Need for Training in Big Data: Experiences and Case Studies

Design & Analysis of Ecological Data. Landscape of Statistical Methods...

ANALYTICS CENTER LEARNING PROGRAM

SEIZE THE DATA SEIZE THE DATA. 2015

CS Data Science and Visualization Spring 2016

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Outline. Topic 4 - Analysis of Variance Approach to Regression. Partitioning Sums of Squares. Total Sum of Squares. Partitioning sums of squares

Big Data Big Knowledge?

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Classification On The Clouds Using MapReduce

PS 271B: Quantitative Methods II. Lecture Notes

Statistics in Medicine Research Lecture Series CSMC Fall 2014

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Model selection in R featuring the lasso. Chris Franck LISA Short Course March 26, 2013

Univariate Regression

Statistics. Head First. A Brain-Friendly Guide. Dawn Griffiths

Distributed forests for MapReduce-based machine learning

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Data Science at U of U

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

Part 2: Analysis of Relationship Between Two Variables

How To Run Statistical Tests in Excel

Machine learning for algo trading

Multivariate Logistic Regression

Azure Machine Learning, SQL Data Mining and R

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Data Mining Practical Machine Learning Tools and Techniques

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

Using Ensemble of Decision Trees to Forecast Travel Time

Sunnie Chung. Cleveland State University

Table of Contents. June 2010

White Paper. Redefine Your Analytics Journey With Self-Service Data Discovery and Interactive Predictive Analytics

Fitting Big Data into Business Statistics

Big Data Analytics and Optimization

Car Insurance. Prvák, Tomi, Havri

Big Data Visualiza9on

Nonparametric statistics and model selection

2015 Workshops for Professors

Machine Learning and Econometrics. Hal Varian Jan 2014

How To Understand Data Science

Comparison of Data Mining Techniques used for Financial Data Analysis

Data Mining: Overview. What is Data Mining?

KnowledgeSTUDIO HIGH-PERFORMANCE PREDICTIVE ANALYTICS USING ADVANCED MODELING TECHNIQUES

Using Data Mining for Mobile Communication Clustering and Characterization

Some Essential Statistics The Lure of Statistics

Similarity Search in a Very Large Scale Using Hadoop and HBase

THE RISE OF THE BIG DATA: WHY SHOULD STATISTICIANS EMBRACE COLLABORATIONS WITH COMPUTER SCIENTISTS XIAO CHENG. (Under the Direction of Jeongyoun Ahn)

SAMPLE SELECTION BIAS IN CREDIT SCORING MODELS

DATA SCIENCE CURRICULUM WEEK 1 ONLINE PRE-WORK INSTALLING PACKAGES COMMAND LINE CODE EDITOR PYTHON STATISTICS PROJECT O5 PROJECT O3 PROJECT O2

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Why is Internal Audit so Hard?

Advanced Ensemble Strategies for Polynomial Models

Welcome. Data Mining: Updates in Technologies. Xindong Wu. Colorado School of Mines Golden, Colorado 80401, USA

How To Make A Credit Risk Model For A Bank Account

Applied Multivariate Analysis - Big data analytics

Causal Leading Indicators Detection for Demand Forecasting

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

C R RAO AIMSCS Lecture Notes Series

Big Data Analytics. Tools and Techniques

Statistical Models in Data Mining

CS570 Data Mining Classification: Ensemble Methods

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

MSCA Introduction to Statistical Concepts

Data Mining Individual Assignment report

Transcription:

How is Big Data Different? A Paradigm Shift Jennifer Clarke, Ph.D. Associate Professor Department of Statistics Department of Food Science and Technology University of Nebraska Lincoln ASA Snake River Chapter, Meridian, ID, May 29 2015 J. Clarke, UNL Paradigm Shift 1/10

Outline Shift 1. Trickle to Firehose Shift 2. Experiment to Observation? Shift 3. Low Dimension (n > p) to High Dimension (n < p) Shift 4. Modeling to Prediction J. Clarke, UNL Paradigm Shift 2/10

Trickle to Firehose As statisticians we enjoy working with data - preprocessing, visualizing, modeling, inference, analysis, sharing, etc. These activities become difficult or impossible when the amount of data is large. Each piece of analysis requires considerable time and effort, and consideration of computational expense What do we do when data is so large it can t fit on one computer? What about garbage in, garbage out? We rethink data as dynamic and emphasize random sampling J. Clarke, UNL Paradigm Shift 3/10

Trickle to Firehose Think about distributed processing. Can analysis be done in parallel or online? Hadoop, MapReduce Data gets bigger, not smaller, during preprocessing so plan ahead Computer scientists and system administrators are your friends Learn new computational skills (Python, SQL) and learn from others (social media) J. Clarke, UNL Paradigm Shift 4/10

Experiment to Observation? Much of Big Data is collected without thought. Period. Collecting is easy, analysis is hard. Experimental design has played a limited role, but is critical to inference and prediction. Sample size may not correlate with data size, e.g., genomics, social networks. Need for modern experimental design with detailed data provenance J. Clarke, UNL Paradigm Shift 5/10

Experiment to Observation? Do we need statistics? YES. To consult the statistician after an experiment is finished is often merely to ask him to conduct a post-mortem examination. He can perhaps say what the experiment died of. Fisher Still relevant: sampling populations, confounders, multiple testing, bias, and overfitting Usually not randomization Unstructured data: information that either does not have a pre-defined data model and/or is not organized in a predefined manner. (www.mapr.com) J. Clarke, UNL Paradigm Shift 6/10

Low Dimension (n > p) to High Dimension (n < p) Figure out ways to make p smaller: variable selection, variable summarization, variable sampling. Variable selection: sure independence screening (Fan and Lv 2008), LASSO (Tibshirani 1996), regularization In the context of linear regression E(y X = x) = α + βx, estimates are chosen to minimize arg min (y i α βx i ) 2 subject to β j < t. This is equivalent to minimizing 1 2n (yi α βx i ) 2 + λ β j. one can summarize variables by PCA, cluster medioids, etc. J. Clarke, UNL Paradigm Shift 7/10

Low Dimension (n > p) to High Dimension (n < p) Variable sampling (huh?). Can model subsets of data where subsets are random samples of observations AND variables. Overfitting is a BIG problem. Correct for multiple testing... FDR, Westfall-Young Problem drives solution... don t hit all the nails J. Clarke, UNL Paradigm Shift 8/10

Shift 4. Modeling to Prediction Idea: When data is unstructured or otherwise complex, model uncertainty is high If the goal is prediction, average or ensemble. Think bagging, boosting. This will reduce variability without increasing bias (while accounting for model uncertainty). Focus on accurate prediction and sequential/prequential analysis (interactive) Approach inference by assessing and dissecting predictor Uncertainty and variability based on random sampling and/or permutation Error rate depends on correlation between models and strength of individual models J. Clarke, UNL Paradigm Shift 9/10

Shift 4. Modeling to Prediction Linear models are like donkeys. Treat them right and they ll carry you, grudgingly, as far as they can. They will bray when they are unhappy but you will know how to feed and water them. They don t travel far. Neural Networks are like a big pile of snakes. Some are poisonous though most aren t. There are different sizes and colors. You can grab one that looks right but since you can t see the whole thing it will coil around in unexpected ways. Trees are like foxes. They re clever and run around all over the place. Catch the ones that will solve your problem. Ensemble methods are packrats. They always have something that works well but it may be a mess. This can save time catching snakes, chasing foxes, or beating donkeys. But, you don t know what you ve really got. J. Clarke, UNL Paradigm Shift 10/10