Fitting Big Data into Business Statistics

Similar documents
Fitting Big Data into Business Statistics

Getting Ready for Big Data

Implications of Big Data for Statistics Instruction 17 Nov 2013

Data Mining Introduction

Demand Management Where Practice Meets Theory

Overview. Data Mining. Predicting Stock Market Returns. Predicting Health Risk. Wharton Department of Statistics. Wharton

Lecture 10: Regression Trees

Customer and Business Analytic

CS 4204 Computer Graphics

Simple Predictive Analytics Curtis Seare

Data Analysis. Using Excel. Jeffrey L. Rummel. BBA Seminar. Data in Excel. Excel Calculations of Descriptive Statistics. Single Variable Graphs

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Regression 3: Logistic Regression

The Predictive Data Mining Revolution in Scorecards:

Street Address: 1111 Franklin Street Oakland, CA Mailing Address: 1111 Franklin Street Oakland, CA 94607

Time Series Analysis. 1) smoothing/trend assessment

STAT 370: Probability and Statistics for Engineers [Section 002]

Data Mining Techniques Chapter 6: Decision Trees

Big Data. A Lot of Opportunities. For Producing Wrong Results. Data Science Symposium MMag. Dr. Günther Eibl

ECLT5810 E-Commerce Data Mining Technique SAS Enterprise Miner -- Regression Model I. Regression Node

B490 Mining the Big Data. 0 Introduction

The Forgotten JMP Visualizations (Plus Some New Views in JMP 9) Sam Gardner, SAS Institute, Lafayette, IN, USA

430 Statistics and Financial Mathematics for Business

Audit Data Analytics. Bob Dohrer, IAASB Member and Working Group Chair. Miklos Vasarhelyi Phillip McCollough

How To Build A Predictive Model In Insurance

Chapter 7: Simple linear regression Learning Objectives

Sunnie Chung. Cleveland State University

Precalculus REVERSE CORRELATION. Content Expectations for. Precalculus. Michigan CONTENT EXPECTATIONS FOR PRECALCULUS CHAPTER/LESSON TITLES

2. What is the general linear model to be used to model linear trend? (Write out the model) = or

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

16 : Demand Forecasting

Trust Accounting Rules!

Forecasting the first step in planning. Estimating the future demand for products and services and the necessary resources to produce these outputs

Big Data Hope or Hype?

Vector HelpDesk - Administrator s Guide

Big Data + Predictive Analytics = Actionable Business Insights: Consider Big Data as the Most Important Thing for Business since the Internet

Data Mining and Visualization

Data Mining: Overview. What is Data Mining?

Probability and Statistics

A Whitepaper of Marketing Questions and Answers Creating Landing Pages that Sell (with Bob Bly)

Big Data in Transportation Engineering

Teaching Business Statistics through Problem Solving

CS171 Visualization. The Visualization Alphabet: Marks and Channels. Alexander Lex [xkcd]

BELGIAN BIG DATA SURVEY Walter Vanherle ba4all Chairman 13 October

APPLICATION FOR PART-TIME EMPLOYMENT AS A TUTOR TUTOR IN THE DOLCIANI MATHEMATICS LEARNING CENTER

Automated Biosurveillance Data from England and Wales,

Should we Really Care about Building Business. Cycle Coincident Indexes!

Introduction to Data Mining

CART 6.0 Feature Matrix

Algebra Academic Content Standards Grade Eight and Grade Nine Ohio. Grade Eight. Number, Number Sense and Operations Standard

Application of SAS! Enterprise Miner in Credit Risk Analytics. Presented by Minakshi Srivastava, VP, Bank of America

Cross Validation. Dr. Thomas Jensen Expedia.com

Startup guide for Zimonitor

Selection bias in secondary analysis of electronic health record data. Sebastien Haneuse, PhD

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

Social Networking Sites like Facebook, MSN

ONLINE EXTERNAL AND SURVEY STUDIES

Data Visualization Handbook

UNIFY YOUR (BIG) DATA

Model Validation Techniques

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

So you want to create an a Friend action

Using Excel (Microsoft Office 2007 Version) for Graphical Analysis of Data

PES 1110 Fall 2013, Spendier Lecture 27/Page 1

Data Mining and Analysis With Excel PivotTables and The QI Macros By Jay Arthur, The KnowWare Man

Section 6: Model Selection, Logistic Regression and more...

5 Steps to Designing Omni-channel Fulfillment Operations

A Basic Guide to Modeling Techniques for All Direct Marketing Challenges

MTH 140 Statistics Videos

INTRODUCTION PARKING REPORTING: OVERVIEW PARKING REPORTING: STATISTICS MY DOMAINS: PARKING OPTIMIZATIONS...

Domestic Services Skips Skip Bags Ancillary Products Support Payments Route Management Maps SMS

1 Maximum likelihood estimation

GLM I An Introduction to Generalized Linear Models

Part 1: Background - Graphing

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Business Intelligence and Process Modelling

How much of a difference should I expect? The bill pay screen and menu will have an enhanced appearance; however the functionality will be the same!

Remarkably simple HR Software

LAGUARDIA COMMUNITY COLLEGE CITY UNIVERSITY OF NEW YORK DEPARTMENT OF MATHEMATICS, ENGINEERING, AND COMPUTER SCIENCE

A Guide to Using Excel in Physics Lab

Credit Risk Analysis Using Logistic Regression Modeling

Data Mining Practical Machine Learning Tools and Techniques

Introduction to Time Series Analysis and Forecasting. 2nd Edition. Wiley Series in Probability and Statistics

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

This Symposium brought to you by

Mendham Township School District Reading Curriculum Kindergarten

The Great Debate Correlation vs Causation

IBM SPSS Direct Marketing 22

SAP Predictive Analytics Roadmap Charles Gadalla SAP SESSION CODE: #####

Appendix 2.1 Tabular and Graphical Methods Using Excel

0 Introduction to Data Analysis Using an Excel Spreadsheet

Hyper-targeted. Customer Retention with Customer360

Data Preprocessing. Week 2

Creating Online Surveys with Qualtrics Survey Tool

Shear Force and Moment Diagrams

Online Algorithms: Learning & Optimization with No Regret.

Exercise 1.12 (Pg )

Hypothesis Testing for Beginners

Wilson Area School District Planned Course Guide

Transcription:

Fitting Big Data into Business Statistics Bob Stine,

Issues from Big Data Changes to the curriculum? New skills for the four Vs of big data? volume, variety, velocity, validity Is it a zero sum game? 2

Changes to Curriculum One semester course 3

Changes to Curriculum One semester course Typical curriculum Descriptive statistics Basic probability, random variables Sampling Inference Regression Top level outline remains Retain major sequence Adapt what goes underneath Offer a few suggestions... 3

Changes: Descriptive 4

Changes: Descriptive Greater variety Social networks, text, spatial -0.0278 0.119 More categorical variables, more bins Rotated Log Component 2 Amazon visitors referred by 11,142 sites. Same chart. Richer data. 1992 2001 2011 4

Changes: Descriptive Greater variety Social networks, text, spatial -0.0278 0.119 More categorical variables, more bins Rotated Log Component 2 Amazon visitors referred by 11,142 sites. Same chart. Richer data. 1992 2001 2011 Validity: lower data quality Eg. Missing data, coding errors, wrong labels 4

Changes: Probability 5

Changes: Probability More coverage of dependence That big sample might not be so big! Credit default recession of 2008! Hurricane insurance Auto insurance 5

Changes: Probability More coverage of dependence That big sample might not be so big! Credit default recession of 2008! Hurricane insurance Auto insurance Greater awareness of multiplicity Boole s inequality, Bonferroni p P(max z >1.96) 1 5 25 100 0.05 0.23 0.72 0.99 5

Multiplicity Cartoon xkcd 6

Multiplicity Cartoon xkcd 6

Multiplicity Cartoon xkcd 6

Changes: Sampling 7

Changes: Sampling Return to designed experiments! Velocity: Detecting changes A/B testing in web design Wired, 2012 Transactional vs sampled data Big data often derive from monitoring transactions, such as billing Don t forget dependence! Big good 7

Changes: Inference What s it mean to be statistically significant? Effect size, economic value H0: µa = µb t=3.2 with p-val 0.002 8

Changes: Inference What s it mean to be statistically significant? Effect size, economic value H0: µa = µb t=3.2 with p-val 0.002 8

Changes: Inference What s it mean to be statistically significant? Effect size, economic value H0: µa = µb t=3.2 with p-val 0.002 Another reason to prefer CIs over tests? 8

Changes: Regression 9

Changes: Regression What will happen to this regression if the sample size increases from 50 to 100,000? Idealized sampling from population, just more. 9

Changes: Regression What will happen to this regression if the sample size increases from 50 to 100,000? Idealized sampling from population, just more. 9

Changes: Regression Outliers less important with more data? CLT has to work with millions! 10

Changes: Regression Outliers less important with more data? CLT has to work with millions! Example Estimating standard errors n=10,000 with 9,999 at x 0 and one at x = 1. 10

Changes: Regression Outliers less important with more data? CLT has to work with millions! Example Estimating standard errors n=10,000 with 9,999 at x 0 and one at x = 1. 10

Changes: Regression Extrapolation gets a lot easier when you have more unfamiliar variables xkcd 11

Changes: Regression Why pretend everything is linear? 12

Changes: Regression Why pretend everything is linear? Regression estimates E(Y X) Smoothing via local averages so simple 12

Changes: Regression Why pretend everything is linear? Regression estimates E(Y X) Smoothing via local averages so simple Effect Size? 12

Changes: Regression Deciding what to use in a model Wide data tables Substantive choices Business analytics? Data table does not always have what you need. 13

Changes: Regression Deciding what to use in a model Wide data tables Substantive choices Business analytics? Data table does not always have what you need. Automated search Modern versions of stepwise regression Concern Over-fitting, building on multiplicity foundation Some notion of cross validation 13

Changes: Regression Deciding what to use in a model Wide data tables Substantive choices Business analytics? Data table does not always have what you need. Automated search Modern versions of stepwise regression Concern Over-fitting, building on multiplicity foundation Some notion of cross validation Where to find the time? 13

Exploit Technology Greater reliance on software Still need basic examples, but only illustrative 14

Exploit Technology Greater reliance on software Still need basic examples, but only illustrative Automated tools Animating regression models using profile tools 14

Issues from Big Data Changes to the curriculum? Not at the top level New skills for the four Vs of big data? Yes. Examples include More categorical, multiplicity, effect sizes, model choice Does it have to be a zero sum game? Key ideas remain. Opportunity to revise and update Rich examples that span semester 15

What about... Host of other data issues Database systems, layout Information technology Computer science! Hadoop, MapReduce,... Opportunity to collaborate Change management, implementation Information systems Supply change Marketing 16

Closing Remarks Graphics remain important Maybe more important than ever Communication remains essential Having clear sense of where an analysis is going is essential with big data. Easy to get lost Business analytics 17

Closing Remarks Graphics remain important Maybe more important than ever Communication remains essential Having clear sense of where an analysis is going is essential with big data. Easy to get lost Business analytics Thanks! 17