Getting Ready for Big Data



Similar documents
Fitting Big Data into Business Statistics

Fitting Big Data into Business Statistics

Data Mining Introduction

Statistics for BIG data

Big Data in Finance. Alexander Grigoriev. School of Business and Economics Sharing Success

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Which Design Is Best?

UR Undergraduate Physics and Astronomy Program Learning Objectives and Assessment Plan

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Stats Review Chapters 9-10

ROCHESTER INSTITUTE OF TECHNOLOGY COURSE OUTLINE FORM COLLEGE OF SCIENCE. School of Mathematical Sciences

Teaching Business Statistics through Problem Solving

Assessing the Proposed 2014 Statistics Curriculum 9/22/2013 V0A. 1

Chapter Eight: Quantitative Methods

The Complete Guide to Unlocking (Not-Provided) Data in Google Analytics

Exploratory Research Design. Primary vs. Secondary data. Advantages and uses of SD

Proposal for New Program: BS in Data Science: Computational Analytics

Teaching Statistics with Fathom

Management Undergraduate Curriculum Revision

Big Data. Donald Kossmann & Nesime Tatbul Systems Group ETH Zurich

Introduction to Regression and Data Analysis

ECON 523 Applied Econometrics I /Masters Level American University, Spring Description of the course

Machine Learning and Data Mining. Fundamentals, robotics, recognition

OCR LEVEL 2 CAMBRIDGE TECHNICAL

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

The impact of Discussion Classes on ODL Learners in Basic Statistics Ms S Muchengetwa and Mr R Ssekuma muches@unisa.ac.za, ssekur@unisa.ac.

Marketing Dept. to guide a study towards its objectives) marketing problems / research objectives? expected, but is not sure about the reasons.

Keywords Big Data; OODBMS; RDBMS; hadoop; EDM; learning analytics, data abundance.

Session 3 Big Data and Insurance Industry. Prof. Jack Ching Syang Yue, ASA

Following are detailed competencies which are addressed to various extents in coursework, field training and the integrative project.

1. How different is the t distribution from the normal?

Pinterest has to be one of my favourite Social Media platforms and I m not alone!

Customer Analysis - Customer analysis is done by analyzing the customer's buying preferences, buying time, budget cycles, etc.

Sunnie Chung. Cleveland State University

Recall this chart that showed how most of our course would be organized:

Experimental Design. Power and Sample Size Determination. Proportions. Proportions. Confidence Interval for p. The Binomial Test

MKTG 680. Chapter 6 Global Information Systems and Market Research. Four topics to study. Information Technology for Global Marketing

Industrial and Systems Engineering Master of Science Program Data Analytics and Optimization

August 2012 EXAMINATIONS Solution Part I

SURVEY REPORT DATA SCIENCE SOCIETY 2014

One View Of Customer Data & Marketing Data

B2B opportunity predictiona Big Data and Advanced. Analytics Approach. Insert

Big Data, Data Analytics and Actuaries. Adam Driussi, Quantium

Department/Academic Unit: Public Health Sciences Degree Program: Biostatistics Collaborative Program

Chapter 23 Inferences About Means

Challenges and Lessons from NIST Data Science Pre-pilot Evaluation in Introduction to Data Science Course Fall 2015

Your Questions from Chapter 1. General Psychology PSYC 200. Your Questions from Chapter 1. Your Questions from Chapter 1. Science is a Method.

DATA ANALYSIS FOR MANAGERS. Harry V. Roberts Graduate School of Business University of Chicago

Management Tools & Technology to improve Out of Pocket Collections

Data Mining is sometimes referred to as KDD and DM and KDD tend to be used as synonyms

ISSN: Journal of Educational and Management Studies. J. Educ. Manage. Stud., 5(3): ; Sep 30, 2015

What do Analytics, Tom Cruise and Bob Dylan have in Common?

The Wondrous World of fmri statistics

What is Grounded Theory? Dr Lynn Calman Research Fellow School of Nursing, Midwifery and Social Work

Business Analytics In a Big Data World Ted Malone Solutions Architect Data Platform and Cloud Microsoft Federal

Information Management course

Big Data. can you seize the opportunity? featuring Donald A. Marchand and Joe Peppard. January 30, Sponsored by

Big Data Analytics. Genoveva Vargas-Solar French Council of Scientific Research, LIG & LAFMIA Labs

BIG DATA FOR YOUR DC

RARITAN VALLEY COMMUNITY COLLEGE ACADEMIC COURSE OUTLINE MATH 111H STATISTICS II HONORS

Chapter 8 Section 1. Homework A

The Next Big Thing in the Internet of Things: Real-Time Big Data Analytics"

How To Get EXACTLY What You Need Financially To Go To College. Educator Lesson Plans

CHANCE ENCOUNTERS. Making Sense of Hypothesis Tests. Howard Fincher. Learning Development Tutor. Upgrade Study Advice Service

Curriculum Guidelines for Bachelor of Arts Degrees in Statistical Science

Pay Per Click Management Services

Big Data and Data Analytics

ADVANCED DATA VISUALIZATION

STOCHASTIC ANALYTICS. Chris Jermaine Rice University

Big Data & Analytics. Counterparty Credit Risk Management. Big Data in Risk Analytics

Executive Master of Public Administration. QUANTITATIVE TECHNIQUES I For Policy Making and Administration U6310, Sec. 03

Master of Science in Finance (MSF) (Courses option)

Introducing Big Data. Abstract. with Small Changes. Agenda. Big Data in the News. Bits and Bytes

DEPARTMENT OF ECONOMICS SCHOOL OF HUMANITIES AND SOCIAL SCIENCES. Undergraduate Prospectus Bachelor of Science in Economics

Getting personal: The future of communications

Digital Optimization

Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course

PROPOSAL FOR A GRADUATE CERTIFICATE IN SOCIAL SCIENCE METHODOLOGY

COMP9321 Web Application Engineering

Branding and Search Engine Marketing

Chapter 9. Two-Sample Tests. Effect Sizes and Power Paired t Test Calculation

Psychology Program. Mission

Collaborations between Official Statistics and Academia in the Era of Big Data

Online 12 - Sections 9.1 and 9.2-Doug Ensley

School of Public Health and Health Services Department of Epidemiology and Biostatistics

Essentials of Marketing Research

A financial software company

Avigdor Gal Technion Israel Institute of Technology

The following are the measurable objectives for graduated computer science students (ABET Standards):

C. The null hypothesis is not rejected when the alternative hypothesis is true. A. population parameters.

The Partnership for the Assessment of College and Careers (PARCC) Acceptance Policy Adopted by the Illinois Council of Community College Presidents

ABSTRACT. Key Words: competitive pricing, geographic proximity, hospitality, price correlations, online hotel room offers; INTRODUCTION

Smarter Analytics. Barbara Cain. Driving Value from Big Data

B490 Mining the Big Data. 0 Introduction

Making data predictive why reactive just isn t enough

Databases - Data Mining. (GF Royle, N Spadaccini ) Databases - Data Mining 1 / 25

Introduction to Big Data the four V's

4. Simple regression. QBUS6840 Predictive Analytics.

Tel: Tuesdays 12:00-2:45

Transcription:

Getting Ready for Big Data Implications for intro stats Bob Stine, www-stat.wharton.upenn.edu/~stine

Change is upon us... Session topics Shifting away from classical methods Communication skills Data visualization Business analytics Predictive analytics Sports analytics Analytics in curriculum Rather than discuss BA course, consider implications of big data for intro courses 2

Big Data? Examples Scanner data captured at retail transaction Credit card, financial transactions Health records and genetic testing Social media, web visits Characteristics Volume, variety, velocity, veracity Often not collected with stat in mind 3

Examples Big Data? Scanner data captured at retail transaction Credit card, financial transactions Health records and genetic testing Social media, web visits Characteristics Volume, variety, velocity, veracity Often not collected with stat in mind Oops, we re not in Kansas anymore 3

Big Data Changes Things Huge number of observations All patient outcomes for a state in a year, all sales transactions, every web query Everything seems statistically significant. p-values 1.0e-122 4

Big Data Changes Things Huge number of observations All patient outcomes for a state in a year, all sales transactions, every web query Everything seems statistically significant. p-values 1.0e-122 But Effect size Substantive versus statistical significance Dependence Are those observations independent? Hurricane versus car insurance Behavior of credit markets, mortgages in 2008 4

Big Data Changes Things Data snooping, hypothesis discovery Wide data sets offer many choices Find important sales patterns Beer and diapers Model fits data very well 5

Big Data Changes Things Data snooping, hypothesis discovery Wide data sets offer many choices Find important sales patterns Beer and diapers Model fits data very well Multiplicity Look for items bought together in scanner data 1000 items produces 500,000 pairs Voter surveys include 1000s of questions related to preferences 5

Implications for Intro Stat Most students will have only one or maybe two semester exposure to statistics Promotional opportunity Attract some to more majors Provide practical knowledge for others Address issues for big data in this context Dependence Multiplicity Effect size Others Zero-sum game 6

Getting Ready for Big Data Have a question to motivate, guide, control the modeling, statistical analysis What question are we trying to answer? Too easy to spend hours wandering in big data without a clear objective 7

Getting Ready for Big Data Have a question to motivate, guide, control the modeling, statistical analysis What question are we trying to answer? Too easy to spend hours wandering in big data without a clear objective Importance in intro courses Why am I doing this? Who cares? Why does this matter? Common metaphors TST, MMMM 7

Getting Ready for Big Data Data is happy to generate many, many hypotheses Testing response to stimulus letters Multiplicity (simultaneous inference) 8

Getting Ready for Big Data Data is happy to generate many, many hypotheses Testing response to stimulus letters Multiplicity (simultaneous inference) Importance in intro courses Examples for regression models Stock market Simple remedies are easy to teach (e.g. Bonferroni p-values) 8

xkcd Others have noticed... 9

xkcd Others have noticed... 9

xkcd Others have noticed... Source of publication bias in journals Economist article 9

Getting Ready for Big Data Big Data don t always measure what you think they measure Units, time lags, codebooks Data preparation is key (95% rule) Mailing list example is full of these problems 10

Getting Ready for Big Data Big Data don t always measure what you think they measure Units, time lags, codebooks Data preparation is key (95% rule) Mailing list example is full of these problems Importance in intro courses Give students data that is more realistic Missing values, vague definitions Too much, too soon? 10

Getting Ready for Big Data Large data sets typically gathered as part of transaction processing, not for analysis Repurposed accounting records Justify that sparkling new data warehouse 11

Getting Ready for Big Data Large data sets typically gathered as part of transaction processing, not for analysis Repurposed accounting records Justify that sparkling new data warehouse Importance in intro courses Always ask! What would be the ideal data! to answer my question? Compare that to the data that you have 11

Getting Ready for Big Data Dependence often makes large data sets much smaller Predicting credit behavior in US: dep customers Repeated measurements (longitudinal) Tukey story 12

Getting Ready for Big Data Dependence often makes large data sets much smaller Predicting credit behavior in US: dep customers Repeated measurements (longitudinal) Tukey story Importance in intro courses Carefully define assumption of independent observations Divisor n is not number of cases, but ind cases Relevant source of variation Common examples: lurking variable 12

Getting Ready for Big Data Results may not generalize On-line experiment on weekday not descriptive of weekend (Can imagine other factors) Text model of one author not applicable to others Transfer learning problem 13

Getting Ready for Big Data Results may not generalize On-line experiment on weekday not descriptive of weekend (Can imagine other factors) Text model of one author not applicable to others Transfer learning problem Importance in intro courses Sampling from what population? Does same population exist? Population drift Dynamics of election polls 13

Place for Classical Methods Surveys and sampling still make sense Billions of credit card transactions each year Do you need to see them all to track prices? DoE analysis of prices for ethanol fuels Experimental design remains essential Hard to beat that randomized experiment Google ad response measurement Trivial to do experiment Generalize? 14

Thanks! 15