Data Science and Prediction*



Similar documents
The Rise of Machines and Liquidity Patterns in Markets

Business Analytics Syllabus

Machine Learning. Chapter 18, 21. Some material adopted from notes by Chuck Dyer

Supervised Learning (Big Data Analytics)

Applying Data Science to Sales Pipelines for Fun and Profit

Knowledge Discovery and Data Mining

Social Media Mining. Data Mining Essentials

Evaluation & Validation: Credibility: Evaluating what has been learned

Data Mining - Evaluation of Classifiers

Data Mining Analytics for Business Intelligence and Decision Support

Behavioral Segmentation

Preliminary Syllabus for the course of Data Science for Business Analytics

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Data Mining. Nonlinear Classification

Performance Measures in Data Mining

How I won the Chess Ratings: Elo vs the rest of the world Competition

TEXT ANALYTICS INTEGRATION

Maximize Social Media Effectiveness with Data Science. An Insurance Industry White Paper from Saama Technologies, Inc.

Chapter 6: Episode discovery process

Hexaware E-book on Predictive Analytics

WHITEPAPER. Text Analytics Beginner s Guide

What? So what? NOW WHAT? Presenting metrics to get results

Text Analytics Beginner s Guide. Extracting Meaning from Unstructured Data

A Capability Model for Business Analytics: Part 3 Using the Capability Assessment

Maschinelles Lernen mit MATLAB

Data Mining: An Introduction

Why do statisticians "hate" us?

WHY THE WATERFALL MODEL DOESN T WORK

PSYCHOLOGY PROGRAM LEARNING GOALS AND OUTCOMES BY COURSE LISTING

Knowledge Discovery and Data Mining

DEMYSTIFYING BIG DATA. What it is, what it isn t, and what it can do for you.

Optimizing Enrollment Management with Predictive Modeling

What is Data Science? Data, Databases, and the Extraction of Knowledge Renée November 2014

Big Data Analytics in Mobile Environments

I N T E L L I G E N T S O L U T I O N S, I N C. DATA MINING IMPLEMENTING THE PARADIGM SHIFT IN ANALYSIS & MODELING OF THE OILFIELD

White Paper: Impact of Inventory on Network Design

B2B opportunity predictiona Big Data and Advanced. Analytics Approach. Insert

Advancing Analytics in Your Organization

Social Media and Digital Marketing Analytics ( INFO-UB ) Professor Anindya Ghose Monday Friday 6-9:10 pm from 7/15/13 to 7/30/13

Course Description This course will change the way you think about data and its role in business.

Social Media Intelligence

The Decision Management Manifesto

IBM Content Analytics with Enterprise Search, Version 3.0

TECHNIQUES FOR OPTIMIZING THE RELATIONSHIP BETWEEN DATA STORAGE SPACE AND DATA RETRIEVAL TIME FOR LARGE DATABASES

Big Data: Key Concepts The three Vs

PREDICTIVE ANALYTICS FOR THE HEALTHCARE INDUSTRY

Health Data Analytics. Data to Value For Small and Medium Healthcare organizations

Big Data & Analytics. The. Deal. About. Jacob Büchler jbuechler@dk.ibm.com Cand. Polit. IBM Denmark, Solution Exec IBM Corporation

9 Keys to Effectively Managing Software Projects

Credibility and Pooling Applications to Group Life and Group Disability Insurance

Big Data & Analytics. Counterparty Credit Risk Management. Big Data in Risk Analytics

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Five Ways Retailers Can Profit from Customer Intelligence

Cutting Through The Hype: What You Need To Know About Big Data

Planning sample size for randomized evaluations

Knowledge Discovery and Data Mining

Busting 7 Myths about Master Data Management

Summary of GAO Cost Estimate Development Best Practices and GAO Cost Estimate Audit Criteria

TDWI Best Practice BI & DW Predictive Analytics & Data Mining

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Making Sense of the Mayhem: Machine Learning and March Madness

Big Data for Marketing & Sales: Data Accuracy to Business Impact

Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets

The Truth About Credit Repair

Intelligent Heuristic Construction with Active Learning

Healthcare Measurement Analysis Using Data mining Techniques

Building a Data Quality Scorecard for Operational Data Governance

CHAPTER THREE: METHODOLOGY Introduction. emerging markets can successfully organize activities related to event marketing.

Measuring the value of social software

Diaspark White Paper July 2010 Microsoft SharePoint 2010 INSIGHT Business Intelligence on a technically enriched platform

Predicting the Risk of Heart Attacks using Neural Network and Decision Tree

InvGen: An Efficient Invariant Generator

Measurement and measures. Professor Brian Oldenburg

Using Earned Value, Part 2: Tracking Software Projects. Using Earned Value Part 2: Tracking Software Projects

large-scale machine learning revisited Léon Bottou Microsoft Research (NYC)

Performance Measures for Machine Learning

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

Digital vs. Analog Volume Controls

The Basics of Building Credit

Lesson 3 - Processing a Multi-Layer Yield History. Exercise 3-4

How To Bet On An Nfl Football Game With A Machine Learning Program

Artificial Intelligence Beating Human Opponents in Poker

Beating the MLB Moneyline

MACHINE LEARNING BASICS WITH R

Transcription:

Data Science and Prediction* Vasant Dhar Professor Editor-in-Chief, Big Data Co-Director, Center for Business Analytics, NYU Stern Faculty, Center for Data Science, NYU *Article in Communications of the ACM, Vol. 56 No. 12, December 2013 http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltext

Themes: What Makes Prediction Hard? Noise (physical versus social systems) Physics has 3 theories that explain 99% of observed phenomena, whereas Psychology has 99 theories that explain 3% of observed phenomena. Not knowing the right question to ask (formulation) If only you knew what to ask, I d show you something really interesting! Not having the right/enough data: observational versus experimental Am I looking under the lamppost for the key because? Combining machine and human intelligence Surely human and machine intelligence can augment each other? Believing in the analysis! I don t believe the result. Go find the mistake! @vasantdhar @digitalarun 2

The Data Landscape and Applications Financial Markets What will the market do tomorrow? Will the retail sector pull back within a month? Healthcare Who will become sick in the near future? How will some respond to a medication? Marketing Who will respond to what offer? Is a customer likely to attrit shortly? Social/Product Networks Will demand for XXX go up next week given the activity of its neighbors? How should I craft my message so that it spreads through the network? i.e. where should I seed it? What is the sentiment in a collection of textual data? Does the sentiment have any predictive power?

Data Science and Prediction Data Science is the study of the generalizable extraction of knowledge from data * A key epistemic requirement for new knowledge (and its actionability ) is its ability to predict and not just explain *Dhar, V., Data Science and Prediction, Communications of the ACM, Vol. 56 No. 12, December 2013. http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltex

1. Noise Physical Systems: theory is expected to be "complete" Social/Health Systems: incomplete models intended to be partial approximations of reality, often based on assumptions of human behavior known to be simplistic. 5

What is Noise Anyway? How would you order these on the continuum?

Skill and Luck in Baseball: Batting Avg, Singles, and Strikeout YoY Correlation

Baseball Metrics By Skill and Luck

Is This Ordering Credible? Reversion to the mean exists in activities that combine skill and luck It is useful to know where the problem lies on the continuum above Knowing where we lie in the continuum allows us to anticipate outcomes Illusion of control is a factor in luck situations!

Disentangling Skill and Luck: The Formula Variance(observed) = Variance (skill) + Variance (luck) Variance(skill) = Variance (observed) Variance (luck) Variance of winning percentages of teams For win/loss outcomes, stdev = p(1-p)/sqrt(n) where p=prob of outcome (i.e. win), and n=number of cases (i.e. games) This is observable from the data Depends on sample size

Prediction in Noisy Domains (Markets)* The more important predictions Actuals Predictions from a model The more important predictions *From: Dhar, V., Prediction in financial markets: The case for small disjuncts, ACM transactions on Intelligent Systems and Technology, volume 2, No 3, April 2011

2. Asking the Right Question Patterns Emerge Before Reasons for Them Become Apparent Asking the right question is therefore critical: If only you knew what question to ask me, I d give you very interesting answers from the data. Keep moving on? Dig for causality?

What is the Right Question Here?* Clean Period Diagnosis Outcome period T I M E Are complications associated with the yellow meds? Or with the gray meds? Or the yellows in the absence of the blues? Or is it more than three yellows or three blues? Or is it the greens in quick succession? Or does it have to do with lifestyle choices?! (i.e. Bias? Gather modata? *Dhar, V., Data Science and Prediction, Communications of the ACM, Vol. 56 No. 12, December 2013. http://cacm.acm.org/magazines/2013/12/169933-data-science-and-prediction/fulltex

High Level View of Model Discovery Decision Rule or Trading Strategy (i.e. the question ) Better part Better part Better part Better part Better part Breeds Breeds Breeds Breeds Worse part Worse part Worse part Worse part Worse part Drops Out Drops Out Drops Out Drops Out Drops Out Solution Quality Best Average Worst Iterations

Solutions Can Represent Arbitrary Data Structures 1 0 0 1 1 Arrays and sequences a + - c b Trees 0 < 0.8 AND Boolean Expressions I.e. X1 between 0 and 0.8

An Interesting Pattern?

3. Observational Data and Acquiring New Data Observational data may answer questions without explicitly asking anyone anything! It may also require an understanding of how the data are being generated Are there natural experiments that are reflected in the data or are the data somehow biased through self selection? Is it possible to runnatural experiments to getadditional data to answer the question?

Have We Collected The Right Data?

Is More Always Better?* Is more data always better? How much you should pay for external data?** *See Provost et.al article in Big Data journal (volume 2, issue 1, 2014) ** See Dalessandroet.al article in Big Data journal (volume 2, issue 2, 2014)

4. Prediction as Epistemic Criterion In assessing whether new knowledge is actionable for decision making: predictive power, not just ability to explain the past

Validation Framework for Prediction Across Time NO YES Across Universe NO A A Out of sample A A Out of sample Out of time YES B B Out of sample Out of universe Out of sample Out of time Out of universe

Prediction Vs Explanation: Varying Model Complexity 1 0.8 Future Performance Versus Complexity Zone with (small) correlation between in-sample and out of sample 0.6 Overfitting Future Performance 0.4 0.2 0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1-0.2-0.4 Complexity Note: Complexity measure is illustrative only, based on the learning method used

Prediction Errors and Big Data(*) Sources of Error in Predictive Models 1. Mis-specification of the model Big data admits a larger space of functional forms 2. Using a sample to estimate the model With big data, sample is a good estimate of the population 3. Randomness Predictive modeling attempts to minimize the combination of these two errors *Adapted from Shmueli, G. To explain or to predict? Statistical Science 25, 3 (Aug. 2010), 289 310.)

Making Predictions Usable in Noisy Domains* why bother making predictions? i.e. with so many false positives? 1 ROC Curve For Diabetes Prediction 0.8 True Positive Rate (TPR) 0.6 0.4 0.2 0 0 0.2 0.4 0.6 0.8 1 False Positive Rate (FPR) It all depends on the costs of being wrong and the aggregate pickup* *From Big Data and Predictive Analytics in Healthcare, Big Data Journal, volume 2, Number 3, Sep 2014 http://online.liebertpub.com/doi/pdfplus/10.1089/big.2014.1525

If Patterns Emerge Before Reasons Are Apparent does it imply you re better off rebuilding the model periodically? when is it necessary to understand the reasons for the patterns? (i.e. causality)

If Prediction is Important is problem formulation different than if the purpose is explanation? is it about the right tradeoff between the structure of the problem and the complexity of the model? how do you ensure against common mistakes such as data snooping or leakage in building predictive models?

Transparency in Complex Systems How can you describe the learned model? Is it possible to envision when the model will and won t perform well?

What s the Explanation? What happened here and why?

Current Areas of Inquiry with Big Data Unstructured data WATSON: leveraging human curated data and AI to synthesize answers Deciding data what to keep, aggregate, etc New risk factors based on text and other data at higher frequency News and social media Digital IQ of companies Understanding the relationship between noise, system behavior, expected predictive accuracy, and performance bounds Understanding how actionable results will change future data making future insights harder and less actionable!

Thank You! @vasantdhar @digitalarun 30