STATISTICS for DECISION-MAKERS. P. Richard Hahn

Similar documents
Sample Size and Power in Clinical Trials

Fairfield Public Schools

Organizing Your Approach to a Data Analysis

Association Between Variables

Problem of the Month: Fair Games

Example: Boats and Manatees

Analysis of Variance ANOVA

Beating the MLB Moneyline

Implementing Portfolio Management: Integrating Process, People and Tools

Weight of Evidence Module

USES OF CONSUMER PRICE INDICES

Stats 202 Data Analysis Project Winter 2016

Some Essential Statistics The Lure of Statistics

Sample Size Issues for Conjoint Analysis

Microsoft Azure Machine learning Algorithms

Decision Making under Uncertainty

The Partnership for the Assessment of College and Careers (PARCC) Acceptance Policy Adopted by the Illinois Council of Community College Presidents

Session 7 Bivariate Data and Analysis

Statistics 3202 Introduction to Statistical Inference for Data Analytics 4-semester-hour course

Prospect Theory Ayelet Gneezy & Nicholas Epley

Statistics in Retail Finance. Chapter 2: Statistical models of default

That s Not Fair! ASSESSMENT #HSMA20. Benchmark Grades: 9-12

Part 2: Analysis of Relationship Between Two Variables

Qualitative Analysis Vs. Quantitative Analysis 06/16/2014 1

Nonparametric statistics and model selection

COLUMBIA UNIVERSITY IN THE CITY OF NEW YORK DEPARTMENT OF INDUSTRIAL ENGINEERING AND OPERATIONS RESEARCH

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

The Cross-Sectional Study:

Study Guide for the Final Exam

Introduction to Fixed Effects Methods

A. General Rules and Conditions: 1) This plan conforms to the regulations of the general frame of the program of graduate studies.

Copyright 2013 The National Council of Teachers of Mathematics, Inc. All rights reserved. This material may not be copied or

FDU-Vancouver Bachelor of Science in Business Administration International Business Concentration Course Descriptions

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Virtual Child Written Project Assignment. Four-Assignment Version of Reflective Questions

In mathematics, there are four attainment targets: using and applying mathematics; number and algebra; shape, space and measures, and handling data.

Two Correlated Proportions (McNemar Test)

ECON 459 Game Theory. Lecture Notes Auctions. Luca Anderlini Spring 2015

Data Analysis, Research Study Design and the IRB

New Work Item for ISO Predictive Analytics (Initial Notes and Thoughts) Introduction

A spreadsheet Approach to Business Quantitative Methods

Acquisition Lesson Plan for the Concept, Topic or Skill---Not for the Day

Summary of important mathematical operations and formulas (from first tutorial):

Multiple Regression: What Is It?

Module 223 Major A: Concepts, methods and design in Epidemiology

Data quality and metadata

The importance of graphing the data: Anscombe s regression examples

Master of Science in Marketing Analytics (MSMA)

13: Additional ANOVA Topics. Post hoc Comparisons

LOGISTIC REGRESSION ANALYSIS

Correlational Research. Correlational Research. Stephen E. Brock, Ph.D., NCSP EDS 250. Descriptive Research 1. Correlational Research: Scatter Plots

A Statistical Analysis of Popular Lottery Winning Strategies

Circuits and Boolean Expressions

X X X a) perfect linear correlation b) no correlation c) positive correlation (r = 1) (r = 0) (0 < r < 1)

Learning Example. Machine learning and our focus. Another Example. An example: data (loan application) The data and the goal

NEW ZEALAND INJURY PREVENTION STRATEGY SERIOUS INJURY OUTCOME INDICATORS

Current Standard: Mathematical Concepts and Applications Shape, Space, and Measurement- Primary

A Little Set Theory (Never Hurt Anybody)

Constructing a TpB Questionnaire: Conceptual and Methodological Considerations

EST.03. An Introduction to Parametric Estimating

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Bayesian probability theory

Research Methods & Experimental Design

2016 ERCOT System Planning Long-Term Hourly Peak Demand and Energy Forecast December 31, 2015

Simple Regression Theory II 2010 Samuel L. Baker

Purchase Conversions and Attribution Modeling in Online Advertising: An Empirical Investigation

MATH ADVISEMENT GUIDE

Case Studies. Dewayne E Perry ENS 623 perry@mail.utexas.edu

LogNormal stock-price models in Exams MFE/3 and C/4

Review of Fundamental Mathematics

CHAPTER 2 Estimating Probabilities

Untangling F9 terminology

Decision Analysis. Here is the statement of the problem:

DIGITAL MEDIA MEASUREMENT FRAMEWORK SUMMARY Last updated April 2015

Inference for two Population Means

Risk, Return and Market Efficiency

STANDARD. Risk Assessment. Supply Chain Risk Management: A Compilation of Best Practices

Working with whole numbers

The first three steps in a logistic regression analysis with examples in IBM SPSS. Steve Simon P.Mean Consulting

The Predictive Data Mining Revolution in Scorecards:

Unit 12 Logistic Regression Supplementary Chapter 14 in IPS On CD (Chap 16, 5th ed.)

International Statistical Institute, 56th Session, 2007: Phil Everson

*&6( 0DWKHPDWLFV,QWURGXFWLRQ

Creating, Solving, and Graphing Systems of Linear Equations and Linear Inequalities

Statistics 2014 Scoring Guidelines

Comparison of frequentist and Bayesian inference. Class 20, 18.05, Spring 2014 Jeremy Orloff and Jonathan Bloom

Business and Economics Applications

NORTHWESTERN UNIVERSITY Department of Statistics. Fall 2012 Statistics 210 Professor Savage INTRODUCTORY STATISTICS FOR THE SOCIAL SCIENCES

Name Class Date. In the space provided, write the letter of the description that best matches the term or phrase.

research/scientific includes the following: statistical hypotheses: you have a null and alternative you accept one and reject the other

Credit Risk Analysis Using Logistic Regression Modeling

Unit 31 A Hypothesis Test about Correlation and Slope in a Simple Linear Regression

COMP6053 lecture: Relationship between two variables: correlation, covariance and r-squared.

R Simulations: Monty Hall problem

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

Depth-of-Knowledge Levels for Four Content Areas Norman L. Webb March 28, Reading (based on Wixson, 1999)

Total Credits for Diploma of Interior Design and Decoration 63

The 2014 Ultimate Career Guide

Dealing with Missing Data

Transcription:

STATISTICS for DECISION-MAKERS P. Richard Hahn

Statistics for Decision-Makers COPYRIGHT P. RICHARD HAHN ISBN INFO ISBN 13: ALL RIGHTS RESERVED i

Contents Contents ii I Core Concepts 6 1 Exploiting statistical patterns 9 How to predict well on average. 1.1 Basic probability............... 10 1.2 Random variables............... 10 1.3 Expected value (averages).......... 10 1.4 Expected utility maximization....... 10 1.5 Bayes rule: refining your reference set... 10 1.6 The best linear predictor........... 10 2 Learning statistical patterns from data 11 How to make data-driven predictions. 2.1 Empirical distributions vs. "true" distributions...................... 12 2.2 Estimand, estimator, estimate........ 12 2.3 Empirical utility maximization....... 12 ii

3 Assessing sampling variability 13 How to judge the reliability of a data-driven prediction rule. 3.1 Sampling variation and sampling distributions 13 3.2 Null hypotheses................ 13 3.3 Permutation tests............... 13 3.4 Boot-strapping................ 13 3.5 Over-fitting and regularization....... 13 II Linear prediction 14 4 Linear regression 17 Finding trend lines in data. 4.1 Estimating the best linear predictor.... 17 4.2 Least-squares................. 17 4.3 R-squared................... 17 4.4 Confidence intervals (and hypothesis tests) 17 4.5 Data transformations............. 17 5 Multiple linear regression 19 Finding linear trends when there are multiple factors. 5.1 R-squared with more than one predictor.. 19 5.2 Interactions.................. 19 6 Logistic regression 21 How to predict binary outcomes. 6.1 Link functions................. 21 6.2 Classification rules.............. 21 6.3 Odds ratios and log-odds.......... 21 iii

III Beyond prediction 22 7 Experimental design 25 Guidelines for data collection. 7.1 Controlled randomized experiments..... 25 7.2 Power calculation............... 25 7.3 Controlling for confounding......... 25 8 Coping with sampling bias 27 How policy evaluation differs from straight prediction. 8.1 Natural experiments" and instrumental variables...................... 27 8.2 Regression discontinuity design....... 27 9 Causal regret analysis 29 How to make sense of statistical information for one-time decisions. iv

Preface This book aims to communicate core ideas from probability and statistics distributions, expected value, conditional probability, sampling variability and sampling bias towards the goal of making practical use of statistical data. Readers of this book should not expect to come away with a technical understanding of how to apply modern data analytic methods to massive databases. What I do hope to deliver is a clear picture of how such methods work on a conceptual level, a flavor of the variety of situations where they might profitably be applied, and a useful mental vocabulary for thinking about the various data streams you interact with on a daily basis in your work and your life. While there is a proliferation of books documenting that individuals and institutions are using data to guide their decisions, this book aims to fill a gap in explaining the basic logic behind how exactly data ought to inform our decision making. 1

Outline This book is divided into three parts, each with three chapters. Part one presents the foundational concepts underpinning statistical data analysis. The first chapter concerns what to do when you need to make a decision based on uncertain information. Our prototypical decision will be a prediction of some sort. (Later we will consider more general decision-making scenarios.) The classic example of an applied prediction scenario would be picking stocks. You have to make a decision which stocks to pick and the eventual payoff will depend on some future outcome. The key idea of this first chapter is the idea of an average. When making predictions in random environments, you can t hope to be right every time, so you have to think about selecting strategies that lead to good average performance. Accordingly, defining what average means is important. This first chapter is essentially a primer on the basic ideas of probability, which is a language for describing patterns which emerge when one looks at many random events in aggregate. Chapter two is about how to find patterns that allow 3

you to characterize randomness (more specifically, probability distributions) in processes you might care about. The whole idea of an average presumes that even random events have some structure. For example, although which specific people happen to die in car crashes in Illinois in a given year is essentially random, the total number of motor vehicle fatalities might be relatively stable from year to year. In the first chapter, we pretend such features are know to us at the outset. The second chapter turns to the problem of determining such patterns directly from data. The chapter closes by introducing the notion of a linear prediction rule, which is a powerful technique for describing relationships between two quantities such as the price of gas in a country and that country s unemployment rate which hold approximately. The third chapter focuses on determining how much we should trust the patterns we find in data. For example, it might seem like higher gas price associates strongly with high unemployment, but is the pattern we observe real, or just a fluke? Part two covers linear regression, which refers to the process of finding linear prediction rules from observed data. This method is the workhorse of applied statistical analysis. This section includes a chapter on how to find linear prediction rules when there are multiple factors influencing the outcome we are trying to predict (multiple linear regression), as well as a chapter that extends the basic method to predicting yes/no outcomes such as who is going to win a (two-party) election or tonight s Bulls- Pacers game or whether or not a given patient has diabetes. Part three looks at how to extend these ideas beyond the pure prediction setting, where we might be interested in policy/managerial interventions. It turns out that a whole separate set of delicate issues crop up when we want to mess with the system we re studying (such as the econ- 4

omy) rather than just passively make predictions about it. Things also get more subtle when we try to apply statistical reasoning to one-shot decisions, such as what diet you should stick to if you re pregnant. Unlike an investing strategy, most people won t face such a decision enough times to make the statistical information a reliable guide to future outcomes. Note to the reader Two more things. First, this book has formulas and equations here and there. I empathize with the anxiety that formulas provoke in a lot of folks. (A pet peeve of mine is when formulas are used to impress rather than to express ideas clearly and compactly.) With this common aversion in mind, I ve tried to keep my equations and symbols and such to a bare minimum, but it turns out that minimum in this case is not none. So I encourage you to face this hurdle with the knowledge that sticking with it will pay dividends. Achieving a comfort with mathematical notation is challenging in much the same way that learning to play the piano or speak a foreign language is challenging, and is similarly worthwhile. Second, this is not a textbook. It is a chatty guided tour through the key ideas underpinning data analysis for decision-making. My selection of topics, choice of examples and ordering of material are all in service of a narrative designed to make the case that statistical data analysis is 1) broadly useful and 2) not rocket science. So while much of the material will overlap with a more traditional statistics text, do not be alarmed if the territory seems markedly different from what you expected or have seen previously in a statistics book. 5