STATISTICS for DECISION-MAKERS. P. Richard Hahn

STATISTICS for DECISION-MAKERS P. Richard Hahn

Contents Contents ii I Core Concepts 6 1 Exploiting statistical patterns 9 How to predict well on average. 1.1 Basic probability............... 10 1.2 Random variables............... 10 1.3 Expected value (averages).......... 10 1.4 Expected utility maximization....... 10 1.5 Bayes rule: refining your reference set... 10 1.6 The best linear predictor........... 10 2 Learning statistical patterns from data 11 How to make data-driven predictions. 2.1 Empirical distributions vs. "true" distributions...................... 12 2.2 Estimand, estimator, estimate........ 12 2.3 Empirical utility maximization....... 12 ii

3 Assessing sampling variability 13 How to judge the reliability of a data-driven prediction rule. 3.1 Sampling variation and sampling distributions 13 3.2 Null hypotheses................ 13 3.3 Permutation tests............... 13 3.4 Boot-strapping................ 13 3.5 Over-fitting and regularization....... 13 II Linear prediction 14 4 Linear regression 17 Finding trend lines in data. 4.1 Estimating the best linear predictor.... 17 4.2 Least-squares................. 17 4.3 R-squared................... 17 4.4 Confidence intervals (and hypothesis tests) 17 4.5 Data transformations............. 17 5 Multiple linear regression 19 Finding linear trends when there are multiple factors. 5.1 R-squared with more than one predictor.. 19 5.2 Interactions.................. 19 6 Logistic regression 21 How to predict binary outcomes. 6.1 Link functions................. 21 6.2 Classification rules.............. 21 6.3 Odds ratios and log-odds.......... 21 iii

III Beyond prediction 22 7 Experimental design 25 Guidelines for data collection. 7.1 Controlled randomized experiments..... 25 7.2 Power calculation............... 25 7.3 Controlling for confounding......... 25 8 Coping with sampling bias 27 How policy evaluation differs from straight prediction. 8.1 Natural experiments" and instrumental variables...................... 27 8.2 Regression discontinuity design....... 27 9 Causal regret analysis 29 How to make sense of statistical information for one-time decisions. iv

Preface This book aims to communicate core ideas from probability and statistics distributions, expected value, conditional probability, sampling variability and sampling bias towards the goal of making practical use of statistical data. Readers of this book should not expect to come away with a technical understanding of how to apply modern data analytic methods to massive databases. What I do hope to deliver is a clear picture of how such methods work on a conceptual level, a flavor of the variety of situations where they might profitably be applied, and a useful mental vocabulary for thinking about the various data streams you interact with on a daily basis in your work and your life. While there is a proliferation of books documenting that individuals and institutions are using data to guide their decisions, this book aims to fill a gap in explaining the basic logic behind how exactly data ought to inform our decision making. 1

Outline This book is divided into three parts, each with three chapters. Part one presents the foundational concepts underpinning statistical data analysis. The first chapter concerns what to do when you need to make a decision based on uncertain information. Our prototypical decision will be a prediction of some sort. (Later we will consider more general decision-making scenarios.) The classic example of an applied prediction scenario would be picking stocks. You have to make a decision which stocks to pick and the eventual payoff will depend on some future outcome. The key idea of this first chapter is the idea of an average. When making predictions in random environments, you can t hope to be right every time, so you have to think about selecting strategies that lead to good average performance. Accordingly, defining what average means is important. This first chapter is essentially a primer on the basic ideas of probability, which is a language for describing patterns which emerge when one looks at many random events in aggregate. Chapter two is about how to find patterns that allow 3

you to characterize randomness (more specifically, probability distributions) in processes you might care about. The whole idea of an average presumes that even random events have some structure. For example, although which specific people happen to die in car crashes in Illinois in a given year is essentially random, the total number of motor vehicle fatalities might be relatively stable from year to year. In the first chapter, we pretend such features are know to us at the outset. The second chapter turns to the problem of determining such patterns directly from data. The chapter closes by introducing the notion of a linear prediction rule, which is a powerful technique for describing relationships between two quantities such as the price of gas in a country and that country s unemployment rate which hold approximately. The third chapter focuses on determining how much we should trust the patterns we find in data. For example, it might seem like higher gas price associates strongly with high unemployment, but is the pattern we observe real, or just a fluke? Part two covers linear regression, which refers to the process of finding linear prediction rules from observed data. This method is the workhorse of applied statistical analysis. This section includes a chapter on how to find linear prediction rules when there are multiple factors influencing the outcome we are trying to predict (multiple linear regression), as well as a chapter that extends the basic method to predicting yes/no outcomes such as who is going to win a (two-party) election or tonight s Bulls- Pacers game or whether or not a given patient has diabetes. Part three looks at how to extend these ideas beyond the pure prediction setting, where we might be interested in policy/managerial interventions. It turns out that a whole separate set of delicate issues crop up when we want to mess with the system we re studying (such as the econ- 4

omy) rather than just passively make predictions about it. Things also get more subtle when we try to apply statistical reasoning to one-shot decisions, such as what diet you should stick to if you re pregnant. Unlike an investing strategy, most people won t face such a decision enough times to make the statistical information a reliable guide to future outcomes. Note to the reader Two more things. First, this book has formulas and equations here and there. I empathize with the anxiety that formulas provoke in a lot of folks. (A pet peeve of mine is when formulas are used to impress rather than to express ideas clearly and compactly.) With this common aversion in mind, I ve tried to keep my equations and symbols and such to a bare minimum, but it turns out that minimum in this case is not none. So I encourage you to face this hurdle with the knowledge that sticking with it will pay dividends. Achieving a comfort with mathematical notation is challenging in much the same way that learning to play the piano or speak a foreign language is challenging, and is similarly worthwhile. Second, this is not a textbook. It is a chatty guided tour through the key ideas underpinning data analysis for decision-making. My selection of topics, choice of examples and ordering of material are all in service of a narrative designed to make the case that statistical data analysis is 1) broadly useful and 2) not rocket science. So while much of the material will overlap with a more traditional statistics text, do not be alarmed if the territory seems markedly different from what you expected or have seen previously in a statistics book. 5