Opportunities and Limitations of Big Data Karl Schmedders University of Zurich and Swiss Finance Institute «Big Data: Little Ethics?» HWZ-Darden-Conference June 4, 2015 On fortune.com this morning: Apple's Tim Cook launches blistering attack on Facebook and Google Blistering speech stresses Apple as a company that doesn t want your data. Source: https://fortune.com/2015/06/03/tim-cook-attacks-facebook-googlegovernment-privacy-speech/ 2 1
Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 3 Tracking Influenza with Google Flu Trends Google Flu Trends: Track the spread of influenza across the US through analyzing the top 50 million search terms Centers for Disease Control and Prevention: Track the spread of influenza across the US through analyzing the reports from doctors 4 2
Advantages Google is much faster than the CDC Google Flu Trends: 1 day CDC: > 1 week Flu Trends based on millions of people (n=all, n=infinity) Approach us quick, accurate, cheap and theory-free 5 When Google got flu wrong (Nature, 2013) Massive overestimation of flu season 2012/13 Earlier problems in 2009 6 3
What went wrong? Widespread media coverage of the severity of US flu season Reports may have triggered many flu-related searches by people who were not ill Well-known old problem: wrong sample Sample Bias 7 Classical Example: U.S. Presidential Election 1936 Forecasts of election outcomes: The Literary Digest (sample: 2.4 million people) vs. George Gallup (sample: 3 000 people) Prediction Literary Digest: 43% Roosevelt Prediction George Gallup: 61% Roosevelt Actual outcome: 61% Roosevelt The Literary Digest sent out forms to people from a list of automobile registrations and telephone directories Who had a phone in 1936? 8 4
Still relevant today Why were the Israeli election polls so wrong? http://edition.cnn.com/2015/03/18/middleeast/israel-election-polls/ "The Internet does not represent the state of Israel and the people of Israel," he said, referring to modern statistical methods. "It represents panels, and the panels are biased strongly to the center - - Tel Aviv, better-educated, more participants in this kind of conversation. (Avi Degani, Tel Aviv University) 9 Why Sampling is Still Very Important Self-reported user data is often a biased sample Growth in noise is swamping the signal that businesses hope to find in the data n = all (or n = infinity ) is wishful thinking 10 5
Available Data Sample Objective: Unbiased analysis Problem: User data likely biased Solution: Do NOT use all your user data (may be too much anyway) Sample from available data Check for representativeness Target Population 11 Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 12 6
The End of Theory? The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Chris Anderson, Wired Magazine, 2008) All models are wrong, and increasingly you can succeed without them. (Peter Norvig, Google s research director) [F]aced with massive data, this approach to science hypothesize, model, test is becoming obsolete. 13 Did you really mean that? There is now a better way. Petabytes allow us to say: Correlation is enough. We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. Referring to The End of Theory : This revolutionary notion has now entered not just the popular imagination, but also the research practices of corporations, states, journalists and academics. (Mark Graham, The Guardian, 2012) 14 7
Spurious Correlation Source: Correlation or Causation (Business Week, 2011) 15 And there is much more Super Bowl and Dow Jones Industrial Average 1979 89 and 1990 98 Superbowl correctly forecasted sign of annual DJIA return Hot-hand fallacy The Hot Hand in Basketball: On the Misperception of Random Sequences. Gilovich, Tversky, Vallone. Cognitive psychology 1985. 16 8
Lack of models Without a model we cannot distinguish between spurious and meaningful correlations Lack of models makes data less useful than it might be; big data insights will be limited Nate Silver describes The End of Theory as categorically the wrong attitude Sound theoretical understanding of statistical application is absolutely necessary 17 Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 18 9
Top 500 supercomputers are getting faster Source: http://top500.org/statistics/perfdevel/ 19 while storage of data is getting cheaper Source: http://www.jcmit.com/disk2012.htm 20 10
Big Data and Business Analytics More storage More data Faster computers New sophisticated methods 21 Aside: Reproducibility of Statistical Results Numerous cases in the biomedical field of statistical results from clinical trials that cannot be reproduced in separate studies In 2011, Bayer researchers reported that they were able to reproduce the results of only 17 of 67 published studies they examined In 2012, Amgen researchers reported that they were able to reproduce the results of only 6 of 53 published cancer studies In 2014, a review of Tamiflu found that while it made flu symptoms disappear a bit sooner, it did not stop serious complications or keep people out of the hospital 22 11
Fundamental Flaw Publication of only successful trials introduces bias Success in study may have been purely accidental With enough trials, at least one will sooner or later be successful 23 How to become a Guru to some Investors Financial advisor sends letters to 10,240 = 10 x 2^10 potential clients, with half (5120) predicting a particular stock will go up, and the other half predicting it will go down One month later, the advisor sends letters only to the 5120 investors who were previously sent the correct prediction, with half (2560) letters predicting a certain security will go up, and the other half predicting it will go down The advisor continues this process for 10 months. 24 12
in a fraudulent way Ten investors will have been sent ten consecutive correct predictions! They may be so impressed by the advisor's ten consecutive spot-on predictions that will entrust to him/her all of their assets... Final ten investors are unaware of the thousands of wrong predictions 25 Statistical Overfitting I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk. (Enrico Fermi) If you torture the data long enough, it will confess. (Ronald H. Coase) 26 13
Backtest Overfitting in Finance Backtesting of an investment strategy: use historical market data to assess performance Backtest overfitting: develop a trading strategy (``model ) that is sufficiently complex and has many degrees of freedom to fit the data almost perfectly Computers can analyze millions or billions of variations of a strategy, so sooner or later you will find a great match 27 Strategy vs. Underlying Asset (Pseudo random) Source: http://datagrid.lbl.gov/backtest/ (Thanks to David H. Bailey) 28 14
After Iteration 23 29 After Iteration 148 30 15
After Iteration 3496 31 After convergence 32 16
and on a new data set Source: http://datagrid.lbl.gov/backtest/ (Thanks to David H. Bailey) 33 Too much of a good thing? David H. Bailey (LBNL & UC Davis) [C]omputers operating on big data can generate nonsense faster than ever before! 34 17
Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 35 18