Opportunities and Limitations of Big Data



Similar documents
Healthcare data analytics. Da-Wei Wang Institute of Information Science

Why Big Data is not Big Hype in Economics and Finance?

Collaborations between Official Statistics and Academia in the Era of Big Data

Big Data Hope or Hype?

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Extreme Computing. Big Data. Stratis Viglas. School of Informatics University of Edinburgh Stratis Viglas Extreme Computing 1

!!! The Fallacy of Big Data! Brian Fine and Con Menictas!

Statistical Challenges with Big Data in Management Science

Big Data: Opportunities & Challenges, Myths & Truths 資 料 來 源 : 台 大 廖 世 偉 教 授 課 程 資 料

Is Big Data Bigger than a Bread Box?

How can you unlock the value in real-world data? A novel approach to predictive analytics could make the difference.

PSYCHOLOGY PROGRAM LEARNING GOALS AND OUTCOMES BY COURSE LISTING

Getting personal: The future of communications

Big Data, Socio- Psychological Theory, Algorithmic Text Analysis, and Predicting the Michigan Consumer Sentiment Index

Big Data Big Knowledge?

BIG DATA FUNDAMENTALS

Statistical Fallacies: Lying to Ourselves and Others

Analytics in Days White Paper and Business Case

Characterizing Task Usage Shapes in Google s Compute Clusters

Analyzing Big Data: The Path to Competitive Advantage

American Economic Association

How To Understand Data Science

Speed bump. Acceleration-ramp blues on the information superhighway

SOCIAL MEDIA: A NEW DATA SOURCE FOR PUBLIC HEALTH. Mark Dredze Johns Hopkins University Michael Paul, Alex Lamb, David Broniatowski

Algorithmic Trading Session 1 Introduction. Oliver Steinki, CFA, FRM

THE THREE "Rs" OF PREDICTIVE ANALYTICS

T he complete guide to SaaS metrics

QUANTIFYING THE EFFECTS OF ONLINE BULLISHNESS ON INTERNATIONAL FINANCIAL MARKETS

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

The Big Picture on Big Data. Princeton Section 307 Dinner Meeting December 11, 2013 Richard Herczeg

Big data: are we making a big mistake?

Automated Text Analytics. Testing Manual Processing against Automated Listening

Introduction to Pattern Recognition

CPE 462 VHDL: Simulation and Synthesis

Putting IBM Watson to Work In Healthcare

Cross Validation. Dr. Thomas Jensen Expedia.com

Software for data analysis and accurate forecasting. Forecasts for Guaranteed Profits. The Predictive Analytics Software for Insurance Companies

Colleen s Interview With Ivan Kolev

Lecture 23: Pairs Trading Steven Skiena. skiena

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Web content provided for Blue Square Design see Home Page

How To Predict Stock Price With Mood Based Models

So Just What Is Big Data? James E. Tcheng, MD, FACC, FSCAI

White Paper. Benefits and Challenges for Today s Online B- to- B Research Methodology. By Pete Cape, Director, Global Knowledge Management.

Stock Market Trends...P1. What Is Automated Trading...P2. Advantages & Disadvantages...P3. Automated Trading Conclusion...P4

Website Promotion for Voice Actors: How to get the Search Engines to give you Top Billing! By Jodi Krangle

How To Improve Data Quality

A/B TESTING. Comparing Data. October 25, 2007 Version 4.0

Statistics for BIG data

SURVEY REPORT DATA SCIENCE SOCIETY 2014

The Adwords Companion

Table of Contents 11-step plan on how to get the most out of the strategies backtesting... 2 Step # Pick any strategy you like from the "10

Big Data how it changes the way you treat data

Lead Generation in Emerging Markets

PITFALLS IN TIME SERIES ANALYSIS. Cliff Hurvich Stern School, NYU

Cloud Computing with Microsoft Azure

Uncovering Value in Healthcare Data with Cognitive Analytics. Christine Livingston, Perficient Ken Dugan, IBM

Sales Lead Brokerage Profit Plan Bonus Document

Employee Surveys: Four Do s and Don ts. Alec Levenson

Your Questions from Chapter 1. General Psychology PSYC 200. Your Questions from Chapter 1. Your Questions from Chapter 1. Science is a Method.

Basic research methods. Basic research methods. Question: BRM.2. Question: BRM.1

Proceedings of the 9th WSEAS International Conference on APPLIED COMPUTER SCIENCE

Currency Trading and Forex 100 Success Secrets 100 Most Asked Questions on becoming a Successful Currency Trader

Top tips for online campaign optimisation

Ridiculously Good Outsourcing. The Monetization of Big Data: Made Possible By Humans. (888) TASK

Seven Things You Must Know Before Hiring a DUI Lawyer

Doing Multidisciplinary Research in Data Science

Our Data & Methodology. Understanding the Digital World by Turning Data into Insights

Data Centric Computing Revisited

POLLING STANDARDS. The following standards were developed by a committee of editors and reporters and should be adhered to when using poll results.

The Data Engineer. Mike Tamir Chief Science Officer Galvanize. Steven Miller Global Leader Academic Programs IBM Analytics

Google AdWords Remarketing

A U T H O R S : G a n e s h S r i n i v a s a n a n d S a n d e e p W a g h Social Media Analytics

Chapter 29 Scale-Free Network Topologies with Clustering Similar to Online Social Networks

Counter Expertise Review on the TNO Security Analysis of the Dutch OV-Chipkaart. OV-Chipkaart Security Issues Tutorial for Non-Expert Readers

Getting to Know Big Data

Transcription:

Opportunities and Limitations of Big Data Karl Schmedders University of Zurich and Swiss Finance Institute «Big Data: Little Ethics?» HWZ-Darden-Conference June 4, 2015 On fortune.com this morning: Apple's Tim Cook launches blistering attack on Facebook and Google Blistering speech stresses Apple as a company that doesn t want your data. Source: https://fortune.com/2015/06/03/tim-cook-attacks-facebook-googlegovernment-privacy-speech/ 2 1

Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 3 Tracking Influenza with Google Flu Trends Google Flu Trends: Track the spread of influenza across the US through analyzing the top 50 million search terms Centers for Disease Control and Prevention: Track the spread of influenza across the US through analyzing the reports from doctors 4 2

Advantages Google is much faster than the CDC Google Flu Trends: 1 day CDC: > 1 week Flu Trends based on millions of people (n=all, n=infinity) Approach us quick, accurate, cheap and theory-free 5 When Google got flu wrong (Nature, 2013) Massive overestimation of flu season 2012/13 Earlier problems in 2009 6 3

What went wrong? Widespread media coverage of the severity of US flu season Reports may have triggered many flu-related searches by people who were not ill Well-known old problem: wrong sample Sample Bias 7 Classical Example: U.S. Presidential Election 1936 Forecasts of election outcomes: The Literary Digest (sample: 2.4 million people) vs. George Gallup (sample: 3 000 people) Prediction Literary Digest: 43% Roosevelt Prediction George Gallup: 61% Roosevelt Actual outcome: 61% Roosevelt The Literary Digest sent out forms to people from a list of automobile registrations and telephone directories Who had a phone in 1936? 8 4

Still relevant today Why were the Israeli election polls so wrong? http://edition.cnn.com/2015/03/18/middleeast/israel-election-polls/ "The Internet does not represent the state of Israel and the people of Israel," he said, referring to modern statistical methods. "It represents panels, and the panels are biased strongly to the center - - Tel Aviv, better-educated, more participants in this kind of conversation. (Avi Degani, Tel Aviv University) 9 Why Sampling is Still Very Important Self-reported user data is often a biased sample Growth in noise is swamping the signal that businesses hope to find in the data n = all (or n = infinity ) is wishful thinking 10 5

Available Data Sample Objective: Unbiased analysis Problem: User data likely biased Solution: Do NOT use all your user data (may be too much anyway) Sample from available data Check for representativeness Target Population 11 Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 12 6

The End of Theory? The End of Theory: The Data Deluge Makes the Scientific Method Obsolete (Chris Anderson, Wired Magazine, 2008) All models are wrong, and increasingly you can succeed without them. (Peter Norvig, Google s research director) [F]aced with massive data, this approach to science hypothesize, model, test is becoming obsolete. 13 Did you really mean that? There is now a better way. Petabytes allow us to say: Correlation is enough. We can stop looking for models. We can analyze the data without hypotheses about what it might show. We can throw the numbers into the biggest computing clusters the world has ever seen and let statistical algorithms find patterns where science cannot. Referring to The End of Theory : This revolutionary notion has now entered not just the popular imagination, but also the research practices of corporations, states, journalists and academics. (Mark Graham, The Guardian, 2012) 14 7

Spurious Correlation Source: Correlation or Causation (Business Week, 2011) 15 And there is much more Super Bowl and Dow Jones Industrial Average 1979 89 and 1990 98 Superbowl correctly forecasted sign of annual DJIA return Hot-hand fallacy The Hot Hand in Basketball: On the Misperception of Random Sequences. Gilovich, Tversky, Vallone. Cognitive psychology 1985. 16 8

Lack of models Without a model we cannot distinguish between spurious and meaningful correlations Lack of models makes data less useful than it might be; big data insights will be limited Nate Silver describes The End of Theory as categorically the wrong attitude Sound theoretical understanding of statistical application is absolutely necessary 17 Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 18 9

Top 500 supercomputers are getting faster Source: http://top500.org/statistics/perfdevel/ 19 while storage of data is getting cheaper Source: http://www.jcmit.com/disk2012.htm 20 10

Big Data and Business Analytics More storage More data Faster computers New sophisticated methods 21 Aside: Reproducibility of Statistical Results Numerous cases in the biomedical field of statistical results from clinical trials that cannot be reproduced in separate studies In 2011, Bayer researchers reported that they were able to reproduce the results of only 17 of 67 published studies they examined In 2012, Amgen researchers reported that they were able to reproduce the results of only 6 of 53 published cancer studies In 2014, a review of Tamiflu found that while it made flu symptoms disappear a bit sooner, it did not stop serious complications or keep people out of the hospital 22 11

Fundamental Flaw Publication of only successful trials introduces bias Success in study may have been purely accidental With enough trials, at least one will sooner or later be successful 23 How to become a Guru to some Investors Financial advisor sends letters to 10,240 = 10 x 2^10 potential clients, with half (5120) predicting a particular stock will go up, and the other half predicting it will go down One month later, the advisor sends letters only to the 5120 investors who were previously sent the correct prediction, with half (2560) letters predicting a certain security will go up, and the other half predicting it will go down The advisor continues this process for 10 months. 24 12

in a fraudulent way Ten investors will have been sent ten consecutive correct predictions! They may be so impressed by the advisor's ten consecutive spot-on predictions that will entrust to him/her all of their assets... Final ten investors are unaware of the thousands of wrong predictions 25 Statistical Overfitting I remember my friend Johnny von Neumann used to say, with four parameters I can fit an elephant, and with five I can make him wiggle his trunk. (Enrico Fermi) If you torture the data long enough, it will confess. (Ronald H. Coase) 26 13

Backtest Overfitting in Finance Backtesting of an investment strategy: use historical market data to assess performance Backtest overfitting: develop a trading strategy (``model ) that is sufficiently complex and has many degrees of freedom to fit the data almost perfectly Computers can analyze millions or billions of variations of a strategy, so sooner or later you will find a great match 27 Strategy vs. Underlying Asset (Pseudo random) Source: http://datagrid.lbl.gov/backtest/ (Thanks to David H. Bailey) 28 14

After Iteration 23 29 After Iteration 148 30 15

After Iteration 3496 31 After convergence 32 16

and on a new data set Source: http://datagrid.lbl.gov/backtest/ (Thanks to David H. Bailey) 33 Too much of a good thing? David H. Bailey (LBNL & UC Davis) [C]omputers operating on big data can generate nonsense faster than ever before! 34 17

Old Rules vs. Modern Myths Sample Size The n=all myth Causation vs. Correlation The End of Theory myth Model Fitting Machine Learning dangers 35 18