Why Big Data is not Big Hype in Economics and Finance?



Similar documents
How To Use Big Data In Economics

The Billion Prices Project Using Online Prices for Inflation and Research

Statistics for BIG data

Opportunities and Limitations of Big Data

Discovering, Not Finding. Practical Data Mining for Practitioners: Level II. Advanced Data Mining for Researchers : Level III

The Data Revolution and Economic Analysis *

Healthcare data analytics. Da-Wei Wang Institute of Information Science

In this presentation, you will be introduced to data mining and the relationship with meaningful use.

Metrics for Managers: Big Data and Better Answers. Fall Course Syllabus DRAFT. Faculty: Professor Joseph Doyle E

Machine Learning and Data Mining. Fundamentals, robotics, recognition

Forecasting Trade Direction and Size of Future Contracts Using Deep Belief Network

Up/Down Analysis of Stock Index by Using Bayesian Network

Big data: are we making a big mistake?

An Introduction to Advanced Analytics and Data Mining

Why do statisticians "hate" us?

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

Data Mining Applications in Higher Education

An Overview of Data Mining: Predictive Modeling for IR in the 21 st Century

Statistical Challenges with Big Data in Management Science

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

A STUDY OF DATA MINING ACTIVITIES FOR MARKET RESEARCH

Data Mining Algorithms Part 1. Dejan Sarka

Defending Networks with Incomplete Information: A Machine Learning Approach. Alexandre

Data Mining: An Introduction

Multichannel Attribution

Data Mining: Overview. What is Data Mining?

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

Data Mining Methods: Applications for Institutional Research

INTRODUCTION TO DATA SCIENCE USING R

Information Management course

Marketing Mix Modelling and Big Data P. M Cain

Characterizing Task Usage Shapes in Google s Compute Clusters

Introduction to Time Series Analysis and Forecasting. 2nd Edition. Wiley Series in Probability and Statistics

Algorithmic Trading Session 1 Introduction. Oliver Steinki, CFA, FRM

ECLT 5810 E-Commerce Data Mining Techniques - Introduction. Prof. Wai Lam

MHI3000 Big Data Analytics for Health Care Final Project Report

Data Science and Prediction*

Machine learning for algo trading

Algorithmic Presentation to European Central Bank. Jean-Marc Orlando, EFX Global Head BNP PARIBAS

Nancy Cartwright, Hunting Causes and Using Them: Approaches in Philosophy and Economics

Predictive Modeling and Big Data

Threat Intelligence: The More You Know the Less Damage They Can Do. Charles Kolodgy Research VP, Security Products

PREDICTIVE ANALYTICS: PROVIDING NOVEL APPROACHES TO ENHANCE OUTCOMES RESEARCH LEVERAGING BIG AND COMPLEX DATA

Customer Relationship Management using Adaptive Resonance Theory

B2B opportunity predictiona Big Data and Advanced. Analytics Approach. Insert

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Delivering new insights and value to consumer products companies through big data

Hexaware E-book on Predictive Analytics

Event driven trading new studies on innovative way. of trading in Forex market. Michał Osmoła INIME live 23 February 2016

Financial Markets. Itay Goldstein. Wharton School, University of Pennsylvania

REFLECTIONS ON THE USE OF BIG DATA FOR STATISTICAL PRODUCTION

An Introduction to Data Mining. Big Data World. Related Fields and Disciplines. What is Data Mining? 2/12/2015

The impact of social media is pervasive. It has

Supply Chain Best Practice: Demand Planning Using Point-of-Sale Data. An Oracle White Paper Updated October 2006

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

COMP9321 Web Application Engineering

An Introduction to Data Mining

News Trading and Speed

Machine Learning Introduction

Machine Learning and Algorithmic Trading

Class 10. Data Mining and Artificial Intelligence. Data Mining. We are in the 21 st century So where are the robots?

Data Science Will computer science and informatics eat our lunch?

Data Mining Part 5. Prediction

From Raw Data to. Actionable Insights with. MATLAB Analytics. Learn more. Develop predictive models. 1Access and explore data

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Database Marketing, Business Intelligence and Knowledge Discovery

Social Media Implementations

A Proposal for the use of Artificial Intelligence in Spend-Analytics

Data Mining + Business Intelligence. Integration, Design and Implementation

Big Data, Socio- Psychological Theory, Algorithmic Text Analysis, and Predicting the Michigan Consumer Sentiment Index

A Trading Strategy Based on the Lead-Lag Relationship of Spot and Futures Prices of the S&P 500

Financial Econometrics and Volatility Models Introduction to High Frequency Data

Spam Filtering using Naïve Bayesian Classification

T : Classification as Spam or Ham using Naive Bayes Classifier. Santosh Tirunagari :

Role Description. Position of a Data Scientist Machine Learning at Fractal Analytics

Performance optimization in retail business using real-time predictive analytics

Exploring Big Data in Social Networks

Certificate Program in Applied Big Data Analytics in Dubai. A Collaborative Program offered by INSOFE and Synergy-BI

Big Data and Economics, Big Data and Economies. Susan Athey, Stanford University Disclosure: The author consults for Microsoft.

DATA MINING TECHNIQUES AND APPLICATIONS

Machine Learning and Econometrics. Hal Varian Jan 2014

Advanced analytics at your hands

Characterizing Task Usage Shapes in Google s Compute Clusters

Students will become familiar with the Brandeis Datastream installation as the primary source of pricing, financial and economic data.

Cross Validation. Dr. Thomas Jensen Expedia.com

AN INTRODUCTION TO BACKTESTING WITH PYTHON AND PANDAS

Component Ordering in Independent Component Analysis Based on Data Power

COMP3420: Advanced Databases and Data Mining. Classification and prediction: Introduction and Decision Tree Induction

Transcription:

Why Big Data is not Big Hype in Economics and Finance? Ariel M. Viale Marshall E. Rinker School of Business Palm Beach Atlantic University West Palm Beach, April 2015

1 The Big Data Hype 2 Big Data as a resourceful Toolbox 3 Big Data or Big Mistake?

What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.

What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.

What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.

What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.

What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.

The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!

The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!

The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!

The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!

The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!

The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!

The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!

The Four Pillars of the Faith Accuracy Four years after the Google Flu Trends publication in Nature, the flu outbreak claimed an unexpected victim: Google Flu Trends. When the slow-and-steady data from the CDC arrived, they showed that Google s estimates were overstated by almost a factor of two.

The Four Pillars of the Faith Theory-free Theory-free analysis of correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause the correlation to break down. Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world using data.

The Four Pillars of the Faith N = All. When it comes to data, size isn t everything N = All is not a good description of found data sets. Take Twitter as example, Twitter users are not representative of the population as a whole. As any well-trained economist knows, a randomly chosen sample might not reflect the underlying population (sampling error), and the sample might not have been randomly chosen at all (sample selection bias). N = All is just an assumption rather than a fact about the data.

The Four Pillars of the Faith Statistical Sorcery? Without careful analysis, the ratio of genuine patterns/correlations to spurious patterns/ correlations - signal to noise ratio - quickly tends to zero.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.

With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.

With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.

With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.

With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.

With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.

With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.

With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.

Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.

Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.

Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.

Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.

Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.

Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.

Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.

How can we benefit using Big Data without making a big mistake? As another important resource for anyone analyzing data, not a silver bullet. Have in mind that some of the conceptual approaches, statistical methods, and challenges used by Big Data are familiar old ones to economists.

Challenges: Data access:taq and TORQ HFT databases from NYSE and NASDAQ are only accessible through WRDS. Other data sets are proprietary e.g., FOREX signed order flow from dealers. Data processing: Handling and cleaning messy data requires specific algorithms and deep knowledge about the data. For example the Lee & Ready method used in time-stamped HFT data. Most of the techniques require programming skills with specific software capable of managing large datasets: SQL, R, SAS, Matlab, Python, etc. Asking the right questions. Only way not to get into the trap opf spurious inference is to get some formal training in the conceptual framework that seek to explain the relations driving the data. Theory does matter! Formal statistical robustness checks and methods ara a must! As an example. In Finance when it comes to HFT data we rely in formal sometimes heavy-weighted econometrics and two Market Microstructure canonical models: 1) The Glosten-Milgrom Dealer model; and 2) Kyle s model of the informed trader. Both rooted in Microeconomics.

Livan Einav and Jonathan Levin (2013). The data revolution and economic analysis. NBER WP No. 19035, NBER and Stanford University. Tim Harford (2014). Big data: Are we making a big mistake? Financial Times article, March 28, 2014.