Why Big Data is not Big Hype in Economics and Finance? Ariel M. Viale Marshall E. Rinker School of Business Palm Beach Atlantic University West Palm Beach, April 2015
1 The Big Data Hype 2 Big Data as a resourceful Toolbox 3 Big Data or Big Mistake?
What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.
What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.
What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.
What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.
What is Big Data? A vague term often thrown around by people with something to sell (Harford, 2014). After the success of Google s Flue Trends it has been taken for granted as a quick, accurate, cheap, and theory-free method to understand the world through data. More generally what is referred as big data is what we know as found data i.e., the digital exhaust of web searches, credit card payments, mobiles, etc. Any data set that is relatively cheap to collect given its size, is less structured, has high dimensionality, and can be updated in real time.
The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!
The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!
The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!
The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!
The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!
The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!
The Four Pillars of the Faith It gets uncanny accurate results. Causation has been knocked off its pedestal. N = All, consequently sampling does not matter. The numbers speak for themselves (Wired). So it is Theory-free!
The Four Pillars of the Faith Accuracy Four years after the Google Flu Trends publication in Nature, the flu outbreak claimed an unexpected victim: Google Flu Trends. When the slow-and-steady data from the CDC arrived, they showed that Google s estimates were overstated by almost a factor of two.
The Four Pillars of the Faith Theory-free Theory-free analysis of correlations is inevitably fragile. If you have no idea what is behind a correlation, you have no idea what might cause the correlation to break down. Statisticians have spent the past 200 years figuring out what traps lie in wait when we try to understand the world using data.
The Four Pillars of the Faith N = All. When it comes to data, size isn t everything N = All is not a good description of found data sets. Take Twitter as example, Twitter users are not representative of the population as a whole. As any well-trained economist knows, a randomly chosen sample might not reflect the underlying population (sampling error), and the sample might not have been randomly chosen at all (sample selection bias). N = All is just an assumption rather than a fact about the data.
The Four Pillars of the Faith Statistical Sorcery? Without careful analysis, the ratio of genuine patterns/correlations to spurious patterns/ correlations - signal to noise ratio - quickly tends to zero.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
Helpful Data Driven Predictive Tools Machine Learning and Pattern Recognition Analysis. Clustering Analysis and Classification Algorithms. Neural Networks. Directed Acyclic Graphs. Bayesian Networks, etc.
With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.
With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.
With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.
With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.
With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.
With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.
With Same Old Problems Overfitting. Stationarity. Lucas Critique. If the predicitve model is used to decide on a policy intervention, the final result may no be what the model predicts because the policy change is anticipated and behavior changes. Its kind of funny if one thinks that some of these techniques were used in Computer Science to get insight into the problem of causality. Judea Pearl s book Causality in Computer Science is a foundational reading in Artifical Intelligence and Machine Learning.
Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.
Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.
Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.
Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.
Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.
Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.
Some Recent Applications - Use of Government Administrative (Big) Data. For example, Piketty and Saez (2003) used IRS data to derive an historical series of income shares for top percentile earners among US households and get some insight into income inequality. - To obtain new measures of private economic activity. For example, the Billion Prices Project (BPP) developed by Alberto Cavallo and Roberto Rigobon at the MIT publishes an alternative measure of retail price inflation obtained from online retail websites in more than fifty countries. - Improving Government policymaking. For example the Federal Reserve made FRED service publicly available and integrated it into popular softwares like Office, E-views, and QuandI. - Use of highly granular data to reveal the role of specific institutional details and variations at the micro-level that will be otherwise difficult to isolate. For example in understanding the markets, the new hype in Finance relies on High Frequency Trading (HFT) data and Market Microstructure models to get a better understanding of the price discovery process.
How can we benefit using Big Data without making a big mistake? As another important resource for anyone analyzing data, not a silver bullet. Have in mind that some of the conceptual approaches, statistical methods, and challenges used by Big Data are familiar old ones to economists.
Challenges: Data access:taq and TORQ HFT databases from NYSE and NASDAQ are only accessible through WRDS. Other data sets are proprietary e.g., FOREX signed order flow from dealers. Data processing: Handling and cleaning messy data requires specific algorithms and deep knowledge about the data. For example the Lee & Ready method used in time-stamped HFT data. Most of the techniques require programming skills with specific software capable of managing large datasets: SQL, R, SAS, Matlab, Python, etc. Asking the right questions. Only way not to get into the trap opf spurious inference is to get some formal training in the conceptual framework that seek to explain the relations driving the data. Theory does matter! Formal statistical robustness checks and methods ara a must! As an example. In Finance when it comes to HFT data we rely in formal sometimes heavy-weighted econometrics and two Market Microstructure canonical models: 1) The Glosten-Milgrom Dealer model; and 2) Kyle s model of the informed trader. Both rooted in Microeconomics.
Livan Einav and Jonathan Levin (2013). The data revolution and economic analysis. NBER WP No. 19035, NBER and Stanford University. Tim Harford (2014). Big data: Are we making a big mistake? Financial Times article, March 28, 2014.