LEARNING FROM BIG DATA

Transcription

1 LEARNING FROM BIG DATA Mattias Villani Division of Statistics and Machine Learning Department of Computer and Information Science Linköping University MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 1 / 17

2 WHAT IS BIG DATA? Volume - the scale of the data. Financial transactions Supermarket scanners Velocity - continously streaming data. Stock trades News and social media Variety - highly varying data structures. Wall street journal articles Network data Veracity - varying data quality. Tweets Online surveys Volatility - constantly changing patterns. Trade data Telecom data MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 2 / 17

3 CENTRAL BANKS CAN USE BIG DATA TO... estimate fine grained economic models more accurately. estimate models for networks and flow in network. construct fast economic indicies: Scanner data for inflation Job adds and text from social media for fine grained unemployment Streaming order data for economic activity improve quality and transparency in decision making. Summarizing news articles. Visualization. improve central banks communication. Is the message getting through? Sentiments. Credibility. Expectations. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 3 / 17

4 SOME RECENT BIG DATA PAPER IN ECONOMICS Varian (2014). Big data: new tricks for econometrics. Journal of Economic Perspectives. Heston and Sinha (2014). News versus Sentiment: Comparing Textual Processing Approaches for Predicting Stock Returns. Bholat et al. (2015). Handbook in text mining for central banks. Bank of England. Bajari et al. (2015). Machine Learning Methods for Demand Estimation. AER. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 4 / 17

5 COMPUTATIONALLY BIG DATA Data are computationally big if they are used in a context where computations are a serious impediment to analysis. Even rather small data sets can be computationally demanding when the model is very complex and time-consuming. Computational dilemma: model complexity increases with large data: large data have the potential to reveal poor fit of simple models with large data one can estimate more complex and more detailed models. with many observations we can estimate the effect from more (explanatory) variables. The big question in statistics and machine learning: how to estimate complex models on large data? MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 5 / 17

6 LARGE DATA REVEALS TOO SIMPLISTIC MODELS EBITDA/TA Data Logistic fit Logistic spline fit Default rate Giordani, Jacobson, Villani and von Schedvin. Journal of Financial and Quantitative Analysis, MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 6 / 17

7 BAYESIAN LEARNING Bayesian methods combine data information with other sources... avoid overfitting by imposing smoothness where data are sparse... connect nicely to prediction and decision making... natural handling of model uncertainty... are beautiful... are time-consuming. MCMC. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 7 / 17

8 DISTRIBUTED LEARNING FOR BIG DATA Big data = data that does not fit on a single machine s RAM. Distributed computations: Matlab: distributed arrays. Python: distarray. R: DistributedR. Parallel distributed MCMC algorithms Distribute data across several machines. Learn on each machine separately. MapReduce Combine the inferences from each machine in a correct way. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 8 / 17

9 DISTRIBUTED MCMC ASYMPTOTICALLY EXACT, EMBARRASSINGLY PARALLEL MCMC Dimension Dimension Dimension 1 Subposteriors (M=10) Posterior Subposterior Density Product Subposterior Average Dimension 1 Subposteriors (M=20) Posterior Subposterior Density Product Subposterior Average Figure 1. Bayesian logistic regression posterior ovals. We show the posterior Asymptotically Exact, Embarrassingly Parallel MCMC by Neiswanger, Wang, and Xing, % probability mass ovals for the first 2-dimensional marginal of the posterior, the M subposteriors, the subposterior density product (via the parametric procedure), and the subposterior average (via the subpostavg procedure). We MATTIAS VILLANI show M=10 (STIMA, subsets LIU) (left) and LEARNING M=20FROM subsets BIG DATA (right). The subposterior density 9 / 17

10 nron PC-LDA 20, 100 1,4,8,16,32,64 α = 0.1, β = 0.01 nron Sparse AD-LDA 20, 100 4,8,16,32,64 α = 0.1, β = 0.01 MULTI-CORE Table 4: SummaryPARALLEL of the speedup experiments. COMPUTING Multi-core parallel computing. Can be combined with distributed computing. RealSpeedup(I, P ) = Available in all high-level T ime(i, languages: Q b, 1) T ime(i, Q p, P ), (6) Matlab s parallel computing toolbox. parfor etc. Python: multiprocessing module, joblib module etc R: Parallel library. up (Sahni and Thanvantri, 1996) can be defined as:, Q b, 1) is the time it takes to solve the task I using the best sequential program. and T ime(i, Q p, P ) is the time it takes to solve task I using program Q p on In our measurements, sparse LDA is the fastest sampler on one core for all, so we will take sparse LDA to be the best sequential program Q b and compare -LDA sampler and AD-LDA as Q p in our measurements for real speedup. Communication overheads can easily overwhelm gains from parallelism. 5 Topics Algorithm AD LDA PC LDA Times faster Dataset enron nips No. Cores No. Cores l speedup Magnusson, to reachjonsson, convergence Villani and (left) Broman and speedup (2015). Parallelizing measured inlda total using running Partially time Collapsed Gibbs Sampling. LDAMATTIAS averaged VILLANI over(stima, five seeds LIU) compared with LEARNING the fastest FROM BIG implementation DATA on one 10 / 17

11 TOPIC MODELS article, entitled Seeking Life s Bare (Genetic) Necessities, is about using data analysis to determine the number of genes an organism needs to survive (in an evolutionary sense). By hand, we have highlighted different words that are used in the article. Words about data analysis, such as model is fleshed out later.) We formally define a topic to be a distribution over a fixed vocabulary. For example, the genetics topic has words about genetics with high probability and the evolutionary biology topic has words about evolutionary biology with high probability. We assume that these chosen from the per-document distribution over topics (step #2a). b In the example article, the distribution over topics would place probability on genetics, data analysis, and b We should explain the mysterious name, latent Dirichlet allocation. The distribution that is computer and prediction, are highlighted in blue; words topics are specified before any data used to draw the per-document topic distributions in step #1 (the cartoon histogram in Figure Probabilistic model about evolutionary for text. has been generated. Popular for summarizing documents. a Now for each 1) is called a Dirichlet distribution. In the generative process for LDA, the result of the Dirichlet biology, such as life and organism, Input: are ahighlighted collection in pink; words of about documents. a Technically, the model assumes that the topics are generated first, before the documents. different topics. Why latent? Keep reading. is used to allocate the words of the document to genetics, such as sequenced and Output: K topics - probability distributions over the vocabulary. Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of topics, which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic. Topic proportions for each document. The topics and topic assignments in this figure are illustrative they are not fit from real data. See Figure 2 for topics fit from data. Topics Documents Topic proportions and assignments gene dna genetic.,, life evolve organism.,, brain neuron nerve data number computer.,, communications of the acm april 2012 vol. 55 no. 4 Blei (2012). Probabilistic Topic Models. Communications of the ACM. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 11 / 17

12 GPU PARALLEL COMPUTING Graphics cards (GPU) for parallel computing on thousands of cores. Neuroimaging: brain activity time series in one million 3D pixels. From Eklund, Dufort, Villani and LaConte (2014). Frontiers of Neuroinformatics. GPU-enabled functions in Matlab s Parallel Computing Toolbox. PyCUDA in Python. gputools in R. Still lots of nitty-gritty low level things to get impressive performance: Low-level CUDA or OpenCL + putting the data in the right place. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 12 / 17

13 MCMC PMCMC TALL DATA Tall data = many observations, not many variables. Approximate Bayes: VB, EP, ABC, INLA... Recent idea: efficient random subsampling of the data in algorithms that eventually give the full data inference. Especially useful when likelihood is costly (e.g. optimizing agents). C FOR LARGE DATA PROBLEMS 18 SPEEDING UP MCMC 23 rs (RIF) non adaptive adaptive 8 Relative Effective Draws (RED) 6 β λ2 β λ3 RED β λ4 β λ5 β 7 β 8 β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β β λ6 τ 2 l shows the Relative Inefficiency Factors (RIF) for ow bar) and PMCMC(1) non adaptive (red bar) for ith a random walk Metropolis proposal. The right ding Relative Effective Draws (RED). Figure 5. Marginal posterior distributions for MCMC (solid blue line) vs PMCMC (dashed red line) using a RWM proposal. algorithm used here is highly vectorized and the computational cost decreases rather slowly From Quiroz, Villani and Kohn (2014). Speeding up MCMC by efficient subsampling. with increased step sizes. Other numerical methods have a computational cost which is linear in the tuning parameter, and for such problems our subsampling method with PPS-weights based on a relaxed tuning parameter would be dramatically better than MCMC on the full MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 13 / 17 sample.

14 WIDE DATA Wide data = many variables, comparatively few observation. Variable selection. Stochastic Search Variable Selection (SSVS). Shrinkage (ridge regression, lasso, elastic net, horseshoe). Big VARs. Model Uncertainty in Growth Regressions 9 Table I. Marginal evidence of importance BMA Sala-i-Martin Regressors Post.Prob. CDF(0) 1 GDP level in Fraction Confucian Life expectancy Equipment investment Sub-Saharan dummy Fraction Muslim Rule of law Number of Years open economy Degree of Capitalism Fraction Protestant Fraction GDP in mining Non-Equipment Investment Latin American dummy Primary School Enrollment, Fraction Buddhist Black Market Premium Fraction Catholic Civil Liberties Fraction Hindu Primary exports, Political Rights MATTIAS VILLANI (STIMA, LIU) 22 Exchange LEARNING rate distortions FROM BIG DATA / 17

15 WIDE DATA Many other models in the machine learning literature are of interest: trees, random forest, support vector machines etc. VOL. VOL NO. ISSUE MACHINE LEARNING METHODS FOR DEMAND ESTIMATION 5 Table 1 Model Comparison: Prediction Error Validation Out-of-Sample RMSE Std. Err. RMSE Std. Err. Weight Linear % Stepwise % Forward Stagewise % Lasso % Random Forest % SVM % Bagging % Logit % Combined % # of Obs 226, ,980 Total Obs 1,510,563 % of Bajari Totalet al. (2015). Machine 15.0% Learning Methods for 25.0% Demand Estimation. AER. yet commonly used in economics, we think Belloni, Alexandre, Victor Cher- MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 15 / 17

16 ONLINE LEARNING Streaming data. Scanners, internet text, trading of financial assets etc How to learn as data come in sequentially? Fixed vs time-varying parameters. State space models: y t = f (x t ) + ɛ t x t = g(x t 1 ) + h(z t ) + ν t Dynamic topic models. Kalman or particle filters. Dynamic variable selection. How to detect changes in the system online? MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 16 / 17

17 CONCLUDING REMARKS Big traditional data (e.g. micro panels) are clearly useful for central banks. Remains to be seen if more exotic data (text, networks, internet searches etc) can play an important role in analysis and communication. Big data will motivate more complex models. Big data + complex models = computational challenges. Economists do not have enough competence for dealing with big data. Computer scientists, statisticians, numerical mathematicians will be needed in central banks. Economics is not machine learning: not only predictions matter. How to fuse economic theory and big data? MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 17 / 17