LEARNING FROM BIG DATA
|
|
- Ilene Gordon
- 8 years ago
- Views:
Transcription
1 LEARNING FROM BIG DATA Mattias Villani Division of Statistics and Machine Learning Department of Computer and Information Science Linköping University MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 1 / 17
2 WHAT IS BIG DATA? Volume - the scale of the data. Financial transactions Supermarket scanners Velocity - continously streaming data. Stock trades News and social media Variety - highly varying data structures. Wall street journal articles Network data Veracity - varying data quality. Tweets Online surveys Volatility - constantly changing patterns. Trade data Telecom data MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 2 / 17
3 CENTRAL BANKS CAN USE BIG DATA TO... estimate fine grained economic models more accurately. estimate models for networks and flow in network. construct fast economic indicies: Scanner data for inflation Job adds and text from social media for fine grained unemployment Streaming order data for economic activity improve quality and transparency in decision making. Summarizing news articles. Visualization. improve central banks communication. Is the message getting through? Sentiments. Credibility. Expectations. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 3 / 17
4 SOME RECENT BIG DATA PAPER IN ECONOMICS Varian (2014). Big data: new tricks for econometrics. Journal of Economic Perspectives. Heston and Sinha (2014). News versus Sentiment: Comparing Textual Processing Approaches for Predicting Stock Returns. Bholat et al. (2015). Handbook in text mining for central banks. Bank of England. Bajari et al. (2015). Machine Learning Methods for Demand Estimation. AER. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 4 / 17
5 COMPUTATIONALLY BIG DATA Data are computationally big if they are used in a context where computations are a serious impediment to analysis. Even rather small data sets can be computationally demanding when the model is very complex and time-consuming. Computational dilemma: model complexity increases with large data: large data have the potential to reveal poor fit of simple models with large data one can estimate more complex and more detailed models. with many observations we can estimate the effect from more (explanatory) variables. The big question in statistics and machine learning: how to estimate complex models on large data? MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 5 / 17
6 LARGE DATA REVEALS TOO SIMPLISTIC MODELS EBITDA/TA Data Logistic fit Logistic spline fit Default rate Giordani, Jacobson, Villani and von Schedvin. Journal of Financial and Quantitative Analysis, MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 6 / 17
7 BAYESIAN LEARNING Bayesian methods combine data information with other sources... avoid overfitting by imposing smoothness where data are sparse... connect nicely to prediction and decision making... natural handling of model uncertainty... are beautiful... are time-consuming. MCMC. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 7 / 17
8 DISTRIBUTED LEARNING FOR BIG DATA Big data = data that does not fit on a single machine s RAM. Distributed computations: Matlab: distributed arrays. Python: distarray. R: DistributedR. Parallel distributed MCMC algorithms Distribute data across several machines. Learn on each machine separately. MapReduce Combine the inferences from each machine in a correct way. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 8 / 17
9 DISTRIBUTED MCMC ASYMPTOTICALLY EXACT, EMBARRASSINGLY PARALLEL MCMC Dimension Dimension Dimension 1 Subposteriors (M=10) Posterior Subposterior Density Product Subposterior Average Dimension 1 Subposteriors (M=20) Posterior Subposterior Density Product Subposterior Average Figure 1. Bayesian logistic regression posterior ovals. We show the posterior Asymptotically Exact, Embarrassingly Parallel MCMC by Neiswanger, Wang, and Xing, % probability mass ovals for the first 2-dimensional marginal of the posterior, the M subposteriors, the subposterior density product (via the parametric procedure), and the subposterior average (via the subpostavg procedure). We MATTIAS VILLANI show M=10 (STIMA, subsets LIU) (left) and LEARNING M=20FROM subsets BIG DATA (right). The subposterior density 9 / 17
10 nron PC-LDA 20, 100 1,4,8,16,32,64 α = 0.1, β = 0.01 nron Sparse AD-LDA 20, 100 4,8,16,32,64 α = 0.1, β = 0.01 MULTI-CORE Table 4: SummaryPARALLEL of the speedup experiments. COMPUTING Multi-core parallel computing. Can be combined with distributed computing. RealSpeedup(I, P ) = Available in all high-level T ime(i, languages: Q b, 1) T ime(i, Q p, P ), (6) Matlab s parallel computing toolbox. parfor etc. Python: multiprocessing module, joblib module etc R: Parallel library. up (Sahni and Thanvantri, 1996) can be defined as:, Q b, 1) is the time it takes to solve the task I using the best sequential program. and T ime(i, Q p, P ) is the time it takes to solve task I using program Q p on In our measurements, sparse LDA is the fastest sampler on one core for all, so we will take sparse LDA to be the best sequential program Q b and compare -LDA sampler and AD-LDA as Q p in our measurements for real speedup. Communication overheads can easily overwhelm gains from parallelism. 5 Topics Algorithm AD LDA PC LDA Times faster Dataset enron nips No. Cores No. Cores l speedup Magnusson, to reachjonsson, convergence Villani and (left) Broman and speedup (2015). Parallelizing measured inlda total using running Partially time Collapsed Gibbs Sampling. LDAMATTIAS averaged VILLANI over(stima, five seeds LIU) compared with LEARNING the fastest FROM BIG implementation DATA on one 10 / 17
11 TOPIC MODELS article, entitled Seeking Life s Bare (Genetic) Necessities, is about using data analysis to determine the number of genes an organism needs to survive (in an evolutionary sense). By hand, we have highlighted different words that are used in the article. Words about data analysis, such as model is fleshed out later.) We formally define a topic to be a distribution over a fixed vocabulary. For example, the genetics topic has words about genetics with high probability and the evolutionary biology topic has words about evolutionary biology with high probability. We assume that these chosen from the per-document distribution over topics (step #2a). b In the example article, the distribution over topics would place probability on genetics, data analysis, and b We should explain the mysterious name, latent Dirichlet allocation. The distribution that is computer and prediction, are highlighted in blue; words topics are specified before any data used to draw the per-document topic distributions in step #1 (the cartoon histogram in Figure Probabilistic model about evolutionary for text. has been generated. Popular for summarizing documents. a Now for each 1) is called a Dirichlet distribution. In the generative process for LDA, the result of the Dirichlet biology, such as life and organism, Input: are ahighlighted collection in pink; words of about documents. a Technically, the model assumes that the topics are generated first, before the documents. different topics. Why latent? Keep reading. is used to allocate the words of the document to genetics, such as sequenced and Output: K topics - probability distributions over the vocabulary. Figure 1. The intuitions behind latent Dirichlet allocation. We assume that some number of topics, which are distributions over words, exist for the whole collection (far left). Each document is assumed to be generated as follows. First choose a distribution over the topics (the histogram at right); then, for each word, choose a topic assignment (the colored coins) and choose the word from the corresponding topic. Topic proportions for each document. The topics and topic assignments in this figure are illustrative they are not fit from real data. See Figure 2 for topics fit from data. Topics Documents Topic proportions and assignments gene dna genetic.,, life evolve organism.,, brain neuron nerve data number computer.,, communications of the acm april 2012 vol. 55 no. 4 Blei (2012). Probabilistic Topic Models. Communications of the ACM. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 11 / 17
12 GPU PARALLEL COMPUTING Graphics cards (GPU) for parallel computing on thousands of cores. Neuroimaging: brain activity time series in one million 3D pixels. From Eklund, Dufort, Villani and LaConte (2014). Frontiers of Neuroinformatics. GPU-enabled functions in Matlab s Parallel Computing Toolbox. PyCUDA in Python. gputools in R. Still lots of nitty-gritty low level things to get impressive performance: Low-level CUDA or OpenCL + putting the data in the right place. MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 12 / 17
13 MCMC PMCMC TALL DATA Tall data = many observations, not many variables. Approximate Bayes: VB, EP, ABC, INLA... Recent idea: efficient random subsampling of the data in algorithms that eventually give the full data inference. Especially useful when likelihood is costly (e.g. optimizing agents). C FOR LARGE DATA PROBLEMS 18 SPEEDING UP MCMC 23 rs (RIF) non adaptive adaptive 8 Relative Effective Draws (RED) 6 β λ2 β λ3 RED β λ4 β λ5 β 7 β 8 β 0 β 1 β 2 β 3 β 4 β 5 β 6 β 7 β β λ6 τ 2 l shows the Relative Inefficiency Factors (RIF) for ow bar) and PMCMC(1) non adaptive (red bar) for ith a random walk Metropolis proposal. The right ding Relative Effective Draws (RED). Figure 5. Marginal posterior distributions for MCMC (solid blue line) vs PMCMC (dashed red line) using a RWM proposal. algorithm used here is highly vectorized and the computational cost decreases rather slowly From Quiroz, Villani and Kohn (2014). Speeding up MCMC by efficient subsampling. with increased step sizes. Other numerical methods have a computational cost which is linear in the tuning parameter, and for such problems our subsampling method with PPS-weights based on a relaxed tuning parameter would be dramatically better than MCMC on the full MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 13 / 17 sample.
14 WIDE DATA Wide data = many variables, comparatively few observation. Variable selection. Stochastic Search Variable Selection (SSVS). Shrinkage (ridge regression, lasso, elastic net, horseshoe). Big VARs. Model Uncertainty in Growth Regressions 9 Table I. Marginal evidence of importance BMA Sala-i-Martin Regressors Post.Prob. CDF(0) 1 GDP level in Fraction Confucian Life expectancy Equipment investment Sub-Saharan dummy Fraction Muslim Rule of law Number of Years open economy Degree of Capitalism Fraction Protestant Fraction GDP in mining Non-Equipment Investment Latin American dummy Primary School Enrollment, Fraction Buddhist Black Market Premium Fraction Catholic Civil Liberties Fraction Hindu Primary exports, Political Rights MATTIAS VILLANI (STIMA, LIU) 22 Exchange LEARNING rate distortions FROM BIG DATA / 17
15 WIDE DATA Many other models in the machine learning literature are of interest: trees, random forest, support vector machines etc. VOL. VOL NO. ISSUE MACHINE LEARNING METHODS FOR DEMAND ESTIMATION 5 Table 1 Model Comparison: Prediction Error Validation Out-of-Sample RMSE Std. Err. RMSE Std. Err. Weight Linear % Stepwise % Forward Stagewise % Lasso % Random Forest % SVM % Bagging % Logit % Combined % # of Obs 226, ,980 Total Obs 1,510,563 % of Bajari Totalet al. (2015). Machine 15.0% Learning Methods for 25.0% Demand Estimation. AER. yet commonly used in economics, we think Belloni, Alexandre, Victor Cher- MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 15 / 17
16 ONLINE LEARNING Streaming data. Scanners, internet text, trading of financial assets etc How to learn as data come in sequentially? Fixed vs time-varying parameters. State space models: y t = f (x t ) + ɛ t x t = g(x t 1 ) + h(z t ) + ν t Dynamic topic models. Kalman or particle filters. Dynamic variable selection. How to detect changes in the system online? MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 16 / 17
17 CONCLUDING REMARKS Big traditional data (e.g. micro panels) are clearly useful for central banks. Remains to be seen if more exotic data (text, networks, internet searches etc) can play an important role in analysis and communication. Big data will motivate more complex models. Big data + complex models = computational challenges. Economists do not have enough competence for dealing with big data. Computer scientists, statisticians, numerical mathematicians will be needed in central banks. Economics is not machine learning: not only predictions matter. How to fuse economic theory and big data? MATTIAS VILLANI (STIMA, LIU) LEARNING FROM BIG DATA 17 / 17
Machine Learning Methods for Demand Estimation
Machine Learning Methods for Demand Estimation By Patrick Bajari, Denis Nekipelov, Stephen P. Ryan, and Miaoyu Yang Over the past decade, there has been a high level of interest in modeling consumer behavior
More informationSampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data
Sampling via Moment Sharing: A New Framework for Distributed Bayesian Inference for Big Data (Oxford) in collaboration with: Minjie Xu, Jun Zhu, Bo Zhang (Tsinghua) Balaji Lakshminarayanan (Gatsby) Bayesian
More informationBayes and Naïve Bayes. cs534-machine Learning
Bayes and aïve Bayes cs534-machine Learning Bayes Classifier Generative model learns Prediction is made by and where This is often referred to as the Bayes Classifier, because of the use of the Bayes rule
More informationEconomic Commentaries
n Economic Commentaries Data and statistics are a cornerstone of the Riksbank s work. In recent years, the supply of data has increased dramatically and this trend is set to continue as an ever-greater
More informationHT2015: SC4 Statistical Data Mining and Machine Learning
HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html Bayesian Nonparametrics Parametric vs Nonparametric
More informationBig Data, Statistics, and the Internet
Big Data, Statistics, and the Internet Steven L. Scott April, 4 Steve Scott (Google) Big Data, Statistics, and the Internet April, 4 / 39 Summary Big data live on more than one machine. Computing takes
More informationMACHINE LEARNING IN HIGH ENERGY PHYSICS
MACHINE LEARNING IN HIGH ENERGY PHYSICS LECTURE #1 Alex Rogozhnikov, 2015 INTRO NOTES 4 days two lectures, two practice seminars every day this is introductory track to machine learning kaggle competition!
More informationLinear regression methods for large n and streaming data
Linear regression methods for large n and streaming data Large n and small or moderate p is a fairly simple problem. The sufficient statistic for β in OLS (and ridge) is: The concept of sufficiency is
More informationBayesian networks - Time-series models - Apache Spark & Scala
Bayesian networks - Time-series models - Apache Spark & Scala Dr John Sandiford, CTO Bayes Server Data Science London Meetup - November 2014 1 Contents Introduction Bayesian networks Latent variables Anomaly
More informationTOWARD BIG DATA ANALYSIS WORKSHOP
TOWARD BIG DATA ANALYSIS WORKSHOP 邁 向 巨 量 資 料 分 析 研 討 會 摘 要 集 2015.06.05-06 巨 量 資 料 之 矩 陣 視 覺 化 陳 君 厚 中 央 研 究 院 統 計 科 學 研 究 所 摘 要 視 覺 化 (Visualization) 與 探 索 式 資 料 分 析 (Exploratory Data Analysis, EDA)
More informationLearning outcomes. Knowledge and understanding. Competence and skills
Syllabus Master s Programme in Statistics and Data Mining 120 ECTS Credits Aim The rapid growth of databases provides scientists and business people with vast new resources. This programme meets the challenges
More informationMachine Learning for Data Science (CS4786) Lecture 1
Machine Learning for Data Science (CS4786) Lecture 1 Tu-Th 10:10 to 11:25 AM Hollister B14 Instructors : Lillian Lee and Karthik Sridharan ROUGH DETAILS ABOUT THE COURSE Diagnostic assignment 0 is out:
More informationData Mining Yelp Data - Predicting rating stars from review text
Data Mining Yelp Data - Predicting rating stars from review text Rakesh Chada Stony Brook University rchada@cs.stonybrook.edu Chetan Naik Stony Brook University cnaik@cs.stonybrook.edu ABSTRACT The majority
More informationMS1b Statistical Data Mining
MS1b Statistical Data Mining Yee Whye Teh Department of Statistics Oxford http://www.stats.ox.ac.uk/~teh/datamining.html Outline Administrivia and Introduction Course Structure Syllabus Introduction to
More informationSTA 4273H: Statistical Machine Learning
STA 4273H: Statistical Machine Learning Russ Salakhutdinov Department of Statistics! rsalakhu@utstat.toronto.edu! http://www.cs.toronto.edu/~rsalakhu/ Lecture 6 Three Approaches to Classification Construct
More informationNon-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning
Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step
More informationMachine Learning Big Data using Map Reduce
Machine Learning Big Data using Map Reduce By Michael Bowles, PhD Where Does Big Data Come From? -Web data (web logs, click histories) -e-commerce applications (purchase histories) -Retail purchase histories
More informationData Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA) shardy@nicta.com.au
Data Analytics at NICTA Stephen Hardy National ICT Australia (NICTA) shardy@nicta.com.au NICTA Copyright 2013 Outline Big data = science! Data analytics at NICTA Discrete Finite Infinite Machine Learning
More informationBayesian Penalized Methods for High Dimensional Data
Bayesian Penalized Methods for High Dimensional Data Joseph G. Ibrahim Joint with Hongtu Zhu and Zakaria Khondker What is Covered? Motivation GLRR: Bayesian Generalized Low Rank Regression L2R2: Bayesian
More informationSupervised Learning (Big Data Analytics)
Supervised Learning (Big Data Analytics) Vibhav Gogate Department of Computer Science The University of Texas at Dallas Practical advice Goal of Big Data Analytics Uncover patterns in Data. Can be used
More informationStatistics Graduate Courses
Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.
More informationCentre for Central Banking Studies
Centre for Central Banking Studies Technical Handbook No. 4 Applied Bayesian econometrics for central bankers Andrew Blake and Haroon Mumtaz CCBS Technical Handbook No. 4 Applied Bayesian econometrics
More informationGovernment of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence
Government of Russian Federation Federal State Autonomous Educational Institution of High Professional Education National Research University «Higher School of Economics» Faculty of Computer Science School
More informationData Mining. Nonlinear Classification
Data Mining Unit # 6 Sajjad Haider Fall 2014 1 Nonlinear Classification Classes may not be separable by a linear boundary Suppose we randomly generate a data set as follows: X has range between 0 to 15
More informationData Mining: An Overview. David Madigan http://www.stat.columbia.edu/~madigan
Data Mining: An Overview David Madigan http://www.stat.columbia.edu/~madigan Overview Brief Introduction to Data Mining Data Mining Algorithms Specific Eamples Algorithms: Disease Clusters Algorithms:
More informationBIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics http://www.ccmb.med.umich.edu/node/1376
Course Director: Dr. Kayvan Najarian (DCM&B, kayvan@umich.edu) Lectures: Labs: Mondays and Wednesdays 9:00 AM -10:30 AM Rm. 2065 Palmer Commons Bldg. Wednesdays 10:30 AM 11:30 AM (alternate weeks) Rm.
More informationMSCA 31000 Introduction to Statistical Concepts
MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced
More informationParallelization Strategies for Multicore Data Analysis
Parallelization Strategies for Multicore Data Analysis Wei-Chen Chen 1 Russell Zaretzki 2 1 University of Tennessee, Dept of EEB 2 University of Tennessee, Dept. Statistics, Operations, and Management
More informationLecture 3: Linear methods for classification
Lecture 3: Linear methods for classification Rafael A. Irizarry and Hector Corrada Bravo February, 2010 Today we describe four specific algorithms useful for classification problems: linear regression,
More informationBayesian Statistics: Indian Buffet Process
Bayesian Statistics: Indian Buffet Process Ilker Yildirim Department of Brain and Cognitive Sciences University of Rochester Rochester, NY 14627 August 2012 Reference: Most of the material in this note
More informationLeveraging Ensemble Models in SAS Enterprise Miner
ABSTRACT Paper SAS133-2014 Leveraging Ensemble Models in SAS Enterprise Miner Miguel Maldonado, Jared Dean, Wendy Czika, and Susan Haller SAS Institute Inc. Ensemble models combine two or more models to
More informationBayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University caizhua@gmail.com
Bayesian Machine Learning (ML): Modeling And Inference in Big Data Zhuhua Cai Google Rice University caizhua@gmail.com 1 Syllabus Bayesian ML Concepts (Today) Bayesian ML on MapReduce (Next morning) Bayesian
More informationNew Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Introduction
Introduction New Work Item for ISO 3534-5 Predictive Analytics (Initial Notes and Thoughts) Predictive analytics encompasses the body of statistical knowledge supporting the analysis of massive data sets.
More informationAcknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues
Data Mining with Regression Teaching an old dog some new tricks Acknowledgments Colleagues Dean Foster in Statistics Lyle Ungar in Computer Science Bob Stine Department of Statistics The School of the
More informationService courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.
Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are
More information14.10.2014. Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)
Overview Kyrre Glette kyrrehg@ifi INF3490 Swarm Intelligence Particle Swarm Optimization Introduction to swarm intelligence principles Particle Swarm Optimization (PSO) 3 Swarms in nature Fish, birds,
More informationAn Introduction to Data Mining
An Introduction to Intel Beijing wei.heng@intel.com January 17, 2014 Outline 1 DW Overview What is Notable Application of Conference, Software and Applications Major Process in 2 Major Tasks in Detail
More informationA Statistician s View of Big Data
A Statistician s View of Big Data Max Kuhn, Ph.D (Pfizer Global R&D, Groton, CT) Kjell Johnson, Ph.D (Arbor Analytics, Ann Arbor MI) What Does Big Data Mean? The advantages and issues related to Big Data
More informationTackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
More informationStatistics for BIG data
Statistics for BIG data Statistics for Big Data: Are Statisticians Ready? Dennis Lin Department of Statistics The Pennsylvania State University John Jordan and Dennis K.J. Lin (ICSA-Bulletine 2014) Before
More informationLearning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu
Learning Gaussian process models from big data Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu Machine learning seminar at University of Cambridge, July 4 2012 Data A lot of
More informationExample: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.
Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C
More informationBig Data and Marketing
Big Data and Marketing Professor Venky Shankar Coleman Chair in Marketing Director, Center for Retailing Studies Mays Business School Texas A&M University http://www.venkyshankar.com venky@venkyshankar.com
More informationParallel Data Mining. Team 2 Flash Coders Team Research Investigation Presentation 2. Foundations of Parallel Computing Oct 2014
Parallel Data Mining Team 2 Flash Coders Team Research Investigation Presentation 2 Foundations of Parallel Computing Oct 2014 Agenda Overview of topic Analysis of research papers Software design Overview
More informationMammoth Scale Machine Learning!
Mammoth Scale Machine Learning! Speaker: Robin Anil, Apache Mahout PMC Member! OSCON"10! Portland, OR! July 2010! Quick Show of Hands!# Are you fascinated about ML?!# Have you used ML?!# Do you have Gigabytes
More informationRegularized Logistic Regression for Mind Reading with Parallel Validation
Regularized Logistic Regression for Mind Reading with Parallel Validation Heikki Huttunen, Jukka-Pekka Kauppi, Jussi Tohka Tampere University of Technology Department of Signal Processing Tampere, Finland
More informationMSCA 31000 Introduction to Statistical Concepts
MSCA 31000 Introduction to Statistical Concepts This course provides general exposure to basic statistical concepts that are necessary for students to understand the content presented in more advanced
More informationMedical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute andek@vtc.vt.edu Outline Motivation why do we need GPUs? Past - how was GPU programming
More information11. Time series and dynamic linear models
11. Time series and dynamic linear models Objective To introduce the Bayesian approach to the modeling and forecasting of time series. Recommended reading West, M. and Harrison, J. (1997). models, (2 nd
More informationPublication List. Chen Zehua Department of Statistics & Applied Probability National University of Singapore
Publication List Chen Zehua Department of Statistics & Applied Probability National University of Singapore Publications Journal Papers 1. Y. He and Z. Chen (2014). A sequential procedure for feature selection
More informationScalable Developments for Big Data Analytics in Remote Sensing
Scalable Developments for Big Data Analytics in Remote Sensing Federated Systems and Data Division Research Group High Productivity Data Processing Dr.-Ing. Morris Riedel et al. Research Group Leader,
More informationMachine Learning for Medical Image Analysis. A. Criminisi & the InnerEye team @ MSRC
Machine Learning for Medical Image Analysis A. Criminisi & the InnerEye team @ MSRC Medical image analysis the goal Automatic, semantic analysis and quantification of what observed in medical scans Brain
More informationFast Analytics on Big Data with H20
Fast Analytics on Big Data with H20 0xdata.com, h2o.ai Tomas Nykodym, Petr Maj Team About H2O and 0xdata H2O is a platform for distributed in memory predictive analytics and machine learning Pure Java,
More informationCOPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments
Contents List of Figures Foreword Preface xxv xxiii xv Acknowledgments xxix Chapter 1 Fraud: Detection, Prevention, and Analytics! 1 Introduction 2 Fraud! 2 Fraud Detection and Prevention 10 Big Data for
More informationApplied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets
Applied Data Mining Analysis: A Step-by-Step Introduction Using Real-World Data Sets http://info.salford-systems.com/jsm-2015-ctw August 2015 Salford Systems Course Outline Demonstration of two classification
More informationBayesian Hidden Markov Models for Alcoholism Treatment Tria
Bayesian Hidden Markov Models for Alcoholism Treatment Trial Data May 12, 2008 Co-Authors Dylan Small, Statistics Department, UPenn Kevin Lynch, Treatment Research Center, Upenn Steve Maisto, Psychology
More informationGeneralizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel
Generalizing Random Forests Principles to other Methods: Random MultiNomial Logit, Random Naive Bayes, Anita Prinzie & Dirk Van den Poel Copyright 2008 All rights reserved. Random Forests Forest of decision
More informationBayesX - Software for Bayesian Inference in Structured Additive Regression
BayesX - Software for Bayesian Inference in Structured Additive Regression Thomas Kneib Faculty of Mathematics and Economics, University of Ulm Department of Statistics, Ludwig-Maximilians-University Munich
More informationPetrel TIPS&TRICKS from SCM
Petrel TIPS&TRICKS from SCM Knowledge Worth Sharing Histograms and SGS Modeling Histograms are used daily for interpretation, quality control, and modeling in Petrel. This TIPS&TRICKS document briefly
More informationLogistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression
Logistic Regression Department of Statistics The Pennsylvania State University Email: jiali@stat.psu.edu Logistic Regression Preserve linear classification boundaries. By the Bayes rule: Ĝ(x) = arg max
More informationAnalytics on Big Data
Analytics on Big Data Riccardo Torlone Università Roma Tre Credits: Mohamed Eltabakh (WPI) Analytics The discovery and communication of meaningful patterns in data (Wikipedia) It relies on data analysis
More informationBig Data - Lecture 1 Optimization reminders
Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Big Data - Lecture 1 Optimization reminders S. Gadat Toulouse, Octobre 2014 Schedule Introduction Major issues Examples Mathematics
More informationMachine learning for algo trading
Machine learning for algo trading An introduction for nonmathematicians Dr. Aly Kassam Overview High level introduction to machine learning A machine learning bestiary What has all this got to do with
More informationSoftware support for economic research at CNB
Software support for economic research at CNB Modern tools for financial analysis and modeling František Brázdik Macroeconomic Forecasting Division frantisek.brazdik@cnb.cz Czech National Bank May 23,
More informationData Mining. SPSS Clementine 12.0. 1. Clementine Overview. Spring 2010 Instructor: Dr. Masoud Yaghini. Clementine
Data Mining SPSS 12.0 1. Overview Spring 2010 Instructor: Dr. Masoud Yaghini Introduction Types of Models Interface Projects References Outline Introduction Introduction Three of the common data mining
More informationIntroduction to Big Data with Apache Spark UC BERKELEY
Introduction to Big Data with Apache Spark UC BERKELEY This Lecture Exploratory Data Analysis Some Important Distributions Spark mllib Machine Learning Library Descriptive vs. Inferential Statistics Descriptive:»
More informationPS 271B: Quantitative Methods II. Lecture Notes
PS 271B: Quantitative Methods II Lecture Notes Langche Zeng zeng@ucsd.edu The Empirical Research Process; Fundamental Methodological Issues 2 Theory; Data; Models/model selection; Estimation; Inference.
More informationSearch Taxonomy. Web Search. Search Engine Optimization. Information Retrieval
Information Retrieval INFO 4300 / CS 4300! Retrieval models Older models» Boolean retrieval» Vector Space model Probabilistic Models» BM25» Language models Web search» Learning to Rank Search Taxonomy!
More information1 Maximum likelihood estimation
COS 424: Interacting with Data Lecturer: David Blei Lecture #4 Scribes: Wei Ho, Michael Ye February 14, 2008 1 Maximum likelihood estimation 1.1 MLE of a Bernoulli random variable (coin flips) Given N
More informationBig data in macroeconomics Lucrezia Reichlin London Business School and now-casting economics ltd. COEURE workshop Brussels 3-4 July 2015
Big data in macroeconomics Lucrezia Reichlin London Business School and now-casting economics ltd COEURE workshop Brussels 3-4 July 2015 WHAT IS BIG DATA IN ECONOMICS? Frank Diebold claimed to have introduced
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationBig Data at Spotify. Anders Arpteg, Ph D Analytics Machine Learning, Spotify
Big Data at Spotify Anders Arpteg, Ph D Analytics Machine Learning, Spotify Quickly about me Quickly about Spotify What is all the data used for? Quickly about Spark Hadoop MR vs Spark Need for (distributed)
More informationTree Ensembles: The Power of Post- Processing. December 2012 Dan Steinberg Mikhail Golovnya Salford Systems
Tree Ensembles: The Power of Post- Processing December 2012 Dan Steinberg Mikhail Golovnya Salford Systems Course Outline Salford Systems quick overview Treenet an ensemble of boosted trees GPS modern
More informationDistributed forests for MapReduce-based machine learning
Distributed forests for MapReduce-based machine learning Ryoji Wakayama, Ryuei Murata, Akisato Kimura, Takayoshi Yamashita, Yuji Yamauchi, Hironobu Fujiyoshi Chubu University, Japan. NTT Communication
More informationDistributed Bayesian Posterior Sampling via Moment Sharing
Distributed Bayesian Posterior Sampling via Moment Sharing Minjie Xu 1, Balaji Lakshminarayanan 2, Yee Whye Teh 3, Jun Zhu 1, and Bo Zhang 1 1 State Key Lab of Intelligent Technology and Systems; Tsinghua
More informationBasics of Statistical Machine Learning
CS761 Spring 2013 Advanced Machine Learning Basics of Statistical Machine Learning Lecturer: Xiaojin Zhu jerryzhu@cs.wisc.edu Modern machine learning is rooted in statistics. You will find many familiar
More informationUsing Data Mining for Mobile Communication Clustering and Characterization
Using Data Mining for Mobile Communication Clustering and Characterization A. Bascacov *, C. Cernazanu ** and M. Marcu ** * Lasting Software, Timisoara, Romania ** Politehnica University of Timisoara/Computer
More informationModeling and Analysis of Call Center Arrival Data: A Bayesian Approach
Modeling and Analysis of Call Center Arrival Data: A Bayesian Approach Refik Soyer * Department of Management Science The George Washington University M. Murat Tarimcilar Department of Management Science
More informationDATA MINING TECHNIQUES AND APPLICATIONS
DATA MINING TECHNIQUES AND APPLICATIONS Mrs. Bharati M. Ramageri, Lecturer Modern Institute of Information Technology and Research, Department of Computer Application, Yamunanagar, Nigdi Pune, Maharashtra,
More informationLocation matters. 3 techniques to incorporate geo-spatial effects in one's predictive model
Location matters. 3 techniques to incorporate geo-spatial effects in one's predictive model Xavier Conort xavier.conort@gear-analytics.com Motivation Location matters! Observed value at one location is
More informationCompression algorithm for Bayesian network modeling of binary systems
Compression algorithm for Bayesian network modeling of binary systems I. Tien & A. Der Kiureghian University of California, Berkeley ABSTRACT: A Bayesian network (BN) is a useful tool for analyzing the
More informationLasso on Categorical Data
Lasso on Categorical Data Yunjin Choi, Rina Park, Michael Seo December 14, 2012 1 Introduction In social science studies, the variables of interest are often categorical, such as race, gender, and nationality.
More informationProbabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014
Probabilistic Models for Big Data Alex Davies and Roger Frigola University of Cambridge 13th February 2014 The State of Big Data Why probabilistic models for Big Data? 1. If you don t have to worry about
More informationMaster of Mathematical Finance: Course Descriptions
Master of Mathematical Finance: Course Descriptions CS 522 Data Mining Computer Science This course provides continued exploration of data mining algorithms. More sophisticated algorithms such as support
More informationWhy Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012
Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts
More informationMachine Learning Logistic Regression
Machine Learning Logistic Regression Jeff Howbert Introduction to Machine Learning Winter 2012 1 Logistic regression Name is somewhat misleading. Really a technique for classification, not regression.
More informationBOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING
BOOSTED REGRESSION TREES: A MODERN WAY TO ENHANCE ACTUARIAL MODELLING Xavier Conort xavier.conort@gear-analytics.com Session Number: TBR14 Insurance has always been a data business The industry has successfully
More informationPrinciples of Data Mining by Hand&Mannila&Smyth
Principles of Data Mining by Hand&Mannila&Smyth Slides for Textbook Ari Visa,, Institute of Signal Processing Tampere University of Technology October 4, 2010 Data Mining: Concepts and Techniques 1 Differences
More informationUsing In-Memory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationHUAWEI Advanced Data Science with Spark Streaming. Albert Bifet (@abifet)
HUAWEI Advanced Data Science with Spark Streaming Albert Bifet (@abifet) Huawei Noah s Ark Lab Focus Intelligent Mobile Devices Data Mining & Artificial Intelligence Intelligent Telecommunication Networks
More informationWebFOCUS RStat. RStat. Predict the Future and Make Effective Decisions Today. WebFOCUS RStat
Information Builders enables agile information solutions with business intelligence (BI) and integration technologies. WebFOCUS the most widely utilized business intelligence platform connects to any enterprise
More informationThe Artificial Prediction Market
The Artificial Prediction Market Adrian Barbu Department of Statistics Florida State University Joint work with Nathan Lay, Siemens Corporate Research 1 Overview Main Contributions A mathematical theory
More informationPredictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD
Predictive Analytics Techniques: What to Use For Your Big Data March 26, 2014 Fern Halper, PhD Presenter Proven Performance Since 1995 TDWI helps business and IT professionals gain insight about data warehousing,
More informationData Mining - Evaluation of Classifiers
Data Mining - Evaluation of Classifiers Lecturer: JERZY STEFANOWSKI Institute of Computing Sciences Poznan University of Technology Poznan, Poland Lecture 4 SE Master Course 2008/2009 revised for 2010
More informationDirichlet Processes A gentle tutorial
Dirichlet Processes A gentle tutorial SELECT Lab Meeting October 14, 2008 Khalid El-Arini Motivation We are given a data set, and are told that it was generated from a mixture of Gaussian distributions.
More informationHow To Understand The Theory Of Probability
Graduate Programs in Statistics Course Titles STAT 100 CALCULUS AND MATR IX ALGEBRA FOR STATISTICS. Differential and integral calculus; infinite series; matrix algebra STAT 195 INTRODUCTION TO MATHEMATICAL
More informationData Mining Practical Machine Learning Tools and Techniques
Ensemble learning Data Mining Practical Machine Learning Tools and Techniques Slides for Chapter 8 of Data Mining by I. H. Witten, E. Frank and M. A. Hall Combining multiple models Bagging The basic idea
More informationScaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce
Scaling Bayesian Network Parameter Learning with Expectation Maximization using MapReduce Erik B. Reed Carnegie Mellon University Silicon Valley Campus NASA Research Park Moffett Field, CA 94035 erikreed@cmu.edu
More informationBayesian Predictive Profiles with Applications to Retail Transaction Data
Bayesian Predictive Profiles with Applications to Retail Transaction Data Igor V. Cadez Information and Computer Science University of California Irvine, CA 92697-3425, U.S.A. icadez@ics.uci.edu Padhraic
More informationProbabilistic topic models for sentiment analysis on the Web
University of Exeter Department of Computer Science Probabilistic topic models for sentiment analysis on the Web Chenghua Lin September 2011 Submitted by Chenghua Lin, to the the University of Exeter as
More information