Towards running complex models on big data

Similar documents
Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

PS 271B: Quantitative Methods II. Lecture Notes

STA 4273H: Statistical Machine Learning

Introduction to Support Vector Machines. Colin Campbell, Bristol University

Statistical Machine Learning from Data

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

CHAPTER 2 Estimating Probabilities

Coding and decoding with convolutional codes. The Viterbi Algor

A Sarsa based Autonomous Stock Trading Agent

Course: Model, Learning, and Inference: Lecture 5

GI01/M055 Supervised Learning Proximal Methods

Basics of Statistical Machine Learning

Section 5. Stan for Big Data. Bob Carpenter. Columbia University

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

10.2 ITERATIVE METHODS FOR SOLVING LINEAR SYSTEMS. The Jacobi Method

Machine Learning over Big Data

Bayesian networks - Time-series models - Apache Spark & Scala

Least-Squares Intersection of Lines

Model-based Synthesis. Tony O Hagan

Logistic Regression for Spam Filtering

HT2015: SC4 Statistical Data Mining and Machine Learning

recursion, O(n), linked lists 6/14

Pa8ern Recogni6on. and Machine Learning. Chapter 4: Linear Models for Classifica6on

Linear Classification. Volker Tresp Summer 2015

Machine Learning in Spam Filtering

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Big Data Analytics. Lucas Rego Drumond

Linear regression methods for large n and streaming data

Microsoft Azure Machine learning Algorithms

Topic models for Sentiment analysis: A Literature Survey

Report: Declarative Machine Learning on MapReduce (SystemML)

Modelling and Big Data. Leslie Smith ITNPBD4, October Updated 9 October 2015

CSCI567 Machine Learning (Fall 2014)

Big Data, Statistics, and the Internet

Linear Programming. March 14, 2014

1 o Semestre 2007/2008

DATA ANALYSIS II. Matrix Algorithms

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Improved Single and Multiple Approximate String Matching

Part 2: Analysis of Relationship Between Two Variables

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Some Examples of (Markov Chain) Monte Carlo Methods

Markov Chain Monte Carlo Simulation Made Simple

STOCHASTIC ANALYTICS. Chris Jermaine Rice University

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

Stock Option Pricing Using Bayes Filters

Statistical Models in Data Mining

Bayesian Statistics in One Hour. Patrick Lam

Big Data Analytics CSCI 4030

Linear programming approach for online advertising

Credit Scoring Modelling for Retail Banking Sector.

Neural Networks Lesson 5 - Cluster Analysis

Machine Learning. CS 188: Artificial Intelligence Naïve Bayes. Example: Digit Recognition. Other Classification Tasks

Probabilistic user behavior models in online stores for recommender systems

Big Graph Processing: Some Background

Social Media Mining. Data Mining Essentials

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Federated Optimization: Distributed Optimization Beyond the Datacenter

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

Data Mining. Cluster Analysis: Advanced Concepts and Algorithms

Linear Models for Classification

Nonlinear Regression:

Bayesian Approaches to Handling Missing Data

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Introduction to Parallel Programming and MapReduce

1 Maximum likelihood estimation

Detection of changes in variance using binary segmentation and optimal partitioning

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Data Mining Algorithms Part 1. Dejan Sarka

LINEAR INEQUALITIES. Mathematics is the art of saying many things in many different ways. MAXWELL

E6895 Advanced Big Data Analytics Lecture 3:! Spark and Data Analytics

Machine Learning Final Project Spam Filtering

Accentuate the Negative: Homework Examples from ACE

Data Mining. Nonlinear Classification

Sample Size Designs to Assess Controls

Let the data speak to you. Look Who s Peeking at Your Paycheck. Big Data. What is Big Data? The Artemis project: Saving preemies using Big Data

Analyzing The Role Of Dimension Arrangement For Data Visualization in Radviz

The Data Mining Process

EXPLORING & MODELING USING INTERACTIVE DECISION TREES IN SAS ENTERPRISE MINER. Copyr i g ht 2013, SAS Ins titut e Inc. All rights res er ve d.

Linear Threshold Units

Dealing with large datasets

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

An Introduction to Machine Learning

E3: PROBABILITY AND STATISTICS lecture notes

IB Maths SL Sequence and Series Practice Problems Mr. W Name

Facebook Friend Suggestion Eytan Daniyalzade and Tim Lipus

Chapter 6. The stacking ensemble approach

Invited Applications Paper

Efficiency of algorithms. Algorithms. Efficiency of algorithms. Binary search and linear search. Best, worst and average case.

Large-Scale Test Mining

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Transcription:

Towards running complex models on big data Working with all the genomes in the world without changing the model (too much) Daniel Lawson Heilbronn Institute, University of Bristol 2013 1 / 17

Motivation For Large datasets Statistics doesn t work estimates get worse as we get more data! (for linear compute) Simple analytics can extract many useful features e.g. PageRank [Page 1999], K-medians clustering, etc Informative in practice - and still hard to get working! Exploit averaging over lots of data But many interesting quantities are subtle... or local, so we only have a small amount of data about them Always a place for models closer matching reality 2 / 17

Example applications 1. Genetics: Model-based clustering for all the genomes in the world (2. Cyber security: finding timing coincidences in event processes a large graph)... In both cases we have a complex model per pair of objects but can use a simpler model to find out whether they are at all close Linked by use of a similarity model Other applications exist 3 / 17

Similarity: a worked example p(d S)p(S Q)p(Q) N Data D N Similarity S Structure N Q L N K Compare N items about which we have a large amount of data D (trivial extension: to M = O(N) other items). Similarity S(i, j) is computationally costly to evaluate S structured by a model Q i.e. Similarity model p(d S) separates the data D from the structure model p(s Q) If rows of Q sum to 1 this is a mixture model if only 1 element is non-zero it is a partitioning 4 / 17

Random or convenience filtering See big data 1 as better sampling of data Why not throw away elements from D? Convenience sampling - what can we actually measure? Systematic sampling - retain every n-th data point Simple random sampling - retain fraction p Stratified sampling etc For example: Use L L Use N N Inference shouldn t get worse with more data! 1: Big data: any data that can t be processed in memory on a single high spec computer 5 / 17

Emulated Likelihood Models (ELMs) Fundamental idea: Replace p(d S, θ) in p(d S, θ)p(θ)p(s Q, φ)p(q, φ) using S computed and S emulated similarities, s.t. S = S S : ˆp(D S, θ) = p(d S S, θ) = p(d S S, θ, ψ)p(s S, ψ)dψ S ij is costly to compute, and needed for S Q But are highly structured (e.g. clusters) So can emulate S ij rather than computing Choose S to be approximately sufficient for p(d S, θ) 6 / 17

The oracle Which NT elements S of S should we evaluate? Natural solution using oracle knowledge of the true S. Use a loss function L(S ) Seek argmin S (T )L(S (T )) Choices for L might be: KL divergence between the true and the emulated posterior A loss based on model usage (e.g. is the clustering correct? Control the false positive rate, etc) Choose T based on acceptable loss 7 / 17

Building blocks of a real Emulated Likelihood Model The oracle is too costly. We instead must: Iteratively choose the next S ij to add to S From a limited set produced by a restriction operator R(S ) Construct an estimator ˆL for L Decide on the ( next point to) evaluate using argmin Sij E ˆL(S S ij ) S We can consider different histories to evaluate performance Stopping rule: convergence of ˆL ˆL can be implicit from R, if it returns points ordered by E( ˆL) 8 / 17

The emulator Machine learning emulator with the usual caveats: Rarely optimal, rarely unbiased Prediction error estimated using online cross validation Respects computational constraints: L N: Consider O(N 2 + LN) algorithms N L: Consider O(LN) algorithms Massive data: Consider O(LN α + N) algorithms with α < 1. 9 / 17

Fast finestructure - outline Emulator: Decision model parameters D S* S [N,L] [N,T] [N,N] D' S' S [N,L'] [N,N] [N,N] θ Thinning model Parameters Simple model parameters Emulator model parameters Empirical Bayes S becomes the data for inferring Q: [ ] p(d Ŝ, ˆθ) p(s Q, φ)p(q, φ) 10 / 17

Fast finestructure - decision and emulation Decision R: Choose item to evaluate i t using the point most distant to all evaluated points We are evaluating S in entire rows Emulation: Predict S i : mixture model Si = MM(Si ), and regression on Ŝ ij = LM(Sij, S ij ) Implicit loss function L: Minimise the maximum prediction error Finds outliers and clusters S form convex hull of S 11 / 17

Fast finestructure - in practice Computation of S costs O(N 2 L + NT 2 L) O(N 2 L) Current datasets: L = 10, 000, 000, N = 5000, L = 10, 000 T = 100, predicted saving ratio is 100 We can save up to factor 10000 by reducing T, the cost of the linearised model on the reduced dataset. L will not grow beyond this, but N will - can introduce an emulation step for S 12 / 17

Fast finestructure - Parallel MCMC algorithm A parallel tempering algorithm for when MCMC parallelises poorly Evaluate the unlinked model S Master node: perform MCMC clustering to find ˆQ t using Ŝ t, when there are t rows St computed Worker nodes compute S i in the order chosen by the master Stopping rule: posterior distribution of ˆQ converges No new information added when increasing t (Or if the MCMC is slower than the evaluation of S, sometime afterwards) 13 / 17

14 / 17

15 / 17

Emulated Likelihood Models for general Bayesian problems General emulation for big (but not so big) problems ˆp(D S, θ) = p(d S S, θ)p(s S, θ, ψ)dψ i.e. Can use θ to emulate S (θ) - e.g. regression in (S, θ) space Gaussian Process for Sij (θ) is a natural choice If S is an unbiased estimator of S this is a pseudo-marginal approach (and hence targeting the correct posterior) 16 / 17

Discussion Goal: Approximate answers to the right questions using exact answers to the wrong questions Machine learning is increasingly important for large datasets Statistical modelling still has a place Proposed the Emulated Likelihood Model: Full statistical modelling Machine learning algorithms used for the calculation Statistical estimation of parameters is retained but approximated 17 / 17