HT2015: SC4 Statistical Data Mining and Machine Learning



Similar documents
Dirichlet Processes A gentle tutorial

BIOINF 585 Fall 2015 Machine Learning for Systems Biology & Clinical Informatics

Gaussian Processes in Machine Learning

STA 4273H: Statistical Machine Learning

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Data Mining. Nonlinear Classification

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Dirichlet Process. Yee Whye Teh, University College London

MS1b Statistical Data Mining

Machine Learning and Data Analysis overview. Department of Cybernetics, Czech Technical University in Prague.

Machine Learning.

Azure Machine Learning, SQL Data Mining and R

Learning outcomes. Knowledge and understanding. Competence and skills

Principles of Data Mining by Hand&Mannila&Smyth

Statistical Machine Learning

Machine learning for algo trading

Basics of Statistical Machine Learning

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Lecture 3: Linear methods for classification

KATE GLEASON COLLEGE OF ENGINEERING. John D. Hromi Center for Quality and Applied Statistics

Bayesian Statistics: Indian Buffet Process

Statistical Models in Data Mining

An Introduction to Data Mining

ADVANCED MACHINE LEARNING. Introduction

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Latent variable and deep modeling with Gaussian processes; application to system identification. Andreas Damianou

Knowledge Discovery and Data Mining

Introduction to nonparametric regression: Least squares vs. Nearest neighbors

COPYRIGHTED MATERIAL. Contents. List of Figures. Acknowledgments

COMP 598 Applied Machine Learning Lecture 21: Parallelization methods for large-scale machine learning! Big Data by the numbers

Statistics Graduate Courses

Model Combination. 24 Novembre 2009

Bayesian networks - Time-series models - Apache Spark & Scala

Logistic Regression. Jia Li. Department of Statistics The Pennsylvania State University. Logistic Regression

Predictive Analytics Techniques: What to Use For Your Big Data. March 26, 2014 Fern Halper, PhD

ARTIFICIAL INTELLIGENCE (CSCU9YE) LECTURE 6: MACHINE LEARNING 2: UNSUPERVISED LEARNING (CLUSTERING)

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Data Mining: An Overview. David Madigan

The Chinese Restaurant Process

Predictive Data modeling for health care: Comparative performance study of different prediction models

Supervised Learning (Big Data Analytics)

Data Mining Practical Machine Learning Tools and Techniques

Neural Networks Lesson 5 - Cluster Analysis

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Government of Russian Federation. Faculty of Computer Science School of Data Analysis and Artificial Intelligence

Fortgeschrittene Computerintensive Methoden

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Methods of Data Analysis Working with probability distributions

CS Master Level Courses and Areas COURSE DESCRIPTIONS. CSCI 521 Real-Time Systems. CSCI 522 High Performance Computing

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

E-commerce Transaction Anomaly Classification

Knowledge Discovery and Data Mining

CSCI-599 DATA MINING AND STATISTICAL INFERENCE

Predictive Modeling and Big Data

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Data Analytics at NICTA. Stephen Hardy National ICT Australia (NICTA)

Evaluation of Machine Learning Techniques for Green Energy Prediction

Nonparametric Factor Analysis with Beta Process Priors

Bayes and Naïve Bayes. cs534-machine Learning

Using Data Mining for Mobile Communication Clustering and Characterization

Insurance Analytics - analýza dat a prediktivní modelování v pojišťovnictví. Pavel Kříž. Seminář z aktuárských věd MFF 4.

Predictive Modeling Techniques in Insurance

Big Data and Marketing

Classification Techniques for Remote Sensing

Microsoft Azure Machine learning Algorithms

> plot(exp.btgpllm, main = "treed GP LLM,", proj = c(1)) > plot(exp.btgpllm, main = "treed GP LLM,", proj = c(2)) quantile diff (error)

Bayesian Ensemble Learning for Big Data

A Study Of Bagging And Boosting Approaches To Develop Meta-Classifier

An Overview of Knowledge Discovery Database and Data mining Techniques

Predicting Good Probabilities With Supervised Learning

Supervised Feature Selection & Unsupervised Dimensionality Reduction

Machine Learning Overview

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Machine Learning with MATLAB David Willingham Application Engineer

Learning Gaussian process models from big data. Alan Qi Purdue University Joint work with Z. Xu, F. Yan, B. Dai, and Y. Zhu

Model Trees for Classification of Hybrid Data Types

LCs for Binary Classification

A General Framework for Mining Concept-Drifting Data Streams with Skewed Distributions

Decompose Error Rate into components, some of which can be measured on unlabeled data

Ensemble Methods. Knowledge Discovery and Data Mining 2 (VU) ( ) Roman Kern. KTI, TU Graz

Introduction to Statistical Machine Learning

Bayesian Time Series Learning with Gaussian Processes

Cross-validation for detecting and preventing overfitting

Data Mining for Business Intelligence. Concepts, Techniques, and Applications in Microsoft Office Excel with XLMiner. 2nd Edition

Applied Multivariate Analysis - Big data analytics

The Data Mining Process

Machine Learning. Term 2012/2013 LSI - FIB. Javier Béjar cbea (LSI - FIB) Machine Learning Term 2012/ / 34

Detection. Perspective. Network Anomaly. Bhattacharyya. Jugal. A Machine Learning »C) Dhruba Kumar. Kumar KaKta. CRC Press J Taylor & Francis Croup

Using multiple models: Bagging, Boosting, Ensembles, Forests

CS Data Science and Visualization Spring 2016

Semi-Supervised Support Vector Machines and Application to Spam Filtering

Machine Learning for Data Science (CS4786) Lecture 1

Transcription:

HT2015: SC4 Statistical Data Mining and Machine Learning Dino Sejdinovic Department of Statistics Oxford http://www.stats.ox.ac.uk/~sejdinov/sdmml.html

Bayesian Nonparametrics Parametric vs Nonparametric models Parametric models have a fixed finite number of parameters, regardless of the dataset size. In the Bayesian setting, given the parameter vector θ, the predictions are independent of the data D. p( x, θ D) = p(θ D)p( x θ) Parameters can be thought of as a data summary: communication channel flows from data to the predictions through the parameters. Model-based learning (e.g., mixture of K multivariate normals) Nonparametric models allow the number of parameters to grow with the dataset size. Alternatively, predictions depend on the data (and the hyperparameters). Memory-based learning (e.g., kernel density estimation)

Bayesian Nonparametrics Dirichlet Process We have seen that a conjugate prior over a probability mass function (π 1,..., π K ) is a Dirichlet distribution Dir(α 1,..., α K ). Can we create a prior over probability distributions on R? Dirichlet process DP(α, h), α > 0 and H a probability distribution on R A random probability distribution F is said to follow a Dirichlet process if when restricted to any finite partition it has a Dirichlet distribution, i.e., for any partition A 1,..., A K of R, (F(A 1 ),..., F(A K )) Dir (αh(a 1 ),..., αh(a K )) A 1 A 2 A 3 A K Stick-breaking construction allows us to draw from a Dirichlet process: 1 Draw s 1, s 2,... i.i.d. h 2 Draw v 1, v 2,... i.i.d. Beta(1, α) 3 Set w 1 = v 1, w 2 = v 2(1 v 1),..., w j = v j j 1 l=1 (1 v l)... Then l=1 w lδ sl DP(α, h)

Bayesian Nonparametrics Dirichlet Process and a Posterior over Distributions i.i.d. Given data D = {x i } n i=1 F, x i R p, we put a prior DP(α, h) on F Posterior p(f D) is DP(α + n, h), where h = n ˆF n+α + α n+α h and ˆF = 1 n n i=1 δ x i is the empirical distribution. But how to reason about this posterior? Answer: sample from it! Figure : top left: a draw from DP(10, N (0, 1)); top right: resulting cdf; bottom left: draws from a posterior based on n = 25 observations from a N (5, 1) distribution (red); bottom right: Bayesian posterior mean (blue), empirical cdf (black). Example and Figure by L. Wasserman

Bayesian Nonparametrics Dirichlet Process Mixture Models In mixture models for clustering, we had to pick the number of clusters K. Can we automatically infer K from data? Just use an infinite mixture model g(x) = π k p(x θ k ) The following generative process defines an implicit prior on g: 1 Draw F DP(α, h) 2 Draw θ 1,..., θ n F i.i.d. F 3 Draw x i θ i p( θ i) k=1 F is discrete and will get ties - ties form clusters. Posterior distribution is more involved but can be sampled from 1. 1 Radford Neal, 2000: Markov Chain Sampling Methods for Dirichlet Process Mixture Models

Regression 1.5 1.5 1 1 0.5 0.5 0 0 0.5 0.5 1 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 We are given a dataset D = {(x i, y i )} n i=1, x i R p, y i R. Regression: learn the underlying real-valued function f (x).

Different Flavours of Regression We can model response y i as a noisy version of the underlying function f evaluated at input x i : y i f, x i N (f (x i ), σ 2 ) Appropriate loss: L(y, f (x)) = (y f (x)) 2 Frequentist Parametric approach: model f as f θ for some parameter vector θ. Fit θ by ML / ERM with squared loss (linear regression). Frequentist Nonparametric approach: model f as the unknown parameter taking values in an infinite-dimensional space of functions. Fit f by regularized ML / ERM with squared loss (kernel ridge regression) Bayesian Parametric approach: model f as f θ for some parameter vector θ. Put a prior on θ and compute a posterior p(θ D) (Bayesian linear regression). Bayesian Nonparametric approach: treat f as the random variable taking values in an infinite-dimensional space of functions. Put a prior over functions f F, and compute a posterior p(f D) (Gaussian Process regression).

Just work with the function values at the inputs f = (f (x 1 ),..., f (x n )) What properties of the function can we incorporate? Multivariate normal prior on f: f N (0, K) Use a kernel function k to define K: K ij = k(x i, x j) The prior p(f) encodes our prior knowledge about the function. 1.5 1 0.5 0 0.5 ( ) f (x) f (x ) Expect regression functions to be smooth: If x and x are close by, then f (x) and f (x ) have similar values, i.e. strongly correlated. N (( ) 0 0, ( k(x, x) )) k(x, x ) k(x, x) k(x, x ) 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Model: f N (0, K) y i f i N (f i, σ 2 ) In particular, want k(x, x ) k(x, x) = k(x, x ).

What does a multivariate normal prior mean? Imagine x forms an infinitesimally dense grid of data space. Simulate prior draws f N (0, K) Plot f i vs x i for i = 1,..., n. The corresponding prior over functions is called a Gaussian Process (GP): any finite number of evaluations of which follow a Gaussian distribution. 3 2 1 0 1 2 3 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 http://www.gaussianprocess.org/

Carl Rasmussen. Tutorial on at NIPS 2006. Bayesian Learning Different kernels lead to different function characteristics.

Bayesian Learning f x N (0, K) y f N (f, σ 2 I) Posterior distribution: f y N (K(K + σ 2 I) 1 y, K K(K + σ 2 I) 1 K) Posterior predictive distribution: Suppose x is a test set. We can extend our model to include the function values f at the test set: ( ) (( ( )) f f x, x 0 Kxx K N, xx 0) K x x K x x y f N (f, σ 2 I) where K xx is matrix with (i, j)-th entry k(x i, x j). Some manipulation of multivariate normals gives: f y N ( K x x(k xx + σ 2 I) 1 y, K x x K x x(k xx + σ 2 I) 1 ) K xx

Bayesian Learning 3 1.5 2 1 1 0.5 0 1 0 2 0.5 3 1 4 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1.5 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

GP regression demo Bayesian Learning http://www.tmpl.fi/gp/

Summary A whirlwind journey through data mining and machine learning techniques: Unsupervised learning: PCA, MDS, Isomap, Hierarchical clustering, K-means, mixture modelling, EM algorithm, Dirichlet process mixtures. Supervised learning: LDA, QDA, naïve Bayes, logistic regression, SVMs, kernel methods, knn, deep neural networks, Gaussian processes, decision trees, ensemble methods: random forests, bagging, stacking, dropout and boosting. Conceptual frameworks: prediction, performance evaluation, generalization, overfitting, regularization, model complexity, validation and cross-validation, bias-variance tradeoff. Theory: decision theory, statistical learning theory, convex optimization, Bayesian vs. frequentist learning, parametric vs non-parametric learning. Further resources: Machine Learning Summer Schools, videolectures.net. Conferences: NIPS, ICML, UAI, AISTATS. Mailing list: ml-news. Thank You!