Dealing with large datasets

Similar documents
STA 4273H: Statistical Machine Learning

Gaussian Processes to Speed up Hamiltonian Monte Carlo

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Tutorial on Markov Chain Monte Carlo

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

QUALITY ENGINEERING PROGRAM

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Christfried Webers. Canberra February June 2015

MIMO CHANNEL CAPACITY

Handling attrition and non-response in longitudinal data

Representing Uncertainty by Probability and Possibility What s the Difference?

Pattern Analysis. Logistic Regression. 12. Mai Joachim Hornegger. Chair of Pattern Recognition Erlangen University

Statistical Distributions in Astronomy

Origins of the Cosmos Summer Pre-course assessment

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Understanding and Applying Kalman Filtering

Bayesian Statistics in One Hour. Patrick Lam

CROP CLASSIFICATION WITH HYPERSPECTRAL DATA OF THE HYMAP SENSOR USING DIFFERENT FEATURE EXTRACTION TECHNIQUES

Linear Threshold Units

Gamma Distribution Fitting

Probability and Random Variables. Generation of random variables (r.v.)

Analysis of Bayesian Dynamic Linear Models

Galaxy Survey data analysis using SDSS-III as an example

Gaussian Conjugate Prior Cheat Sheet

A Log-Robust Optimization Approach to Portfolio Management

Markov Chain Monte Carlo Simulation Made Simple

Linear Classification. Volker Tresp Summer 2015

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

From CFD to computational finance (and back again?)

Web-based Supplementary Materials for Bayesian Effect Estimation. Accounting for Adjustment Uncertainty by Chi Wang, Giovanni

Exponential Random Graph Models for Social Network Analysis. Danny Wyatt 590AI March 6, 2009

Handling missing data in large data sets. Agostino Di Ciaccio Dept. of Statistics University of Rome La Sapienza

Nonlinear Iterative Partial Least Squares Method

Statistics Graduate Courses

Towards running complex models on big data

Borges, J. L On exactitude in science. P. 325, In, Jorge Luis Borges, Collected Fictions (Trans. Hurley, H.) Penguin Books.

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Introduction to Mobile Robotics Bayes Filter Particle Filter and Monte Carlo Localization

DATA ANALYSIS II. Matrix Algorithms

NEW YORK STATE TEACHER CERTIFICATION EXAMINATIONS

Machine Learning and Pattern Recognition Logistic Regression

Introduction to Monte Carlo. Astro 542 Princeton University Shirley Ho

Functional Data Analysis of MALDI TOF Protein Spectra

Imputing Missing Data using SAS

Master s Theory Exam Spring 2006

Automated Stellar Classification for Large Surveys with EKF and RBF Neural Networks

CHAPTER 7 STOCHASTIC ANALYSIS OF MANPOWER LEVELS AFFECTING BUSINESS 7.1 Introduction

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Monte Carlo testing with Big Data

165 points. Name Date Period. Column B a. Cepheid variables b. luminosity c. RR Lyrae variables d. Sagittarius e. variable stars

Statistical Models in Data Mining

Cosmological Analysis of South Pole Telescope-detected Galaxy Clusters

Least Squares Estimation

Dealing with systematics for chi-square and for log likelihood goodness of fit statistics

Gaussian Processes in Machine Learning

Linear regression methods for large n and streaming data

DURATION ANALYSIS OF FLEET DYNAMICS

PREDICTIVE DISTRIBUTIONS OF OUTSTANDING LIABILITIES IN GENERAL INSURANCE

An Introduction to Machine Learning

Two Topics in Parametric Integration Applied to Stochastic Simulation in Industrial Engineering

Parallelization Strategies for Multicore Data Analysis

A Bayesian hierarchical surrogate outcome model for multiple sclerosis

A Tutorial on Particle Filters for Online Nonlinear/Non-Gaussian Bayesian Tracking

Component Ordering in Independent Component Analysis Based on Data Power

Dirichlet forms methods for error calculus and sensitivity analysis

EE 570: Location and Navigation

Bayesian Image Super-Resolution

Simple Linear Regression Inference

Galaxy Morphological Classification

Bayesian Phylogeny and Measures of Branch Support

Introduction to General and Generalized Linear Models

STATISTICA Formula Guide: Logistic Regression. Table of Contents

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Similarity Search and Mining in Uncertain Spatial and Spatio Temporal Databases. Andreas Züfle

Inference on Phase-type Models via MCMC

Java Modules for Time Series Analysis

Penalized regression: Introduction

HT2015: SC4 Statistical Data Mining and Machine Learning

BayesX - Software for Bayesian Inference in Structured Additive Regression

> plot(exp.btgpllm, main = "treed GP LLM,", proj = c(1)) > plot(exp.btgpllm, main = "treed GP LLM,", proj = c(2)) quantile diff (error)

More details on the inputs, functionality, and output can be found below.

A Bootstrap Metropolis-Hastings Algorithm for Bayesian Analysis of Big Data

Statistical Learning THE TENTH EUGENE LUKACS SYMPOSIUM DEPARTMENT OF MATHEMATICS AND STATISTICS BOWLING GREEN STATE UNIVERSITY

Problem of Missing Data

MCMC Using Hamiltonian Dynamics

How to report the percentage of explained common variance in exploratory factor analysis

R2MLwiN Using the multilevel modelling software package MLwiN from R

Generating Valid 4 4 Correlation Matrices

Basics of Statistical Machine Learning

Bayesian Statistics: Indian Buffet Process

Logistic Regression. Vibhav Gogate The University of Texas at Dallas. Some Slides from Carlos Guestrin, Luke Zettlemoyer and Dan Weld.

The Hidden Lives of Galaxies. Jim Lochner, USRA & NASA/GSFC

A Brief Introduction to Factor Analysis

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Classification Problems

Transcription:

Dealing with large datasets (by throwing away most of the data) Alan Heavens Institute for Astronomy, University of Edinburgh with Ben Panter, Rob Tweedie, Mark Bastin, Will Hossack, Keith McKellar, Trevor Whittley Data-intensive Research Workshop, NeSC Mar 15 2010

Data Data: a list of measurements Quantitative Write data as a vector x Data have errors (referred to as noise) n

Modelling In this talk, I will assume there is a good model for the data, and the noise Model: a theoretical framework Model typically has parameters in it Forward modelling: given a model, and values for the parameters, we calculate what the expected value of the data are: μ = x Key quantity: noise covariance matrix: Cij = ninj

Examples Galaxy spectra: flux measurements are sum total of starlight from stars of given age (simplest) 2 parameters = age, mass

Example: cosmology Model: Big Bang theory Parameters: Expansion rate, density of ordinary matter, density of dark matter, dark energy content... (around 15 parameters) Scale

Inverse problem Parameter Estimation: given some data, and a model, what are the most likely values of the parameters, and what are there errors? Model Selection: given some data, which is the most likely model? (Big Bang vs Steady State, or Braneworld model)

Best-fitting parameters Usually minimise a penalty function. For gaussian errors, and no prior prejudice about the parameters, minimise χ 2 : C ij = n i n j Brute-force minimisation may be slow: dataset size N may be large (N, N 2 or N 3 scaling), or parameter space may be large (exponential dependence) χ 2 = χ 2 = (x i µ i ) 2 data i,j data i σ 2 i (x i µ i ) C 1 ij (x j µ j )

Dealing with large parameter spaces Don t explore it all - generate a chain of random points in parameter space Most common technique is MCMC (Markov Chain Monte Carlo) Asymptotically, the density of points is proportional to the probability of the parameter Generate a posterior distribution: prob(parameters data) (Most astronomical analysis is Bayesian) Variants: Hamiltonian Monte Carlo, Nested Sampling,Gibbs Sampling...

Large data sets What scope is there to reduce the size of the dataset? Why? Faster analysis Can we do this without losing accuracy? Depending on where the information is coming from, often the answer is yes.

Karhunen-Loeve compression If information comes from the scatter about the mean, then if the data have variable noise, or are correlated, we can perform linear compression: y = B x Construct matrix B so that elements of y are uncorrelated, and ordered in increasing uselessness σ 2 = modes k σ 2 k Limited compression usually possible Vogeley & Szalay 1995, Tegmark, Taylor & Heavens 1997 Quadratic compression more effective

Fisher Matrix How is B determined? Key concept is the Fisher Matrix - gives the expected parameter errors Steps diagonalise the covariance matrix C of the data divide each mode by its r.m.s. (so now C=I) C p i b = λcb Rotate again until each mode gives uncorrelated information on a parameter (generalised eigenvalue problem) Order the modes, and throw the worst ones away

Massive Data Compression If the information comes from the mean, rather than the scatter, much more radical surgery is possible Dataset of size N can be reduced to size M (= number of parameters), sometimes without loss of accuracy This can be a massive reduction: e.g. galaxy spectrum 2000 10 Microwave background power spectrum 2000 15 Likelihood calculations are at least N/M times faster

MOPED* algorithm (patented) Consider a weight vector y1=b1.x Choose b1 such that the likelihood of y1 is as sharply peaked as possible, in the direction of parameter 1 Repeat (subject to some constraints) for all M parameters Dataset reduced to size M (independent of N) - scaling * Massively-Optimised Parameter Estimation and Data compression

MOPED weighting vectors MOPED automatically calculates the optimum weights for each data point In many cases, the errors from the compressed dataset are no larger than those from the entire dataset It is NOT obvious that this is possible Example: set of data points, from which you want to estimate the mean (of the population from which the sample is drawn). If all errors are the same, then b = (1/N, 1/N,...) i.e. average the data.

Examples Galaxy Spectra ~100,000 galaxies Data compressed by 99% Analysis time reduced from 400 years to a few weeks MOPED weighting vectors

Medical imaging: registration Stroke lesion MRI scans: 512x512x100 voxels N= 2.6 x 10 7 Affine distortions: M = 12 Image Distortions

Summary Astronomical Datasets can be large, but the set of interesting quantities may be small With a good model for the data, carefully-designed (and massive) data compression can hugely speed up analysis, with no loss of accuracy Such a situation is quite typical - applications elsewhere - Blackford Analysis stand in the Research Village