Dealing with large datasets

Dealing with large datasets (by throwing away most of the data) Alan Heavens Institute for Astronomy, University of Edinburgh with Ben Panter, Rob Tweedie, Mark Bastin, Will Hossack, Keith McKellar, Trevor Whittley Data-intensive Research Workshop, NeSC Mar 15 2010

Data Data: a list of measurements Quantitative Write data as a vector x Data have errors (referred to as noise) n

Modelling In this talk, I will assume there is a good model for the data, and the noise Model: a theoretical framework Model typically has parameters in it Forward modelling: given a model, and values for the parameters, we calculate what the expected value of the data are: μ = x Key quantity: noise covariance matrix: Cij = ninj

Examples Galaxy spectra: flux measurements are sum total of starlight from stars of given age (simplest) 2 parameters = age, mass

Example: cosmology Model: Big Bang theory Parameters: Expansion rate, density of ordinary matter, density of dark matter, dark energy content... (around 15 parameters) Scale

Inverse problem Parameter Estimation: given some data, and a model, what are the most likely values of the parameters, and what are there errors? Model Selection: given some data, which is the most likely model? (Big Bang vs Steady State, or Braneworld model)

Best-fitting parameters Usually minimise a penalty function. For gaussian errors, and no prior prejudice about the parameters, minimise χ 2 : C ij = n i n j Brute-force minimisation may be slow: dataset size N may be large (N, N 2 or N 3 scaling), or parameter space may be large (exponential dependence) χ 2 = χ 2 = (x i µ i ) 2 data i,j data i σ 2 i (x i µ i ) C 1 ij (x j µ j )

Dealing with large parameter spaces Don t explore it all - generate a chain of random points in parameter space Most common technique is MCMC (Markov Chain Monte Carlo) Asymptotically, the density of points is proportional to the probability of the parameter Generate a posterior distribution: prob(parameters data) (Most astronomical analysis is Bayesian) Variants: Hamiltonian Monte Carlo, Nested Sampling,Gibbs Sampling...

Large data sets What scope is there to reduce the size of the dataset? Why? Faster analysis Can we do this without losing accuracy? Depending on where the information is coming from, often the answer is yes.

Karhunen-Loeve compression If information comes from the scatter about the mean, then if the data have variable noise, or are correlated, we can perform linear compression: y = B x Construct matrix B so that elements of y are uncorrelated, and ordered in increasing uselessness σ 2 = modes k σ 2 k Limited compression usually possible Vogeley & Szalay 1995, Tegmark, Taylor & Heavens 1997 Quadratic compression more effective

Fisher Matrix How is B determined? Key concept is the Fisher Matrix - gives the expected parameter errors Steps diagonalise the covariance matrix C of the data divide each mode by its r.m.s. (so now C=I) C p i b = λcb Rotate again until each mode gives uncorrelated information on a parameter (generalised eigenvalue problem) Order the modes, and throw the worst ones away

Massive Data Compression If the information comes from the mean, rather than the scatter, much more radical surgery is possible Dataset of size N can be reduced to size M (= number of parameters), sometimes without loss of accuracy This can be a massive reduction: e.g. galaxy spectrum 2000 10 Microwave background power spectrum 2000 15 Likelihood calculations are at least N/M times faster

MOPED* algorithm (patented) Consider a weight vector y1=b1.x Choose b1 such that the likelihood of y1 is as sharply peaked as possible, in the direction of parameter 1 Repeat (subject to some constraints) for all M parameters Dataset reduced to size M (independent of N) - scaling * Massively-Optimised Parameter Estimation and Data compression

MOPED weighting vectors MOPED automatically calculates the optimum weights for each data point In many cases, the errors from the compressed dataset are no larger than those from the entire dataset It is NOT obvious that this is possible Example: set of data points, from which you want to estimate the mean (of the population from which the sample is drawn). If all errors are the same, then b = (1/N, 1/N,...) i.e. average the data.

Examples Galaxy Spectra ~100,000 galaxies Data compressed by 99% Analysis time reduced from 400 years to a few weeks MOPED weighting vectors

Medical imaging: registration Stroke lesion MRI scans: 512x512x100 voxels N= 2.6 x 10 7 Affine distortions: M = 12 Image Distortions

Summary Astronomical Datasets can be large, but the set of interesting quantities may be small With a good model for the data, carefully-designed (and massive) data compression can hugely speed up analysis, with no loss of accuracy Such a situation is quite typical - applications elsewhere - Blackford Analysis stand in the Research Village