Multivariate analyses & decoding

Multivariate analyses & decoding Kay Henning Brodersen Computational Neuroeconomics Group Department of Economics, University of Zurich Machine Learning and Pattern Recognition Group Department of Computer Science, ETH Zurich http://people.inf.ethz.ch/bkay

Why multivariate? Multivariate approaches simultaneously consider brain activity in many locations. PET prediction Haxby et al. (200) Science Lautrup et al. (994) Supercomputing in Brain Research 2

Why multivariate? Multivariate approaches can utilize information jointly encoded in multiple voxels. This is because multivariate distance meausures can account for correlations between voxels. 3

Why multivariate? Multivariate approaches can exploit a sampling bias in voxelized images to reveal interesting activity on a subvoxel scale. Boynton (2005) Nature Neuroscience 4

Why multivariate? Multivariate approaches can utilize hidden quantities such as coupling strengths. u 2 0.4 0.3 Neural population activity 0.2 0. 0 0 0 20 30 40 50 60 70 80 90 00 0.6 u z 3 0.4 0.2 0 0 0 20 30 40 50 60 70 80 90 00 0.3 0.2 0. z z 2 0 0 0 20 30 40 50 60 70 80 90 00 3 2 fmri signal change (%) 0 neural state equation: m ( i) z A uib i j n z j D ( j) Stephan et al. (2008) NeuroImage z Cu 4 3 2 0 0 0 20 30 40 50 60 70 80 90 00-0 0 20 30 40 50 60 70 80 90 00 3 2 0 0 0 20 30 40 50 60 70 80 90 00 5

Outline Foundations 2 Classification 3 Multivariate Bayes 4 Further model-based approaches 6

Modelling terminology Encoding vs. decoding An encoding model (or generative model) relates context to brain activity. A decoding model (or recognition model) relates brain activity to context. condition stimulus response encoding model g: X t Y t decoding model h: Y t X t context X t R d BOLD signal Y t R v 8

Modelling terminology 2 Prediction vs. inference The goal of prediction is to find a highly accurate encoding or decoding function. The goal of inference is to decide between competing hypotheses about structure-function mappings in the brain. predicting a cognitive state using a brain-machine interface predicting a subject-specific diagnostic status comparing a model that links distributed neuronal activity to a cognitive state with a model that does not weighing the evidence for sparse coding vs. dense coding predictive density p X new Y new, X, Y = p X new Y new, θ p θ X, Y dθ marginal likelihood p X Y = p X Y, θ p θ dθ 9

Modelling terminology 3 Univariate vs. multivariate A univariate model considers a single voxel at a time. A multivariate model considers many voxels at once. context X t R d BOLD signal Y t R context X t R d BOLD signal Y t R v The implicit likelihood of the data factorizes v over voxels, p Y t X t = p Y t,i X t i=. Spatial dependencies between voxels are introduced afterwards, through random field theory. This enables multivariate inferences over voxels (i.e., cluster-level or set-level inference). Multivariate models relax the assumption about independence of voxels. They enable inference about distributed responses without requiring focal activations or certain topological response features. They can therefore be more powerful than univariate analyses. 0

Modelling terminology 4 Regression vs. classification In a regression model, the dependent variable is continuous. independent variables (regressors) f dependent variable R (regressand) In a classification model, the dependent variable is categorical (e.g., binary). independent variables (features) f dependent variable {0,, } (label)

Modelling terminology 5 Goodness of fit vs. complexity A measure of goodness of fit assesses how well the model fits available data. A measure of complexity assesses the amount of structure and the degree of flexibility of a model, including, among other things, its number of parameters. Y parameter 4 parameters 0 parameters truth data model underfitting X optimal overfitting We typically wish to find the model that generalizes best to new data. This is the model that optimally trades off goodness of fit and complexity. Bishop (2007) PRML 2

Summary of modelling terminology General Linear Model (GLM) mass-univariate encoding model for regressing context onto brain activity and inferring on topological response features Dynamic Causal Modelling (DCM) multivariate encoding model for comparing alternative connectivity hypotheses Classification based on multivariate decoding models for predicting a categorical context label from brain activity Multivariate Bayes (MVB) multivariate encoding model for comparing alternative coding hypotheses 3

Constructing a classifier In classification, we aim to predict a target variable X from data Y, h: Y t X t,, K Most classifiers are designed to estimate the unknown probabilities of an example belonging to a particular class: h Y t = arg max k p X t = k Y t, X, Y Generative classifiers use Bayes rule to estimate p X t Y t p Y t X t p X t e.g., Gaussian Naïve Bayes Linear Discriminant Analysis Discriminative classifiers estimate p X t Y t directly without Bayes theorem e.g., Logistic regression Relevance Vector Machine Discriminant classifiers estimate h Y t directly e.g., Support Vector Machine Fisher s Linear Discriminant 5

Support vector machines The support vector machine (SVM) is a discriminant classifier. Training Find a hyperplane with a maximal margin to the nearest examples on either side. Test Assign a new example to the class corresponding to its side of the plane. Linear SVM w SVMs are used in many domains of application. Efficiency SVMs are fast and easy to use. Performance SVMs usually perform well compared to other classifiers. Flexibility The need for vectorial representations of examples is replaced by a similarity measure, defined via a kernel function k Y i, Y j. b Nonlinear SVM Vapnik (999) Springer; Schölkopf et al. (2002) MIT Press 6

Generalizability of a classifier Typically, we have many more voxels than observations. This means that there are infinitely many models that enable perfect classification of the available data. But these models might have overfit the data. Overfitting is usually not an issue in GLM analyses, where the number of regressors is much smaller than the number of observations. We want to find a classification model h: Y X that generalizes well to new data. Given some training data, we might consider the probability P h Y test = X test Y train, X train. However, this quantity is dependent on the training data. So instead we should consider the generalizability E training P h Y test = X test Y train, X train, which we can approximate using cross-validation. 7

Cross-validation Cross-validation is a resampling procedure that can be used to estimate the generalizability of a classifier. Examples 2 Training examples Test example 3.................. 99 00 2 3 99 00 Folds 8

Trial-by-trial classification of fmri data fmri timeseries Feature extraction e.g., voxels Trials Voxels Train: learning a mapping Y train θ X train Y train Test: apply the learned mapping 3 Classification A A B A B A A B A A A B A X train Y test 2 Feature selection??? Accuracy estimate [% correct] 9

Target questions in classification studies A Overall classification accuracy B Spatial deployment of informative regions Accuracy [%] 00 % 80% 50 % Left or right button? Truth or lie? Healthy or diseased? Classification task 55% C Temporal evolution of informativeness D Characterization of distributed activity Accuracy [%] 00 % Participant indicates decision Inferring a representational space and extrapolation to novel classes 50 % Accuracy rises above chance Intra-trial time Mitchell et al. (2008) Science Pereira et al. (2009) NeuroImage, Brodersen et al. (2009) The New Collection 20

trial 2 phase A... trial phase B time [volumes] trial phase A Preprocessing for classification The most principled approach is to deconvolve the BOLD signal using a GLM. 200 trial-by-trial design matrix 5 400 0 600 5 800 50 00 50 200 250 300 350 20 2 3 4 5 6 trials and trial phases plus confounds* This approach results in one beta image per trial and phase. 2

22 Performance evaluation subject - + - - + trial n trial subject 2 subject 3 subject 4 subject m subject m+ ~? population 0 0 0 0 0 0 0 Brodersen et al. (in preparation)

Performance evaluation Single-subject study The most common approach is to assess how likely the obtained number of correctly classified trials could have occurred by chance. p = P X k H 0 = B k n, π 0 p k n π 0 B probability of observing the obtained performance by chance number of correctly classified trials total number of trials probability of getting a single result right by chance binomial cumulative density function In publications, this approach is referred to as a binomial test. It is based on the assumption that, under the Null hypothesis, the classifier produces random predictions. 23

Performance evaluation Group study The most common approach is to assess the probability with which the observed subject-wise sample accuracies were sampled from a distribution with a mean equal to chance. t = m π π 0 σ m p = t m t p probability of observing the obtained performance by chance m number of subjects π sample mean of subject-wise sample accuracies σ m sample standard deviation of subject-wise sample accuracies π 0 probability of getting a single result right by chance t m cumulative Student s t-distribution with m d.o.f. This approach represents a random-effects analysis of classification outcomes based on the additional assumption that the mean of sample accuracies is approximately Normal. 24

Spatial deployment of informative voxels Approach Consider the entire brain, and find out which voxels are jointly discriminative. based on a classifier with a constraint on sparseness in features Hampton & O Doherty (2007); Grosenick et al. (2008, 2009) based on Gaussian Processes Marquand et al. (200) NeuroImage; Lomakina et al. (in preparation) Approach 2 At each voxel, consider a small local environment, and compute a discriminability score. based on a CCA Nandy & Cordes (2003) Magn. Reson. Med. based on a classifier based on Euclidean distances based on Mahalanobis distances Kriegeskorte et al. (2006, 2007a, 2007b) Serences & Boynton (2007) J Neuroscience based on the mutual information 25

Temporal evolution of discriminability Example decoding which button the subject pressed classification accuracy motor cortex decision response frontopolar cortex Soon et al. (2008) Nature Neuroscience 26

voxel Pattern characterization Example decoding the identity of the person speaking to the subject in the scanner... fingerprint plot (one plot per class) Formisano et al. (2008) Science 27

Issues to be aware of (as researcher or reviewer) Classification induces constraints on the experimental design. When estimating trial-wise Beta values, we need longer ITIs (typically 8 5 s). At the same time, we need many trials (typically 00+). Classes should be balanced. If they are imbalanced, we can resample the training set, constrain the classifier, or report the balanced accuracy. Construction of examples Estimation of Beta images is the preferred approach. Covariates should be included in the trial-by-trial design matrix. Temporal autocorrelation In trial-by-trial classification, exclude trials around the test trial from the training set. Avoiding double-dipping Any feature selection and tuning of classifier settings should be carried out on the training set only. Performance evaluation Do random-effects or mixed-effects inference. Correct for multiple tests. 28

Multivariate Bayes Multivariate analyses in SPM are not framed in terms of classification problems. Instead, SPM brings multivariate analyses into the conventional inference framework of hierarchical models and their inversion. Multivariate Bayes (MVB) can be used to address two questions: Is there a link between X and Y? using cross-validation (as seen earlier) using model comparison (new) What is the form of the link between X and Y? smooth or sparse coding? (many voxels vs. few voxels) category-specific representations that are functionally selective or functionally segregated? 30

Conventional inference framework Classical encoding model X as a cause of Y Bayesian decoding model X as a consequence of Y X A X A X A Y TA G Y TA G Y g ( ) : X Y TX G TX h ( ) :Y X X A Y G Friston et al. (2008) NeuroImage 3

Lessons from the Neyman-Pearson lemma Is there a link between X and Y? To test for a statistical dependency between a contextual variable X and the BOLD signal Y, we compare H 0 : there is no dependency H a : there is some dependency Which statistical test?. define a test size α (the probability of falsely rejecting H 0, i.e., specificity), 2. choose the test with the highest power β (the probability of correctly rejecting H 0, i.e., sensitivity). The Neyman-Pearson lemma The most powerful test of size α is: to reject H 0 when the likelihood ratio Λ exceeds a criticial value u, Λ Y = p Y X p Y with u chosen such that = p X Y p X P Λ Y u H 0 = α. u The null distribution of the likelihood ratio p Λ Y H 0 can be determined nonparametrically or under parametric assumptions. This lemma underlies both classical statistics and Bayesian statistics (where Λ Y is known as a Bayes factor). Neyman & Person (933) Phil Trans Roy Soc London 32

Lessons from the Neyman-Pearson lemma In summary. Inference about how the brain represents context variables reduces to model comparison. 2. To establish that a link exists between some context X and activity Y, the direction of the mapping is not important. 3. Testing the accuracy of a classifier is not based on Λ is therefore suboptimal. Neyman & Person (933) Phil Trans Roy Soc London Kass & Raftery (995) J Am Stat Assoc Friston et al. (2009) NeuroImage 33

Priors help to regularize the inference problem Mapping brain activity onto a context variable is ill-posed: there is an infinite number of equally likely solutions. We therefore require constraints (priors) to estimate the voxel weights β. SPM comes with several alternative coding hypotheses, specified in terms of spatial priors on voxel weights, p β, after transformations Y = YU and β = βu. Null: U Spatial vectors: Smooth vectors: U I U( x, x i j ) exp( 2 xi x 2 2 ( j ) ) Singular vectors: UDV T RY T Support vectors: U RY T Friston et al. (2008) NeuroImage 34

photic motion attention constant Multivariate Bayes: example MVB can be illustrated using SPM s attention-to-motion example dataset. Buechel & Friston 999 Cerebral Cortex Friston et al. 2008 NeuroImage This dataset is based on a simple block design. Each block is a combination of some of the following three factors: photic there is some visual stimulus motion there is motion attention subjects are paying attention design matrix We form a design matrix by convolving box-car functions with a canonical haemodynamic response function. blocks of 0 scans 35

Multivariate Bayes: example After having specified and estimated a design, we use the Results button. Next, we select the contrast of interest. 36

Multivariate Bayes: example We place the cursor onto the region of interest. 37

Multivariate Bayes: example Multivariate Bayes can be invoked from within the Multivariate section. We specify the region of interest as a sphere around the cursor. We examine the sparse coding hypothesis. 38

Multivariate Bayes: example To display results, we use the button for Bayesian model selection (BMS). 39

(motion) Multivariate Bayes: example MVB-based predictions closely match the observed responses. But crucially, they don t perfectly match them. Perfect match would indicate overfitting. RXc 40

Multivariate Bayes: example The weights attributed to each voxel in the sphere are sparse and multimodal. This suggests sparse coding. log evidence = 3 4

Multivariate Bayes: example MVB may outperform conventional point classifiers when using a more appropriate coding hypothesis. 42

Recall: challenges for multivariate approaches ❶ Variable selection Given tens of thousands of voxels and very few trials of data, how do we find those brain regions that are jointly informative of some variable of interest? ❷ Neurobiological interpretability How can we obtain results that are mechanistically interpretable in the context of the underlying neurobiological system? 44

Identification / inferring a representational space Approach. estimation of an encoding model 2. nearest-neighbour classification or voting Mitchell et al. (2008) Science 45

Reconstruction / optimal decoding Approach. estimation of an encoding model 2. model inversion Paninski et al. (2007) Progr Brain Res Pillow et al. (2008) Nature Miyawaki et al. (2009) Neuron 46

Model-based classification by generative embedding step model inversion C A B step 2 kernel construction A B A C B B B C measurements from an individual subject 5 4 subject-specific inverted generative model 0.6 0.5 subject representation in the generative score space C A B Voxel 2 3 2 0 step 4 interpretation (4) R.HG -> L.HG 0.4 0.3 0.2 0. 0 step 3 classification jointly discriminative connection strengths - -2 0 2 4 6 8 Voxel -0. -0.4-0.35-0.3-0.25-0.2-0.5 () L.MGB -> L.MGB separating hyperplane fitted to discriminate between groups Brodersen, Haiss, Ong, Jung, Tittgemeyer, Buhmann, Weber, Stephan (20) NeuroImage Brodersen, Schofield, Leff, Ong, Lomakina, Buhmann, Stephan (20) PLoS Comput Biol 47

Summary. Foundations. Multivariate methods can uncover and exploit information jointly encoded by multiple voxels. Remember the distinction between prediction and inference, encoding and decoding, univariate and multivariate, and classification and regression. 2. Classification. Classification studies typically aim to examine (i) overall discriminability, (ii) the spatial deployment of informative regions, (iii) the temporal evolution of discriminative activity, and (iv) the nature of the distributed activity. 3. Multivariate Bayes. Multivariate Bayes offers an alternative scheme that maps multivariate patterns of activity onto brain states within the conventional statistical framework of hierarchical models and their inversion. 4. Model-based approaches. Model-based approaches aim to augment previous methods by neurobiological interpretability and are likely to become very fruitful in the future. 48