Multivariate analyses & decoding

Similar documents
Statistical Machine Learning

MACHINE LEARNING IN HIGH ENERGY PHYSICS

Linear Classification. Volker Tresp Summer 2015

Classification Problems

Acknowledgments. Data Mining with Regression. Data Mining Context. Overview. Colleagues

Class #6: Non-linear classification. ML4Bio 2012 February 17 th, 2012 Quaid Morris

Subjects: Fourteen Princeton undergraduate and graduate students were recruited to

Lecture 3: Linear methods for classification

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Introduction to Support Vector Machines. Colin Campbell, Bristol University

STA 4273H: Statistical Machine Learning

The Wondrous World of fmri statistics

Neuroimaging module I: Modern neuroimaging methods of investigation of the human brain in health and disease

Statistics Graduate Courses

Linear Threshold Units

Artificial Neural Networks and Support Vector Machines. CS 486/686: Introduction to Artificial Intelligence

Statistical issues in the analysis of microarray data

Support Vector Machines with Clustering for Training with Very Large Datasets

Christfried Webers. Canberra February June 2015

CCNY. BME I5100: Biomedical Signal Processing. Linear Discrimination. Lucas C. Parra Biomedical Engineering Department City College of New York

PATTERN RECOGNITION AND MACHINE LEARNING CHAPTER 4: LINEAR MODELS FOR CLASSIFICATION

Bayesian probability theory

CS 2750 Machine Learning. Lecture 1. Machine Learning. CS 2750 Machine Learning.

Statistical Models in Data Mining

Azure Machine Learning, SQL Data Mining and R

Linear Discrimination. Linear Discrimination. Linear Discrimination. Linearly Separable Systems Pairwise Separation. Steven J Zeil.

Introduction. Karl J Friston

Statistical Considerations in Magnetic Resonance Imaging of Brain Function

Statistical Machine Learning from Data

How To Understand The Theory Of Probability

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

A Learning Based Method for Super-Resolution of Low Resolution Images

Cell Phone based Activity Detection using Markov Logic Network

These slides follow closely the (English) course textbook Pattern Recognition and Machine Learning by Christopher Bishop

CHARACTERISTICS IN FLIGHT DATA ESTIMATION WITH LOGISTIC REGRESSION AND SUPPORT VECTOR MACHINES

Search Taxonomy. Web Search. Search Engine Optimization. Information Retrieval

HT2015: SC4 Statistical Data Mining and Machine Learning

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Model-free Functional Data Analysis MELODIC Multivariate Exploratory Linear Optimised Decomposition into Independent Components

The Data Mining Process

Machine Learning and Pattern Recognition Logistic Regression

Machine Learning in Spam Filtering

Comparison of machine learning methods for intelligent tutoring systems

Model-Based Recursive Partitioning for Detecting Interaction Effects in Subgroups

Applications of R Software in Bayesian Data Analysis

Detecting Corporate Fraud: An Application of Machine Learning

Spatial Statistics Chapter 3 Basics of areal data and areal data modeling

fmri 實 驗 設 計 與 統 計 分 析 簡 介 Introduction to fmri Experiment Design & Statistical Analysis

Supervised Learning (Big Data Analytics)

: Introduction to Machine Learning Dr. Rita Osadchy

11 Linear and Quadratic Discriminant Analysis, Logistic Regression, and Partial Least Squares Regression

AUTOMATION OF ENERGY DEMAND FORECASTING. Sanzad Siddique, B.S.

Probabilistic Models for Big Data. Alex Davies and Roger Frigola University of Cambridge 13th February 2014

Social Media Mining. Data Mining Essentials

Comparing the Results of Support Vector Machines with Traditional Data Mining Algorithms

Simple Linear Regression Inference

Lecture 6: Classification & Localization. boris. ginzburg@intel.com

Application of discriminant analysis to predict the class of degree for graduating students in a university system

Machine Learning Logistic Regression

Knowledge Discovery from patents using KMX Text Analytics

OUTLIER ANALYSIS. Data Mining 1

Cyber-Security Analysis of State Estimators in Power Systems

Cross-Validation. Synonyms Rotation estimation

Machine Learning for Medical Image Analysis. A. Criminisi & the InnerEye MSRC

Data Mining - Evaluation of Classifiers

Model Selection. Introduction. Model Selection

Elements of statistics (MATH0487-1)

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

STATISTICA Formula Guide: Logistic Regression. Table of Contents

An Introduction to Machine Learning

CS 688 Pattern Recognition Lecture 4. Linear Models for Classification

Learning outcomes. Knowledge and understanding. Competence and skills

1. Classification problems

Theory versus Experiment. Prof. Jorgen D Hondt Vrije Universiteit Brussel jodhondt@vub.ac.be

Data, Measurements, Features

Introduction to General and Generalized Linear Models

Penalized Logistic Regression and Classification of Microarray Data

Probabilistic Linear Classification: Logistic Regression. Piyush Rai IIT Kanpur

Data Mining and Data Warehousing. Henryk Maciejewski. Data Mining Predictive modelling: regression

Principles of Data Mining by Hand&Mannila&Smyth

Lecture 9: Introduction to Pattern Analysis

Predict Influencers in the Social Network

degrees of freedom and are able to adapt to the task they are supposed to do [Gupta].

Chapter 6. The stacking ensemble approach

CONTENTS PREFACE 1 INTRODUCTION 1 2 DATA VISUALIZATION 19

DATA MINING TECHNIQUES AND APPLICATIONS

Obtaining Knowledge. Lecture 7 Methods of Scientific Observation and Analysis in Behavioral Psychology and Neuropsychology.

Introduction to Machine Learning Lecture 1. Mehryar Mohri Courant Institute and Google Research

Classification of Bad Accounts in Credit Card Industry

Reject Inference in Credit Scoring. Jie-Men Mok

The Artificial Prediction Market

II. DISTRIBUTIONS distribution normal distribution. standard scores

Analysis of kiva.com Microlending Service! Hoda Eydgahi Julia Ma Andy Bardagjy December 9, 2010 MAS.622j

Data Mining Techniques for Prognosis in Pancreatic Cancer

E-commerce Transaction Anomaly Classification

IFT3395/6390. Machine Learning from linear regression to Neural Networks. Machine Learning. Training Set. t (3.5, -2,..., 127, 0,...

How To Identify A Churner

Transcription:

Multivariate analyses & decoding Kay Henning Brodersen Computational Neuroeconomics Group Department of Economics, University of Zurich Machine Learning and Pattern Recognition Group Department of Computer Science, ETH Zurich http://people.inf.ethz.ch/bkay

Why multivariate? Multivariate approaches simultaneously consider brain activity in many locations. PET prediction Haxby et al. (200) Science Lautrup et al. (994) Supercomputing in Brain Research 2

Why multivariate? Multivariate approaches can utilize information jointly encoded in multiple voxels. This is because multivariate distance meausures can account for correlations between voxels. 3

Why multivariate? Multivariate approaches can exploit a sampling bias in voxelized images to reveal interesting activity on a subvoxel scale. Boynton (2005) Nature Neuroscience 4

Why multivariate? Multivariate approaches can utilize hidden quantities such as coupling strengths. u 2 0.4 0.3 Neural population activity 0.2 0. 0 0 0 20 30 40 50 60 70 80 90 00 0.6 u z 3 0.4 0.2 0 0 0 20 30 40 50 60 70 80 90 00 0.3 0.2 0. z z 2 0 0 0 20 30 40 50 60 70 80 90 00 3 2 fmri signal change (%) 0 neural state equation: m ( i) z A uib i j n z j D ( j) Stephan et al. (2008) NeuroImage z Cu 4 3 2 0 0 0 20 30 40 50 60 70 80 90 00-0 0 20 30 40 50 60 70 80 90 00 3 2 0 0 0 20 30 40 50 60 70 80 90 00 5

Outline Foundations 2 Classification 3 Multivariate Bayes 4 Further model-based approaches 6

Outline Foundations 2 Classification 3 Multivariate Bayes 4 Further model-based approaches 7

Modelling terminology Encoding vs. decoding An encoding model (or generative model) relates context to brain activity. A decoding model (or recognition model) relates brain activity to context. condition stimulus response encoding model g: X t Y t decoding model h: Y t X t context X t R d BOLD signal Y t R v 8

Modelling terminology 2 Prediction vs. inference The goal of prediction is to find a highly accurate encoding or decoding function. The goal of inference is to decide between competing hypotheses about structure-function mappings in the brain. predicting a cognitive state using a brain-machine interface predicting a subject-specific diagnostic status comparing a model that links distributed neuronal activity to a cognitive state with a model that does not weighing the evidence for sparse coding vs. dense coding predictive density p X new Y new, X, Y = p X new Y new, θ p θ X, Y dθ marginal likelihood p X Y = p X Y, θ p θ dθ 9

Modelling terminology 3 Univariate vs. multivariate A univariate model considers a single voxel at a time. A multivariate model considers many voxels at once. context X t R d BOLD signal Y t R context X t R d BOLD signal Y t R v The implicit likelihood of the data factorizes v over voxels, p Y t X t = p Y t,i X t i=. Spatial dependencies between voxels are introduced afterwards, through random field theory. This enables multivariate inferences over voxels (i.e., cluster-level or set-level inference). Multivariate models relax the assumption about independence of voxels. They enable inference about distributed responses without requiring focal activations or certain topological response features. They can therefore be more powerful than univariate analyses. 0

Modelling terminology 4 Regression vs. classification In a regression model, the dependent variable is continuous. independent variables (regressors) f dependent variable R (regressand) In a classification model, the dependent variable is categorical (e.g., binary). independent variables (features) f dependent variable {0,, } (label)

Modelling terminology 5 Goodness of fit vs. complexity A measure of goodness of fit assesses how well the model fits available data. A measure of complexity assesses the amount of structure and the degree of flexibility of a model, including, among other things, its number of parameters. Y parameter 4 parameters 0 parameters truth data model underfitting X optimal overfitting We typically wish to find the model that generalizes best to new data. This is the model that optimally trades off goodness of fit and complexity. Bishop (2007) PRML 2

Summary of modelling terminology General Linear Model (GLM) mass-univariate encoding model for regressing context onto brain activity and inferring on topological response features Dynamic Causal Modelling (DCM) multivariate encoding model for comparing alternative connectivity hypotheses Classification based on multivariate decoding models for predicting a categorical context label from brain activity Multivariate Bayes (MVB) multivariate encoding model for comparing alternative coding hypotheses 3

Outline Foundations 2 Classification 3 Multivariate Bayes 4 Further model-based approaches 4

Constructing a classifier In classification, we aim to predict a target variable X from data Y, h: Y t X t,, K Most classifiers are designed to estimate the unknown probabilities of an example belonging to a particular class: h Y t = arg max k p X t = k Y t, X, Y Generative classifiers use Bayes rule to estimate p X t Y t p Y t X t p X t e.g., Gaussian Naïve Bayes Linear Discriminant Analysis Discriminative classifiers estimate p X t Y t directly without Bayes theorem e.g., Logistic regression Relevance Vector Machine Discriminant classifiers estimate h Y t directly e.g., Support Vector Machine Fisher s Linear Discriminant 5

Support vector machines The support vector machine (SVM) is a discriminant classifier. Training Find a hyperplane with a maximal margin to the nearest examples on either side. Test Assign a new example to the class corresponding to its side of the plane. Linear SVM w SVMs are used in many domains of application. Efficiency SVMs are fast and easy to use. Performance SVMs usually perform well compared to other classifiers. Flexibility The need for vectorial representations of examples is replaced by a similarity measure, defined via a kernel function k Y i, Y j. b Nonlinear SVM Vapnik (999) Springer; Schölkopf et al. (2002) MIT Press 6

Generalizability of a classifier Typically, we have many more voxels than observations. This means that there are infinitely many models that enable perfect classification of the available data. But these models might have overfit the data. Overfitting is usually not an issue in GLM analyses, where the number of regressors is much smaller than the number of observations. We want to find a classification model h: Y X that generalizes well to new data. Given some training data, we might consider the probability P h Y test = X test Y train, X train. However, this quantity is dependent on the training data. So instead we should consider the generalizability E training P h Y test = X test Y train, X train, which we can approximate using cross-validation. 7

Cross-validation Cross-validation is a resampling procedure that can be used to estimate the generalizability of a classifier. Examples 2 Training examples Test example 3.................. 99 00 2 3 99 00 Folds 8

Trial-by-trial classification of fmri data fmri timeseries Feature extraction e.g., voxels Trials Voxels Train: learning a mapping Y train θ X train Y train Test: apply the learned mapping 3 Classification A A B A B A A B A A A B A X train Y test 2 Feature selection??? Accuracy estimate [% correct] 9

Target questions in classification studies A Overall classification accuracy B Spatial deployment of informative regions Accuracy [%] 00 % 80% 50 % Left or right button? Truth or lie? Healthy or diseased? Classification task 55% C Temporal evolution of informativeness D Characterization of distributed activity Accuracy [%] 00 % Participant indicates decision Inferring a representational space and extrapolation to novel classes 50 % Accuracy rises above chance Intra-trial time Mitchell et al. (2008) Science Pereira et al. (2009) NeuroImage, Brodersen et al. (2009) The New Collection 20

trial 2 phase A... trial phase B time [volumes] trial phase A Preprocessing for classification The most principled approach is to deconvolve the BOLD signal using a GLM. 200 trial-by-trial design matrix 5 400 0 600 5 800 50 00 50 200 250 300 350 20 2 3 4 5 6 trials and trial phases plus confounds* This approach results in one beta image per trial and phase. 2

22 Performance evaluation subject - + - - + trial n trial subject 2 subject 3 subject 4 subject m subject m+ ~? population 0 0 0 0 0 0 0 Brodersen et al. (in preparation)

Performance evaluation Single-subject study The most common approach is to assess how likely the obtained number of correctly classified trials could have occurred by chance. p = P X k H 0 = B k n, π 0 p k n π 0 B probability of observing the obtained performance by chance number of correctly classified trials total number of trials probability of getting a single result right by chance binomial cumulative density function In publications, this approach is referred to as a binomial test. It is based on the assumption that, under the Null hypothesis, the classifier produces random predictions. 23

Performance evaluation Group study The most common approach is to assess the probability with which the observed subject-wise sample accuracies were sampled from a distribution with a mean equal to chance. t = m π π 0 σ m p = t m t p probability of observing the obtained performance by chance m number of subjects π sample mean of subject-wise sample accuracies σ m sample standard deviation of subject-wise sample accuracies π 0 probability of getting a single result right by chance t m cumulative Student s t-distribution with m d.o.f. This approach represents a random-effects analysis of classification outcomes based on the additional assumption that the mean of sample accuracies is approximately Normal. 24

Spatial deployment of informative voxels Approach Consider the entire brain, and find out which voxels are jointly discriminative. based on a classifier with a constraint on sparseness in features Hampton & O Doherty (2007); Grosenick et al. (2008, 2009) based on Gaussian Processes Marquand et al. (200) NeuroImage; Lomakina et al. (in preparation) Approach 2 At each voxel, consider a small local environment, and compute a discriminability score. based on a CCA Nandy & Cordes (2003) Magn. Reson. Med. based on a classifier based on Euclidean distances based on Mahalanobis distances Kriegeskorte et al. (2006, 2007a, 2007b) Serences & Boynton (2007) J Neuroscience based on the mutual information 25

Temporal evolution of discriminability Example decoding which button the subject pressed classification accuracy motor cortex decision response frontopolar cortex Soon et al. (2008) Nature Neuroscience 26

voxel Pattern characterization Example decoding the identity of the person speaking to the subject in the scanner... fingerprint plot (one plot per class) Formisano et al. (2008) Science 27

Issues to be aware of (as researcher or reviewer) Classification induces constraints on the experimental design. When estimating trial-wise Beta values, we need longer ITIs (typically 8 5 s). At the same time, we need many trials (typically 00+). Classes should be balanced. If they are imbalanced, we can resample the training set, constrain the classifier, or report the balanced accuracy. Construction of examples Estimation of Beta images is the preferred approach. Covariates should be included in the trial-by-trial design matrix. Temporal autocorrelation In trial-by-trial classification, exclude trials around the test trial from the training set. Avoiding double-dipping Any feature selection and tuning of classifier settings should be carried out on the training set only. Performance evaluation Do random-effects or mixed-effects inference. Correct for multiple tests. 28

Outline Foundations 2 Classification 3 Multivariate Bayes 4 Further model-based approaches 29

Multivariate Bayes Multivariate analyses in SPM are not framed in terms of classification problems. Instead, SPM brings multivariate analyses into the conventional inference framework of hierarchical models and their inversion. Multivariate Bayes (MVB) can be used to address two questions: Is there a link between X and Y? using cross-validation (as seen earlier) using model comparison (new) What is the form of the link between X and Y? smooth or sparse coding? (many voxels vs. few voxels) category-specific representations that are functionally selective or functionally segregated? 30

Conventional inference framework Classical encoding model X as a cause of Y Bayesian decoding model X as a consequence of Y X A X A X A Y TA G Y TA G Y g ( ) : X Y TX G TX h ( ) :Y X X A Y G Friston et al. (2008) NeuroImage 3

Lessons from the Neyman-Pearson lemma Is there a link between X and Y? To test for a statistical dependency between a contextual variable X and the BOLD signal Y, we compare H 0 : there is no dependency H a : there is some dependency Which statistical test?. define a test size α (the probability of falsely rejecting H 0, i.e., specificity), 2. choose the test with the highest power β (the probability of correctly rejecting H 0, i.e., sensitivity). The Neyman-Pearson lemma The most powerful test of size α is: to reject H 0 when the likelihood ratio Λ exceeds a criticial value u, Λ Y = p Y X p Y with u chosen such that = p X Y p X P Λ Y u H 0 = α. u The null distribution of the likelihood ratio p Λ Y H 0 can be determined nonparametrically or under parametric assumptions. This lemma underlies both classical statistics and Bayesian statistics (where Λ Y is known as a Bayes factor). Neyman & Person (933) Phil Trans Roy Soc London 32

Lessons from the Neyman-Pearson lemma In summary. Inference about how the brain represents context variables reduces to model comparison. 2. To establish that a link exists between some context X and activity Y, the direction of the mapping is not important. 3. Testing the accuracy of a classifier is not based on Λ is therefore suboptimal. Neyman & Person (933) Phil Trans Roy Soc London Kass & Raftery (995) J Am Stat Assoc Friston et al. (2009) NeuroImage 33

Priors help to regularize the inference problem Mapping brain activity onto a context variable is ill-posed: there is an infinite number of equally likely solutions. We therefore require constraints (priors) to estimate the voxel weights β. SPM comes with several alternative coding hypotheses, specified in terms of spatial priors on voxel weights, p β, after transformations Y = YU and β = βu. Null: U Spatial vectors: Smooth vectors: U I U( x, x i j ) exp( 2 xi x 2 2 ( j ) ) Singular vectors: UDV T RY T Support vectors: U RY T Friston et al. (2008) NeuroImage 34

photic motion attention constant Multivariate Bayes: example MVB can be illustrated using SPM s attention-to-motion example dataset. Buechel & Friston 999 Cerebral Cortex Friston et al. 2008 NeuroImage This dataset is based on a simple block design. Each block is a combination of some of the following three factors: photic there is some visual stimulus motion there is motion attention subjects are paying attention design matrix We form a design matrix by convolving box-car functions with a canonical haemodynamic response function. blocks of 0 scans 35

Multivariate Bayes: example After having specified and estimated a design, we use the Results button. Next, we select the contrast of interest. 36

Multivariate Bayes: example We place the cursor onto the region of interest. 37

Multivariate Bayes: example Multivariate Bayes can be invoked from within the Multivariate section. We specify the region of interest as a sphere around the cursor. We examine the sparse coding hypothesis. 38

Multivariate Bayes: example To display results, we use the button for Bayesian model selection (BMS). 39

(motion) Multivariate Bayes: example MVB-based predictions closely match the observed responses. But crucially, they don t perfectly match them. Perfect match would indicate overfitting. RXc 40

Multivariate Bayes: example The weights attributed to each voxel in the sphere are sparse and multimodal. This suggests sparse coding. log evidence = 3 4

Multivariate Bayes: example MVB may outperform conventional point classifiers when using a more appropriate coding hypothesis. 42

Outline Foundations 2 Classification 3 Multivariate Bayes 4 Further model-based approaches 43

Recall: challenges for multivariate approaches ❶ Variable selection Given tens of thousands of voxels and very few trials of data, how do we find those brain regions that are jointly informative of some variable of interest? ❷ Neurobiological interpretability How can we obtain results that are mechanistically interpretable in the context of the underlying neurobiological system? 44

Identification / inferring a representational space Approach. estimation of an encoding model 2. nearest-neighbour classification or voting Mitchell et al. (2008) Science 45

Reconstruction / optimal decoding Approach. estimation of an encoding model 2. model inversion Paninski et al. (2007) Progr Brain Res Pillow et al. (2008) Nature Miyawaki et al. (2009) Neuron 46

Model-based classification by generative embedding step model inversion C A B step 2 kernel construction A B A C B B B C measurements from an individual subject 5 4 subject-specific inverted generative model 0.6 0.5 subject representation in the generative score space C A B Voxel 2 3 2 0 step 4 interpretation (4) R.HG -> L.HG 0.4 0.3 0.2 0. 0 step 3 classification jointly discriminative connection strengths - -2 0 2 4 6 8 Voxel -0. -0.4-0.35-0.3-0.25-0.2-0.5 () L.MGB -> L.MGB separating hyperplane fitted to discriminate between groups Brodersen, Haiss, Ong, Jung, Tittgemeyer, Buhmann, Weber, Stephan (20) NeuroImage Brodersen, Schofield, Leff, Ong, Lomakina, Buhmann, Stephan (20) PLoS Comput Biol 47

Summary. Foundations. Multivariate methods can uncover and exploit information jointly encoded by multiple voxels. Remember the distinction between prediction and inference, encoding and decoding, univariate and multivariate, and classification and regression. 2. Classification. Classification studies typically aim to examine (i) overall discriminability, (ii) the spatial deployment of informative regions, (iii) the temporal evolution of discriminative activity, and (iv) the nature of the distributed activity. 3. Multivariate Bayes. Multivariate Bayes offers an alternative scheme that maps multivariate patterns of activity onto brain states within the conventional statistical framework of hierarchical models and their inversion. 4. Model-based approaches. Model-based approaches aim to augment previous methods by neurobiological interpretability and are likely to become very fruitful in the future. 48