Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Similar documents

MarkerView Software for Metabolomic and Biomarker Profiling Analysis

Gene expression analysis. Ulf Leser and Karin Zimmermann

Tutorial for proteome data analysis using the Perseus software platform

A Streamlined Workflow for Untargeted Metabolomics

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Time series experiments

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

Dimensionality Reduction: Principal Components Analysis

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Quantitative proteomics background

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

II. DISTRIBUTIONS distribution normal distribution. standard scores

Principal Component Analysis

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Data, Measurements, Features

Exploratory data analysis for microarray data

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Functional Data Analysis of MALDI TOF Protein Spectra

Chapter 5 Analysis of variance SPSS Analysis of variance

Alignment and Preprocessing for Data Analysis

Statistical issues in the analysis of microarray data

Azure Machine Learning, SQL Data Mining and R

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

Simple Predictive Analytics Curtis Seare

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Review Jeopardy. Blue vs. Orange. Review Jeopardy

PHARMACOMETABOLOMICS IN BIPOLAR DISORDER

Using multiple models: Bagging, Boosting, Ensembles, Forests

ProteinPilot Report for ProteinPilot Software

Univariate Regression

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

SIMCA 14 MASTER YOUR DATA SIMCA THE STANDARD IN MULTIVARIATE DATA ANALYSIS

Statistics Graduate Courses

Multivariate Tools for Modern Pharmaceutical Control FDA Perspective

Linear Models in STATA and ANOVA

Metabolic profile of veins and their implications in primary varicose veins Disease.

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Chapter 13 Introduction to Linear Regression and Correlation Analysis

How to report the percentage of explained common variance in exploratory factor analysis

Additional sources Compilation of sources:

Anomaly detection. Problem motivation. Machine Learning

Chemometric Analysis for Spectroscopy

Dr Alexander Henzing

Thermo Scientific SIEVE Software for Differential Expression Analysis

Part 2: Analysis of Relationship Between Two Variables

How To Understand Multivariate Models

Descriptive Statistics

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Model Selection. Introduction. Model Selection

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»

Statistiek II. John Nerbonne. October 1, Dept of Information Science

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

BIG DATA What it is and how to use?

Multivariate Normal Distribution

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

Introduction to Statistics and Quantitative Research Methods

Multivariate Analysis. Overview

Exploratory data analysis (Chapter 2) Fall 2011

Normality Testing in Excel

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Introduction to Regression and Data Analysis

STATISTICA Formula Guide: Logistic Regression. Table of Contents

Chapter 12 Discovering New Knowledge Data Mining

Common factor analysis

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Gene Expression Analysis

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Machine Learning with MATLAB David Willingham Application Engineer

Analysing Questionnaires using Minitab (for SPSS queries contact -)

Factor Analysis. Chapter 420. Introduction

Didacticiel - Études de cas

Strategies in data integration to predict fish susceptibility to toxicants

DeCyder Extended Data Analysis (EDA) Software

Data Mining Techniques Chapter 5: The Lure of Statistics: Data Mining Using Familiar Tools

Chapter 7: Simple linear regression Learning Objectives

Environmental Remote Sensing GEOG 2021

1. What is the critical value for this 95% confidence interval? CV = z.025 = invnorm(0.025) = 1.96

Data Exploration Data Visualization

Data analysis process

Section Format Day Begin End Building Rm# Instructor. 001 Lecture Tue 6:45 PM 8:40 PM Silver 401 Ballerini

Introduction to Data Mining

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Unit 26: Small Sample Inference for One Mean

Data Exploration and Preprocessing. Data Mining and Text Mining (UIC Politecnico di Milano)

Overview of Factor Analysis

UNDERSTANDING THE TWO-WAY ANOVA

1 st day Basic Training Course

Multivariate Analysis of Ecological Data

One-Way Analysis of Variance (ANOVA) Example Problem

Independent t- Test (Comparing Two Means)

Transcription:

Statistical Analysis NBAF-B Metabolomics Masterclass Mark Viant

1. Introduction 2. Univariate analysis Overview of lecture 3. Unsupervised multivariate analysis Principal components analysis (PCA) Interpreting scores and loadings plots Statistical tests on PCA scores data 4. Supervised multivariate analysis Partial least squares discriminant analysis (PLS-DA) Partial least squares regression (PLS-R) 5. Data standards, databases, and NBAF-B data analysis workflow

1. Introduction

peak 1 peak 2 peak 3 n bin 1 bin 2 bin 3 n Output from spectral processing (Jon & Ulf) sample 1 sample 2 sample 3 m X matrix of NMR signal intensities sample 1 sample 2 sample 3 X matrix of MS signal intensities m

peak 1 peak 2 peak 3 Output from spectral processing: X and Y matrices sample label X matrix of signal intensities 5.0 8.4 9.2 1.3 4.9 7.9 16.6 18.2 14.4 10.0 25.9 21.3 103.1 69.9 91.3 98.5 48.1 59.7 EITHER Y matrix = treatment group labels = discrete variable OR Y matrix = separate non-metabolic measurement for each sample = continuous variable

2. Univariate statistical analysis

peak 1 peak 2 peak 3 Univariate statistical analysis sample label X matrix of signal intensities Y matrix = treatment group labels = discrete variable t-test or ANOVA t-test or ANOVA with false discovery rate (FDR) correction

But what p-value is significant? Yes, this is possible, but it must be done with caution! Typically if we conduct a single univariate statistical test then p<0.05 is considered a significant result. But this is associated with an error rate of 5% (1 in every 20 tests gives false result). Imagine we dataset contains 1000 metabolites; so we conduct 1000 univariate tests - if p<0.05 is significant, we will incorrectly say that 50 metabolites are significantly different when they are not Unacceptable error rate! Correction for multiple testing Adjustment for multiple testing False discovery rate (FDR) by Benjamini and Hochberg, Journal of the Royal Statistical Society, Series B 57: 289 300 (1995). Controls the expected proportion of incorrectly rejected null hypotheses (type I errors)

Typical output for NBAF-B MS dataset

peak 1 peak 2 peak 3 Multivariate statistical analysis sample label X matrix of signal intensities Y matrix = treatment group labels = discrete variable Analyse data in its entirety --- unsupervised and supervised analyses

3. Unsupervised multivariate statistical analysis What is principal components analysis (PCA)? Scores plots and their interpretation Loadings plots and their interpretation Statistical tests on scores data

What is unsupervised multivariate statistics? Multivariate statistical analyses deals with large numbers of variables (e.g. metabolites) simultaneously Unsupervised means that the analysis algorithm has no knowledge of the identities of the samples; the algorithm looks at the innate variation in the dataset Many unsupervised methods; we focus on principal components analysis Widely used in omics analyses, including metabolomics, proteomics and transcriptomics

Multivariate data and PCA Common to find correlated variables in multivariate data redundancy in the information provided by these variables PCA exploits this redundancy enabling us to: - pick out patterns (relationships) in the variables - reduce the dimensionality of a data set without a significant loss of information PCA is a projection technique

Concept behind principal components analysis: Consider 50 different fish For each fish, measure length and breadth

breadth Principal components analysis - continued Plotting length vs. breadth shows clear relationship between these two variables Multivariate!!! Fish Length Breadth 1 105 93 2 82 73 3 121 111 : : : : : : length 50 95 101

breadth Principal components analysis - continued Plotting length vs. breadth shows clear relationship between these two variables PC1 Create new axis (PC1) that accounts for the largest proportion of the data s variance PC1 = p1.length + p2.breadth length PC1 = principal component 1 For each data point, project the two original variables (length, breadth) onto the one new variable (PC1)

breadth Principal components analysis - continued Plotting length vs. breadth shows clear relationship between these two variables Simpler dataset!!! Fish PC1 PC1 1 8 2 4 3 12 : : length : : 50 6

Sample From dataset to variable space PCA step by step var. 3 (i) One sample in variable space var. 2 var. 1 The dataset (many samples) yields a swarm of points in "variable space"

var. 3 PCA step by step Mean centering move centre of swarm of points to the origin of variable space (0,0,0) var. 3 original mean (i) var. 2 new mean var. 2 var. 1 (i) var. 1

PCA step by step PC1 score var. 3 PC1 axis var. 2 (i) var. 1 The first principal component (PC1) is set to describe the largest variation in the data, which is the same as the direction in which the points spread most in the variable space. The Score value for point i is the distance from the projection of the point on the 1st component to the origin. PC1 is the first variable in a new coordinate system that describes the variation in the data.

PCA step by step var. 3 PC1 axis PC2 axis (i) var. 2 PC2 scores var. 1 The second principal component (PC2) is set to describe the largest variation in the data, perpendicular (orthogonal) to PC1. The Score value for point i is the distance from the projection of the point on the 2nd component to the origin.

GENDER DIFFERENCE PC2 axis PCA scores plot: Analysis of LC-MS metabolomics data 100 Female Night 50 Female Day 0-50 Male Night Male Day -100-120 -100-80 -60-40 -20 0 20 40 60 80 100 120 PC1 axis DIURNAL DIFFERENCE The relative distances among individual samples in the scores plot represent the similarities/differences between those samples

PCA scores plots of 1 H NMR spectra of foot muscle Metabolic changes??? PC1 loadings

Loadings on PC 1 PCA loadings plot 0.03 Peaks/metabolites with positive PC1 loadings are 0.02 0.01 diseased abalone ELEVATED in diseased abalone 0-0.01-0.02-0.03 healthy abalone Peaks/metabolites with negative PC1 loadings are ELEVATED in healthy abalone Variable number (e.g. metabolites) Identify which metabolites (or peaks) are responsible for the pattern of samples in the scores plot

Loadings on PC 1 PCA loadings plot 0.03 homarine (p<0.001) 0.02 0.01 diseased abalone formate (p<0.001) 0-0.01-0.02-0.03 healthy abalone ATP (p<0.001) tryptophan (p<0.001) tyrosine (p<0.001) Variable number (e.g. metabolites) Identify which metabolites (or peaks) are responsible for the pattern of samples in the scores plot

Typical output for NBAF-B MS dataset

Summary of PCA scores and loadings plots PCA scores plot shows relationship between samples shows major underlying unbiased structures in your data PCA loadings plot identifies which metabolites (or peaks) are responsible for the structures in the scores plot

Significance testing on PCA scores data (1) PC1 scores for healthy PC1 scores for diseased

Significance testing on PCA scores data (2) Sample PC1 score Healthy 1-3.4 Healthy 2-4.2 Healthy 3-2.0 Healthy 4-2.4 Healthy 5-1.2 Healthy 6-3.1 Diseased 1 6.2 Diseased 2 8.1 Diseased 3 2.5 Diseased 4 7.8 t-test on scores data (p<0.001 for abalone) Unsupervised analysis found significant separation of groups Diseased 5 7.3

Summary of Principal Components Analysis 1. PCA is a common unsupervised method for analysing metabolomics datasets. 2. Aim is to identify the metabolic similarities and differences between the samples. 3. Excellent initial approach for screening dataset, identifying outliers, and getting a feel of the structure of your data. 4. Can be used to determine which metabolites discriminate between different groups of samples (although not as powerful as supervised methods)

4. Supervised multivariate statistical analysis Partial least squares discriminant analysis (PLS-DA) Partial least squares regression (PLS-R)

What is supervised multivariate statistics? Supervised means that the analysis algorithm has prior knowledge of the identities (classification) or some other continuous property (regression) of the samples Used to build multivariate models that can predict identity of an unknown sample (classification) or predict continuous variable of that unknown sample (regression) Powerful tool for discovering biomarkers Many supervised methods exist; we focus on partial least squares (PLS) based methods Widely used in metabolomics

peak 1 peak 2 peak 3 Partial least squares discriminant analysis (PLS-DA) sample label X matrix of signal intensities Y matrix = treatment group labels PLS-DA seeks to discriminate different groups of samples (classification), and to discover the relevant biomarkers

Example of PLS-DA Miniature Schnauzer (MS) dogs Labrador dogs known Labrador dogs Predict dog breed from urine known MS dogs

What are urinary metabolic differences between two dog breeds? Determined using partial least squares discriminant analysis (PLS-DA)

peak 1 peak 2 peak 3 Partial least squares regression (PLS-R) sample label X matrix of signal intensities 5.0 8.4 9.2 1.3 4.9 7.9 16.6 18.2 14.4 10.0 25.9 21.3 103.1 69.9 91.3 98.5 48.1 59.7 Y matrix = separate non-metabolic measurement for each sample = continuous variable PLS-R seeks to discover relevant biomarkers that can predict the continuous variable (regression)

Total no. of neonates produced per daphnid Example dataset for PLS-R (continuous variable = neonate production) 38 individual adult daphnids (in 5 treatment groups)

Example of PLS-R (continuous variable = neonate production) r 2 (CV) = 0.937 Optimal PLS model: 107 peaks in mass spectra (out of ca. 4000 total)

5. Data standards, databases, and NBAF-B data analysis workflow

Data standards in omics science Transcriptomics MIAME: Minimum information about a microarray experiment MAGE: Microarray gene expression - data exchange format Proteomics PSI-OM: Proteomics standards initiative - object model Metabolomics

Transcriptomics Databases in omics science Proteomics Metabolomics? - no publically available databases to store metabolomics measurements (yet)

Median Relative Standard Deviation for QC No. of peaks RSD from 3 analytical reps (%) Median RSD is a simple measure of reproducibility

Typical NBAF-B data analysis workflow NMR or MS spectral processing Initial PCA of dataset with QC samples is data of high technical quality? Initial PCA of dataset without QC samples are there biological outliers? Further multivariate statistics (PLS-DA or PLS-R, depending on biological question) Univariate statistics on NMR or MS data (with FDR) Metabolite identification using software tools (MI-Pack, Chenomx, etc.) Metabolite identification via further MS and/or NMR experiments