Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Size: px
Start display at page:

Download "Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant"

Transcription

1 Statistical Analysis NBAF-B Metabolomics Masterclass Mark Viant

2 1. Introduction 2. Univariate analysis Overview of lecture 3. Unsupervised multivariate analysis Principal components analysis (PCA) Interpreting scores and loadings plots Statistical tests on PCA scores data 4. Supervised multivariate analysis Partial least squares discriminant analysis (PLS-DA) Partial least squares regression (PLS-R) 5. Data standards, databases, and NBAF-B data analysis workflow

3 1. Introduction

4 peak 1 peak 2 peak 3 n bin 1 bin 2 bin 3 n Output from spectral processing (Jon & Ulf) sample 1 sample 2 sample 3 m X matrix of NMR signal intensities sample 1 sample 2 sample 3 X matrix of MS signal intensities m

5 peak 1 peak 2 peak 3 Output from spectral processing: X and Y matrices sample label X matrix of signal intensities EITHER Y matrix = treatment group labels = discrete variable OR Y matrix = separate non-metabolic measurement for each sample = continuous variable

6 2. Univariate statistical analysis

7 peak 1 peak 2 peak 3 Univariate statistical analysis sample label X matrix of signal intensities Y matrix = treatment group labels = discrete variable t-test or ANOVA t-test or ANOVA with false discovery rate (FDR) correction

8 But what p-value is significant? Yes, this is possible, but it must be done with caution! Typically if we conduct a single univariate statistical test then p<0.05 is considered a significant result. But this is associated with an error rate of 5% (1 in every 20 tests gives false result). Imagine we dataset contains 1000 metabolites; so we conduct 1000 univariate tests - if p<0.05 is significant, we will incorrectly say that 50 metabolites are significantly different when they are not Unacceptable error rate! Correction for multiple testing Adjustment for multiple testing False discovery rate (FDR) by Benjamini and Hochberg, Journal of the Royal Statistical Society, Series B 57: (1995). Controls the expected proportion of incorrectly rejected null hypotheses (type I errors)

9 Typical output for NBAF-B MS dataset

10 peak 1 peak 2 peak 3 Multivariate statistical analysis sample label X matrix of signal intensities Y matrix = treatment group labels = discrete variable Analyse data in its entirety --- unsupervised and supervised analyses

11 3. Unsupervised multivariate statistical analysis What is principal components analysis (PCA)? Scores plots and their interpretation Loadings plots and their interpretation Statistical tests on scores data

12 What is unsupervised multivariate statistics? Multivariate statistical analyses deals with large numbers of variables (e.g. metabolites) simultaneously Unsupervised means that the analysis algorithm has no knowledge of the identities of the samples; the algorithm looks at the innate variation in the dataset Many unsupervised methods; we focus on principal components analysis Widely used in omics analyses, including metabolomics, proteomics and transcriptomics

13 Multivariate data and PCA Common to find correlated variables in multivariate data redundancy in the information provided by these variables PCA exploits this redundancy enabling us to: - pick out patterns (relationships) in the variables - reduce the dimensionality of a data set without a significant loss of information PCA is a projection technique

14 Concept behind principal components analysis: Consider 50 different fish For each fish, measure length and breadth

15 breadth Principal components analysis - continued Plotting length vs. breadth shows clear relationship between these two variables Multivariate!!! Fish Length Breadth : : : : : : length

16 breadth Principal components analysis - continued Plotting length vs. breadth shows clear relationship between these two variables PC1 Create new axis (PC1) that accounts for the largest proportion of the data s variance PC1 = p1.length + p2.breadth length PC1 = principal component 1 For each data point, project the two original variables (length, breadth) onto the one new variable (PC1)

17 breadth Principal components analysis - continued Plotting length vs. breadth shows clear relationship between these two variables Simpler dataset!!! Fish PC1 PC : : length : : 50 6

18 Sample From dataset to variable space PCA step by step var. 3 (i) One sample in variable space var. 2 var. 1 The dataset (many samples) yields a swarm of points in "variable space"

19 var. 3 PCA step by step Mean centering move centre of swarm of points to the origin of variable space (0,0,0) var. 3 original mean (i) var. 2 new mean var. 2 var. 1 (i) var. 1

20 PCA step by step PC1 score var. 3 PC1 axis var. 2 (i) var. 1 The first principal component (PC1) is set to describe the largest variation in the data, which is the same as the direction in which the points spread most in the variable space. The Score value for point i is the distance from the projection of the point on the 1st component to the origin. PC1 is the first variable in a new coordinate system that describes the variation in the data.

21 PCA step by step var. 3 PC1 axis PC2 axis (i) var. 2 PC2 scores var. 1 The second principal component (PC2) is set to describe the largest variation in the data, perpendicular (orthogonal) to PC1. The Score value for point i is the distance from the projection of the point on the 2nd component to the origin.

22 GENDER DIFFERENCE PC2 axis PCA scores plot: Analysis of LC-MS metabolomics data 100 Female Night 50 Female Day 0-50 Male Night Male Day PC1 axis DIURNAL DIFFERENCE The relative distances among individual samples in the scores plot represent the similarities/differences between those samples

23 PCA scores plots of 1 H NMR spectra of foot muscle Metabolic changes??? PC1 loadings

24 Loadings on PC 1 PCA loadings plot 0.03 Peaks/metabolites with positive PC1 loadings are diseased abalone ELEVATED in diseased abalone healthy abalone Peaks/metabolites with negative PC1 loadings are ELEVATED in healthy abalone Variable number (e.g. metabolites) Identify which metabolites (or peaks) are responsible for the pattern of samples in the scores plot

25 Loadings on PC 1 PCA loadings plot 0.03 homarine (p<0.001) diseased abalone formate (p<0.001) healthy abalone ATP (p<0.001) tryptophan (p<0.001) tyrosine (p<0.001) Variable number (e.g. metabolites) Identify which metabolites (or peaks) are responsible for the pattern of samples in the scores plot

26 Typical output for NBAF-B MS dataset

27 Summary of PCA scores and loadings plots PCA scores plot shows relationship between samples shows major underlying unbiased structures in your data PCA loadings plot identifies which metabolites (or peaks) are responsible for the structures in the scores plot

28 Significance testing on PCA scores data (1) PC1 scores for healthy PC1 scores for diseased

29 Significance testing on PCA scores data (2) Sample PC1 score Healthy Healthy Healthy Healthy Healthy Healthy Diseased Diseased Diseased Diseased t-test on scores data (p<0.001 for abalone) Unsupervised analysis found significant separation of groups Diseased 5 7.3

30 Summary of Principal Components Analysis 1. PCA is a common unsupervised method for analysing metabolomics datasets. 2. Aim is to identify the metabolic similarities and differences between the samples. 3. Excellent initial approach for screening dataset, identifying outliers, and getting a feel of the structure of your data. 4. Can be used to determine which metabolites discriminate between different groups of samples (although not as powerful as supervised methods)

31 4. Supervised multivariate statistical analysis Partial least squares discriminant analysis (PLS-DA) Partial least squares regression (PLS-R)

32 What is supervised multivariate statistics? Supervised means that the analysis algorithm has prior knowledge of the identities (classification) or some other continuous property (regression) of the samples Used to build multivariate models that can predict identity of an unknown sample (classification) or predict continuous variable of that unknown sample (regression) Powerful tool for discovering biomarkers Many supervised methods exist; we focus on partial least squares (PLS) based methods Widely used in metabolomics

33 peak 1 peak 2 peak 3 Partial least squares discriminant analysis (PLS-DA) sample label X matrix of signal intensities Y matrix = treatment group labels PLS-DA seeks to discriminate different groups of samples (classification), and to discover the relevant biomarkers

34 Example of PLS-DA Miniature Schnauzer (MS) dogs Labrador dogs known Labrador dogs Predict dog breed from urine known MS dogs

35 What are urinary metabolic differences between two dog breeds? Determined using partial least squares discriminant analysis (PLS-DA)

36 peak 1 peak 2 peak 3 Partial least squares regression (PLS-R) sample label X matrix of signal intensities Y matrix = separate non-metabolic measurement for each sample = continuous variable PLS-R seeks to discover relevant biomarkers that can predict the continuous variable (regression)

37 Total no. of neonates produced per daphnid Example dataset for PLS-R (continuous variable = neonate production) 38 individual adult daphnids (in 5 treatment groups)

38 Example of PLS-R (continuous variable = neonate production) r 2 (CV) = Optimal PLS model: 107 peaks in mass spectra (out of ca total)

39 5. Data standards, databases, and NBAF-B data analysis workflow

40 Data standards in omics science Transcriptomics MIAME: Minimum information about a microarray experiment MAGE: Microarray gene expression - data exchange format Proteomics PSI-OM: Proteomics standards initiative - object model Metabolomics

41 Transcriptomics Databases in omics science Proteomics Metabolomics? - no publically available databases to store metabolomics measurements (yet)

42 Median Relative Standard Deviation for QC No. of peaks RSD from 3 analytical reps (%) Median RSD is a simple measure of reproducibility

43 Typical NBAF-B data analysis workflow NMR or MS spectral processing Initial PCA of dataset with QC samples is data of high technical quality? Initial PCA of dataset without QC samples are there biological outliers? Further multivariate statistics (PLS-DA or PLS-R, depending on biological question) Univariate statistics on NMR or MS data (with FDR) Metabolite identification using software tools (MI-Pack, Chenomx, etc.) Metabolite identification via further MS and/or NMR experiments

MarkerView Software 1.2.1 for Metabolomic and Biomarker Profiling Analysis

MarkerView Software 1.2.1 for Metabolomic and Biomarker Profiling Analysis MarkerView Software 1.2.1 for Metabolomic and Biomarker Profiling Analysis Overview MarkerView software is a novel program designed for metabolomics applications and biomarker profiling workflows 1. Using

More information

Gene expression analysis. Ulf Leser and Karin Zimmermann

Gene expression analysis. Ulf Leser and Karin Zimmermann Gene expression analysis Ulf Leser and Karin Zimmermann Ulf Leser: Bioinformatics, Wintersemester 2010/2011 1 Last lecture What are microarrays? - Biomolecular devices measuring the transcriptome of a

More information

Tutorial for proteome data analysis using the Perseus software platform

Tutorial for proteome data analysis using the Perseus software platform Tutorial for proteome data analysis using the Perseus software platform Laboratory of Mass Spectrometry, LNBio, CNPEM Tutorial version 1.0, January 2014. Note: This tutorial was written based on the information

More information

A Streamlined Workflow for Untargeted Metabolomics

A Streamlined Workflow for Untargeted Metabolomics A Streamlined Workflow for Untargeted Metabolomics Employing XCMS plus, a Simultaneous Data Processing and Metabolite Identification Software Package for Rapid Untargeted Metabolite Screening Baljit K.

More information

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE ACCELERATING PROGRESS IS IN OUR GENES AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE GENESPRING GENE EXPRESSION (GX) MASS PROFILER PROFESSIONAL (MPP) PATHWAY ARCHITECT (PA) See Deeper. Reach Further. BIOINFORMATICS

More information

Time series experiments

Time series experiments Time series experiments Time series experiments Why is this a separate lecture: The price of microarrays are decreasing more time series experiments are coming Often a more complex experimental design

More information

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis

OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis OplAnalyzer: A Toolbox for MALDI-TOF Mass Spectrometry Data Analysis Thang V. Pham and Connie R. Jimenez OncoProteomics Laboratory, Cancer Center Amsterdam, VU University Medical Center De Boelelaan 1117,

More information

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012 Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization GENOME 560, Spring 2012 Data are interesting because they help us understand the world Genomics: Massive Amounts

More information

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov

Data Clustering. Dec 2nd, 2013 Kyrylo Bessonov Data Clustering Dec 2nd, 2013 Kyrylo Bessonov Talk outline Introduction to clustering Types of clustering Supervised Unsupervised Similarity measures Main clustering algorithms k-means Hierarchical Main

More information

Dimensionality Reduction: Principal Components Analysis

Dimensionality Reduction: Principal Components Analysis Dimensionality Reduction: Principal Components Analysis In data mining one often encounters situations where there are a large number of variables in the database. In such situations it is very likely

More information

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis

Integrated Data Mining Strategy for Effective Metabolomic Data Analysis The First International Symposium on Optimization and Systems Biology (OSB 07) Beijing, China, August 8 10, 2007 Copyright 2007 ORSC & APORC pp. 45 51 Integrated Data Mining Strategy for Effective Metabolomic

More information

Quantitative proteomics background

Quantitative proteomics background Proteomics data analysis seminar Quantitative proteomics and transcriptomics of anaerobic and aerobic yeast cultures reveals post transcriptional regulation of key cellular processes de Groot, M., Daran

More information

II. DISTRIBUTIONS distribution normal distribution. standard scores

II. DISTRIBUTIONS distribution normal distribution. standard scores Appendix D Basic Measurement And Statistics The following information was developed by Steven Rothke, PhD, Department of Psychology, Rehabilitation Institute of Chicago (RIC) and expanded by Mary F. Schmidt,

More information

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Lecture 2: Descriptive Statistics and Exploratory Data Analysis Lecture 2: Descriptive Statistics and Exploratory Data Analysis Further Thoughts on Experimental Design 16 Individuals (8 each from two populations) with replicates Pop 1 Pop 2 Randomly sample 4 individuals

More information

Principal Component Analysis

Principal Component Analysis Principal Component Analysis ERS70D George Fernandez INTRODUCTION Analysis of multivariate data plays a key role in data analysis. Multivariate data consists of many different attributes or variables recorded

More information

Functional Data Analysis of MALDI TOF Protein Spectra

Functional Data Analysis of MALDI TOF Protein Spectra Functional Data Analysis of MALDI TOF Protein Spectra Dean Billheimer dean.billheimer@vanderbilt.edu. Department of Biostatistics Vanderbilt University Vanderbilt Ingram Cancer Center FDA for MALDI TOF

More information

Data, Measurements, Features

Data, Measurements, Features Data, Measurements, Features Middle East Technical University Dep. of Computer Engineering 2009 compiled by V. Atalay What do you think of when someone says Data? We might abstract the idea that data are

More information

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Cancer Biostatistics Workshop Science of Doing Science - Biostatistics Yu Shyr, PhD Jan. 18, 2008 Cancer Biostatistics Center Vanderbilt-Ingram Cancer Center Yu.Shyr@vanderbilt.edu Aims Cancer Biostatistics

More information

Exploratory data analysis for microarray data

Exploratory data analysis for microarray data Eploratory data analysis for microarray data Anja von Heydebreck Ma Planck Institute for Molecular Genetics, Dept. Computational Molecular Biology, Berlin, Germany heydebre@molgen.mpg.de Visualization

More information

Azure Machine Learning, SQL Data Mining and R

Azure Machine Learning, SQL Data Mining and R Azure Machine Learning, SQL Data Mining and R Day-by-day Agenda Prerequisites No formal prerequisites. Basic knowledge of SQL Server Data Tools, Excel and any analytical experience helps. Best of all:

More information

Chapter 5 Analysis of variance SPSS Analysis of variance

Chapter 5 Analysis of variance SPSS Analysis of variance Chapter 5 Analysis of variance SPSS Analysis of variance Data file used: gss.sav How to get there: Analyze Compare Means One-way ANOVA To test the null hypothesis that several population means are equal,

More information

Notes for STA 437/1005 Methods for Multivariate Data

Notes for STA 437/1005 Methods for Multivariate Data Notes for STA 437/1005 Methods for Multivariate Data Radford M. Neal, 26 November 2010 Random Vectors Notation: Let X be a random vector with p elements, so that X = [X 1,..., X p ], where denotes transpose.

More information

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I

BNG 202 Biomechanics Lab. Descriptive statistics and probability distributions I BNG 202 Biomechanics Lab Descriptive statistics and probability distributions I Overview The overall goal of this short course in statistics is to provide an introduction to descriptive and inferential

More information

Alignment and Preprocessing for Data Analysis

Alignment and Preprocessing for Data Analysis Alignment and Preprocessing for Data Analysis Preprocessing tools for chromatography Basics of alignment GC FID (D) data and issues PCA F Ratios GC MS (D) data and issues PCA F Ratios PARAFAC Piecewise

More information

Simple Predictive Analytics Curtis Seare

Simple Predictive Analytics Curtis Seare Using Excel to Solve Business Problems: Simple Predictive Analytics Curtis Seare Copyright: Vault Analytics July 2010 Contents Section I: Background Information Why use Predictive Analytics? How to use

More information

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: "What do the data look like?"

4.1 Exploratory Analysis: Once the data is collected and entered, the first question is: What do the data look like? Data Analysis Plan The appropriate methods of data analysis are determined by your data types and variables of interest, the actual distribution of the variables, and the number of cases. Different analyses

More information

Statistical issues in the analysis of microarray data

Statistical issues in the analysis of microarray data Statistical issues in the analysis of microarray data Daniel Gerhard Institute of Biostatistics Leibniz University of Hannover ESNATS Summerschool, Zermatt D. Gerhard (LUH) Analysis of microarray data

More information

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R

Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Practical Data Science with Azure Machine Learning, SQL Data Mining, and R Overview This 4-day class is the first of the two data science courses taught by Rafal Lukawiecki. Some of the topics will be

More information

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska MIC - Detecting Novel Associations in Large Data Sets by Nico Güttler, Andreas Ströhlein and Matt Huska Outline Motivation Method Results Criticism Conclusions Motivation - Goal Determine important undiscovered

More information

Review Jeopardy. Blue vs. Orange. Review Jeopardy

Review Jeopardy. Blue vs. Orange. Review Jeopardy Review Jeopardy Blue vs. Orange Review Jeopardy Jeopardy Round Lectures 0-3 Jeopardy Round $200 How could I measure how far apart (i.e. how different) two observations, y 1 and y 2, are from each other?

More information

Anomaly detection. Problem motivation. Machine Learning

Anomaly detection. Problem motivation. Machine Learning Anomaly detection Problem motivation Machine Learning Anomaly detection example Aircraft engine features: = heat generated = vibration intensity Dataset: New engine: (vibration) (heat) Density estimation

More information

PHARMACOMETABOLOMICS IN BIPOLAR DISORDER

PHARMACOMETABOLOMICS IN BIPOLAR DISORDER PHARMACOMETABOLOMICS IN BIPOLAR DISORDER V I C K I L. E L L I N G R O D, P H A R M. D., F C C P J O H N G I D E O N S E A R L E P R O F E S S O R O F C L I N I C A L A N D T R A N S L AT I O N A L P H

More information

Using multiple models: Bagging, Boosting, Ensembles, Forests

Using multiple models: Bagging, Boosting, Ensembles, Forests Using multiple models: Bagging, Boosting, Ensembles, Forests Bagging Combining predictions from multiple models Different models obtained from bootstrap samples of training data Average predictions or

More information

Chapter 13 Introduction to Linear Regression and Correlation Analysis

Chapter 13 Introduction to Linear Regression and Correlation Analysis Chapter 3 Student Lecture Notes 3- Chapter 3 Introduction to Linear Regression and Correlation Analsis Fall 2006 Fundamentals of Business Statistics Chapter Goals To understand the methods for displaing

More information

Chemometric Analysis for Spectroscopy

Chemometric Analysis for Spectroscopy Chemometric Analysis for Spectroscopy Bridging the Gap between the State and Measurement of a Chemical System by Dongsheng Bu, PhD, Principal Scientist, CAMO Software Inc. Chemometrics is the use of mathematical

More information

Univariate Regression

Univariate Regression Univariate Regression Correlation and Regression The regression line summarizes the linear relationship between 2 variables Correlation coefficient, r, measures strength of relationship: the closer r is

More information

How to report the percentage of explained common variance in exploratory factor analysis

How to report the percentage of explained common variance in exploratory factor analysis UNIVERSITAT ROVIRA I VIRGILI How to report the percentage of explained common variance in exploratory factor analysis Tarragona 2013 Please reference this document as: Lorenzo-Seva, U. (2013). How to report

More information

ProteinPilot Report for ProteinPilot Software

ProteinPilot Report for ProteinPilot Software ProteinPilot Report for ProteinPilot Software Detailed Analysis of Protein Identification / Quantitation Results Automatically Sean L Seymour, Christie Hunter SCIEX, USA Pow erful mass spectrometers like

More information

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics. Course Catalog In order to be assured that all prerequisites are met, students must acquire a permission number from the education coordinator prior to enrolling in any Biostatistics course. Courses are

More information

Technology Step-by-Step Using StatCrunch

Technology Step-by-Step Using StatCrunch Technology Step-by-Step Using StatCrunch Section 1.3 Simple Random Sampling 1. Select Data, highlight Simulate Data, then highlight Discrete Uniform. 2. Fill in the following window with the appropriate

More information

Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test.

Variables and Data A variable contains data about anything we measure. For example; age or gender of the participants or their score on a test. The Analysis of Research Data The design of any project will determine what sort of statistical tests you should perform on your data and how successful the data analysis will be. For example if you decide

More information

Statistics Graduate Courses

Statistics Graduate Courses Statistics Graduate Courses STAT 7002--Topics in Statistics-Biological/Physical/Mathematics (cr.arr.).organized study of selected topics. Subjects and earnable credit may vary from semester to semester.

More information

Clustering and Data Mining in R

Clustering and Data Mining in R Clustering and Data Mining in R Workshop Supplement Thomas Girke December 10, 2011 Introduction Data Preprocessing Data Transformations Distance Methods Cluster Linkage Hierarchical Clustering Approaches

More information

Multivariate Tools for Modern Pharmaceutical Control FDA Perspective

Multivariate Tools for Modern Pharmaceutical Control FDA Perspective Multivariate Tools for Modern Pharmaceutical Control FDA Perspective IFPAC Annual Meeting 22 January 2013 Christine M. V. Moore, Ph.D. Acting Director ONDQA/CDER/FDA Outline Introduction to Multivariate

More information

SIMCA 14 MASTER YOUR DATA SIMCA THE STANDARD IN MULTIVARIATE DATA ANALYSIS

SIMCA 14 MASTER YOUR DATA SIMCA THE STANDARD IN MULTIVARIATE DATA ANALYSIS SIMCA 14 MASTER YOUR DATA SIMCA THE STANDARD IN MULTIVARIATE DATA ANALYSIS 02 Value From Data A NEW WORLD OF MASTERING DATA EXPLORE, ANALYZE AND INTERPRET Our world is increasingly dependent on data, and

More information

Linear Models in STATA and ANOVA

Linear Models in STATA and ANOVA Session 4 Linear Models in STATA and ANOVA Page Strengths of Linear Relationships 4-2 A Note on Non-Linear Relationships 4-4 Multiple Linear Regression 4-5 Removal of Variables 4-8 Independent Samples

More information

1. The standardised parameters are given below. Remember to use the population rather than sample standard deviation.

1. The standardised parameters are given below. Remember to use the population rather than sample standard deviation. Kapitel 5 5.1. 1. The standardised parameters are given below. Remember to use the population rather than sample standard deviation. The graph of cross-validated error versus component number is presented

More information

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not.

Example: Credit card default, we may be more interested in predicting the probabilty of a default than classifying individuals as default or not. Statistical Learning: Chapter 4 Classification 4.1 Introduction Supervised learning with a categorical (Qualitative) response Notation: - Feature vector X, - qualitative response Y, taking values in C

More information

Metabolic profile of veins and their implications in primary varicose veins Disease.

Metabolic profile of veins and their implications in primary varicose veins Disease. Metabolic profile of veins and their implications in primary varicose veins Disease. Anwar MA 1, Beckonert OP 2, Shalhoub J 1, Vorkas P 2, Lim CS 1, Want EJ 2, Nicholson JK 2, Holmes E 2, Davies AH 1 1

More information

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning

Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning Non-negative Matrix Factorization (NMF) in Semi-supervised Learning Reducing Dimension and Maintaining Meaning SAMSI 10 May 2013 Outline Introduction to NMF Applications Motivations NMF as a middle step

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm

Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm Mgt 540 Research Methods Data Analysis 1 Additional sources Compilation of sources: http://lrs.ed.uiuc.edu/tseportal/datacollectionmethodologies/jin-tselink/tselink.htm http://web.utk.edu/~dap/random/order/start.htm

More information

Part 2: Analysis of Relationship Between Two Variables

Part 2: Analysis of Relationship Between Two Variables Part 2: Analysis of Relationship Between Two Variables Linear Regression Linear correlation Significance Tests Multiple regression Linear Regression Y = a X + b Dependent Variable Independent Variable

More information

Dr Alexander Henzing

Dr Alexander Henzing Horizon 2020 Health, Demographic Change & Wellbeing EU funding, research and collaboration opportunities for 2016/17 Innovate UK funding opportunities in omics, bridging health and life sciences Dr Alexander

More information

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( )

NCSS Statistical Software Principal Components Regression. In ordinary least squares, the regression coefficients are estimated using the formula ( ) Chapter 340 Principal Components Regression Introduction is a technique for analyzing multiple regression data that suffer from multicollinearity. When multicollinearity occurs, least squares estimates

More information

Applied Multivariate Analysis

Applied Multivariate Analysis Neil H. Timm Applied Multivariate Analysis With 42 Figures Springer Contents Preface Acknowledgments List of Tables List of Figures vii ix xix xxiii 1 Introduction 1 1.1 Overview 1 1.2 Multivariate Models

More information

Exploratory data analysis (Chapter 2) Fall 2011

Exploratory data analysis (Chapter 2) Fall 2011 Exploratory data analysis (Chapter 2) Fall 2011 Data Examples Example 1: Survey Data 1 Data collected from a Stat 371 class in Fall 2005 2 They answered questions about their: gender, major, year in school,

More information

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data From Reads to Differentially Expressed Genes The statistics of differential gene expression analysis using RNA-seq data experimental design data collection modeling statistical testing biological heterogeneity

More information

Descriptive Statistics

Descriptive Statistics Descriptive Statistics Primer Descriptive statistics Central tendency Variation Relative position Relationships Calculating descriptive statistics Descriptive Statistics Purpose to describe or summarize

More information

PREDA S4-classes. Francesco Ferrari October 13, 2015

PREDA S4-classes. Francesco Ferrari October 13, 2015 PREDA S4-classes Francesco Ferrari October 13, 2015 Abstract This document provides a description of custom S4 classes used to manage data structures for PREDA: an R package for Position RElated Data Analysis.

More information

Common factor analysis

Common factor analysis Common factor analysis This is what people generally mean when they say "factor analysis" This family of techniques uses an estimate of common variance among the original variables to generate the factor

More information

Chapter 12 Discovering New Knowledge Data Mining

Chapter 12 Discovering New Knowledge Data Mining Chapter 12 Discovering New Knowledge Data Mining Becerra-Fernandez, et al. -- Knowledge Management 1/e -- 2004 Prentice Hall Additional material 2007 Dekai Wu Chapter Objectives Introduce the student to

More information

Statistiek II. John Nerbonne. October 1, 2010. Dept of Information Science j.nerbonne@rug.nl

Statistiek II. John Nerbonne. October 1, 2010. Dept of Information Science j.nerbonne@rug.nl Dept of Information Science j.nerbonne@rug.nl October 1, 2010 Course outline 1 One-way ANOVA. 2 Factorial ANOVA. 3 Repeated measures ANOVA. 4 Correlation and regression. 5 Multiple regression. 6 Logistic

More information

Model Selection. Introduction. Model Selection

Model Selection. Introduction. Model Selection Model Selection Introduction This user guide provides information about the Partek Model Selection tool. Topics covered include using a Down syndrome data set to demonstrate the usage of the Partek Model

More information

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics»

Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics» Un (bref) aperçu des méthodes et outils de fouilles et de visualisation de données «omics» Workshop «Protéomique & Maladies rares» 25 th September 2012, Paris yves.vandenbrouck@cea.fr CEA Grenoble irtsv

More information

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones

Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones Introduction to machine learning and pattern recognition Lecture 1 Coryn Bailer-Jones http://www.mpia.de/homes/calj/mlpr_mpia2008.html 1 1 What is machine learning? Data description and interpretation

More information

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays

Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays Increasing the Multiplexing of High Resolution Targeted Peptide Quantification Assays Scheduled MRM HR Workflow on the TripleTOF Systems Jenny Albanese, Christie Hunter AB SCIEX, USA Targeted quantitative

More information

Gene Expression Analysis

Gene Expression Analysis Gene Expression Analysis Jie Peng Department of Statistics University of California, Davis May 2012 RNA expression technologies High-throughput technologies to measure the expression levels of thousands

More information

Chapter 7: Simple linear regression Learning Objectives

Chapter 7: Simple linear regression Learning Objectives Chapter 7: Simple linear regression Learning Objectives Reading: Section 7.1 of OpenIntro Statistics Video: Correlation vs. causation, YouTube (2:19) Video: Intro to Linear Regression, YouTube (5:18) -

More information

0BComparativeMarkerSelection Documentation

0BComparativeMarkerSelection Documentation 0BComparativeMarkerSelection Documentation Description: Author: Computes significance values for features using several metrics, including FDR(BH), Q Value, FWER, Feature-Specific P-Value, and Bonferroni.

More information

Strategies in data integration to predict fish susceptibility to toxicants

Strategies in data integration to predict fish susceptibility to toxicants Strategies in data integration to predict fish susceptibility to toxicants Fernando Ortega and Francesco Falciani School of Biosciences The University of Birmingham NERC PGP Project Kevin Chipman Mark

More information

Multivariate Normal Distribution

Multivariate Normal Distribution Multivariate Normal Distribution Lecture 4 July 21, 2011 Advanced Multivariate Statistical Methods ICPSR Summer Session #2 Lecture #4-7/21/2011 Slide 1 of 41 Last Time Matrices and vectors Eigenvalues

More information

Environmental Remote Sensing GEOG 2021

Environmental Remote Sensing GEOG 2021 Environmental Remote Sensing GEOG 2021 Lecture 4 Image classification 2 Purpose categorising data data abstraction / simplification data interpretation mapping for land cover mapping use land cover class

More information

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 123 CHAPTER 7 BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES 7.1 Introduction Even though using SVM presents

More information

Unit 26: Small Sample Inference for One Mean

Unit 26: Small Sample Inference for One Mean Unit 26: Small Sample Inference for One Mean Prerequisites Students need the background on confidence intervals and significance tests covered in Units 24 and 25. Additional Topic Coverage Additional coverage

More information

Introduction to Statistics and Quantitative Research Methods

Introduction to Statistics and Quantitative Research Methods Introduction to Statistics and Quantitative Research Methods Purpose of Presentation To aid in the understanding of basic statistics, including terminology, common terms, and common statistical methods.

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining 1 Why Data Mining? Explosive Growth of Data Data collection and data availability Automated data collection tools, Internet, smartphones, Major sources of abundant data Business:

More information

Multivariate Analysis. Overview

Multivariate Analysis. Overview Multivariate Analysis Overview Introduction Multivariate thinking Body of thought processes that illuminate the interrelatedness between and within sets of variables. The essence of multivariate thinking

More information

Thermo Scientific SIEVE Software for Differential Expression Analysis

Thermo Scientific SIEVE Software for Differential Expression Analysis m a s s s p e c t r o m e t r y Thermo Scientific SIEVE Software for Differential Expression Analysis Automated, label-free, semi-quantitative analysis of proteins, peptides, and metabolites based on comparisons

More information

Normality Testing in Excel

Normality Testing in Excel Normality Testing in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com

More information

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon t-tests in Excel By Mark Harmon Copyright 2011 Mark Harmon No part of this publication may be reproduced or distributed without the express permission of the author. mark@excelmasterseries.com www.excelmasterseries.com

More information

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression

Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Principle Component Analysis and Partial Least Squares: Two Dimension Reduction Techniques for Regression Saikat Maitra and Jun Yan Abstract: Dimension reduction is one of the major tasks for multivariate

More information

Factors affecting online sales

Factors affecting online sales Factors affecting online sales Table of contents Summary... 1 Research questions... 1 The dataset... 2 Descriptive statistics: The exploratory stage... 3 Confidence intervals... 4 Hypothesis tests... 4

More information

Overview of Factor Analysis

Overview of Factor Analysis Overview of Factor Analysis Jamie DeCoster Department of Psychology University of Alabama 348 Gordon Palmer Hall Box 870348 Tuscaloosa, AL 35487-0348 Phone: (205) 348-4431 Fax: (205) 348-8648 August 1,

More information

Introduction to Regression and Data Analysis

Introduction to Regression and Data Analysis Statlab Workshop Introduction to Regression and Data Analysis with Dan Campbell and Sherlock Campbell October 28, 2008 I. The basics A. Types of variables Your variables may take several forms, and it

More information

Multivariate Analysis of Ecological Data

Multivariate Analysis of Ecological Data Multivariate Analysis of Ecological Data MICHAEL GREENACRE Professor of Statistics at the Pompeu Fabra University in Barcelona, Spain RAUL PRIMICERIO Associate Professor of Ecology, Evolutionary Biology

More information

Statistics Review PSY379

Statistics Review PSY379 Statistics Review PSY379 Basic concepts Measurement scales Populations vs. samples Continuous vs. discrete variable Independent vs. dependent variable Descriptive vs. inferential stats Common analyses

More information

Basics of microarrays. Petter Mostad 2003

Basics of microarrays. Petter Mostad 2003 Basics of microarrays Petter Mostad 2003 Why microarrays? Microarrays work by hybridizing strands of DNA in a sample against complementary DNA in spots on a chip. Expression analysis measure relative amounts

More information

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk

Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Analysing Questionnaires using Minitab (for SPSS queries contact -) Graham.Currell@uwe.ac.uk Structure As a starting point it is useful to consider a basic questionnaire as containing three main sections:

More information

Introduction to mass spectrometry (MS) based proteomics and metabolomics

Introduction to mass spectrometry (MS) based proteomics and metabolomics Introduction to mass spectrometry (MS) based proteomics and metabolomics Tianwei Yu Department of Biostatistics and Bioinformatics Rollins School of Public Health Emory University September 10, 2015 Background

More information

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests

Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests Effects of Intelligent Data Acquisition and Fast Laser Speed on Analysis of Complex Protein Digests AB SCIEX TOF/TOF 5800 System with DynamicExit Algorithm and ProteinPilot Software for Robust Protein

More information

Machine Learning with MATLAB David Willingham Application Engineer

Machine Learning with MATLAB David Willingham Application Engineer Machine Learning with MATLAB David Willingham Application Engineer 2014 The MathWorks, Inc. 1 Goals Overview of machine learning Machine learning models & techniques available in MATLAB Streamlining the

More information

Statistics 112 Regression Cheatsheet Section 1B - Ryan Rosario

Statistics 112 Regression Cheatsheet Section 1B - Ryan Rosario Statistics 112 Regression Cheatsheet Section 1B - Ryan Rosario I have found that the best way to practice regression is by brute force That is, given nothing but a dataset and your mind, compute everything

More information

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data CMPE 59H Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data Term Project Report Fatma Güney, Kübra Kalkan 1/15/2013 Keywords: Non-linear

More information

Factor Analysis. Chapter 420. Introduction

Factor Analysis. Chapter 420. Introduction Chapter 420 Introduction (FA) is an exploratory technique applied to a set of observed variables that seeks to find underlying factors (subsets of variables) from which the observed variables were generated.

More information

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout

SPSS Resources. 1. See website (readings) for SPSS tutorial & Stats handout Analyzing Data SPSS Resources 1. See website (readings) for SPSS tutorial & Stats handout Don t have your own copy of SPSS? 1. Use the libraries to analyze your data 2. Download a trial version of SPSS

More information

Module 5: Statistical Analysis

Module 5: Statistical Analysis Module 5: Statistical Analysis To answer more complex questions using your data, or in statistical terms, to test your hypothesis, you need to use more advanced statistical tests. This module reviews the

More information

Psyc 250 Statistics & Experimental Design. Correlation Exercise

Psyc 250 Statistics & Experimental Design. Correlation Exercise Psyc 250 Statistics & Experimental Design Correlation Exercise Preparation: Log onto Woodle and download the Class Data February 09 dataset and the associated Syntax to create scale scores Class Syntax

More information

Diagnosis of Students Online Learning Portfolios

Diagnosis of Students Online Learning Portfolios Diagnosis of Students Online Learning Portfolios Chien-Ming Chen 1, Chao-Yi Li 2, Te-Yi Chan 3, Bin-Shyan Jong 4, and Tsong-Wuu Lin 5 Abstract - Online learning is different from the instruction provided

More information

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition

Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Bowerman, O'Connell, Aitken Schermer, & Adcock, Business Statistics in Practice, Canadian edition Online Learning Centre Technology Step-by-Step - Excel Microsoft Excel is a spreadsheet software application

More information