Statistical Analysis for Microarray & Next-Generation Sequencing Studies

Similar documents
Frequently Asked Questions Next Generation Sequencing

Model Selection. Introduction. Model Selection

Partek Methylation User Guide

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Gene Expression Analysis

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

PREDA S4-classes. Francesco Ferrari October 13, 2015

G E N OM I C S S E RV I C ES

Tutorial for proteome data analysis using the Perseus software platform

Analysis of Illumina Gene Expression Microarray Data

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

SCHOOL OF HEALTH AND HUMAN SCIENCES DON T FORGET TO RECODE YOUR MISSING VALUES

Post-hoc comparisons & two-way analysis of variance. Two-way ANOVA, II. Post-hoc testing for main effects. Post-hoc testing 9.

Analysing Questionnaires using Minitab (for SPSS queries contact -)

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Overview of Non-Parametric Statistics PRESENTER: ELAINE EISENBEISZ OWNER AND PRINCIPAL, OMEGA STATISTICS

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

THE KRUSKAL WALLLIS TEST

MEASURES OF LOCATION AND SPREAD

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Lesson 1: Comparison of Population Means Part c: Comparison of Two- Means

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Statistical issues in the analysis of microarray data

StatCrunch and Nonparametric Statistics

AP: LAB 8: THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

HYPOTHESIS TESTING WITH SPSS:

Understanding West Nile Virus Infection

Statistical tests for SPSS

Basic processing of next-generation sequencing (NGS) data

One-Way Analysis of Variance (ANOVA) Example Problem

Projects Involving Statistics (& SPSS)

Hierarchical Clustering Analysis

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

LAB : THE CHI-SQUARE TEST. Probability, Random Chance, and Genetics

t Tests in Excel The Excel Statistical Master By Mark Harmon Copyright 2011 Mark Harmon

1.5 Oneway Analysis of Variance

Sample Size and Power in Clinical Trials

Comparing Means in Two Populations

Big Data Visualization for Genomics. Luca Vezzadini Kairos3D

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

A Streamlined Workflow for Untargeted Metabolomics

Research Methods & Experimental Design

Chapter 7 Section 7.1: Inference for the Mean of a Population

INTERPRETING THE ONE-WAY ANALYSIS OF VARIANCE (ANOVA)

II. DISTRIBUTIONS distribution normal distribution. standard scores

QUANTITATIVE METHODS BIOLOGY FINAL HONOUR SCHOOL NON-PARAMETRIC TESTS

Descriptive Statistics

Testing for differences I exercises with SPSS

UNDERSTANDING THE TWO-WAY ANOVA

Analysis of Variance ANOVA

Step-by-Step Guide to Basic Expression Analysis and Normalization

ANALYSING LIKERT SCALE/TYPE DATA, ORDINAL LOGISTIC REGRESSION EXAMPLE IN R.

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

The Kruskal-Wallis test:

ABSORBENCY OF PAPER TOWELS

That s Not Fair! ASSESSMENT #HSMA20. Benchmark Grades: 9-12

Introduction To Epigenetic Regulation: How Can The Epigenomics Core Services Help Your Research? Maria (Ken) Figueroa, M.D. Core Scientific Director

HYPOTHESIS TESTING: CONFIDENCE INTERVALS, T-TESTS, ANOVAS, AND REGRESSION

Analysis of Data. Organizing Data Files in SPSS. Descriptive Statistics

InSyBio BioNets: Utmost efficiency in gene expression data and biological networks analysis

Additional sources Compilation of sources:

Permutation & Non-Parametric Tests

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

Cluster software and Java TreeView

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Rank-Based Non-Parametric Tests

SPSS Explore procedure

Why Taking This Course? Course Introduction, Descriptive Statistics and Data Visualization. Learning Goals. GENOME 560, Spring 2012

Simple Predictive Analytics Curtis Seare

Section 13, Part 1 ANOVA. Analysis Of Variance

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Overview of Next Generation Sequencing platform technologies

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Parametric and non-parametric statistical methods for the life sciences - Session I

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Gene Enrichment Analysis

DEPARTMENT OF PSYCHOLOGY UNIVERSITY OF LANCASTER MSC IN PSYCHOLOGICAL RESEARCH METHODS ANALYSING AND INTERPRETING DATA 2 PART 1 WEEK 9

A Statistician s View of Big Data

GC3 Use cases for the Cloud

Core Facility Genomics

Statistical analysis of modern sequencing data quality control, modelling and interpretation

Two-Sample T-Tests Allowing Unequal Variance (Enter Difference)

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

Structural Health Monitoring Tools (SHMTools)

Package empiricalfdr.deseq2

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

NCSS Statistical Software

UNIVERSITY OF NAIROBI

Gene expression analysis. Ulf Leser and Karin Zimmermann

Skewed Data and Non-parametric Methods

Using Excel for inferential statistics

CHAPTER 14 NONPARAMETRIC TESTS

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Transcription:

Statistical Analysis for Microarray & Next-Generation Sequencing Studies Using the right tools and interpreting the results Jonathan Gerstenhaber Field Application Specialist Partek Inc.

Who is Partek? Founded in 1993 Based in St. Louis, MO USA Focused on Genomics Thousands of customers worldwide Building tools for both biologists and bioinformaticians 2 Copyright Partek Inc

What is Partek Genomics Suite? Desktop software - no server required Supports multiple assays Supports all assay providers Enables Integrated Genomics Advanced Statistics Rapid Development Focus on Technical Support Competitively priced 3 Copyright Partek Inc

ONE Software Any Platform Partek Genomics Suite RT- PCR 4 Copyright Partek Inc

ONE Software for Any Assay Partek Genomics Suite Gene Expression RNA-seq srna-seq Exon ChIP-Seq chip-chip & methylation mirna Copy Number DNA-seq 5 Copyright Partek Inc

Linear Workflows Help Users through Analysis Import QA / QC Analysis Visualization Biological Interpretation Integrated Genomics Additional Menu options 6 Copyright Partek Inc

Integrated Genomics Genome Copy Number + AsCN Loss of Heterozygocity Cytogenetics Association DNA-seq Transcriptome Gene Expression Exon/Alternative Splicing DGE & mrna-seq Taqman RT-PCR Sage Regulation mirna arrays srna-seq Tiling arrays ChIP-seq MeDIP-seq 7 Copyright Partek Inc

Three major goals today 1. How to analyze data in the most powerful way possible 2. What to do when your experimental design is not balanced 3. How to interpret your results to give you the most powerful answers 8 Copyright Partek Inc

Build complex models to explain experiments Larger experiments can be analyzed more powerfully when all experimental variables are taken into account This study of Downs Syndrome has very complex behaviors the necessitate powerful analytical methods 9 Copyright Partek Inc

Evolving models: colon cancer vs. normal T-test One factor comparison Tumor vs Normal. Data does not appear especially significant Paired T-test Two factors comparing Tumor vs Normal while controlling for patient-patient differences 3-Way ANOVA Compares Tumor vs Normal just as a t-test Controls for patient-patient and male-female differences Allows us to find new concepts: Does colon cancer affect men and women the same way? 10 Copyright Partek Inc

Significant interaction 11 Copyright Partek Inc

RNA sequencing analysis is just as easy in Partek RNA Seq workflow will take you through import and transcript abundance estimation The resulting estimates can be analyzed using ANOVA for complex designs or to remove potential batch effects like lane or flow channel Fit data to known transcripts 12 Copyright Partek Inc

Batch effects Large experiments can become more complex due to batch or other nuisance variables 13 Copyright Partek Inc

Remove noise & highlight biology: Trefoil Factor 1 Since the treatments were perfectly balanced with the batches, the batch can be can be completely removed from the data With a simple 2-way ANOVA, this gene was #228 on the gene list and would not pass multiple test correction for significance. With a 3-way ANOVA including batch, it was #2 on the gene list. Factor 2-way ANOVA 3-way ANOVA Treatment 0.00391497 3.43275E-07 Time 0.396031 0.00964938 Treatment*Time 0.100862 3.56752E-05 14 Copyright Partek Inc

Batch effect remover Appreciating that the ANOVA is successfully partitioning all our factors apart doesn t make for better images straight away If you want to see your data the way ANOVA sees your data, use the Batch Effect Remover 15 Copyright Partek Inc

Batch effect remover Batch effect remover does not make the data any better. Analyzing the removed data yields the same results Even fold change will not be altered Batch effect data should not be used as input into other types of analysis as it has already been fit to the ANOVA model Original Results Batch Corrected Results 16 Copyright Partek Inc

Model building: Am I making it better, or bigger? Factors should be more significant than error They should help explain more of the error from the pie 17 Copyright Partek Inc

Model fitness and significance ANOVA models, like lines fit in Excel, have a significance and fitness! This is a less graphical method best suited when you know have a particular gene of interest that you are looking to optimize Analysis of Variance Source DF Sum of Squares Excerpt from ANOVA report Mean Square F p-value Model 11 4.27274 0.3884 12.29 0.00075 Error 8 0.25276 0.0315 C Total 19 4.52551 18 Copyright Partek Inc

Contrast vs T-test: power Contrasts allow specific comparisons to be made between groups in the ANOVA without filtering data and running a T-Test This allows us to have more degrees of freedom When small experiments are run our analysis is limited because we have a poor variance estimate A contrast allows us to calculate variance from all groups, even when comparing only two of them! 19 Copyright Partek Inc

Contrast vs T-test: new comparisons T test view ANOVA view 20 Copyright Partek Inc

Assumptions of ANOVA 1. Sample groups are independent Build the best models possible to describe sample relations 2. Variance is equal within different treatment groups Designing balanced experiments with similar numbers of treated samples and control sample will keep variance similar between groups 3. Data is Normally distributed (bell shaped) within different treatment groups For array data, this is one major reason we log-transform the data 21 Copyright Partek Inc

Imbalance: REML Default ANOVA in Partek is called Method of Moments It is especially fast and works very well on balance designs Experiments that are very unbalanced can actually become effectively underpowered 2-way MoM ANOVA 2-way MoM ANOVA (excluding non paired samples) 2-way REML ANOVA 22 Copyright Partek Inc

Imbalance: REML REML is designed to handle data that is imbalanced or incomplete In some of these cases, Method of Moments will be unable to present an unbiased result and Partek will output? within the results. Switching to REML will remedy the situation, but will also remove the p-values for random effects 23 Copyright Partek Inc

Imbalance: Welch s When data is balanced between group (# of controls = # of treated) variance can be within 3x without problems when using an ANOVA. When groups are of very different sizes, equal variance can become of concern This is why it is ideal that you design balanced experiments Available from the Stat menu, Welch s ANOVA allows the comparison of multiple groups when variance is unequal Unfortunately, it is limited to a single factor at a time 24 Copyright Partek Inc

Parametric tests versus nonparametric Parametric Test T-test ANOVA Non Parametric Test Mann-Whitney Kruskal Wallis Repeated Measures ANOVA Friedman Parametric tests assume a normal distribution, but yield more powerful results It is best to normalize data to make it normal. Nonparametric tests can operate even when the data is of unknown distribution so long as the shape is the same for all samples. # of samples Minimum P-value 10 0.009 9 0.014 8 0.02 You need many samples to get significance Significance is independent of degree of change! This can lead to increased false discovery especially in small experiments 25 Copyright Partek Inc

Power analysis to appreciate our experiment Can give the effective dynamic range of an experiment Blue line represents current experiment (colon cancer vs. normal) With 20 samples, we detect 90% of genes that changed 1.8 fold as statistically significant but only 10% of the genes that changed 1.1 fold 26 Copyright Partek Inc

Power analysis for pilot studies What if I only had started with 2 patients? (2 colon cancer + 2 normal samples) Determine ideal experiment size to detect genes of a specific fold change What if I needed to detect genes changed only 75%, or 1.75 fold To significantly detect 90% of the genes that changed 1.75 fold I will need 20 samples 27 Copyright Partek Inc

What is a P-value and what is FDR Comparison Treated vs Control P-value 0.01: 1% chance that this data can appear if Treatment=Control. P-value 0.99: 99% chance that treatment=control P-value 0.2: 20% chance that treatment=control Lack of low p-values does not indicate groups are necessarily equal What about FDR? FDR is a measure of the false positive potential of a gene list FDR is not a correction of an individual gene s level of significance, rather it helps us to keep in mind the whole picture. A coin which lands heads up 5 times in a row has a p-value of 0.03 Yet, if I was to give you 1000 coins to flip 5 times, you would expect it to happen over 30 times! Here is where FDR helps. A gene list with an FDR of 0.05 has 5% predicted false positives. 28 Copyright Partek Inc

What is FDR: Storey s q-value On left: random data is compared with a t-test. On right: colon cancer and normal tissue are compared. A histogram is generated showing how many genes there are at different significance levels in the dataset. Random data Appears equally as significant or non significant Significant Data Significant genes will lead to an increased number of genes at low p-values 29 Copyright Partek Inc

What is FDR: Storey s q-value On left: random data is compared with a t-test. On right: colon cancer and normal tissue are compared. A histogram is generated showing how many genes there are at different significance levels in the dataset. Random data Appears equally as significant or non significant Significant Data Significant genes will lead to an increased number of genes at low p-values Storey s q-value: Non significant genes tell us this number of genes likely falsely discovered. In this case, 10% of genes at a p-value of 0.05 30 Copyright Partek Inc

What is FDR: Storey s q-value On left: random data is compared with a t-test. On right: Down Syndrome and Normal individuals are compared. A histogram is generated showing how many genes there are at different significance levels in the dataset. Random data Appears equally as significant or non significant Significant Data Significant genes will lead to an increased number of genes at low p-values Storey s q-value: The non significant genes tell us this number of genes is likely falsely discovered 31 Copyright Partek Inc

FDR in ChIP sequencing analysis Split the entire genome into 100bp windows Fit a distribution to the number of reads randomly occurring in each Use this distribution to estimate background significance and false discovery and determine an FDR cutoff when detecting binding events Storey s q-value 1 0.1 0.01 0.001 0.0001 0.00001 0.000001 0.0000001 ChIP-Seq FDR 1 2 3 4 5 6 7 8 ZTNB (Background Brinding) Detected region reads 32 Copyright Partek Inc

FDR and downstream analysis FDR at the gene level can be useful when looking for biomarkers and false positives are particularly worrisome, but when looking downstream excessive filtering after ANOVA can be harmful Cutoff Value FDR on Genome FDR 0.01 2 FDR 0.05 10 # of Significant p-values FDR on Chromosome 21 Cutoff Value FDR 0.01 22 FDR 0.05 53 # of Significant p-values 33 Copyright Partek Inc

Downstream analysis can filter out false positives Positional Enrichment of gene passing FDR 0.05 Cytoband Enrichment Score Enrichment p-value chr21q21 10.5922 2.51E-05 chr21q22 9.22367 9.87E-05 chr2p16 4.14997 0.0157649 chr17q22 4.0454 0.0175027 Positional Enrichment of gene passing P value 0.01 Cytoband Enrichment Score Enrichment p-value chr21q22 36.7939 1.05E-16 chrxq27 9.89962 5.02E-05 chr15q11 9.51423 7.38E-05 chr4q35 8.56832 0.000190032 chr21q21 7.94009 0.000356174 chr14q32 6.72957 0.00119505 chr21q11 6.02404 0.00241988 All but one of the down syndrome critical region genes are located on Chr21q22 34 Copyright Partek Inc

Model selection So far we looked for genes that are altered by a disease with ANOVA But did not answer the question, can these altered genes predict disease state? A tutorial is available from the Tutorials page 35 Copyright Partek Inc

Partek model selection: Part 1 Choose methods to find the best genes 36 Copyright Partek Inc

Partek model selection: Part 2 Choose how to classify samples 37 Copyright Partek Inc

Partek model selection: Part 3 And detect significance of the model using cross validation If I had different samples, would I find different lead genes? If I had different samples, would I find a different best model? 38 Copyright Partek Inc

Quick example of colon cancer prediction ANOVA was used to rank the genes by significance 50 genes passing FDR of 0.02 (2 genes passed 0.01) on the left, we see decent separation between groups. While differential, we are not sure that these genes are diagnostic. Instead of FDR, predictability was used to choose the lead set on the right. 15 genes were chosen, and they are estimated to predict correctly tumorogenesis 86.25% of the time 39 Copyright Partek Inc

Partek GS A Complete Solution Any genomic assay Microarray & Next-Gen Seq Any platform Advanced Statistical Analysis Biological Interpretation Integrated Genomics Competitive price Join us online Get your FREE trial today! www.partek.com FREE Data Analysis Webinars www.partek.com/webinars 40 Copyright Partek Inc