Introduction to data analysis: Supervised analysis



Similar documents
Statistical issues in the analysis of microarray data

False Discovery Rates

Gene Expression Analysis

Time series experiments

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Model Selection. Introduction. Model Selection

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Comparing Functional Data Analysis Approach and Nonparametric Mixed-Effects Modeling Approach for Longitudinal Data Analysis

Validation and Replication

MIC - Detecting Novel Associations in Large Data Sets. by Nico Güttler, Andreas Ströhlein and Matt Huska

Integrating DNA Motif Discovery and Genome-Wide Expression Analysis. Erin M. Conlon

Prediction Analysis of Microarrays in Excel

Gene expression analysis. Ulf Leser and Karin Zimmermann

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Statistical Analysis. NBAF-B Metabolomics Masterclass. Mark Viant

Two-Way ANOVA tests. I. Definition and Applications...2. II. Two-Way ANOVA prerequisites...2. III. How to use the Two-Way ANOVA tool?...

Cancer Biostatistics Workshop Science of Doing Science - Biostatistics

A direct approach to false discovery rates

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

Expression Quantification (I)

Analysis Issues II. Mary Foulkes, PhD Johns Hopkins University

SAM Significance Analysis of Microarrays Users guide and technical document

Introduction to Hypothesis Testing. Hypothesis Testing. Step 1: State the Hypotheses

Knowledge Discovery and Data Mining

Package GSA. R topics documented: February 19, 2015

HYPOTHESIS TESTING: POWER OF THE TEST

NOVEL GENOME-SCALE CORRELATION BETWEEN DNA REPLICATION AND RNA TRANSCRIPTION DURING THE CELL CYCLE IN YEAST IS PREDICTED BY DATA-DRIVEN MODELS

Tutorial for proteome data analysis using the Perseus software platform

US News Rankings (2015) America s Best Graduate Schools Published March 2014

Identification of Differentially Expressed Genes with Artificial Components the acde Package

ANOVA ANOVA. Two-Way ANOVA. One-Way ANOVA. When to use ANOVA ANOVA. Analysis of Variance. Chapter 16. A procedure for comparing more than two groups

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Minería de Datos ANALISIS DE UN SET DE DATOS.! Visualization Techniques! Combined Graph! Charts and Pies! Search for specific functions

How to report the percentage of explained common variance in exploratory factor analysis

A PARADIGM FOR DEVELOPING BETTER MEASURES OF MARKETING CONSTRUCTS

Data Mining in CRM & Direct Marketing. Jun Du The University of Western Ontario jdu43@uwo.ca

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Open Access Research on Application of Neural Network in Computer Network Security Evaluation. Shujuan Jin *

Package empiricalfdr.deseq2

Quantitative proteomics background

Module 12. Software Project Monitoring and Control. Version 2 CSE IIT, Kharagpur

Exercise with Gene Ontology - Cytoscape - BiNGO

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Three Methods for ediscovery Document Prioritization:

Charge Card Administration. Accessing the Travel Card Cardholder Training in the Learning Management System (LMS)

Exploratory data analysis for microarray data

Strong control, conservative point estimation and simultaneous conservative consistency of false discovery rates: a unified approach

Practical Differential Gene Expression. Introduction

Frequently Asked Questions (FAQ)

Epigenetic variation and complex disease risk

II. DISTRIBUTIONS distribution normal distribution. standard scores

Identify & Engage: Analyzing your Donor Data to Discover and Prioritize Major Gift. Prospects. Oregon Nonprofit Leaders Conference, April 2014

Survey Research: Choice of Instrument, Sample. Lynda Burton, ScD Johns Hopkins University

Tutorial 5: Hypothesis Testing

Basic Analysis of Microarray Data

Descriptive Statistics

Package ERP. December 14, 2015

Package dunn.test. January 6, 2016

Gene Enrichment Analysis

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

How To Cluster

Chapter 8 Hypothesis Testing Chapter 8 Hypothesis Testing 8-1 Overview 8-2 Basics of Hypothesis Testing

Review & AI Lessons learned while using Artificial Intelligence April 2013

College Readiness LINKING STUDY

Using MS Excel to Analyze Data: A Tutorial

SAMPLING & INFERENTIAL STATISTICS. Sampling is necessary to make inferences about a population.

FACULTY OF MEDICAL SCIENCE

Is log ratio a good value for measuring return in stock investments

How To Identify Noisy Variables In A Cluster

This module introduces the use of informal career assessments and provides the student with an opportunity to use an interest and skills checklist.

A Streamlined Workflow for Untargeted Metabolomics

HLA data analysis in anthropology: basic theory and practice

diagnosis through Random

National Cancer Institute

CHAPTER 14 NONPARAMETRIC TESTS

A Property & Casualty Insurance Predictive Modeling Process in SAS

Course on Functional Analysis. ::: Gene Set Enrichment Analysis - GSEA -

Application of Analytical Hierarchy Process (AHP) in productivity of costs of Quality

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Quality Assessment of Exon and Gene Arrays

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Investigating the genetic basis for intelligence

Transcription:

Introduction to data analysis: Supervised analysis Introduction to Microarray Technology course May 2011 Solveig Mjelstad Olafsrud solveig@microarray.no Most slides adapted/borrowed from presentations by Kjell Petersen & Anne Kristin Stavrum (NMC, Bergen)

Overview What is differential expression? Scoring a gene for differential expression T test SAM Rank Product What does the result of the analysis look like? Measure of significance We re producing lists Cut offs and prioritizations

What do they look like?

Supervised analysis Goal: To find differentially expressed genes between two groups of samples. Group: Different treatement, timepoint, phenotype etc. What is differential expression? Measurements before (1.5, 0.8, 1.2) and after (2.1, 1.7, 1.5) treatment We are looking for systematic differences in expression levels Are the sub groups significantly different? We need a model to help us decide! Gjennomsnitt kontrollgruppe Gjennomsnitt behandlet gruppe

Differetial expression Basic strategy: Find mean expression level for each gene within each of the two groups calculate fold change Find the variance of each gene within each of the two groups Use a statistical test for to two classes to decide whether expression is different in the two groups Simplistic forumula:

T test Assumes normal distribution of data Assumes student t distribution of t scores Problem: Only valid with univariate tests We are testing thousands of genes at the same time multipple testing Probably not student t distribution of scores

Significance analysis of Microarrays (SAM) Assumes normal distribution of data Makes no assumption about the distribution of the scores Advantages: Eliminate occurance of accidentally large t values due to accidantally small within group variance Effectively introduces a fold change criterion Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21.

Rank product Does no assumption on distribution of data No calculation of variance across array Scores genes according to their rank in multiple comparisons Breitling, R. et al., Rank products: a simple, yet powerful, new method to Detect differentially regulated genes in replicated microarray experiments. FEBS Letters 2004. 573(1-3): p.83-92.

Significance of scores The p value is defined as the probability of a gene obtaining the score by chance Assuming only one gene has been tested Does not take into accont that when multiple genes are tested, the probability of randomly obtaining a high score for any of the tested genes increases The p value should therefore be corrected for multiple testing E.g. Bonferroni correction (very strict) Many other methods to correct for multiple testing, but it is hard to select the right one case by case Common to work with significance of lists instead of single genes

What does the results look like? 1.995 4.117 360415 5.333 1.443 4.14 359725 5.495 2.47 4.167 304092 6.045 2.56 4.186 rcg26536 6.253 1.673 4.196 rcg22278 6.476 1.515 4.2 307947 4.724 4.343 315095 4.318 1.788 4.411 rcg48508 5.66 3.182 4.525 313504 6.476 2.168 4.552 56227 0 0 1.421 4.655 rcg34061 q verdi FDR Fold Change Score Gen_ID

False discovery rate (FDR) FDR refers to the number of genes on a ranked gene list that is expected to be false positives If the p value of gene #200 in a ranked list is 0.001 and we have analysed 20 000 genes, Expect 20 000 * 0.001 = 20 genes to be false positives FDR = False positives /Rank *100% = 20/200*100% = 10 %

FDR value and q value 1.995 4.117 360415 5.333 1.443 4.14 359725 5.495 2.47 4.167 304092 6.045 2.56 4.186 rcg26536 6.253 1.673 4.196 rcg22278 6.476 1.515 4.2 307947 4.724 4.343 315095 4.318 1.788 4.411 rcg48508 5.66 3.182 4.525 313504 6.476 2.168 4.552 56227 0 0 1.421 4.655 rcg34061 q verdi FDR Fold Change Score Gen_ID

Where do we cut? Commonly used strategies: Sufficiently large fold change Suitably samll p value or q value Any strategy results in a random cut off There is no perfect cut off where every gene above the cut off is truly differentially expressed, while every gene below the cut off is not differentially expressed

Don t use absolute cut offs Use q values (alternatively FDR or corrected p values) To guide your work: FDR estimates is acceptable because we want to screen and search for biological knowledge, looking for emerging pictures/treds Never use statistics alone, only as input together with your interpretations of the data/teh biological picture you see Do you believe it? Enought to do follow up expreiments? We re working on lists, don t chop off of the top and forget the rest. Distribution of related genes in the whole list

Questions?