I Just Received My Microarray Data, Now What? Danny Park MGH-PGA (ParaBioSys) Sat April 24, 2004

Similar documents
Microarray Technology

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Analysis of Illumina Gene Expression Microarray Data

REAL TIME PCR USING SYBR GREEN

How many of you have checked out the web site on protein-dna interactions?

Tutorial for proteome data analysis using the Perseus software platform

A truly robust Expression analyzer

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Materials and Methods. Blocking of Globin Reverse Transcription to Enhance Human Whole Blood Gene Expression Profiling

Basic Analysis of Microarray Data

Essentials of Real Time PCR. About Sequence Detection Chemistries

Gene Expression Assays

Quality Assessment of Exon and Gene Arrays

CompleteⅡ 1st strand cdna Synthesis Kit

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Validating Microarray Data Using RT 2 Real-Time PCR Products

Introduction To Real Time Quantitative PCR (qpcr)

Real-Time PCR Vs. Traditional PCR

Measuring gene expression (Microarrays) Ulf Leser

Oligonucleotide Stability Study

Frequently Asked Questions Next Generation Sequencing

QPCR Applications using Stratagene s Mx Real-Time PCR Platform

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Gene expression analysis. Ulf Leser and Karin Zimmermann

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

PREDA S4-classes. Francesco Ferrari October 13, 2015

Final Project Report

Real-time PCR: Understanding C t

Application Note. Single Cell PCR Preparation

Statistical issues in the analysis of microarray data

Statistics in Medicine Research Lecture Series CSMC Fall 2014

MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R

Quando si parla di PCR quantitativa si intende:

Real time and Quantitative (RTAQ) PCR. so I have an outlier and I want to see if it really is changed

Data Acquisition. DNA microarrays. The functional genomics pipeline. Experimental design affects outcome data analysis

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

An Introduction to Microarray Data Analysis

Recombinant DNA and Biotechnology

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Gene expression profiling during gliogenesis in the Drosophila embryo. Angela Becker

BBSRC TECHNOLOGY STRATEGY: TECHNOLOGIES NEEDED BY RESEARCH KNOWLEDGE PROVIDERS

Bioinformatics Grid - Enabled Tools For Biologists.

Section 14 Simple Linear Regression: Introduction to Least Squares Regression

SYBR Green Realtime PCR Master Mix -Plus-

ALLEN Mouse Brain Atlas

Sentiment analysis using emoticons

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

User Manual/Hand book. qpcr mirna Arrays ABM catalog # MA003 (human) and MA004 (mouse)

OriGene Technologies, Inc. MicroRNA analysis: Detection, Perturbation, and Target Validation

A Streamlined Workflow for Untargeted Metabolomics

2.500 Threshold e Threshold. Exponential phase. Cycle Number

Co Extra (GM and non GM supply chains: Their CO EXistence and TRAceability) Outcomes of Co Extra

Real-time quantitative RT -PCR (Taqman)

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Thermo Scientific DyNAmo cdna Synthesis Kit for qrt-pcr Technical Manual

Sommerakademie der Studienstiftung des deutschen Volkes. St. Johann,

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

ab Hi-Fi cdna Synthesis Kit

ProteinPilot Report for ProteinPilot Software

First Strand cdna Synthesis

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

bitter is de pil Linos Vandekerckhove, MD, PhD

MeDIP-chip service report

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Core Facility Genomics

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Microarray Data Mining: Puce a ADN

Quantitative Real Time PCR Protocol. Stack Lab

GenBank, Entrez, & FASTA

Gene Expression Analysis

DeCyder Extended Data Analysis module Version 1.0

Study Design and Statistical Analysis

amplification tech Optical Design of CFX96 Real-Time PCR Detection System Eliminates the Requirement of a Passive Reference Dye

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

How To Run Statistical Tests in Excel

UNSUPERVISED MACHINE LEARNING TECHNIQUES IN GENOMICS

ncounter Leukemia Fusion Gene Expression Assay Molecules That Count Product Highlights ncounter Leukemia Fusion Gene Expression Assay Overview

Package empiricalfdr.deseq2

HiPer RT-PCR Teaching Kit

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Descriptive Statistics

imarray An Integrated Data Management and Data Mining System for Microarray Data Analysis

PreciseTM Whitepaper

Importance of Statistics in creating high dimensional data

qstar mirna qpcr Detection System

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

REAL TIME PCR SYBR GREEN

the online and local desktop version of the Pathway tools) (2) and genome overview chart of

Stratagene QPCR Mouse Reference Total RNA

Reprogramming, Screening and Validation of ipscs and Terminally Differentiated Cells using the qbiomarker PCR Array System

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

SPSS Explore procedure

MMLV High Performance Reverse Transcriptase

G E N OM I C S S E RV I C ES

Two-sample inference: Continuous data

User Bulletin #2 ABI PRISM 7700 Sequence Detection System

Row Quantile Normalisation of Microarrays

Primetime for KNIME:

Transcription:

I Just Received My Microarray Data, Now What? Danny Park MGH-PGA (ParaBioSys) Sat April 24, 2004

I Just Received My Microarray Data Where did this come from? A description of the process from RNA to raw data What do I do now? A description of how to analyze your data

Demystifying the Core Facility RNA (precious) Microarray Core Facility (black magic) Researcher Strange numbers and pictures

Demystifying the Core Facility RNA Researcher QC, RT & label Labeled cdna hybridize Slides scan, segment analysis DB upload Images & data files

Approximately 20 µg total RNA required RT & Labeling Reference sample Aminoallyl-dUTP, datp, dctp, dgtp, oligo-dt primer, reverse transcriptase Test sample mrna TTTT AAAAAAAA mrna TTTT AAAAAAAA RNase treatment or NaOH hydrolysis of RNA cdna NH 2 NH 2 NH 2 TTTT cdna NH 2 NH 2 NH 2 TTTT Cy3 N-hydroxysuccinimide activated fluorescent dye Cy5 Reference labeled cdna TTTT Test labeled cdna TTTT

Hybridization Synthesized oligonucleotides in 384 well plates Reference labeled cdna Test labeled cdna Genomic Solutions Hybridization Station (PerkinElmer) Combine Robotic printing Microarray Hybridization

Scanning Hybridized Microarray Laser 1 Excitation Laser 2 Emission Axon Instruments GenePix 4000B Monochrome pictures combined

Scanning 20 µg total RNA macrophage RAW Cy5: 100 ng/ml LPS 2 h Cy3: no treatment

Numerical Data Segmentation Scanned Image Segmentation Software

Segmentation

Segmentation

Segmentation

Segmentation

Core Facility Demystified! RNA Researcher QC, RT & label Labeled cdna hybridize Slides scan, segment analysis DB upload Images & data files

Core Facility Demystified? RNA Researcher QC, RT & label Labeled cdna hybridize Slides scan, segment? analysis DB upload Images & data files

What Do I Do Now? (data analysis) What was I asking? Remember your experimental design How do I analyze the data? Learn some typical filters, transformations, and statistics Learn the necessary software tools Consult biostatistician

What Was I Asking? Typically: which genes changed expression patterns when I did Common s: Binary conditions: knock out, treatment, etc Unordered discrete scales: multiple types of treatment or mutations Continuous scales: time courses, levels of treatment, etc My focus: binary conditions (aka diagnostic experiments )

Diagnostic Experiments Two-sample comparison w/n replicates KO vs. WT Treated vs. untreated Diseased vs. normal Etc Question of interest: which genes or groups are (most) differentially expressed?

Software Tools? BASE BioArray Software Environment Data storage and distribution Simple filtering, normalization, averaging, and statistics Export/Download results to other tools R, Bioconductor (complex statistics) MS Excel (general) TIGR Multi Experiment Viewer (clustering) GenMAPP (ontologies)

Analyzing a Diagnostic Experiment Filter out bad spots Adjust low intensities Normalize correct for non-linearities and dye inconsistencies Calculate average ratios and significance values per gene Rank, sort, filter, squint, sift data Validate results

Filtering bad spots Why?

Filtering bad spots

Filtering bad spots Filter

Adjusting low intensities Why? T C

Adjusting low intensities Why? T C LOWESS Normalization log(t), log(c) Poof! Data is gone!

Adjusting low intensities T C T C

Adjusting low intensities Int Limit

Normalization Why? Not perfectly centered around zero Implies that nearly all genes down regulated? There are dye effects

Normalization Why? Regional variations Up (red) and down (green) regulated genes should be randomly distributed across the slide (but they re not) Green corner!

Normalization LOWESS

Normalization Thoughts There are many different ways to normalize data Global median, LOWESS, LOESS, etc By print tip, spatial, etc Choose one wisely BUT: don t expect it to fix bad data! Won t make up for lack of replicates Won t make up for horrible slides

Average Fold Ratios Why? Fibroblast growth factor 9 (20 repl) T C You don t care about spots up or down regulated You care about genes up or down regulated Data is highly variable, so do a lot of replicates

Average Fold Ratios Per-spot ratios Per-gene ratios Fold ratio average

Statistical Significance Why? Which gene is more likely to be down regulated? Fibroblast growth factor 9 ratio: 0.6 ETS-related transcription factor ratio: 0.6

Statistical Significance Why? Fibroblast growth factor 9 (20 repl) ETS-related transcription factor T C T C

Statistical Significance Why? Fibroblast growth factor 9 (20 repl) ETS-related transcription factor T C T C Same average fold ratio, but the gene on the right has almost as many replicates going up as it does down!

Statistical Significance Why? Fibroblast growth factor 9 (20 repl) ETS-related transcription factor T C T C Same average fold ratio, but different P-values! Probability of null hypothesis via t-test: 0.0012% Probability of null hypothesis via t-test: 13%

Statistical Significance Variability data P-values T-test

Statistical Significance Thoughts There are many different statistical significance metrics T-test (P values), SAM (T values), Wilcoxon RST, ANOVA (F-statistics), many more Choose one (or more!) wisely BUT: don t let it make decisions for you! There will always be false pos/neg hits Ultimately, biological significance matters

Statistical Significance # s How Should We Use Them? To sort and rank data To reduce data set of 1000s genes to 10s or 100s With annotations and biological insight As a guide in selecting which genes to validate more precisely (qpcr)

Analysis Pipeline: Summary Filter out bad spots Adjust low intensities Normalize correct for non-linearities and dye inconsistencies Calculate average ratios and significance values per gene Rank, sort, filter, squint, sift data Validate results

Analysis Pipeline: Summary Filter out bad spots Adjust low intensities Normalize There s correct for no non-linearities one way to and dye inconsistencies do it just many Calculate average ratios and significance values per gene variations on a theme! Rank, sort, filter, squint, sift data Validate results

The Biologist s Creed (adapted from the US Marine Corps) This is my microarray analysis pipeline. There are many like it, but this one is mine. It is my life. I must master it as I must master my life. Without me my pipeline is useless. Without my pipeline, I am useless. My pipeline and I know that what counts in research is not the P-values we choose, the normalization parameters we pick, or the pretty plots we generate. We know that it is the validations that count. We will validate.

The Biologist s Creed (adapted from the US Marine Corps) My pipeline is human, even as I am human, because it is my life. Thus, I will learn it as a brother. I will learn its weaknesses, its strengths, its parts, its statistical assumptions, its usage, its variations, and their effects on my conclusions. I will keep my pipeline clean and ready, even as I am clean and ready. We will become part of each other.

The End! What Have We Covered? The path from RNA samples to numeric data Typical steps & concerns in data scrubbing Typical analysis of diagnostic experiments

The End! What Have We Not Covered? Different flavors of filters, normalizations and stat. significance metrics (10:45a) Analysis of time course & multiple treatment experiments (1:00p) Clustering, visualization methods (1:00p) Step by step tutorial of software (1:45p)

Acknowledgements MGH Lipid Metabolism Unit Unit Mason Freeman Harry Björkbacka MGH Molecular Biology Bioinformatics Group Chuck Cooper Xiaowei Wang Harvard School of of Public Health Biostatistics Xiaoman Li Li MGH Microarray Core Glenn Short Jocelyn Burke Najib El El Messadi Jason Frietas Zhiyong Ren Ren BU BU BioMolecular Engineering Research Center Temple Smith Gabriel Eichler Sean Quinlan Prashanth Vishwanath