Analysis of gene expression data. Ulf Leser and Philippe Thomas

Similar documents
Measuring gene expression (Microarrays) Ulf Leser

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

How many of you have checked out the web site on protein-dna interactions?

Basic Analysis of Microarray Data

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Introduction To Real Time Quantitative PCR (qpcr)

Microarray Technology

Gene expression analysis. Ulf Leser and Karin Zimmermann

Correlation of microarray and quantitative real-time PCR results. Elisa Wurmbach Mount Sinai School of Medicine New York

Row Quantile Normalisation of Microarrays

Software and Methods for the Analysis of Affymetrix GeneChip Data. Rafael A Irizarry Department of Biostatistics Johns Hopkins University

Quality Assessment of Exon and Gene Arrays

Recombinant DNA and Biotechnology

Lecture 11 Data storage and LIMS solutions. Stéphane LE CROM

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Consistent Assay Performance Across Universal Arrays and Scanners

Gene Expression Analysis

Core Facility Genomics

REAL TIME PCR USING SYBR GREEN

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Next Generation Sequencing: Technology, Mapping, and Analysis

Biotechnology: DNA Technology & Genomics

Introduction to Pattern Recognition

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Gene Mapping Techniques

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

Essentials of Real Time PCR. About Sequence Detection Chemistries

ncounter Leukemia Fusion Gene Expression Assay Molecules That Count Product Highlights ncounter Leukemia Fusion Gene Expression Assay Overview

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Materials and Methods. Blocking of Globin Reverse Transcription to Enhance Human Whole Blood Gene Expression Profiling

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

Activity 4 Long-Term Effects of Drug Addiction

Single Nucleotide Polymorphisms (SNPs)

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Microarray Data Mining: Puce a ADN

Gene Expression Assays

CCR Biology - Chapter 9 Practice Test - Summer 2012

Predictive Gene Signature Selection for Adjuvant Chemotherapy in Non-Small Cell Lung Cancer Patients

Technical Note. Roche Applied Science. No. LC 18/2004. Assay Formats for Use in Real-Time PCR

Long-Term Effects of Drug Addiction

DNA Fingerprinting. Unless they are identical twins, individuals have unique DNA

MICROARRAY DATA ANALYSIS TOOL USING JAVA AND R

ALLEN Mouse Brain Atlas

Exiqon Array Software Manual. Quick guide to data extraction from mircury LNA microrna Arrays

Real-time PCR: Understanding C t

OriGene Technologies, Inc. MicroRNA analysis: Detection, Perturbation, and Target Validation

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

FACULTY OF MEDICAL SCIENCE

IKDT Laboratory. IKDT as Service Lab (CRO) for Molecular Diagnostics

Thermo Scientific DyNAmo cdna Synthesis Kit for qrt-pcr Technical Manual

Forensic DNA Testing Terminology

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

A Primer of Genome Science THIRD

SNP Essentials The same SNP story

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA

Factors for success in big data science

PreciseTM Whitepaper

QPCR Applications using Stratagene s Mx Real-Time PCR Platform

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Antibody Function & Structure

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Quantitative proteomics background

Biological Sciences Initiative. Human Genome

Name Date Period. 2. When a molecule of double-stranded DNA undergoes replication, it results in

Genetics Lecture Notes Lectures 1 2

Genetics Module B, Anchor 3

PRACTICAL APPROACH TO ECOTOXICOGENOMICS

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Scottish Qualifications Authority

Lecture 2: Descriptive Statistics and Exploratory Data Analysis

Translation Study Guide

Description: Molecular Biology Services and DNA Sequencing

Final Project Report

GenBank, Entrez, & FASTA

RNA & Protein Synthesis

Protein Protein Interaction Networks

Real-Time PCR Vs. Traditional PCR

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Chapter 5: Organization and Expression of Immunoglobulin Genes

Analysis of Illumina Gene Expression Microarray Data

HiPer RT-PCR Teaching Kit

Recombinant DNA & Genetic Engineering. Tools for Genetic Manipulation

Identification of rheumatoid arthritis and osteoarthritis patients by transcriptome-based rule set generation

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

The Scientific Data Mining Process

Transcription:

Analysis of gene expression data Ulf Leser and Philippe Thomas

This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser: Bioinformatics, Wintersemester 2010/2011 2

Protein synthesis Gene expression has 2 phases: Transcription (DNA -> mrna) (40 nt /second) Translation (mrna -> protein) (40 aa/second) Wikipedia AG:Proteomics Algorithms and Simulations Tübingen Ulf Leser: Bioinformatics, Wintersemester 2010/2011 3

mrna quantification Reporter Gene (GFP) Northern blotting Real time PCR Microarray High throughput (multiple genes with one experiment) http://www.agilent.com/about/newsroom/ lsca/imagelibrary/images/lsca_160_array_slide.jpg http://www.dddmag.com/uploadedimages/ rticles/2007_10/dd7odisnp5.jpg Ulf Leser: Bioinformatics, Wintersemester 2010/2011 4

This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser: Bioinformatics, Wintersemester 2010/2011 5

Microarray Experiment I Comparative Between experiment not between gene Measure mrna expression For all genes At a specific time point One sample Assumes that mrna expression correlates with protein synthesis Not entirely true Trend towards next generation sequencing But: Microarrays are an important and prominent technology Ulf Leser: Bioinformatics, Wintersemester 2010/2011 6

Microarray Experiment II Find differentially expressed genes between groups Healthy vs. sick Tissue /Cell types Development state Embryo, Child, Adolescent, Adult, Cell development Environment Heat shock, Nutrition, Therapy Disease subtypes ALL vs. AML 40% chemotherapy resistant in colon cancer Co-regulation of genes Similar gene-pattern -> similar function? Similar gene-pattern -> similar regulation? Wikipedia Ulf Leser: Bioinformatics, Wintersemester 2010/2011 7

RNA-Hybridisation Hybridization is the process of unifying two complementary singlestranded chains of DNA or RNA two one doublestranded molecule. http://encyclopedia2.thefreedictionary.com/micro+array Ulf Leser: Bioinformatics, Wintersemester 2010/2011 8

Application flow Probe / Target Sample Sample preparation and labeling/color RNA-Hybridisation Chip washing Scanning Sample-cDNA hybridizes with Probe-cDNA (overnight) Remove non-hybridised Sample-cDNA Scan array with a laser beam TIFF Picture Image recognition Detection of light intensity and image segmentation Raw data Analysis Normalisation and data analysis Ulf Leser: Bioinformatics, Wintersemester 2010/2011 9

Probe preparation Replicate cdna library (PCR) Spot on array Each spot represents a transcript/gene (idealized) Array-Layout: Redundancy, controls, maximum number of spots, S1 S2 S3 S4 S5 S6 Z1 G11 G11 G21............ Z2 G12.................. Z3 G13.................. Z4 G14.................. Z5 G15.................. Z6 G16.......................................... Ulf Leser: Bioinformatics, Wintersemester 2010/2011 10

Sample preparation Isolate cells in condition X RNA extraction Synthesize to cdna cdna labeling with colorized nucleotides Labeled nucleotides Biotinylated Oligo-dT RNA-Hybridisation Ulf Leser: Bioinformatics, Wintersemester 2010/2011 11

Differences in Technology Spotted Array Spotted on a microscope slide Spacing between two spots is ~120um 5ng of DNA/spot needed Many labs have required equipment Customization of probes (selection according to user need) Fewer Probes/Genes represented Probes are longer Ulf Leser: Bioinformatics, Wintersemester 2010/2011 12

Differences in Technology Two Color Array I Two samples with one array Two different colors Cy3 / Cy5 Laser uses two different wave length Usually spotted Cross hybridisation between the two samples possible Sample A (Red) Channel Sample B (Green) Channel Ulf Leser: Bioinformatics, Wintersemester 2010/2011 13

Differences in Technology Two Color Array II Sample A: red Sample B: green Ratio Red/Green Black: No signal in both samples Red: High expression in A Green: High expression in B Yellow: Equal expression Ulf Leser: Bioinformatics, Wintersemester 2010/2011 14

Differences in Technology Oligo Microarray Robustness between one array type Often one sample/chip Twice as many chips as two-color No cross hybridization between samples Short Probes (25nt 80nt) 11-20 Probes is one probeset represent one gene transcript scattered over array Up to 4 Mio probes / Chip Perfect match and mismatch probes Affymetrix Ulf Leser: Bioinformatics, Wintersemester 2010/2011 15

Differences in Technology Oligo Microarray Calculate photo mask Light exposure: Exposed array-cells can bind to X Enrich Chip with base X {A,C,G,T} Wash unbound bases http://www.koreanbio.org/biocourse/images/e/e0/photolithography.gif Ulf Leser: Bioinformatics, Wintersemester 2010/2011 16

Differences in Technology Oligo Microarray Good quality control No customization of probes Selection of good oligos is difficult Probe self-hybridization Probe should be unique for a transcript Minimize the number of light expose cycles (reduce cost) Ulf Leser: Bioinformatics, Wintersemester 2010/2011 17

Comparison Oligo Chips Densely packed Companies sell kits for every use case Robust reproducible results Easy to handle in tools like R (probe annotation) No customization of probes Not available for all species cdna Arrays cdna covers a longer mrna fragment Good customization Very time intensive Less spots / genes limited redundancy Error prone workflow Results difficult to compare Only available for species with EST (cdna transcripts) Ulf Leser: Bioinformatics, Wintersemester 2010/2011 18

Replicates In order to exclude technical or biological bias, replicated measurements are exploited: Technical Replicates: Same sample hybridized against several arrays Statistical estimation of systematic effects Biological Replicates: Different sample sources are used They allow to estimate biological noise and reduce the randomness of the measurement. Ulf Leser: Bioinformatics, Wintersemester 2010/2011 19

Advances in technology Gene-Expression profiling Usually referred to as Microarray or GeneChip Measure expression level of all genes Exon array Each exon of a gene is measured individually SNP array Identifying single nucleotide polymorphism among alleles within or between populations. ChIP-on-chip Detects DNA fragments specifically bound to a protein (e.g. transcription factor) Ulf Leser: Bioinformatics, Wintersemester 2010/2011 20

Advances in technology Ulf Leser: Bioinformatics, Wintersemester 2010/2011 21

Challenges Patient data has a high variance Different genetic background Mixture of cells from different tissues Cells are in different stages (cell cycle; cell development) Gene representation Very low mrna level Gene active during a short life time (embryonic stage, M-stage) Genes not represented on Microarray Annotation of a genome evolves over time; oligo-array is constant Environment has influence on hybridization quality Noise: Technical replicates never produce the same data Ulf Leser: Bioinformatics, Wintersemester 2010/2011 22

Challenges Transient data Select appropriate time point Signaling might be very fast for some processes Intermediate steps are lost Cause end effect Tumors have high cell proliferation Biological interpretation difficult High number of transcripts Multiple test correction Choice of statistical test Time series results in day/night work and might result in a completely lost data set Ulf Leser: Bioinformatics, Wintersemester 2010/2011 23

This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser: Bioinformatics, Wintersemester 2010/2011 24

Data visualization, quality control Boxplot Estimate the homogeneity of data Ulf Leser: Bioinformatics, Wintersemester 2010/2011 25

Data visualization, quality control Scatter plot Each point represents one transcript in two experimental settings Ulf Leser: Bioinformatics, Wintersemester 2010/2011 26

MA-Plot I Fold Change or m-value Fold change is the log2-ratio between two values Log-values are symmetric Visual interpretation (Difference between 4 to 16 vs. 0.25 to 0.0625) Mathematical issues For example: Value FC( Value1/ Value = 2) log 2 Value 1 2 512 FC(512 /1024) = log2 = 1 1024 123 FC(123/123) = log2 = 0 123 FC(512 / 256) = log 2 512 256 = + 1 512 FC(512 /1024) = = 0.5 1024 123 FC(123/123) = = 1 123 FC(512 / 256) = 512 256 = 2 Ulf Leser: Bioinformatics, Wintersemester 2010/2011 27

MA-Plot II A-Value is the logged intensity mean value 1 A = 2 1 + 2 Value 2 ( log ( Value ) log ( )) Note that this scatter plot is a 45 rotated version with subsequent scaling of the normal scatter plot. 2 Ulf Leser: Bioinformatics, Wintersemester 2010/2011 28

This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser: Bioinformatics, Wintersemester 2010/2011 29

Normalization Microarrays are comparative experiments Distinguish between biological from technical variation Measurements between two experiments are not directly comparable Minimize the influence systematic errors on the experiment Sample preparation Different quantities of RNA Probe affinity Fluorescence detection non-linear self fluorescence of microarray surface Experimentator variability No method to handle bad RNA quality Ulf Leser: Bioinformatics, Wintersemester 2010/2011 30

Normalization G and C bindings have a higher energy than A and T bindings. Furthermore, the binding specificity depends on the nucleotide positions in the probe. mathworks.com (based on Naef and Magnasco, Phys. Rev., 2004) Ulf Leser: Bioinformatics, Wintersemester 2010/2011 31

Normalization Global scaling mrna in a sample Assumption: cells contain same proportion of RNA Measure total mrna Divide intensities by this value Reference gene Assumption: These genes are similar expressed across tissue Selection of Housekeeping genes Divide intensities by this value Ulf Leser: Bioinformatics, Wintersemester 2010/2011 32

Non linear methods Lowess I Dye-bias in two color array Green channel appears consistently brighter then red channel Intensity based Fit simple models to localized subsets Needs no global function of any form to fit a model to the data It requires large, densely sampled data sets in order to produce good models Ulf Leser: Bioinformatics, Wintersemester 2010/2011 33

Non linear methods Lowess II Ulf Leser: Bioinformatics, Wintersemester 2010/2011 34

Quantile normalization I Impose same empirical distribution of entities to each array Each hybridization is thus the transformation of an underlying common distribution Usually outperforms linear methods Sophisticated methods like RMA use quantile normalization Ulf Leser: Bioinformatics, Wintersemester 2010/2011 35

Quantile normalization II 1. Given a matrix X where p x n where each array is a column and each transcript is a row 2. Sort each column of X separately to give X sort 3. Take the mean, across rows, of X sort and create X sort 4. Get X n by rearranging each column of X sort to have the same ordering as the corresponding input vector Ulf Leser: Bioinformatics, Wintersemester 2010/2011 36

Quantile normalization III 1. Given a matrix X where p x n where each array is a column and each transcript is a row Sort by column Array 1 Array 2 Array 3 Gene1 1 6 8 Gene2 2 5 9 Gene 3 3 4 7 Array 1 Array 2 Array 3 Gene1 1 4 7 Gene2 2 5 8 Gene 3 3 6 9 Row wise mean Array 1 Array 2 Array 3 Gene1 4 4 4 Gene2 5 5 5 Gene 3 6 6 6 Rearrange Array 1 Array 2 Array 3 Gene1 4 6 5 Gene2 5 5 6 Gene 3 6 4 4 Rearrange Ulf Leser: Bioinformatics, Wintersemester 2010/2011 37

Quantile normalization IV Before Array 1 Array 2 Array 3 Gene1 1 6 8 Gene2 2 5 9 Gene 3 3 4 7 After Array 1 Array 2 Array 3 Gene1 4 6 5 Gene2 5 5 6 Gene 3 6 4 4 Ulf Leser: Bioinformatics, Wintersemester 2010/2011 38

Quantile normalization V Before After Ulf Leser: Bioinformatics, Wintersemester 2010/2011 39

Gene expression matrix Sample Gene Ulf Leser: Bioinformatics, Wintersemester 2010/2011 40

Conclusion Different techniques cdna: cdna library Oligo: artificial Oligos Problems Image recognition Normalization Comparative tool Findings about interplay of genes in pathways Often used Ulf Leser: Bioinformatics, Wintersemester 2010/2011 41

This Lecture Protein synthesis Microarray Idea Technologies Applications Problems Quality control Normalization Analysis next week! Ulf Leser: Bioinformatics, Wintersemester 2010/2011 42