Expression Quantification (I)

Similar documents
Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

Frequently Asked Questions Next Generation Sequencing

RNA-seq. Quantification and Differential Expression. Genomics: Lecture #12

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Gene Expression Analysis

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Basic processing of next-generation sequencing (NGS) data

Computational Genomics. Next generation sequencing (NGS)

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Institute of Actuaries of India Subject CT3 Probability and Mathematical Statistics

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Challenges associated with analysis and storage of NGS data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

RNAseq / ChipSeq / Methylseq and personalized genomics

Services. Updated 05/31/2016

Analysis of ChIP-seq data in Galaxy

G E N OM I C S S E RV I C ES

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Probability Calculator

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Introduction to next-generation sequencing data

STAT 360 Probability and Statistics. Fall 2012

People have thought about, and defined, probability in different ways. important to note the consequences of the definition:

2WB05 Simulation Lecture 8: Generating random variables

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

Next Generation Sequencing

PrimePCR Assay Validation Report

Next Generation Sequencing: Technology, Mapping, and Analysis

mrna NGS Data Analysis Report

PrimePCR Assay Validation Report

SAS Software to Fit the Generalized Linear Model

NGS data analysis. Bernardo J. Clavijo

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Next generation DNA sequencing technologies. theory & prac-ce

MATHEMATICAL METHODS OF STATISTICS

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Practical Differential Gene Expression. Introduction

STA 4273H: Statistical Machine Learning

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

E3: PROBABILITY AND STATISTICS lecture notes

How Sequencing Experiments Fail

edger: differential expression analysis of digital gene expression data User s Guide Yunshun Chen, Davis McCarthy, Mark Robinson, Gordon K.

Quantitative proteomics background

GeneProf and the new GeneProf Web Services

CIDANE: comprehensive isoform discovery and abundance estimation

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

From the help desk: hurdle models

PROBABILITY AND STATISTICS. Ma To teach a knowledge of combinatorial reasoning.

bitter is de pil Linos Vandekerckhove, MD, PhD

High Throughput Sequencing Data Analysis using Cloud Computing

Understanding West Nile Virus Infection

Basics of Statistical Machine Learning

Statistical Rules of Thumb

Final Project Report

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Overview of Monte Carlo Simulation, Probability Review and Introduction to Matlab

Analysis of Bayesian Dynamic Linear Models

Normalization of RNA-Seq

Math 431 An Introduction to Probability. Final Exam Solutions

Homework 4 - KEY. Jeff Brenion. June 16, Note: Many problems can be solved in more than one way; we present only a single solution here.

PreciseTM Whitepaper

A survey of best practices for RNA-seq data analysis

SEQUENCING. From Sample to Sequence-Ready

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Real-time PCR: Understanding C t

NGS Data Analysis: An Intro to RNA-Seq

TruSeq Custom Amplicon v1.5

MATH4427 Notebook 2 Spring MATH4427 Notebook Definitions and Examples Performance Measures for Estimators...

Tutorial for proteome data analysis using the Perseus software platform

Gene Expression Assays

Service courses for graduate students in degree programs other than the MS or PhD programs in Biostatistics.

Time series experiments

To be able to describe polypeptide synthesis including transcription and splicing

Curriculum Map Statistics and Probability Honors (348) Saugus High School Saugus Public Schools

Thermo Scientific DyNAmo cdna Synthesis Kit for qrt-pcr Technical Manual

Partek Methylation User Guide

Graphical Modeling for Genomic Data

6 PROBABILITY GENERATING FUNCTIONS

Analysis of gene expression data. Ulf Leser and Philippe Thomas

MEU. INSTITUTE OF HEALTH SCIENCES COURSE SYLLABUS. Biostatistics

business statistics using Excel OXFORD UNIVERSITY PRESS Glyn Davis & Branko Pecar

TGC AT YOUR SERVICE. Taking your research to the next generation

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Linear regression methods for large n and streaming data

Part 2: One-parameter models

FOR REFERENCE PURPOSES

Multivariate Statistical Inference and Applications

Transcription:

Expression Quantification (I) Mario Fasold, LIFE, IZBI

Sequencing Technology One Illumina HiSeq 2000 run produces 2 times (paired-end) ca. 1,2 Billion reads ca. 120 GB FASTQ file

RNA-seq protocol

Task Obtain some value (estimate) representing the true abundance of transcripts in vivo

Overview 1. Basic Expression Measures 2. Statistical Poisson Model 1. Single Isoform 2. Multiple Isoform 3. Statistical Inference 4. Results 3. Paired-End Sequencing 4. Negative-Binomial

Expression Quantification What properties should the expression values have? Values should be comparable Between transcripts Between samples (-> differential expression)

Problems in Expression Quantification Differences in transcript length

Problems in Expression Quantification Fluctuations in sequencing depth

Basic Measure: RPKM RPKM/FPKM = Reads/Fragments per kilobase of exon model per million mapped reads FPKM 10 9 x wl x = the number of reads mapped onto the gene's exons w = total number of reads in the experiment l = the sum of the exons in base pairs.

However, there are isoforms...

Typical number of isoforms 7%

Notation fg,1 fg,2 ng=2

kf kf lf lf

f2 f1 f1 f2 f1 f1 f1 f2 Set of all isoforms in sample Total length of isoforms f1 f1 f1 f1 f1 f1 f1 f2 f2 f2

f2 f1 f1 f2 f1 f1 f1 f2 f1 f1 f1 f1 f1 f1 f1 f2 f2 f2 Expression value Probablity of a read coming from isoform f

Model Assumptions Reads sampled independently (-> unique reads) Read sampling probability uniform over all sequences (uniform coverage) Each transcript processed independently and then sequenced

Model (1) Let w be total number of reads Given isoform f and region of length l in f Let X be a random variable representing the number of reads falling in that region Then X~bin(w, Θfl)

Coin Toss gives Binomial Dist n

Poisson Model Each gene g handled separately Exons only shared as a whole isoforms either share an exon, or they don t Let X~Poisson( ) be a random variable representing the number of reads falling in some region of interest in g

Poisson Model Each gene g handled separately Exons only shared as a whole isoforms either share an exon, or they don t Let X~Poisson( ) be a random variable representing the number of reads falling in some region of interest in g Then for some exon j: λ= cij is 1 if isoform i contains exon j, 0 otherwise ai is the sampling rate

Estimation of expression

Single isoform case

Multiple isoform case No closed-form solution available for ML if n>1 Likelihood function is concave any local maximum is also a global one Use numerical methods to find maximum Here: coordinate-wise hill climbing

Fisher information matrix We now have a point estimate for the expression index according to some Poisson model How reliable is the estimate? We need some confidence interval to test for DE The distribution of the estimate can be approximated with a normal distribution with mean (true parameter) and covariance (inverse Fisher information matrix)

Results Use RNA-seq dataset from Mortavazi et al.(2008): three mouse tissue samples with 2 replicates 60-80 M reads each Mapped with SeqMap tool from the same authors to mm9 mrna annotations from mm9 RefSeq mrna database

Differentially expressed isoforms

From isoforms to gene expression: which reads to count?

From isoforms to gene expression Exon intersection similar to microarrays, but they can reduce statistical power as reads are not considered Exon union underestimated expression for alternatively spliced genes This paper used sum of all reads = exon union

Statistical inference Problem: Fisher information matrix is degenerated especially for isoforms with low expression Need to regularize covariance matrix Use Importance Sampling method to simulate from posterior distribution

Summary: isoform expression Isoform expression inference in RNA-seq possible using Poisson model Inferences agree with detailed inspection Exon junction reads reduce confidence interval

Paired-end sequencing P5 grafting forward read primer DNA insert index read primer P7 grafting index reverse read primer

Paired-end sequencing Forward read Reverse Read Index read

Informative Reads

Incorporation in the Poisson Model Model insert size using sampling rate ai

Overdispersion and Negative Binomial Poisson distribution (one parameter) cannot account for (high) biological variability across samples Most RNA-seq studies have not enough replicates to estimate variability using a permutation-derived approach Model variance using e.g.negative binomial distribution

Summary and Challenges Straightforward statistical model for isoform expression available However, reads are not truly random and uniform High peaks and 3 bias in read distributions Positive correlations in read distributions between tissues Isoform-level annotations not complete Most studies comprise few biological replicates