8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Similar documents

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

New solutions for Big Data Analysis and Visualization

Frequently Asked Questions Next Generation Sequencing

Challenges associated with analysis and storage of NGS data

Introduction to NGS data analysis

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Comparing Methods for Identifying Transcription Factor Target Genes

Analysis of NGS Data

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing

Analysis of ChIP-seq data in Galaxy

Basic processing of next-generation sequencing (NGS) data

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Introduction to next-generation sequencing data

Next generation DNA sequencing technologies. theory & prac-ce

LifeScope Genomic Analysis Software 2.5

RNA Express. Introduction 3 Run RNA Express 4 RNA Express App Output 6 RNA Express Workflow 12 Technical Assistance

NGS Data Analysis: An Intro to RNA-Seq

GenomeStudio Data Analysis Software

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

G E N OM I C S S E RV I C ES

Version 5.0 Release Notes

GenomeStudio Data Analysis Software

Next generation sequencing (NGS)

Computational Genomics. Next generation sequencing (NGS)

Understanding West Nile Virus Infection

Core Facility Genomics

How Sequencing Experiments Fail

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Analyzing the Effect of Treatment and Time on Gene Expression in Partek Genomics Suite (PGS) 6.6: A Breast Cancer Study

Tutorial for proteome data analysis using the Perseus software platform

RNAseq / ChipSeq / Methylseq and personalized genomics

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

-> Integration of MAPHiTS in Galaxy

NGS data analysis. Bernardo J. Clavijo

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Discovery and Quantification of RNA with RNASeq Roderic Guigó Serra Centre de Regulació Genòmica (CRG)

Partek Methylation User Guide

Text file One header line meta information lines One line : variant/position

Gene Expression Analysis

Analyzing microrna Data and Integrating mirna with Gene Expression Data in Partek Genomics Suite 6.6

BioHPC Web Computing Resources at CBSU

Module 1. Sequence Formats and Retrieval. Charles Steward

Deep Sequencing Data Analysis

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

High Throughput Sequencing Data Analysis using Cloud Computing

MiSeq: Imaging and Base Calling

Practical Solutions for Big Data Analytics

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

RT 2 Profiler PCR Array: Web-Based Data Analysis Tutorial

Expression Quantification (I)

Introduction. Overview of Bioconductor packages for short read analysis

AGILENT S BIOINFORMATICS ANALYSIS SOFTWARE

Practical Guideline for Whole Genome Sequencing

PreciseTM Whitepaper

Gene expression analysis. Ulf Leser and Karin Zimmermann

Gene Models & Bed format: What they represent.

PrimePCR Assay Validation Report

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

A Primer of Genome Science THIRD

SRA File Formats Guide

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Delivering the power of the world s most successful genomics platform

Molecular Genetics: Challenges for Statistical Practice. J.K. Lindsey

Quality Assessment of Exon and Gene Arrays

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

Bioinformatics Unit Department of Biological Services. Get to know us

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Microarray Data Analysis. A step by step analysis using BRB-Array Tools

School of Nursing. Presented by Yvette Conley, PhD

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

User Manual. Transcriptome Analysis Console (TAC) Software. For Research Use Only. Not for use in diagnostic procedures. P/N Rev.

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Data Analysis for Ion Torrent Sequencing

Subread/Rsubread Users Guide

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

FOR REFERENCE PURPOSES

Biological Sciences Initiative. Human Genome

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

How-To: SNP and INDEL detection

Visualisation tools for next-generation sequencing

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE Q5B

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

GeneProf and the new GeneProf Web Services

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

Athanasia Pavlopoulou University of Thessaly, Lamia June 2015

Transcription:

Experimental Design & Intro to NGS Data Analysis Ryan Peters Field Application Specialist Partek, Incorporated Agenda Experimental Design Examples ANOVA What assays are possible? NGS Analytical Process Alignment of NGS Data Challenges of NGS Analysis Partek Flow Demonstration 2 Examples Shoe Example Breast Cancer Example Rat Example (Experimental Design) Tips on setting up your next experiment 1

The Role of Experimental Design The goal of statistics is to find signals in a sea of noise The goal of experimental design is to reduce that noise so true biological signals can be found with as small a sample size as possible Partek Shoe Example Question: Do shoes affect height? Hypothesis: Yes, shoes affect height. Assay: Measure the height 10 people with & without shoes. (Change only one variable.) Sample Size: 10 people @ Partek (5 male, 5 female) Analysis: Use a two sample t-test to see if there is a difference between the mean of two groups: with shoes and without shoes. t-test A simple t-test does not have the power to correctly identify this pattern, because it assumes multiple samples from the same individual are independent when they are not. p= 0.51 Fold-change = 1.02 Conclusion - No statistically significant difference in height due to shoes. 2

Paired t-test The paired t-test provides substantially more statistical power by removing person-to-person differences from the noise. p(shoes)=1e-5 p(person)=2e-9 Introducing Gender Once person is known, gender is already known; thus the p-value for Shoe remains unchanged. We get the estimate of gender effect for free! Add Gender (3-way ANOVA) p(shoes)=1e-5 p(gender)=.04 p(person)=2e-9 It appears (p=.04) that men (at Partek) are significantly taller than women 3

Explore Gender/Shoe Interaction Do shoes have the same effect on men & women? p(shoes)=1e-8 p(gender)=.04 p(person)=2e-12 p(shoe*gender) =7e-5 Wow! Shoes affect women s height more than men s! Also note that p-values for shoe effect are even smaller because we explained more noise. Breast Cancer Example Example of Large Batch Effect Example Data, GEO Experiment GSE848 Control (E2) Plus Drug Treatment of Breast Cancer Cells 5 Treatments x 3 Time Points x 2 replicates Biological replicates were processed in 2 batches Control Estrogen (E2) E2 + ICI E2 + Raloxifene E2 + Tomoxifen 0 hr 2 8 hr 2 2 2 2 48 hr 2 2 2 2 Fortunately, treatments were perfectly balanced across processing batches. 4

As Seen Using PCA As Seen Using Hierarchical Clustering What is Analysis of Variance? Analysis (Source: m-w.com) Etymology: New Latin, from Greek, from analyein to break up separation of a whole into its component parts 17.49% 1.15% 17.40% 58.36% Treatment Time 1.64% Analysis of Variance ANOVA a technique that partitions the variance in data into separate components or factors 5

Good News! Balanced Experimental Design The treatments were perfectly balanced with the batches, so batch can be included as a blocking factor in ANOVA, and the batch effect (noise) can be removed from the data. In terms of p-values for this gene, the difference is dramatic. With a simple 2-way ANOVA, this gene was #228 on the gene list and would not pass multiple test correction for significance. With a 3-way ANOVA including batch, it was #2 on the gene list. Factor 2-way ANOVA 3-way ANOVA Treatment 0.00391497 3.43275E-07 Time 0.396031 0.00964938 Treatment*Time 0.100862 3.56752E-05 #2 Most Significant Gene Monday Median A =8.5 Median B =9.7 Tuesday Tue vs. Mon more than 2-fold difference ANOVA Partitions Variability Total variance is partitioned into variability due to influencing factors and the rest is assumed to be due to random error (noise). R 2 =81% for 2-way ANOVA R 2 =99% when Batch included 6

Batch Effect Remover Before Batch Removal After Batch Removal 19 Batch Effect Remover For visualization purposes only! Factors you would normally add for ANOVA How do we account for batch without Partek Batch Remover? 20 Building Blocks of Experimental Design No Randomization Completely Randomized Subjects randomly assigned to treatment groups Randomized Block Subjects randomly assigned to treatment groups within similar blocks (e.g. gender, litters) Requires a priori knowledge of differences between the blocks 7

Simplest Design: Not Randomized 8 Male Rats 4 Treated 4 Control Stripe coated rats are faster or more alert. Completely Randomized 8 Male Rats 4 Treated 4 Control A Better Approach Randomized Block Design First divide into blocks, then randomly assign to treatment groups 8

Randomized Block Design 8 Male Rats 4 Treated 4 Control Technical Blocks in Microarray Experiments Litter is an example of a biological block Examples of Technical/Processing Blocks: RNA Isolation Batch Hybridization Batch Operator As well as (although less so) Wash and Stain Batch Reagent, Cocktail Batches Chip Lot In Summary Block what you can and randomize what you cannot. Box, Hunter, & Hunter (1978) Blocking ensures that the differences in treatment cannot possibly be due to the blocking factor Blocking completely eliminates noise due to blocks Randomization gives approximate balance across other variables unaccounted for 9

Analysis of Variance Also Known As: ANOVA ANCOVA Linear Model Mixed Linear Model Invented in 1900, 1908, 1923 Still remains the most commonly used statistical method to analyze clinical trials! Simple ANOVA: Student s t-test t and F Statistics Fun fact In equal variance t-test is mathematically equivalent to a 1-way ANOVA. Student/Gosset Fisher 10

Assumptions of ANOVA Data is Normally distributed (bell shaped) within different treatment groups Ensure data is log transformed Variance is equal within different treatment groups Design balanced experiments Samples groups are independent. Don t make the shoe mistake *Replicates Required to get p-value Random vs Fixed Effects If the experiment were to be performed again, would the same levels of the factor be used? Yes - Fixed effect (e.g. gender, dose, time, dye) No - Random effect (e.g. hyb batch, wash batch, litter, subject) Why do I have to worry about this? In general, treating a random effect as a fixed effect will produce an overoptimistic p-value, leading to a false discovery. What Factors Belong in the Model? Obviously, the factors of interest to the researcher e.g. strain, time, strain*time Any factor needed to account for dependence of samples (don t violate assumption of independence!) e.g. donor Any additional blocking factors for noise reduction e.g. batch 11

Partek Expression Philosophy Use PCA to aid in quality control & sample grouping Use ANOVA to detect significantly expressed genes. Fold change is interesting for ranking, but not a great primary filtering metric Incorporate as much phenotypic and experimental design information into the ANOVA model as possible. Measure the experimental technical components.* Make sense of gene lists through functional groups How NOT to Run/Ruin Your Next Experiment! Samples are frequently organized by treatment groups. Samples are then processed in batches corresponding to treatment groups. But please do NOT process your control samples on Monday, and then process your treated samples on Tuesday. You will confound these two variables. ANOVA is powerful but not magical. Summary Experimental Design & Analysis Understand how separating variables in your analysis is critical to your success Design balanced experiments. Let p-values rank your data, but don t be a slave to FDR. 12

What kinds of assays are possible? DNA-Seq Copy Number SNP Structural variants Whole genome sequencing Metagenomics Targeted/Amplicon Sequencing ChIP-Seq Transcription Factor binding sites Methylation sites Histone modifications RIP-Seq (RNA-binding proteins) RNA-Seq Transcriptome Differential Gene Expression Alternative Splicing SNP detection Indel detection Novel exons/genes mirna-seq identify regulatory (non-coding) RNAs NGS Analysis Phases Primary Analysis Secondary Analysis Data File (Reads + Quality) Tertiary Analysis Reads aligned to genome FASTQ, BAM, Control Software Bowtie/BIOSC OPE/BWA, etc. Data File (Reads + Quality) FASTQ Reads aligned to genome Modified from Strand Life Sciences 38 NGS Analytical Process Illumina HiSeq SOLiD Roche 454 Ion Torrent PacBio Sequencing Genome Alignment GAGGTTGCAGTTTG chr1 243919543 R ACTGCTCCGCCTCA chr16 49094914 F GAATAAAAAATCCA chr13 55882620 F CGTCCTTCACCCTCT chr13 110085165 R CCTTAAGGAAAGGA chr18 72273046 F CAGCTAGGGTTGCC chr2 120786940 R CTGCTGGTGCTGCG chr10 73237323 F QC & Exploratory Analysis Powerful Statistics Intuitive Visualizations Integrated Genomics Biological Interpretation Publication 39 13

Comprehensive Analysis of NGS Data DNA-Seq SmallRNA- Seq RNA-Seq Methylation Seq ChIP-Seq 40 Read Types for NGS Single End Reads Paired-end Reads Junction Reads Multiple Aligned reads Strand-specific reads Paired End & Single End Reads DNA Space Single End Paired End chr2 DNA Space chr5 Multiple aligned 42 14

Junction Reads Derived from transcripts, some RNA-Seq reads will read through splice junctions (single end or pairedend) They will not align well to genomic reference since the two ends are many nucleotides apart (separated by the intronic region) DNA Space 43 Next Gen File Formats Unaligned (FASTA, FASTQ, SCARF, QSEQ, SRA, RAW, TXT, others) Alignment Tools ELAND BFAST Bowtie TMAP BWA TopHat SOAP Etc. GAGGTTGCAGTTTG chr1 243919543 R ACTGCTCCGCCTCA chr16 49094914 F GAATAAAAAATCCA chr13 55882620 F CGTCCTTCACCCTCT chr13 110085165 R CCTTAAGGAAAGGA chr18 72273046 F CAGCTAGGGTTGCC chr2 120786940 R CTGCTGGTGCTGCG chr10 73237323 F Aligned: SAM, BAM, Vendor Specific Formats/Color Space Variant Call File (VCF, BCF) SNPs, indels 44 What to expect? Cluster Cloud Laptop File size depends on read length, read type 4GB single lane (~100 million reads) Bowtie w/ 8 cores = 20/25 minutes; reference genome - read length = 33bp (older) TopHat same file 1 day Read length x number of reads x 8 = file size (fasta, double for fastq) BAM file ~ 3-4x smaller than unaligned file 15

FASTQ Format(Unaligned Reads) @SEQ_ID GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATCCATTTGTTCAACTCACAGTTT +!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65 Line 1) begins with a '@' character and is followed by a sequence identifier and an optional description (like a FASTA title line). Line 2) is the raw sequence letters(acgt). Line 3) begins with a '+' character and is optionally followed by the same sequence identifier (and any description) again. Line 4) encodes the quality values for the sequence in Line 2, and must contain the same number of symbols as letters in the sequence Sanger format can encode a Phred quality score from 0 to 93 using ASCII 33 to 126 46 What is Alignment? Read comes off a sequencing machine A T G G T C A Goal: Determine where on the genome that read belongs Method: Match sequence of read to sequence from a reference genome (reference G G C A T G G T C A T T C genome) (read) A T G G T C A Result: Genomic Location of read 47 Align junction reads Gene/Transcript Exon junction (Reference Genome) DNA Space G A T G C A C G G A T T G T C A T RNA Space A T G G T C A (Read) 1) Align to Genome gapped alignment time expensive -breaks up read in pieces (25mer) 2) Align to transcriptome lose genomic context! 48 16

SAM Format (Aligned reads) Sequence Alignment/MAP (SAM) format is TAB-delimited BAM is binary SAM M-match Position Header line I-Insertion position of mate D-Deletion sequence length quality (header) (Reference Sequence) CIGAR Read id Bitwise flag Quality score Reference genome Reference name of mate optional Explain flags http://picard.sourceforge.net/explain-flags.html 50 VCF Format (Variant Call Format) The Variant Call Format (VCF) is a TAB-delimited format with each data line consists of the following fields: Chromosome, Position, variant id, reference/alternative alleles, quality, information(read depth), event, sample Id (optional), format (optional) 51 17

Partek Flow Web based Application Cloud, Desktop, Server Chrome, Firefox, Safari Access from any terminal, smartphone Project centric Protocols Collaborate with others Current release 1.0 / 2.1 beta Alignment, QA/QC, GSA Export results to PGS Coming soon SGE, 52 Challenge: Data volume is a bottleneck Help, I m drowning in data! How do I handle all this data? Solution: Schedule Tasks Schedule & Queue tasks Emails you when tasks are complete Keep your hardware running 24/7/365 18

Challenge: The quality of the data will affect the alignment How do I determine data quality? Do I have outliers? Can I move forward with my analysis? Do I need to trim/filter my reads? 55 Solution: Pre & Post Alignment QA/QC Group and individual QA/QC for excluding outliers Quality score per read/position Look for drop in quality scores Make intelligent decisions for trimming/filtering adaptors, barcodes, low quality reads 56 Challenge: Alignment Different people, different parameters will result in different alignment. Which aligner to use? Some aligners have more than 50 different options. How do I know what to set? What options do I choose for RNA-Seq, ChIP-Seq, DNA-Seq, mirna-seq, MeDip-Seq? What options do I choose for the different read types? Junction reads? Paired-End reads? Multiple Aligned reads? 57 19

Solution: Multiple aligners with recommended defaults Vendor Specific default options Automatic Download of reference genomes Assay specific default options (RNA- Seq, ChIP-Seq, DNA- Seq) Advanced options also available through GUI Interface (no command line) 58 Challenge: How do I keep track of my samples? Which samples are Tumor? Control? Age? Sex/Gender? How am I ever going to keep track of this clinical information? 59 Solution: Advanced Sample Management Manage files associated with sample throughout life of project Keep track of reference genome Controlled vocabulary SNOMED List In-place editing of sample info 20

Plug-in for Torrent Suite Perform QA/QC within Torrent Suite and seamlessly upload data to Partek Flow for Comprehensive Data Analysis and Visualization Performs QA/QC within Torrent Suite Uploads data to Partek Flow Comprehensive Solution for RNA-Seq Alignment Mapping QC Statistics Visualization Integrated Genomics Biological Interpretation Acknowledgements Partek Flow Demonstration 63 21