How Sequencing Experiments Fail

Similar documents
NGS data analysis. Bernardo J. Clavijo

MiSeq: Imaging and Base Calling

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Introduction to next-generation sequencing data

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Analysis of Illumina Gene Expression Microarray Data

Analysis of ChIP-seq data in Galaxy

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Introduction to Bioinformatics 3. DNA editing and contig assembly

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

A Primer of Genome Science THIRD

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Introduction to NGS data analysis

G E N OM I C S S E RV I C ES

Standards, Guidelines and Best Practices for RNA-Seq V1.0 (June 2011) The ENCODE Consortium

July 7th 2009 DNA sequencing

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

An Overview of DNA Sequencing

Version 5.0 Release Notes

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Reduced Representation Bisulfite-Seq A Brief Guide to RRBS

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

Vector NTI Advance 11 Quick Start Guide

Next generation DNA sequencing technologies. theory & prac-ce

Next Generation Sequencing

Comparing Methods for Identifying Transcription Factor Target Genes

PreciseTM Whitepaper

Gene Expression Analysis

Module 1. Sequence Formats and Retrieval. Charles Steward

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

AS Replaces Page 1 of 50 ATF. Software for. DNA Sequencing. Operators Manual. Assign-ATF is intended for Research Use Only (RUO):

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

A Brief Introduction on DNase-Seq Data Aanalysis

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

DNA Sequencing Troubleshooting Guide

Next Generation Sequencing: Technology, Mapping, and Analysis

Data Integration. Lectures 16 & 17. ECS289A, WQ03, Filkov

Next Generation Sequencing for DUMMIES

Analyzing A DNA Sequence Chromatogram

DNA SEQUENCING SANGER: TECHNICALS SOLUTIONS GUIDE

Single Nucleotide Polymorphisms (SNPs)

TruSeq Custom Amplicon v1.5

SMRT Analysis v2.2.0 Overview. 1. SMRT Analysis v SMRT Analysis v2.2.0 Overview. Notes:

Computational Genomics. Next generation sequencing (NGS)

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

SRA File Formats Guide

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

PATHOGEN DETECTION SYSTEMS BY REAL TIME PCR. Results Interpretation Guide

Biological Sciences Initiative. Human Genome

Typing in the NGS era: The way forward!

DNA Sequence Analysis

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Introduction To Epigenetic Regulation: How Can The Epigenomics Core Services Help Your Research? Maria (Ken) Figueroa, M.D. Core Scientific Director

BRCA1 / 2 testing by massive sequencing highlights, shadows or pitfalls?

Searching Nucleotide Databases

Single-Cell Whole Genome Sequencing on the C1 System: a Performance Evaluation

So you want to create an a Friend action

Core Facility Genomics

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA

GeneProf and the new GeneProf Web Services

CCR Biology - Chapter 9 Practice Test - Summer 2012

From Reads to Differentially Expressed Genes. The statistics of differential gene expression analysis using RNA-seq data

Expression Quantification (I)

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Next generation sequencing (NGS)

Frequently Asked Questions Next Generation Sequencing

Basic Analysis of Microarray Data

BioHPC Web Computing Resources at CBSU

LifeScope Genomic Analysis Software 2.5

Step by Step Guide to Importing Genetic Data into JMP Genomics

Data Analysis for Ion Torrent Sequencing

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

3. About R2oDNA Designer

Genome-wide measurements of protein-dna interaction by chromatin immunoprecipitation

Practical Guideline for Whole Genome Sequencing

MHI3000 Big Data Analytics for Health Care Final Project Report

MASCOT Search Results Interpretation

Introduction Bioo Scientific

Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

Genomics GENterprise

Analysis of FFPE DNA Data in CNAG 2.0 A Manual

Software Getting Started Guide

Visualisation tools for next-generation sequencing

Transcription:

How Sequencing Experiments Fail v1.0 Simon Andrews simon.andrews@babraham.ac.uk

Classes of Failure Technical Tracking Library Contamination Biological Interpretation Something went wrong with a machine Samples aren t what they re supposed to be Problems during sequencing library preparation Unexpected material in your libraries Samples didn t behave the way you expected Drawing the wrong conclusion from the data

Technical

Technical Technical Failures C T A G C T A G C T A G Signal Level Signal Level Signal Level Call = T Confidence = High Call = T Confidence = Low Call = T Confidence = Low

Technical Phred Scores Phred = -10 log 10 p p = Probability call is incorrect 10% error Phred10 1% error Phred20 0.1% error Phred30

Technical Incorrect Encoding Phred64! #$%& ()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefgh Phred33 1000000 100000 10000 1000 100 10 1 33 36 39 42 45 48 51 54 57 60 63 66 69 72 75 78 81 84 87 90 93 96 99 102 105 108 111 114 117 120 123 126 Phred33 (Sanger) Phred64 (Illumina)

Technical Phred Scores

Technical Positional Phred Scores

Technical Positional Phred Scores

Technical Biased Phred Scores HindIII sites GGATCCGTATGCGATGCTAGCGT GGATCATATATATGCTAGCGTAT GGATCTATATTGCGCGATACTGG GGATCCCGTAGCTGCGATGCTGA GGATCAAGGATAGCGCGTCTAGA GGATCTATATAGTTGCCGTATCG GGATCGGAGCGGGATCGGATGCG

Tracking

Tracking Tracking - Barcodes

Tracking Tracking - Barcodes

Tracking Tracking - Barcodes

Tracking Tracking Exercise You have some barcode statistics from a set of runs from the same group. Red = Expected barcode Grey = Unexpected barcode Can you see if you can spot a problem within this data set, and say how many of the lanes it might affect?

Tracking Tracking Swapped Samples Swapped between users Different sample type Different species See later Contamination / Biology sections Swapped within experiment Look for consistent biological signal Use other knowledge of the samples to validate

Tracking Tracking Sample groups

Tracking Tracking Sample groups

Tracking Tracking Sample swaps KO1 KO2 KO3 KO4 WT1 WT2 WT3 WT4

Library

Library Library Problems Material Lost Overamplification Duplication Biases in selection Priming bias GC bias Methylation bias Size selection bias Technical contamination Read through adapter Adapter dimers

Library Duplication Over-sequencing of library complexity Too little material or too much PCR Can be difficult to assess Why does duplication matter? Potentially biased Over-estimates measurement accuracy

Library Duplication

Library Duplication

Library Duplication

Library Duplication

Library Duplication - Repeats Real R1 NR R2 Mapped R1 NR R2 Deduplicated R1 NR R2 Repeat Repeat Peak callers (MACS for example) deduplicate internally, so you don t have to consciously do this. Only avoided by using uniquely mapped reads.

Deduplication

Library Priming Bias

Contamination

Contamination Contamination Different kinds Technical contamination Adapter dimers Contamination with a species you might expect E.coli in a mouse sample Contamination with something unexpected Contamination with the wrong material DNA in an RNA-Prep Mixed samples

Contamination Mapping Efficiency Know what to expect Data type (genomic / transcriptomic) How good / complete is the genome Distinguish unique / multi-mapped reads Understand the mapping process Reads: Input: 5725730 Mapped: 4703342 (82.1% of input) of these: 471516 (10.0%) have multiple alignments (471516 have >1) 82.1% overall read alignment rate.

Contamination Species Screen

Contamination Species Screen

Contamination Species Screen

Contamination TAGC Plots Assemble Filter contigs Plot %GC vs Coverage Sample and blast

Contamination Contamination

Contamination Internal Contamination

Biological

Biological Samples Don t Behave All samples come with a set of expectations Biological effect Sample source Rough biological behaviour If these aren t met Samples may not be what you expect Statistical analyses may be invalid Larger biological picture may be missed

Biological Exercise You have been given a set of QC and visualisation results for a knockout in male black6 mice (same genotype as the reference) of a single gene. Have a look through the plots and see if there is anything which would cause you concern regarding the behaviour of the samples.

Biological Expected effects missing

Biological Confounded effects

Biological ChIP doesn t behave

Biological ChIP doesn t behave

Biological Methylation doesn t behave

Biological RNA-Seq doesn t behave

Biological Multiple subgroups

Interpretation