Next generation sequencing (NGS) Bioinformatics Challenges and strategies. Urmi Trivedi Lead Bioinformatician

Similar documents
Analysis of NGS Data

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Next generation sequencing (NGS)

Introduction to NGS data analysis

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

Challenges associated with analysis and storage of NGS data

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Deep Sequencing Data Analysis

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Basic processing of next-generation sequencing (NGS) data

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

G E N OM I C S S E RV I C ES

Practical Solutions for Big Data Analytics

SRA File Formats Guide

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

CHALLENGES IN NEXT-GENERATION SEQUENCING

Practical Guideline for Whole Genome Sequencing

Delivering the power of the world s most successful genomics platform

Comparing Methods for Identifying Transcription Factor Target Genes

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Bioinformatics Unit Department of Biological Services. Get to know us

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

New solutions for Big Data Analysis and Visualization

Large-scale Research Data Management and Analysis Using Globus Services. Ravi Madduri Argonne National Lab University of

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Data formats and file conversions

MiSeq: Imaging and Base Calling

-> Integration of MAPHiTS in Galaxy

LifeScope Genomic Analysis Software 2.5

Introduction to next-generation sequencing data

Databases and mapping BWA. Samtools

Version 5.0 Release Notes

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Genomic Testing: Actionability, Validation, and Standard of Lab Reports

About the Princess Margaret Computational Biology Resource Centre (PMCBRC) cluster

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Analysis of ChIP-seq data in Galaxy

Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing

UGENE Quick Start Guide

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

High Throughput Sequencing Data Analysis using Cloud Computing

Introduction to Bioinformatics 3. DNA editing and contig assembly

De Novo Assembly Using Illumina Reads

BioHPC Web Computing Resources at CBSU

Copy Number Variation: available tools

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Storage Solutions for Bioinformatics

454 Sequencing System Software Manual Version 2.6

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Text file One header line meta information lines One line : variant/position

Cloud Ready for Bioinformatics?

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Hadoopizer : a cloud environment for bioinformatics data analysis

NGS Data Analysis: An Intro to RNA-Seq

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

Development of Bio-Cloud Service for Genomic Analysis Based on Virtual

GeneProf and the new GeneProf Web Services

The NGS IT notes. George Magklaras PhD RHCE

How Sequencing Experiments Fail

SeqPig: simple and scalable scripting for large sequencing data sets in Hadoop

How-To: SNP and INDEL detection

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

Using Galaxy for NGS Analysis. Daniel Blankenberg Postdoctoral Research Associate The Galaxy Team

HADOOP IN THE LIFE SCIENCES:

An Overview of DNA Sequencing

Welcome to the Plant Breeding and Genomics Webinar Series

High Performance Compu2ng Facility

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Next Generation Sequencing: Technology, Mapping, and Analysis

Installation Guide for Windows

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Core Facility Genomics

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

Accelerating Data-Intensive Genome Analysis in the Cloud

454 Sequencing System Software Manual, v 2.5p1

Mass Storage Use Cases April 21, 2011

HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER

Deep Sequencing Data Analysis: Challenges and Solutions

Globus Genomics Tutorial GlobusWorld 2014

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Introduction. Overview of Bioconductor packages for short read analysis

CloudMap: A Cloud-based Pipeline for Analysis of Mutant Genome Sequences

Computational Genomics. Next generation sequencing (NGS)

TGC AT YOUR SERVICE. Taking your research to the next generation

Integrated Rule-based Data Management System for Genome Sequencing Data

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Automating installation, testing and development of bcbio-nextgen pipeline

Partek Flow Installation Guide

Disease gene identification with exome sequencing

Overview sequence projects

Transcription:

Next generation sequencing (NGS) Bioinformatics Challenges and strategies Urmi Trivedi Lead Bioinformatician urmi.trivedi@ed.ac.uk

Major Bottlenecks Data volume Data complexity Data noise Overview Solutions Data formats Levels of NGS Bioinformatics Analytical strategies 2

Imbalance in genome informatics ecosystem Stein Genome Biology 2010 11:207 doi:10.1186/gb-2010-11-5-207 3

Hierarchy of NGS Data Volume Individual features 1MB Variation Data 4-5 GB Alignment Data 600GB Sequence plus Quality Data 1TB Intensities and Raw Data 5 TB Typical output from a single flowcell of a hiseq run 5

Hierarchy of NGS Data Volume Scalable storage Maintenance Clear as you go Individual features 1MB Backup Network Variation Data 4-5 GB Streamed replication of the original data, with copies being stored at multiple location Alignment Data 600GB Deposit data in public data repositories like ENA (European Nucleotide Archive) High Speed Network Sequence plus Quality Data 1TB Intensities and Raw Data 5 TB High performance computing such as Cluster computing or cloud computing Typical output from a single flowcell of a hiseq run Analysis Paralysis 5

Large amount of intermediate files Bears no discernable relationship to experiments >5000 intermediate files after image processing 6

Multiplexing adds up to it LIMS (Laboratory Information Management System) Automatic pipelines for running Tracking Samples programs Running downstream analysis A flowcell with only 3 bacrodes in 2 lanes produces ~14000 files 7

Data Noise Base-pair quality score Adapter contamination Quality Control Uneven Amplification 8

Data Formats SFF A binary file containing information about flowgrams, sequences, qualities FASTQ Contains sequence by cycle and respective quality 9

Data Formats FASTA Fasta header and the sequence >HWI-EAS222_2093MAAXX GAAATATTAAGTCTTTCAAA QUAL Fasta header and phred scores >HWI-EAS222_2093MAAXX 40 40 40 40 40 40 40 40 40 40 FASTQ Sequence and ASCII coded phred qualities @HWI-EAS222_205JYAAXX GATTTCTTTGTCATTATTTA + IIIIIIIIIIIIIIIIIIIIIIIIIIIII 10

Levels of NGS Bioinformatics Production Bionformatics Advanced Bioinformactics Produce raw sequence reads and QC Map to genome and generate raw genomic features (e.g. SNPs), Assemble a genome de novo with existing tools, Bioinformatics Research Analyze the data; Uncover the biological meaning 11

Production Bioinformatics Vendor s pipeline Generation of fastq or similar files QC scripts L I M S >>Per sequence quality scores pass #Quality Count 2 96325.0 3 4392.0 4 7924.0 5 7229.0 6 12586.0 7 20861.0 8 22431.0 9 26053.0 10 35403.0 11 40341.0 12 46845.0 13 56089.0 14 63524.0 15 67926.0 FAIL PASS Offsite backup Further investigation Advanced Bioinformatics 12

Advanced Bioinformatics Existing Reference Sequence No Reference Sequence Short Read Alignment De novo Assembly Variant Calling Gene Expression sirna/microrna Analysis De novo Transcriptome Assembly Metagenomics Population Genomics 13

Advanced Bioinformatics Software/Tools Open source tools Free for use Mostly Linux based Runs on command line Complicated Installation at times Commercial software Tools for biologists Pretty interface and ease of use CLCBio, Geneious, DNAStar, Partek 14

Short Read Alignment: Challenges Speed Using tools like BLAST/BLAT would require 100 CPU hours Memory Read Errors Repetitive regions Sequencer Differences 15

ELAND MAQ BWA BOWTIE TOPHAT GSNAP SOAP-2 Novoalign Short Read Alignment: Software 16

Variant calling Reads Align Reference Genome SNP 17

Variant Calling Misalignment due to Indels 18

Variant Calling Indel Realignment GATK MSA 19

Variant Calling -Workflow Raw data Alignment (SAM/BAM format) Realignment to correct errors GATK ELAND, MAQ, BWA, BOWTIE, SSAHA2, SOAP-2 PICARD, SAMTOOLS Mark duplicates Validation, Visualization and Bioinformatics Research Annotations SAMTOOLS, GATK, VarScan SNPs/indel calling IGV, Savant, Tablet 20

S1 Gene Expression Analysis Reads (cdna fragments) S2 Align Aligned read counts as a measure of gene expression 21

Gene Expression Analysis -Workflow Raw data Alignment (SAM/BAM format) Mark duplicates? TOPHAT, GSNAP, STAMPY, BWA, BOWTIE PICARD, SAMTOOLS Validation & Bioinformatics Research Normalization and Differential Gene Expression Raw Counts EdgeR, DeSEQ 22

Velvet ABYSS ALLPATHS-2 SOAPDenovo SGA EDENA CLCbio Newbler De novo Genome Assembly Software 23

De novo Genome Assembly Workflow Short reads (Typically 100bp paired end) Filter Poor quality data, sequence adapters, etc Assemble and generate contigs QC Long reads (e.g. 454, Sanger) Mate pairs (3-10KB insert) Generate Scaffolds QC Data Visualization (GMOD, Gbrowse,Tablet) Annotation (Gene Prediction, etc.) (MAKER, Augustus) 24

Summary Nextgen Sequencing is still very rapidly moving field Plan for change Keeping our infrastructure flexible Keep disk space expandable Keep software agile NEVER proceed with the analysis without data QC Choose the right tool for the right job 25

Acknowledgements Professor Mark Blaxter Dr. Karim Gharbi Dr. Stephen Bridgett Timothée Cézard Gaganjot Kaur Stuart Taylor The Darwin Trust of Edinburgh 26