Next generation sequencing (NGS)



Similar documents
Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Version 5.0 Release Notes

Analysis of NGS Data

Computational Genomics. Next generation sequencing (NGS)

Next Generation Sequencing: Technology, Mapping, and Analysis

Copy Number Variation: available tools

UGENE Quick Start Guide

Deep Sequencing Data Analysis

Databases and mapping BWA. Samtools

Data Analysis & Management of High-throughput Sequencing Data. Quoclinh Nguyen Research Informatics Genomics Core / Medical Research Institute

Comparing Methods for Identifying Transcription Factor Target Genes

Introduction to Bioinformatics 3. DNA editing and contig assembly

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

Introduction to NGS data analysis

Analysis of ChIP-seq data in Galaxy

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Delivering the power of the world s most successful genomics platform

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Welcome to the Plant Breeding and Genomics Webinar Series

-> Integration of MAPHiTS in Galaxy

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Prepare the environment Practical Part 1.1

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Data formats and file conversions

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

De Novo Assembly Using Illumina Reads

LifeScope Genomic Analysis Software 2.5

Reading DNA Sequences:

New solutions for Big Data Analysis and Visualization

BioHPC Web Computing Resources at CBSU

A Complete Example of Next- Gen DNA Sequencing Read Alignment. Presentation Title Goes Here

Challenges associated with analysis and storage of NGS data

Hadoopizer : a cloud environment for bioinformatics data analysis

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

An FPGA Acceleration of Short Read Human Genome Mapping

CHALLENGES IN NEXT-GENERATION SEQUENCING

Module 1. Sequence Formats and Retrieval. Charles Steward

Importance of Statistics in creating high dimensional data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Practical Guideline for Whole Genome Sequencing

A Primer of Genome Science THIRD

A Tutorial in Genetic Sequence Classification Tools and Techniques

Installation Guide for Windows

Managing and Conducting Biomedical Research on the Cloud Prasad Patil

SRA File Formats Guide

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

What is a contig? What are the contig assembly programs?

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

Overview sequence projects

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

How Sequencing Experiments Fail

G E N OM I C S S E RV I C ES

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Bioinformatics Resources at a Glance

Introduction. Overview of Bioconductor packages for short read analysis

Next Generation Sequencing

Basic processing of next-generation sequencing (NGS) data

HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER

Storage Solutions for Bioinformatics

Human Genomes and Big Data Challenges QUANTITY, QUALITY AND QUANDRY Gerry Higgins, M.D., Ph.D. AssureRx Health, Inc.

GeneProf and the new GeneProf Web Services

Analysis and Integration of Big Data from Next-Generation Genomics, Epigenomics, and Transcriptomics

The NGS IT notes. George Magklaras PhD RHCE

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Keeping up with DNA technologies

Introduction to next-generation sequencing data

Pairwise Sequence Alignment

454 Sequencing System Software Manual Version 2.6

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

MiSeq: Imaging and Base Calling

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

All in a highly interactive, easy to use Windows environment.

Leading Genomics. Diagnostic. Discove. Collab. harma. Shanghai Cambridge, MA Reykjavik

Bioinformatics Unit Department of Biological Services. Get to know us

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

School of Nursing. Presented by Yvette Conley, PhD

High Performance Compu2ng Facility

Data search and visualization tools at the Comparative Evolutionary Genomics of Cotton Web resource

Next Generation Sequencing data Analysis at Genoscope. Jean-Marc Aury

SAP HANA Enabling Genome Analysis

Fast. Integrated Genome Browser & DAS. Easy. Flexible. Free. bioviz.org/igb

A Design of Resource Fault Handling Mechanism using Dynamic Resource Reallocation for the Resource and Job Management System

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Subread/Rsubread Users Guide

RNA-Seq Tutorial 1. John Garbe Research Informatics Support Systems, MSI March 19, 2012

Disease gene identification with exome sequencing

DNA Sequencing Data Compression. Michael Chung

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

RESTRICTION DIGESTS Based on a handout originally available at

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, Abstract. Haruna Cofer*, PhD

Building Bioinformatics Capacity in Africa. Nicky Mulder CBIO Group, UCT

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Transcription:

Next generation sequencing (NGS) Vijayachitra Modhukur BIIT modhukur@ut.ee 1 Bioinformatics course 11/13/12

Sequencing 2 Bioinformatics course 11/13/12

Microarrays vs NGS Sequences do not need to be known in advance Highly quantitative Lesser noise levels, do not suffer from cross hybridization NGS provides increased sensitivity to detect rare sequences in complex genomic samples Accurate single-nucleotide resolution permits the discrimination between highly related sequences The lowered cost of NGS makes comprehensive mapping of multiple features possible Paul J. Hurd et al 3 Bioinformatics course 11/13/12

Outline of NGS 4 Bioinformatics course 11/13/12

Genome architecture Disease diagnosis Variability studies Comparative genomics Gene regulation Drug design and many more Why sequencing? 5 Bioinformatics course 11/13/12

6 Bioinformatics course 11/13/12

Different generations (computers and sequencing) 7 Bioinformatics course 11/14/12

First Generation Sanger sequencing http://www.youtube.com/watch? v=apn8lp4yxpo&feature=related 8 Bioinformatics course 11/13/12

Application Human genome project 1990-2002 9 Bioinformatics course 11/14/12

Human genome project key finding 1. There are approximately 23,000 genes in human beings, the same range as in mice and roundworms. Understanding how these genes express themselves will provide clues to how diseases are caused. 2. The human genome has significantly more segmental duplications (nearly identical, repeated sections of DNA) than other mammalian genomes. These sections may underlie the creation of new primate-specific genes 3. At the time when the draft sequence was published fewer than 7% of protein families appeared to be vertebrate specific http://en.wikipedia.org/wiki/human_genome_project/ 10 Bioinformatics course 11/14/12

Second generation sequencing 11 Bioinformatics course 11/13/12

http://sciblogs.co.nz/code-for-life/2012/03/22/the-world-in-dna-sequencers/ 12 Bioinformatics course 11/13/12

Break through NGS technology ER Mardis. Nature 470, 198-203 (2011) doi:10.1038/nature09796 13 Bioinformatics course 11/13/12

Leading Platforms NGS platforms With 3730s, ~60Mb per year Specifications as of summer 2008 454 Solexa/Illumina SOLiD (ABI) Bp per run 400 Mb 2-3 Gb 3-6 Gb Read length 250-400 bp 35-50 (70-100) bp 35-50 bp run time 10 hr 2.5 days 5 days Download 20 min 27 hr (44 min) ~1 day Analysis 2-5 hr 2 days 2-3 days Files 20-50 Gb 1T 1 T 14 Bioinformatics course 11/13/12

Massive amount of sequenced data 15 Bioinformatics course 11/13/12

Sequencing projects 16 Bioinformatics course 11/13/12

Application 17 Bioinformatics course 11/13/12

Human Genome

Human genome 19 http://www.mdpi.com/journal/genes/special_issues/nextgen-sequencing/ Bioinformatics course 11/14/12

1,000 genome project 20 Bioinformatics course 11/13/12

1,000 genome project Small inter individual differences in regulatory regions found in all human population Genetic variation association to disease Discover novel genetic variats such as snps, cnvs etc., Better improvement of human reference sequence. Key results Each person carry 250 to 300 loss-of-function variants in annotated genes and 50 to 100 variants previously implicated in inherited disorders. 21 Bioinformatics course 11/13/12

Analysis 22 Bioinformatics course 11/13/12

data to analysis cpu/memory intensive

NGS pipeline 24 Bioinformatics course 11/13/12

Name BLAT Bowtie BWA ELAND GMAP and GSNAP MAQ MOSAIK RazerS SHRiMP SLIDER SOAP SOCS Description BLAST-Like Alignment Tool. Can handle one mismatch in initial alignment step. Uses a Burrows-Wheeler transform to create a permanent, reusable index of the genome; 1.3 GB memory footprint for human genome. Aligns more than 25 million Illumina reads in 1 CPU hour. Uses a Burrows-Wheeler transform to create an index of the genome. It's a bit slower than bowtie but allows indels in alignment Implemented by Illumina. Includes ungapped alignment with a finite read length. Robust, fast, short-read alignment. GMAP: singleton reads; GSNAP: paired reads. Useful for digital gene expression, SNP and indel genotyping. Ungapped alignment that takes into account quality scores for each base Fast gapped aligner and reference-guided assembler. Aligns reads using a banded Smith- Waterman algorithm seeded by results from a k-mer hashing scheme. Supports reads ranging in size from very short to very long. No read length limit. Hamming or edit distance mapping with configurable error rates. Configurable and predictable sensitivity (runtime/sensitivity tradeoff). Supports paired-end read mapping. Indexes the reads instead of the reference genome. Uses masks to generate possible keys. Can map ABI SOLiD color space reads. Slider is an application for the Illumina Sequence Analyzer output that uses the "probability" files instead of the sequence files as an input for alignment to a reference sequence or a set of reference sequences. Robust with a small (1-3) number of gaps and mismatches. Speed improvement over BLAT, uses a 12 letter hash table. Now SOAP2 is much faster than the first version. For ABI SOLiD technologies. Significant increase in time to map reads with mismatches (or color errors). Uses an iterative version of the Rabin-Karp string search algorithm. SSAHA Fast for a small number of variants. Taipan de-novo Assembler for Illumina reads 25 Bioinformatics course 11/13/12 based on http://en.wikipedia.org/wiki/list_of_sequence_alignment_software

Quality scores Each base from a sequencer comes with a quality score Base-calling error probabilities Phred quality score Q = 10 log10 P higher quality score indicates a smaller probability of error http://www.illumina.com/truseq/quality_101/quality_scores.ilmn 26 Bioinformatics course 11/13/12

Quality scores http://www.illumina.com/truseq/quality_101/quality_scores.ilmn 27 Bioinformatics course 11/13/12

File formats 28 Bioinformatics course 11/13/12

fastq Raw data http://en.wikipedia.org/wiki/fastq_format

fastq to fasta

SAM/BAM Format SAM/BAM format Proliferation of alignment formats over the years: Cigar, psl, gff, xml etc. SAM (Sequence Alignment/Map) format Single unified format for storing read alignments to a reference genome BAM (Binary Alignment/Map) format Binary equivalent of SAM Developed for fast processing/indexing Advantages Can store alignments from most aligners Supports multiple sequencing technologies Supports indexing for quick retrieval/viewing Compact size (e.g. 112Gbp Illumina = 116Gbytes disk space) Reads can be grouped into logical groups e.g. lanes, libraries, individuals/genotypes Supports second best base call/quality for hard to call bases Possibility of storing raw sequencing data in BAM as replacement to SRF & fastq 31 Bioinformatics course Thomas Keane 9th European Conference on Computational Biology 26 th September, 2010 11/13/12

SAM format 32 Bioinformatics course 11/14/12

Each bit in SAM format 33 Bioinformatics course 11/14/12

Reference alignment De novo alignment Sequence alignment 34 Bioinformatics course 11/13/12

Spaced seed vs BWT 35 Bioinformatics course 11/14/12

Burrows wheeler transform Original : WBWBWB# Compressed : WWW#BBB = 3W#3B 36 Bioinformatics course 11/13/12

identical characters together in the Output column of Table 4-1. In this example, the BWT algorithm transforms the string WBWBWB# into WWW#BBB. Burrows wheeler transform Table 4-1 Rotating and Sorting Data Rotate Sort Output WBWBWB# BWBWB#W W #WBWBWB BWB#WBW W B#WBWBW B#WBWBW W WB#WBWB WBWBWB# # BWB#WBW WBWB#WB B WBWB#WB WB#WBWB B BWBWB#W #WBWBWB B 37 Bioinformatics course 11/13/12

Sequence assembly- Solving a jigaw puzzle 38 Bioinformatics course 11/13/12

Sequence assembly- repeating patterns 39 Bioinformatics course 11/13/12

Greedy Assemblers Greedily Greedyjoins the reads together that are most similar to each other. Greedy assemblers - The first assembly programs followed a simple but effective strategy in which the assembler greedily joins together the reads that are most similar to each other. An example is shown below, where the assembler joins, in order, reads 1 and 2 (overlap = 200 Examples : Phrap, Cap3, TIGR assembler, bp), then reads 3 and 4 (overlap = 150 bp), then reads 2 and 3 (overlap = 50 bp) thereby creating a single contig from the four reads provided in the input. One disadvantage of the simple greedy approach is that because local information is considered at each step, the assembler can be easily confused by complex repeats, leading to mis-assemblies. 2009 SIB LF June 4, 2010 40 Bioinformatics course 11/13/12 Overlap-layout-consensus

Overlap layout consensus Overlap-Layout-Consensus Based on all pairwise comparisons Constuction of an overlap graph nodes = reads (sequences) egdes = connections between overlapping reads 41 Layout: look for paths in the overlap graph which are segments of the genome to assemble (contigs) goal: find Hamiltonian path = a path that contains all nodes exactly once Consensus: following the Hamiltonian path, combine the overlapping sequences in the nodes into the sequence of the genome in case of different nucleotides: majority vote considering base qualities Programs using the OLC: Arachne, Celera Assembler (CABOG), newbler, Minimus, Edena, CAP, PCAP http://gepard.bioinformatik.uni-saarland.de/teaching/ws-2011-12/special-topic-lecture-bioinformatics-next Bioinformatics course 11/13/12

De bruign graph- Velvet 42 Bioinformatics course 11/13/12

Online resources NCBI-SRA NCBI-GEO The European Nucleotide Archive (ENA) Array express 43 Bioinformatics course 11/13/12

sequence similarity. A user can interactively explore the sequence Assembly visualization tools possess most of the necessary Visualization tools REVIEW Table 1 Tools for visualizing sequencing data Name Cost OS Description URL Stand-alone tools ABySS-Explorer 25 Free Win, Mac, Linux Interactive assembly structure visualization tool http://tinyurl.com/abyss-explorer/ CLC Genomics Workbench $ Win, Mac, Linux Integrates NGS data visualization with analysis tools; http://www.clcbio.com/ user friendly Consed 3 * Free Mac, Linux Widely used; assembly finishing package; NGS compatible http://www.phrap.org/ DNASTAR Lasergene 14 $ Win, Mac Analysis suite with an assembly finishing package; http://www.dnastar.com/ NGS compatible EagleView 17 Free Win, Mac, Linux Assembly viewer; compatible with single-end NGS http://tinyurl.com/eagleview/ Gap 12,13 Free Linux Widely used; assembly finishing package; Gap5 is http://staden.sourceforge.net/ NGS compatible Hawkeye 6 Free Win, Mac, Linux (S) Sanger sequencing assembly viewer http://amos.sourceforge.net/hawkeye/ Integrative Genomics Free Win, Mac, Linux Genome browser with alignment view support (Table 2); http://www.broadinstitute.org/igv/ Viewer (IGV)* NGS compatible MapView 18 Free Win, Linux Read alignment viewer; custom file format for fast http://evolution.sysu.edu.cn/mapview/ NGS data loading MaqView Free Mac, Linux Read alignment viewer; fast NGS data loading from Maq http://maq.sourceforge.net/ alignment files Orchid Free Linux (S) Assembly viewer customized to display paired-end http://tinyurl.com/orchid-view/ relationships Sequencher $ Win, Mac Assembly finishing package http://www.genecodes.com/ SAMtools tview 8 Free Win, Mac, Linux Simple and fast text alignment viewer; NGS compatible http://samtools.sourceforge.net/ 44 Web-based tools LookSeq 19 Free Uses AJAX; y axis for insert size; user configures data resources; NGS compatible NCBI Assembly Free Graphical interface to contig and trace data in NCBI s Archive Viewer 7 Assembly Archive http://lookseq.sourceforge.net/ http://tinyurl.com/assmbrowser/ Free means the tool is free for academic use; $ means there is a cost. OS, operating system: Win, Microsoft Windows; Mac, Macintosh OS X. Tools running on Linux usually also run on other versions of Unix. (S) indicates that compilation from source is required. Assembly finishing package enables interactive sequence editing and/or integration with tools for automated assembly improvement. *Our recommendation Bioinformatics course 11/13/12

45 Bioinformatics course 11/13/12 Dr. Ece Gamsiz

Next lectures RNA sequencing, method, application, advantages over microarrays Chip sequencing Epigenomics, DNA methylation, histone modification.. 46 Bioinformatics course 11/13/12