Next Generation Sequencing: Technology, Mapping, and Analysis



Similar documents
New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Next Generation Sequencing

Introduction to NGS data analysis

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Computational Genomics. Next generation sequencing (NGS)

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Next generation DNA sequencing technologies. theory & prac-ce

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Introduction to next-generation sequencing data

Data Analysis for Ion Torrent Sequencing

How many of you have checked out the web site on protein-dna interactions?

DNA Sequencing & The Human Genome Project

Gene mutation and molecular medicine Chapter 15

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

G E N OM I C S S E RV I C ES

School of Nursing. Presented by Yvette Conley, PhD

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

14.3 Studying the Human Genome

Targeted. sequencing solutions. Accurate, scalable, fast TARGETED

Genomes and SNPs in Malaria and Sickle Cell Anemia

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Core Facility Genomics

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Forensic DNA Testing Terminology

Genetic diagnostics the gateway to personalized medicine

Analysis of gene expression data. Ulf Leser and Philippe Thomas

escience and Post-Genome Biomedical Research

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Biological Sciences Initiative. Human Genome

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Lecture 3: Mutations

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

Single-Cell DNA Sequencing with the C 1. Single-Cell Auto Prep System. Reveal hidden populations and genetic diversity within complex samples

Innovations in Molecular Epidemiology

Lecture 6: Single nucleotide polymorphisms (SNPs) and Restriction Fragment Length Polymorphisms (RFLPs)

DNA Insertions and Deletions in the Human Genome. Philipp W. Messer

MUTATION, DNA REPAIR AND CANCER

Next generation sequencing (NGS)

LifeScope Genomic Analysis Software 2.5

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Simplifying Data Interpretation with Nexus Copy Number

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Human Genome Organization: An Update. Genome Organization: An Update

BRCA1 / 2 testing by massive sequencing highlights, shadows or pitfalls?

SICKLE CELL ANEMIA & THE HEMOGLOBIN GENE TEACHER S GUIDE

Sequencing and microarrays for genome analysis: complementary rather than competing?

The Human Genome Project

Challenges associated with analysis and storage of NGS data

Intro to Bioinformatics

Bio-Informatics Lectures. A Short Introduction

Module 1. Sequence Formats and Retrieval. Charles Steward

Analysis of NGS Data

SNP Essentials The same SNP story

An Overview of DNA Sequencing

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

1 Mutation and Genetic Change

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

Version 5.0 Release Notes

Introduction to Bioinformatics 3. DNA editing and contig assembly

Single Nucleotide Polymorphisms (SNPs)

TECHNOLOGIES, PRODUCTS & SERVICES for MOLECULAR DIAGNOSTICS, MDx ABA 298

RNAseq / ChipSeq / Methylseq and personalized genomics

How-To: SNP and INDEL detection

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

History of DNA Sequencing & Current Applications

Umm AL Qura University MUTATIONS. Dr Neda M Bogari

BioBoot Camp Genetics

Biology Behind the Crime Scene Week 4: Lab #4 Genetics Exercise (Meiosis) and RFLP Analysis of DNA

A Primer of Genome Science THIRD

Cancer Genomics: What Does It Mean for You?

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

GenomeStudio Data Analysis Software

SEQUENCING. From Sample to Sequence-Ready

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

INTERNATIONAL CONFERENCE ON HARMONISATION OF TECHNICAL REQUIREMENTS FOR REGISTRATION OF PHARMACEUTICALS FOR HUMAN USE E15

GenomeStudio Data Analysis Software

BIOL 3200 Spring 2015 DNA Subway and RNA-Seq Data Analysis

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

2. True or False? The sequence of nucleotides in the human genome is 90.9% identical from one person to the next. False (it s 99.

Current Motif Discovery Tools and their Limitations

Welcome to the Plant Breeding and Genomics Webinar Series

New Technologies for Sensitive, Low-Input RNA-Seq. Clontech Laboratories, Inc.

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Recombinant DNA and Biotechnology

Bioruptor NGS: Unbiased DNA shearing for Next-Generation Sequencing

ISTEP+: Biology I End-of-Course Assessment Released Items and Scoring Notes

SNPbrowser Software v3.5

Transcription:

Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/

The Human Genome Project took 10 years and cost roughly $3,000,000,000 by 2001.

The Human Genome Project took 10 years and cost roughly $3,000,000,000 by 2001. But, we had only one, maybe two, versions of the human genome, therefore, little data for a comprehensive, systematic study of human genetic diversity.

The Human Genome Project took 10 years and cost roughly $3,000,000,000 by 2001. But, we had only one, maybe two, versions of the human genome, therefore, little data for a comprehensive, systematic study of human genetic diversity. My lab has data from 2 whole human genomes, stored on a hard drive. Each cost roughly $40,000 (in 2010). Today (2013) sequencing a genome costs $5000 $10,000.

Outline Next Generation Sequencing Technologies Algorithms for Mapping Reads Detecting Structural Variants Visualization Software

Why Sequence DNA? DNA is the molecule of genetic inheritance. Sequencing data provide a fundamental basis for understanding the biology of an organism. The data allow comprehensive comparisons of organisms on a genomic level to find regions of similarity, difference, and functional significance.

Why Sequence DNA? The data allow us to understand Human variation on a molecular level, for example, the genetic differences between tumor and normal tissue. This will hopefully lead to more specific medical treatments (personalized medicine).

Current Experimental Methods That Use Sequencing RNA Seq Measurement of gene expression in a tissue by counting the number of RNA fragments sequenced from each gene. Also used for alternative splicing detection. ChIP Seq (Chromatin Immunoprecipitation) Identification of protein binding sites on DNA by determining where DNA fragments bound to a specific protein map onto the genome.

Current Experimental Methods That Use Sequencing Genome sequencing allows us to detect SNPs (single nucleotide polymorphisms) and structural variations among individuals: within a population, from different populations

Next Generation Sequencing Technologies Current Illumina Genome Analyzer Roche 454 Applied Biosystems Solid Future Ion Torrent Pacific Biosciences RS

Sanger Sequencing http://upload.wikimedia.org/wikipedia/en/6/60/dna_sequencing_gdna_libraries.jpg

Sanger vs NGS technology http://www.nature.com/nbt/journal/v26/n10/fig_tab/nbt1486_f1.html Next-generation DNA sequencing by Jay Shendure and Hanlee Ji, Nature Biotechnology 26, 1135-1145 (2008), doi:10.1038/nbt1486

Sanger vs NGS technology $/.50 per kilobase http://www.nature.com/nbt/journal/v26/n10/fig_tab/nbt1486_f1.html Next-generation DNA sequencing by Jay Shendure and Hanlee Ji, Nature Biotechnology 26, 1135-1145 (2008), doi:10.1038/nbt1486 $/ 1.50 per megabase

Emulsion and bridge amplification http://www.nature.com/nbt/journal/v26/n10/fig_tab/nbt1486_f2.html Next-generation DNA sequencing by Jay Shendure and Hanlee Ji, Nature Biotechnology 26, 1135-1145 (2008), doi:10.1038/nbt1486

Illumina sequencing technology http://seqanswers.com/forums/showthread.php?t=21

Illumina sequencing technology http://seqanswers.com/forums/showthread.php?t=21

Illumina sequencing technology http://seqanswers.com/forums/showthread.php?t=21

Video of Illumina Sequencing

SOLiD and 454 technologies http://www.nature.com/nrg/journal/v11/n1/fig_tab/nrg2626_f3.html Sequencing technologies the next generation by Michael L. Metzker Nature Reviews Genetics 11, 31-46, doi:10.1038/nrg2626

Most Common Error Types: 454 vs Illumina Roche/454 Illumina/Solexa Lower overall error

Next Generation Sequencing Data These technologies generate millions to billions of short DNA reads sampled from a whole DNA genome, targeted genetic regions, or transcribed RNA. Length: 35 200 nt (Illumina) 200 300 nt (454)

Mapping Reads the Problem Given 100+ million reads from an experiment, for each: 1. find the genomic coordinates, chromosome and first base, where it has the best match in a reference genome, either with the forward or reverse strand. 2. best match means zero or a small number of differences with the reference. 3. differences include mismatches and indels. 4. determine if it has multiple matches or none at all.

Algorithms for Read Mapping

Structural Variants Structural variants are any rearrangements of the genome relative to a reference. They include: insertions/deletions inversions translocations tandem repeat variations Many can be detected with paired end or mate pair reads.

Paired-Ends and Mate-Pairs http://www.nature.com/nature/journal/v456/n7218/fig_tab/nature07517_f1.html Accurate whole human genome sequencing using reversible terminator chemistry, David Bentley et al., Nature 456, 53-59, doi:10.1038/nature07517

Mate-pairs vs paired-ends Detection of large indels requires large insert sizes. Current Illumina technology allows for paired-end insert sizes very close to 250 bp, which, depending on coverage, allows for detection of small and medium-size indels only. Mate-pair libraries allow for generation of large inserts at the expense of more insert-size variability.

Normally Mapped Reads 1 2 paired reads sequenced fragment (insert) Subject A B C Reference 1 2 A B C mapped reads Apparent insert size in the normally expected range.

Deletion 1 2 paired reads sequenced fragment (insert) Subject A C Reference 1 2 A B C mapped reads

Deletion 1 2 Subject A C 1 2 Reference A B C Apparent insert size longer than expected indicating deletion of B.

Insertion 1 2 Subject A B C 1 2 Reference A C

Insertion 1 2 Subject A B C 1 2 Reference A C Apparent insert size shorter than expected indicating insertion of B.

Distribution to determine unusually long or short apparent insert length Insertions Deletions

Singletons A singleton is a read which maps, but whose pair does not map. Possible causes: 1. Split read 2. Novel insertion

Singleton Split Read 1 2 Subject A C 1 2 2 Reference A B C Parts of read 2 map to two locations. It is split. Some mapping programs cannot detect the split mapping.

Singleton Split Read 1 2 Subject A C 1 Reference A B C Only one read mapped Parts of read 2 map to two locations. It is split. Some mapping programs cannot detect the split mapping.

Homozygous deletion Bentley, et al, Nature 456, 53-59 (6 November 2008) doi:10.1038/nature07517 ;

Bentley, et al, Nature 456, 53-59 (6 November 2008) doi:10.1038/nature07517 ; Gap Heterozygous deletion No gap

Inversion 1 3 prime 5 prime Reference A B C D E F G H 5 prime 3 prime

Inversion 2 B C D E F G 3 prime 5 prime Reference A H 5 prime 3 prime

Inversion 3 B C D E F G 3 prime 5 prime Reference A H 5 prime 3 prime

Inversion 4 3 prime B C D E F G 5 prime Reference A H 5 prime 3 prime

Inversion 5 B C D E F G 3 prime 5 prime Reference A H 5 prime 3 prime

Inversion 6 G F E D C B 3 prime 5 prime Reference A H 5 prime 3 prime

Inversion 7 Subject 3 prime A G F E D C B 5 prime H 5 prime 3 prime

Inversion 8 1 2 3 4 Subject A G F E D C B H 1 3 2 4 Reference A B C D E F G H

Inversion 9 1 2 3 4 Subject A G F E D C B H 1 3 2 4 Reference A B C D E F G H Paired reads map in the same direction and are farther apart than expected

Inversion 10 1 2 3 4 Subject A G F E D C B H 1 3 2 4 Reference A B C D E F G H Paired reads map in the same direction and are farther apart than expected

Inversion 11 1 2 Subject A G F E D C B H 1 2 2 Reference A B C D E F G H A split read will generally go undetected.

Inversion 12 1 2 Subject A G F E D C B H 2 1 Reference A B C D E F G H An insert entirely contained in the inversion will look normal, although the positions are swapped.

Bentley, et al, Nature 456, 53-59 (6 November 2008) doi:10.1038/nature07517 ; Homozygous inversion Red is pair mapped in the same direction

Breakpoints of an Inversion No normally mapped reads span the breakpoints.

Tandem Repeat Variants or VNTRs (Variable Number of Tandem Repeats)

Tandem Repeat tcgctggtcata cgt cgt cgt cgt cgt tacaaacgtcttccgt

Tandem Repeat tcgctggtcata cgt cgt cgt cgt cgt tacaaacgtcttccgt 1 2 3 4 5 left flank sequence tandem array of copies right flank sequence

Tandem Repeat 1 2 3 4 5 left flank sequence tandem array of copies right flank sequence consensus sequence 1 2 3 4 5 multiple alignment

Tandem Repeat Variants Tandem Repeat polymorphisms occur as differences in: copy number individual copy motifs (SNPs/indels) order of motifs in the tandem array

Why are Tandem Repeat Variants Important? They are associated with human disease: Triple repeat diseases Fragile X mental retardation Myotonic dystrophy Huntington s disease Friedreich s ataxia Epilepsy Diabetes Ovarian cancer They co occur with transcription factor binding sites and so may be involved in gene regulation.

Why is detecting variants difficult? Read mapping in the presence of large indels (copy number difference) is computationally costly.

Why is detecting variants difficult? Read mapping in the presence of large indels (copy number difference) is computationally costly. Motif differences (indels and SNPs) and motif order differences are additional complications for both seed indexing and BWT/Suffix Array matching approaches.

Why is detecting variants difficult? Read mapping in the presence of large indels (copy number difference) is computationally costly. Motif differences (indels and SNPs) and motif order differences are additional complications for both seed indexing and BWT/Suffix Array matching approaches. Mapping and indel detection is typically oblivious to sequence annotation.

Outline of Strategy 1. Detect repeats in the subject and in the human reference using TRF software. 2. Map read TRs to reference TRs using: indexing of reference patterns fast bit wise edit distance with threshold profile alignment of TR arrays flanking sequence alignment (bit wise) 3. Compare copy number of reference TRs with those of mapped read TRs and identify variants

Subject data comes from the Watson genome. 454 technology: 74 million reads, avg. length 261 nt. avg. coverage ~6

Reference tandem repeats come from the Tandem Repeats Database (TRDB). https://tandem.bu.edu/cgi-bin/trdb/trdb.exe

Reference tandem repeats come from the Tandem Repeats Database (TRDB). https://tandem.bu.edu/cgi-bin/trdb/trdb.exe 230,671

VNTR 1 copy shorter (heterozygous?)

VNTR 2 copies shorter

VNTR 1 copy longer

VNTR 2 alleles observed

VNTR 2 copies shorter

VNTR and SNP alleles

VNTR internal motif duplicated and flanking SNPs

VNTR Results Watson Genome

Khoisan Genome African Hunter Gatherer Culture nominal 12.3 coverage

Acknowledgments Yevgeniy Gelfand Yozen Hernandez Joshua Loving

Thank you!