Genome-scale technologies 2 / Algorithmic and statistical aspects of DNA sequencing

Similar documents
Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Introduction to next-generation sequencing data

Next Generation Sequencing

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Next Generation Sequencing: Technology, Mapping, and Analysis

Introduction to NGS data analysis

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

NGS data analysis. Bernardo J. Clavijo

July 7th 2009 DNA sequencing

DNA Sequencing & The Human Genome Project

An example of bioinformatics application on plant breeding projects in Rijk Zwaan

Genetic Analysis. Phenotype analysis: biological-biochemical analysis. Genotype analysis: molecular and physical analysis

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Next Generation Sequencing for DUMMIES

Genome-wide measurements of protein-dna interaction by chromatin immunoprecipitation

Lectures 1 and February 7, Genomics 2012: Repetitorium. Peter N Robinson. VL1: Next- Generation Sequencing. VL8 9: Variant Calling

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Next generation DNA sequencing technologies. theory & prac-ce

How Sequencing Experiments Fail

MiSeq: Imaging and Base Calling

Computational Genomics. Next generation sequencing (NGS)

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

G E N OM I C S S E RV I C ES

How many of you have checked out the web site on protein-dna interactions?

CCR Biology - Chapter 9 Practice Test - Summer 2012

Data Analysis for Ion Torrent Sequencing

Comparing Methods for Identifying Transcription Factor Target Genes

DNA Sequencing. Ben Langmead. Department of Computer Science

Big data in cancer research : DNA sequencing and personalised medicine

Overview of Next Generation Sequencing platform technologies

History of DNA Sequencing & Current Applications

How is genome sequencing done?

Illumina Sequencing Technology

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

NGS Technologies for Genomics and Transcriptomics

Lecture 13: DNA Technology. DNA Sequencing. DNA Sequencing Genetic Markers - RFLPs polymerase chain reaction (PCR) products of biotechnology

PreciseTM Whitepaper

RNAseq / ChipSeq / Methylseq and personalized genomics

Sanger Sequencing and Quality Assurance. Zbigniew Rudzki Department of Pathology University of Melbourne

Analysis of ChIP-seq data in Galaxy

Reading DNA Sequences:

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Molecular Computing Athabasca Hall Sept. 30, 2013

Replication Study Guide

Systematic discovery of regulatory motifs in human promoters and 30 UTRs by comparison of several mammals

Transcription and Translation of DNA

TruSeq Custom Amplicon v1.5

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Genetic diagnostics the gateway to personalized medicine

DNA Fingerprinting. Unless they are identical twins, individuals have unique DNA

Illumina TruSeq DNA Adapters De-Mystified James Schiemer

Analysis of gene expression data. Ulf Leser and Philippe Thomas

Basic Concepts of DNA, Proteins, Genes and Genomes

Scottish Qualifications Authority

Activity 7.21 Transcription factors

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Reduced Representation Bisulfite-Seq A Brief Guide to RRBS

Challenges associated with analysis and storage of NGS data

DNA Sequence Analysis

HENIPAVIRUS ANTIBODY ESCAPE SEQUENCING REPORT

Recombinant DNA & Genetic Engineering. Tools for Genetic Manipulation

Name: Date: Period: DNA Unit: DNA Webquest

Core Facility Genomics

Analysis of DNA methylation: bisulfite libraries and SOLiD sequencing

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Intro to Bioinformatics

Introduction to SAGEnhaft

14.3 Studying the Human Genome

Cloud Computing Solutions for Genomics Across Geographic, Institutional and Economic Barriers

Central Dogma. Lecture 10. Discussing DNA replication. DNA Replication. DNA mutation and repair. Transcription

Appendix 2 Molecular Biology Core Curriculum. Websites and Other Resources

Next Generation Sequencing

Advances in RainDance Sequence Enrichment Technology and Applications in Cancer Research. March 17, 2011 Rendez-Vous Séquençage

Concepts and methods in sequencing and genome assembly

Mitochondrial DNA Analysis

Recombinant DNA and Biotechnology

Sequencing the Human Genome

Biotechnology and Recombinant DNA (Chapter 9) Lecture Materials for Amy Warenda Czura, Ph.D. Suffolk County Community College

13.2 Ribosomes & Protein Synthesis

An Overview of DNA Sequencing

Introduction To Real Time Quantitative PCR (qpcr)

High Performance Compu2ng Facility

The Power of Next-Generation Sequencing in Your Hands On the Path towards Diagnostics

Just the Facts: A Basic Introduction to the Science Underlying NCBI Resources

Molecular Genetics. RNA, Transcription, & Protein Synthesis

Next generation sequencing (NGS)

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Molecular and Cell Biology Laboratory (BIOL-UA 223) Instructor: Ignatius Tan Phone: Office: 764 Brown

DNA Replication & Protein Synthesis. This isn t a baaaaaaaddd chapter!!!

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Next Generation Sequencing: Adjusting to Big Data. Daniel Nicorici, Dr.Tech. Statistikot Suomen Lääketeollisuudessa

Cloud BioLinux: Pre-configured and On-demand Bioinformatics Computing for the Genomics Community

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

Chapter 6 DNA Replication

1.5 page 3 DNA Replication S. Preston 1

The NGS IT notes. George Magklaras PhD RHCE

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Bioinformatics Unit Department of Biological Services. Get to know us

Transcription:

Genome-scale technologies 2 / Algorithmic and statistical aspects of DNA sequencing Introduction to next generation sequencing and its applications Ewa Szczurek szczurek@mimuw.edu.pl Instytut Informatyki Uniwersytet Warszawski 1/30

Organisational remarks Organisation of this course In English 1.5 hrs lecture Up to 2 1.5 hrs lab Homework 15% of your grade Final assesment Data analysis project Successful completion of the project 85% of the nal grade Oral exam 2/30

Scope of this course Next generation sequencing (NGS) data Its analysis and applications 3/30

Short biological introduction 4/30

DNA sequence determines how cells work DNA sequence Gene expression Protein function. http://www.nature.com/scitable/topicpage/gene-expression-14121669 5/30

Short biological introduction http://www.nature.com/scitable/topicpage/gene-expression-14121669 6/30

Short biological introduction http://www.nature.com/scitable/topicpage/gene-expression-14121669 7/30

Sequencing methods 1. First generation sequencing: Sanger sequencing 2. Next (second) generation sequencing: 454, Solid, Illumina/Solexa 3. Third generation sequencing: IonTorrent, Single molecule sequencing The focus of this course: NGS 8/30

Two big names in sequencing history Frederick Sanger Craig Venter 9/30

Sequencing history: the race for throughput and money 1958 Sanger's 1st Nobel prize for the sequence of insulin (51 bp) 1980 Sanger's 2nd Nobel prize for the dideoxy method of sequencing DNA 1990 The Human Genome Project plan for 15 yrs and $3 billion 1998 Craig Venter (Celera Genomics) plan for 3 yrs and $300 million 2001 HGP and Celera genome drafts published (3 billion bp; 3GB) 2003 HGP publishes the nal sequence 2008 1000 Genomes project launched 2012 1092 Human genome sequences published 2014 Illumina HiSeq X Ten Sequencer $1,000 genome First individuals sequenced include Craig Venter, James Watson, Seong-Jin Kim, and Steve Jobs ($100K ). 10/30

The ever dropping cost of sequencing http://www.genome.gov/sequencingcosts/ 11/30

Applications of NGS de novo sequencing, e.g. genome assembly calling single nucleotide and structural variants in genomes personalized medicine, knowledge discovery, e.g. cancer genomics metagenomics ChIP-Seq - measuring expression control (TF binding) RNA-Seq - measuring gene expression DNASE-Seq - marking open chromatin regions CLIP- Seq, FAIRE-Seq, BiSulte-Seq.. and many more 12/30

Plan of the course (but don't feel attached to it) 6.X.15 01. NGS, quality control 13.X.15 02. Genome assembly 20.X.15 03. Genome assembly 27.X.15 04. Read mapping 3.XI.15 05. Read mapping 10.XI.15 06. Variant calling 17.XI.15 07. Variant calling 24.XI.15 08. Cancer genomics 01.XII.15 09. Metagenomics 08.XII.15 10. ChIP-Seq 15.XII.15 11. RNA-Seq 22.XII.15 12. RNA-Seq 12.I.16 13. DNASE-Seq 19.I.16 14. Hi-C 26.I.16 15. Project presentations 13/30

How does NGS work? 14/30

How does NGS work? A focus on Illumina (Solexa) sequencing. Preprocessing steps 1. Cleave the input sample into short fragments 2. Ligate the fragments to generic adaptors 3. Anneal the fragments to a slide using the adaptors 4. Amplify the reads with PCR many copies of each read 5. Separate into single strands for sequencing 15/30

How does NGS work? A focus on Illumina (Solexa) sequencing. Sequencing steps 1. Flood the slide with nucleotides and DNA polymerase. 2. Nucleotides: uorescently labelled, with colour base. have a terminator only one base added at a time. EBI: Next Generation Sequencing Practical Courses 16/30

How does NGS work? A focus on Illumina (Solexa) sequencing. Sequencing steps cd 1. Take an image of each slide. Each read location uorescent signal added base. 2. Remove the terminators, allowing the next base to be added 3. Remove the uorescent signal, preventing contamination of the next image. EBI: Next Generation Sequencing Practical Courses 17/30

How does NGS work? A focus on Illumina sequencing. EBI: Next Generation Sequencing Practical Courses 18/30

How does NGS work? We will watch a very simplistic video https://www.youtube.com/watch?v=-7gk1hxwcte To watch at home https://www.youtube.com/watch?v=jfcd8q6qstm 19/30

Parameters of sequencing technologies First generation sequencing: Sanger large read length (700 1000bp), large sequencing cost (500$/Mb), high accuracy (read error rate 0.001%), low throughput. 20/30

Parameters of sequencing technologies (Quail et al. BMC Genom. '12) 21/30

Key concepts: coverage Theoretical coverage c = L N G, where L is the read length, N is the number of reads, and G is the length of the genome. Sims et al. Nature Reviews Genetics, 2014 22/30

Key concepts: coverage Theoretical coverage c = L N G, where L is the read length, N is the number of reads, and G is the length of the genome. Empirical coverage The exact number of times that a base in the reference is covered by a high-quality aligned read from a given sequencing experiment. Both called simply coverage or depth and understood from the context. Sims et al. Nature Reviews Genetics, 2014 22/30

Key concepts: coverage Sims et al. Nature Reviews Genetics, 2014 23/30

Key concepts: single- and paired-end reads Single-read sequencing Sequencing DNA from only one end: simple, economical. Paired-read sequencing Sequencing both ends of a fragment: higher quality data. Facilitates detection of genomic rearrangements, repetitive sequence elements, gene fusions and novel transcripts. 24/30

The FastQ format @SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 GGGTGATGGCCGCTGCCGATGGCGTCAAATCCCACC +SRR001666.1 071112_SLXA-EAS1_s_7:5:1:817:345 length=36 IIIIIIIIIIIIIIIIIIIIIIIIIIIIII9IG9IC Consecutive lines: Identifyier Sequence Identifyier Quality scores 25/30

Quality report Phred score let P be the base-calling error probability, Q = 10 log 10 (P) P = 10 Q/10. For example Q = 10 P = 0.1 Q = 20 P = 0.01 Q = 30 P = 0.001 Q = 40 P = 0.0001 26/30

Phred scores reported by dierent platforms https://en.wikipedia.org/wiki/fastq_format 27/30

Quality control FastQC Tracks errors both from the sequencer and of the material http://www.bioinformatics.babraham.ac.uk/projects/ fastqc/ 28/30

Quality-based ltering Trimmomatic for Illumina data a variety of useful trimming tasks for illumina paired-end and single ended data. http://www.usadellab.org/cms/?page=trimmomatic/ ILLUMINACLIP: SLIDINGWINDOW: LEADING: TRAILING: CROP: HEADCROP: MINLEN: TOPHRED33: Cut adapter and other illumina-specic sequences from the read. Perform a sliding window trimming, cutting once the average quality within the window falls below a threshold. Cut bases o the start of a read, if below a threshold quality Cut bases o the end of a read, if below a threshold quality Cut the read to a specied length Cut the specied number of bases from the start of the read Drop the read if it is below a specied length Convert quality scores to Phred-33 TOPHRED64: Convert quality scores to Phred-64 29/30

Acknowledgements For input to these and the remaining slides of this course thanks to Norbert Dojer Jerzy Tiuryn Bartosz Wilczy«ski 30/30