DNA SEQUENCING. Shear DNA into millions of small fragments Read nucleotides at a time from the small fragments (Sanger method)

Size: px
Start display at page:

Download "DNA SEQUENCING. Shear DNA into millions of small fragments Read nucleotides at a time from the small fragments (Sanger method)"

Transcription

1 GENOME ASSEMBLY

2 DNA SEQUENCING Shear DNA into millions of small fragments Read nucleotides at a time from the small fragments (Sanger method)

3 WHOLE GENOME SHOTGUN SEQUENCING Start with many copies of genome. Bacterial genome length: 5 million. Fragment them and sequence reads at both ends. Read length: 35 to 1000 bp. Find overlapping reads. ACGTAGAATCGACCATG......AACATAGTTGACGTAGAATC Merge overlapping reads into contigs....aacatagttgacgtagaatcgaccatg... Contig Contig Contig Gap Gap Coverage at this location=2

4 DE NOVO GENOME ASSEMBLY Problem: given a collection of reads, i.e. short subsequences of the genomic sequence in the alphabet A, C, G, T, completely reconstruct the genome from which the reads are derived. Challenges: Repeats in the genome ACCCAGTTGACTGGGATCCTTTTTAAAGACTGGGATTTTAACGCG CAGTTGACTG ACTGGGATCC Sample reads GACTGGGATT Sequencing errors: substitutions, insertions, deletions, and others. TTTTTATAGA (substitution), CCTT TAAACG (deletion and insertion) Size of the data, e.g. 1.5 billion reads in 110GB FASTA/Q file.

5 CHALLENGES IN FRAGMENT ASSEMBLY Repeats: A major problem for fragment assembly > 50% of human genome are repeats: - over 1 million Alu repeats (about 300 bp) - about 200,000 LINE repeats (1000 bp and longer) Repeat Repeat Repeat Green and blue fragments are interchangeable when assembling repetitive DNA

6 DE NOVO GENOME ASSEMBLY CURRENT SOLUTIONS Overlap-layout-consensus (Celera, Newbler) Suitable for low coverage, long reads Highly parallelizable De Bruijn graph construction (ALLPATHS-LG, ABySS, Velvet, SOAPdenovo, EULER-SR, SPAdes, IDBA, and HyDA) Suitable for high coverage, short reads Fast but memory-intensive Sensitive to sequencing errors Mathematically elegant repeat classification

7 OVERLAP-LAYOUT-CONSENSUS Assemblers: ARACHNE, PHRAP, CAP, TIGR, CELERA Overlap: find potentially overlapping reads Layout: merge reads into contigs and contigs into supercontigs Consensus: derive the DNA sequence and correct read errors

8 OVERLAP Find the best match between the suffix of one read and the prefix of another Due to sequencing errors, need to use dynamic programming to find the optimal overlap alignment Apply a filtration method to filter out pairs of fragments that do not share a significantly long common substring

9 OVERLAPPING READS Sort all k-mers in reads (k ~ 24) Find pairs of reads sharing a k-mer Extend to full alignment throw away if not >95% similar TACA TAGATTACACAGATTACT GA TAGT TAGATTACACAGATTACTAGA

10 OVERLAPPING READS AND REPEATS A k-mer that appears N times, initiates N 2 comparisons For an Alu that appears 10 6 times comparisons too much Solution: Discard all k-mers that appear more than t Coverage, (t ~ 10)

11 FINDING OVERLAPPING READS Create local multiple alignments from the overlapping reads TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA

12 FINDING OVERLAPPING READS (CONT D) Correct errors using multiple alignment TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA TAG TTACACAGATTATTGA TAGATTACACAGATTACTGA TAGATTACACAGATTACTGA Score alignments C: 20 C: 35 T: 30 C: 35 C: 40 A: 15 A: 25 - A: 40 A: 25 Accept alignments with good scores C: 20 C: 35 C: 0 C: 35 C: 40 A: 15 A: 25 A: 0 A: 40 A: 25

13 Repeats are a major challenge LAYOUT Do two aligned fragments really overlap, or are they from two copies of a repeat? Solution: repeat masking hide the repeats!!! Masking results in high rate of misassembly (up to 20%) Misassembly means alot more work at the finishing step

14 MERGE READS INTO CONTIGS Merge reads up to potential repeat boundaries

15 MERGE READS INTO CONTIGS Merge reads up to potential repeat boundaries

16 REPEATS, ERRORS, AND CONTIG LENGTHS Repeats shorter than read length are OK Repeats with more base pair differences than sequencing error rate are OK To make a smaller portion of the genome appear repetitive, try to: Increase read length

17 DE BRUIJN GRAPH EXAMPLE SHRED READS INTO K-MERS (K = 3) Read 1 Read 2 G G A C T A A A G A C C A A A T G G A G A C A C T C T A T A A A A A G A C A C C C C A C A A A A A A A T GGA GAC ACT CTA TAA AAA GAC ACC CCA CAA AAA AAT P. Pevzner, J Biomol Struct Dyn (1989) 7:63 73 R. Idury, M. Waterman, J Comput Biol (1995) 2:

18 DE BRUIJN GRAPH EXAMPLE MERGE VERTICES LABELED BY IDENTICAL K-MERS Read 1: GGA GAC ACT CTA TAA AAA Read 2: GAC ACC CCA CAA AAA AAT Resulting Graph: GGA GAC ( 2x ) ACT CTA TAA AAA ( 2x ) AAT ACC CCA CAA

19 ANOTHER EXAMPLE CONSTRUCTING THE GRAPH (K = 4) TGAG ATGA GATG CGAT CCGA TCCG (9x) ( 5x ) ( 6x ) ( 7x ) ( 7x ) ATCC ( 7x ) GATC AGAT GCTC ( 2x ) CTCT TCTA ( 2x ) CTAG ( 2x ) AAGT ( 3x ) AGTC ( 7x ) GTCG ( 9x ) TCGA ( 10x ) CGAG CGAC GAGG ( 16x ) GACG AGGC ( 16x ) ACGC GGCT ( 11x ) GCTT CTTT TTTA TTAG ( 12x ) TAGA ( 16x ) AGAG GAGA ( 9x ) ( 12x ) AGAC ( 9x ) GACA ACAA ( 5x ) A branching vertex is caused by either a repeat in the original sequence or a sequencing error

20 ANOTHER EXAMPLE CONSTRUCTING THE GRAPH (K = 4) TGAG ATGA GATG CGAT CCGA TCCG (9x) ( 5x ) ( 6x ) ( 7x ) ( 7x ) ATCC ( 7x ) GATC AGAT GCTC ( 2x ) CTCT TCTA ( 2x ) CTAG ( 2x ) AAGT ( 3x ) AGTC ( 7x ) GTCG ( 9x ) TCGA ( 10x ) CGAG CGAC GAGG ( 16x ) GACG AGGC ( 16x ) ACGC GGCT ( 11x ) GCTT CTTT TTTA TTAG ( 12x ) TAGA ( 16x ) AGAG GAGA ( 9x ) ( 12x ) AGAC ( 9x ) Sequencing errors are normally detected by a coverage cutoff threshold GACA ACAA ( 5x ) A branching vertex is caused by either a repeat in the original sequence or a sequencing error

21 EXAMPLE AFTER CONDENSATION AGATCCGATGAG AAGTCGA CGAG GCTCTAG CGACGC GAGGCT GCTTTAG TAGA AGAG GAGACAA

22 EXAMPLE AFTER ERROR REMOVAL AGATCCGATGAG AAGTCGA CGAG GAGGCT GCTTTAG TAGA AGAG GAGACAA

23 EXAMPLE AFTER RECONDENSATION AGATCCGATGAG AAGTCGAG GAGGCTTTAGA AGAG GAGACAA Any non-branching path in this graph corresponds to a contig in the original sequence. Taking the risk of following arbitrary branching paths may create chimeric species Source: Serafim Batzoglou

24 RESOLVING REPEATS USING PAIRED READS Genome Read 1 Insert size: a design parameter Read 2

25 RESOLVING REPEATS EQUIVALENT TRANSFORMATIONS Genome: S1 REPEAT S2. S3 REPEAT S4 Matches the distance in the graph, Longer than repeat length S1 S3 REPEAT REPEAT S2 S4 S1 S2 REPEAT S3 S4 P. Pevzner and H. Tang, Bioinformatics (2001) Suppl1:S

26 RESOLVING REPEATS FAILURE Spans multiple paths S1 P1 S2 REPEAT1 REPEAT2 S3 P2 S4 Mate pair transformation (Velvet, ABySS, EULER-SR) Find a unique path between mates in the graph. When multiple paths match the distance between mate-pairs, mate pair transformation fails. To resolve a repeat, insert size must be larger than the repeat length and smaller than the length of potential conjugate paths (same length paths passing through the repeat).

Le Reads devo essere tra,ate per trasformare i da1 in informazioni

Le Reads devo essere tra,ate per trasformare i da1 in informazioni Le Reads devo essere tra,ate per trasformare i da1 in informazioni Assembly In bioinforma1cs, sequence assembly refers to aligning and merging fragments of a much longer DNA sequence in order to reconstruct

More information

Reading DNA Sequences:

Reading DNA Sequences: Reading DNA Sequences: 18-th Century Mathematics for 21-st Century Technology Michael Waterman University of Southern California Tsinghua University DNA Genetic information of an organism Double helix,

More information

De Novo Assembly Using Illumina Reads

De Novo Assembly Using Illumina Reads De Novo Assembly Using Illumina Reads High quality de novo sequence assembly using Illumina Genome Analyzer reads is possible today using publicly available short-read assemblers. Here we summarize the

More information

Next generation sequencing (NGS)

Next generation sequencing (NGS) Next generation sequencing (NGS) Vijayachitra Modhukur BIIT modhukur@ut.ee 1 Bioinformatics course 11/13/12 Sequencing 2 Bioinformatics course 11/13/12 Microarrays vs NGS Sequences do not need to be known

More information

SCIENCE CHINA Life Sciences. Sequence assembly using next generation sequencing data challenges and solutions

SCIENCE CHINA Life Sciences. Sequence assembly using next generation sequencing data challenges and solutions SCIENCE CHINA Life Sciences THEMATIC ISSUE: November 2014 Vol.57 No.11: 1 9 RESEARCH PAPER doi: 10.1007/s11427-014-4752-9 Sequence assembly using next generation sequencing data challenges and solutions

More information

Computational Genomics. Next generation sequencing (NGS)

Computational Genomics. Next generation sequencing (NGS) Computational Genomics Next generation sequencing (NGS) Sequencing technology defies Moore s law Nature Methods 2011 Log 10 (price) Sequencing the Human Genome 2001: Human Genome Project 2.7G$, 11 years

More information

Assembly of Large Genomes using Cloud Computing Michael Schatz. July 23, 2010 Illumina Sequencing Panel

Assembly of Large Genomes using Cloud Computing Michael Schatz. July 23, 2010 Illumina Sequencing Panel Assembly of Large Genomes using Cloud Computing Michael Schatz July 23, 2010 Illumina Sequencing Panel How to compute with 1000s of cores Michael Schatz July 23, 2010 Illumina Sequencing Panel Parallel

More information

Introduction to Bioinformatics 3. DNA editing and contig assembly

Introduction to Bioinformatics 3. DNA editing and contig assembly Introduction to Bioinformatics 3. DNA editing and contig assembly Benjamin F. Matthews United States Department of Agriculture Soybean Genomics and Improvement Laboratory Beltsville, MD 20708 matthewb@ba.ars.usda.gov

More information

Overview of Genome Assembly Techniques

Overview of Genome Assembly Techniques 6 Overview of Genome Assembly Techniques Sun Kim and Haixu Tang The most common laboratory mechanism for reading DNA sequences (e.g., gel electrophoresis) can determine sequence of up to approximately

More information

4.2.1. What is a contig? 4.2.2. What are the contig assembly programs?

4.2.1. What is a contig? 4.2.2. What are the contig assembly programs? Table of Contents 4.1. DNA Sequencing 4.1.1. Trace Viewer in GCG SeqLab Table. Box. Select the editor mode in the SeqLab main window. Import sequencer trace files from the File menu. Select the trace files

More information

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms Introduction Mate pair sequencing enables the generation of libraries with insert sizes in the range of several kilobases (Kb).

More information

GPU-Euler: Sequence Assembly using GPGPU

GPU-Euler: Sequence Assembly using GPGPU Department of Computer Science George Mason University Technical Reports 4400 University Drive MS#4A5 Fairfax, VA 22030-4444 USA http://cs.gmu.edu/ 703-993-1530 GPU-Euler: Sequence Assembly using GPGPU

More information

An Overview of DNA Sequencing

An Overview of DNA Sequencing An Overview of DNA Sequencing Prokaryotic DNA Plasmid http://en.wikipedia.org/wiki/image:prokaryote_cell_diagram.svg Eukaryotic DNA http://en.wikipedia.org/wiki/image:plant_cell_structure_svg.svg DNA Structure

More information

BIOINFORMATICS ORIGINAL PAPER

BIOINFORMATICS ORIGINAL PAPER BIOINFORMATICS ORIGINAL PAPER Vol. 27 no. 2 2011, pages 153 160 doi:10.1093/bioinformatics/btq646 Genome analysis Advance Access publication November 18, 2010 Scoring-and-unfolding trimmed tree assembler:

More information

UCHIME in practice Single-region sequencing Reference database mode

UCHIME in practice Single-region sequencing Reference database mode UCHIME in practice Single-region sequencing UCHIME is designed for experiments that perform community sequencing of a single region such as the 16S rrna gene or fungal ITS region. While UCHIME may prove

More information

Ariella Syma Sasson ALL RIGHTS RESERVED

Ariella Syma Sasson ALL RIGHTS RESERVED 2010 Ariella Syma Sasson ALL RIGHTS RESERVED FROM MILLIONS TO ONE: THEORETICAL AND CONCRETE APPROACHES TO DE NOVO ASSEMBLY USING SHORT READ DNA SEQUENCES by ARIELLA SYMA SASSON A Dissertation submitted

More information

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment Tutorial for Windows and Macintosh Preparing Your Data for NGS Alignment 2015 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) 1.734.769.7249

More information

Next Generation Sequencing: Technology, Mapping, and Analysis

Next Generation Sequencing: Technology, Mapping, and Analysis Next Generation Sequencing: Technology, Mapping, and Analysis Gary Benson Computer Science, Biology, Bioinformatics Boston University gbenson@bu.edu http://tandem.bu.edu/ The Human Genome Project took

More information

A Tutorial in Genetic Sequence Classification Tools and Techniques

A Tutorial in Genetic Sequence Classification Tools and Techniques A Tutorial in Genetic Sequence Classification Tools and Techniques Jake Drew Data Mining CSE 8331 Southern Methodist University jakemdrew@gmail.com www.jakemdrew.com Sequence Characters IUPAC nucleotide

More information

Commonly Used STR Markers

Commonly Used STR Markers Commonly Used STR Markers Repeats Satellites 100 to 1000 bases repeated Minisatellites VNTR variable number tandem repeat 10 to 100 bases repeated Microsatellites STR short tandem repeat 2 to 6 bases repeated

More information

Searching Nucleotide Databases

Searching Nucleotide Databases Searching Nucleotide Databases 1 When we search a nucleic acid databases, Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from the forward strand and 3 reading frames

More information

How long is long enough?

How long is long enough? How long is long enough? - Modeling to predict genome assembly performance - Hayan Lee@Schatz Lab Feb 26, 2014 Quantitative Biology Seminar 1 Outline Background Assembly history Recent sequencing technology

More information

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want 1 When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want to search other databases as well. There are very

More information

Biology in the Clouds

Biology in the Clouds Biology in the Clouds Michael Schatz July 30, 2012 Science Cloud Summer School Outline 1. Milestones in genomics 1. Quick primer on molecular biology 2. The evolution of sequencing 2. Hadoop Applications

More information

Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques

Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques International Journal of Electronics and Computer Science Engineering 2000 Available Online at www.ijecse.org ISSN- 2277-1956 Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization

More information

RESTRICTION DIGESTS Based on a handout originally available at

RESTRICTION DIGESTS Based on a handout originally available at RESTRICTION DIGESTS Based on a handout originally available at http://genome.wustl.edu/overview/rst_digest_handout_20050127/restrictiondigest_jan2005.html What is a restriction digests? Cloned DNA is cut

More information

ABySS-Explorer: Visualizing Genome Sequence Assemblies

ABySS-Explorer: Visualizing Genome Sequence Assemblies IEEE TRANSACTIONS ON VISUALIZATION AND COMPUTER GRAPHICS, VOL. 15, NO. 6, NOVEMBER/DECEMBER 2009 881 ABySS-Explorer: Visualizing Genome Sequence Assemblies Cydney B. Nielsen, Shaun D. Jackman, Inanç Birol,

More information

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe Go where the biology takes you. To published results faster With proven scalability To the forefront of discovery To limitless applications

More information

DNA Sequencing & The Human Genome Project

DNA Sequencing & The Human Genome Project DNA Sequencing & The Human Genome Project An Endeavor Revolutionizing Modern Biology Jutta Marzillier, Ph.D Lehigh University Biological Sciences November 13 th, 2013 Guess, who turned 60 earlier this

More information

GENEWIZ, Inc. DNA Sequencing Service Details for USC Norris Comprehensive Cancer Center DNA Core

GENEWIZ, Inc. DNA Sequencing Service Details for USC Norris Comprehensive Cancer Center DNA Core DNA Sequencing Services Pre-Mixed o Provide template and primer, mixed into the same tube* Pre-Defined o Provide template and primer in separate tubes* Custom o Full-service for samples with unknown concentration

More information

Gene Finding CMSC 423

Gene Finding CMSC 423 Gene Finding CMSC 423 Finding Signals in DNA We just have a long string of A, C, G, Ts. How can we find the signals encoded in it? Suppose you encountered a language you didn t know. How would you decipher

More information

Two scientific papers (1, 2) recently appeared reporting

Two scientific papers (1, 2) recently appeared reporting On the sequencing of the human genome Robert H. Waterston*, Eric S. Lander, and John E. Sulston *Genome Sequencing Center, Washington University, Saint Louis, MO 63108; Whitehead Institute Massachusetts

More information

Bioinformatic Approaches for Genome Finishing

Bioinformatic Approaches for Genome Finishing Bioinformatic Approaches for Genome Finishing Ph. D. Thesis submitted to the Faculty of Technology, Bielefeld University, Germany for the degree of Dr. rer. nat. by Peter Husemann July, 2011 Referees:

More information

Keeping up with DNA technologies

Keeping up with DNA technologies Keeping up with DNA technologies Mihai Pop Department of Computer Science Center for Bioinformatics and Computational Biology University of Maryland, College Park The evolution of DNA sequencing Since

More information

The prevailing method of determining

The prevailing method of determining C OMPUTATIONAL B IOLOGY WHOLE-GENOME DNA SEQUENCING Computation is integrally and powerfully involved with the DNA sequencing technology that promises to reveal the complete human DNA sequence in the next

More information

Applications of MapReduce to the Life Science

Applications of MapReduce to the Life Science Genetic Sequence Analysis in the Clouds: Applications of MapReduce to the Life Science Jimmy Lin, Michael Schatz, and Ben Langmead University of Maryland Wednesday, June 10, 2009 This work is licensed

More information

3.1 Solving Systems Using Tables and Graphs

3.1 Solving Systems Using Tables and Graphs Algebra 2 Chapter 3 3.1 Solve Systems Using Tables & Graphs 3.1 Solving Systems Using Tables and Graphs A solution to a system of linear equations is an that makes all of the equations. To solve a system

More information

454 Sequencing System Software Manual Version 2.6

454 Sequencing System Software Manual Version 2.6 454 Sequencing System Software Manual Version 2.6 Part C: May 2011 Instrument / Kit GS Junior / Junior GS FL+ / L+ GS FL+ / LR70 GS FL / LR70 For life science research only. Not for use in diagnostic procedures.

More information

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company

Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Chapter 8: Recombinant DNA 2002 by W. H. Freeman and Company Genetic engineering: humans Gene replacement therapy or gene therapy Many technical and ethical issues implications for gene pool for germ-line gene therapy what traits constitute disease rather than just

More information

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data The Illumina TopHat Alignment and Cufflinks Assembly and Differential Expression apps make RNA data analysis accessible to any user, regardless

More information

DNA Sequencing Data Compression. Michael Chung

DNA Sequencing Data Compression. Michael Chung DNA Sequencing Data Compression Michael Chung Problem DNA sequencing per dollar is increasing faster than storage capacity per dollar. Stein (2010) Data 3 billion base pairs in human genome Genomes are

More information

Cloud Computing and the DNA Data Race Michael Schatz. June 8, 2011 HPDC 11/3DAPAS/ECMLS

Cloud Computing and the DNA Data Race Michael Schatz. June 8, 2011 HPDC 11/3DAPAS/ECMLS Cloud Computing and the DNA Data Race Michael Schatz June 8, 2011 HPDC 11/3DAPAS/ECMLS Outline 1. Milestones in DNA Sequencing 2. Hadoop & Cloud Computing 3. Sequence Analysis in the Clouds 1. Sequence

More information

Deep Sequencing Data Analysis

Deep Sequencing Data Analysis Deep Sequencing Data Analysis Ross Whetten Professor Forestry & Environmental Resources Background Who am I, and why am I teaching this topic? I am not an expert in bioinformatics I started as a biologist

More information

Bioinformatics and its applications

Bioinformatics and its applications Bioinformatics and its applications Alla L Lapidus, Ph.D. SPbAU, SPbSU, St. Petersburg Term Bioinformatics Term Bioinformatics was invented by Paulien Hogeweg (Полина Хогевег) and Ben Hesper in 1970 as

More information

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Computational Challenges in Storage, Analysis and Interpretation of Next-Generation Sequencing Data Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center Next Generation Sequencing

More information

CCR Biology - Chapter 9 Practice Test - Summer 2012

CCR Biology - Chapter 9 Practice Test - Summer 2012 Name: Class: Date: CCR Biology - Chapter 9 Practice Test - Summer 2012 Multiple Choice Identify the choice that best completes the statement or answers the question. 1. Genetic engineering is possible

More information

Protein Protein Interaction Networks

Protein Protein Interaction Networks Functional Pattern Mining from Genome Scale Protein Protein Interaction Networks Young-Rae Cho, Ph.D. Assistant Professor Department of Computer Science Baylor University it My Definition of Bioinformatics

More information

How is genome sequencing done?

How is genome sequencing done? How is genome sequencing done? Using 454 Sequencing on the Genome Sequencer FLX System, DNA from a genome is converted into sequence data through four primary steps: Step One DNA sample preparation; Step

More information

Cloud Computing and the DNA Data Race Michael Schatz. March 28, 2011 Columbia University

Cloud Computing and the DNA Data Race Michael Schatz. March 28, 2011 Columbia University Cloud Computing and the DNA Data Race Michael Schatz March 28, 2011 Columbia University Outline 1. Milestones in DNA Sequencing 2. Hadoop & Cloud Computing 3. Sequence Analysis in the Clouds 1. Sequence

More information

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing

Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing KOO10 5/31/04 12:17 PM Page 131 10 Geospiza s Finch-Server: A Complete Data Management System for DNA Sequencing Sandra Porter, Joe Slagel, and Todd Smith Geospiza, Inc., Seattle, WA Introduction The increased

More information

Accurately and Efficiently

Accurately and Efficiently Scoring-and-Unfolding Trimmed Tree Assembler: Algorithms for Assembling Genome Sequences Accurately and Efficiently by Giuseppe Narzisi A dissertation submitted in partial fulfillment of the requirements

More information

Introduction to NGS data analysis

Introduction to NGS data analysis Introduction to NGS data analysis Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Sequencing Illumina platforms Characteristics: High

More information

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions BULLETIN OF THE POLISH ACADEMY OF SCIENCES TECHNICAL SCIENCES, Vol. 59, No. 1, 2011 DOI: 10.2478/v10175-011-0015-0 Varia A greedy algorithm for the DNA sequencing by hybridization with positive and negative

More information

High Performance Computing for DNA Sequence Alignment and Assembly

High Performance Computing for DNA Sequence Alignment and Assembly High Performance Computing for DNA Sequence Alignment and Assembly Michael C. Schatz May 18, 2010 Stone Ridge Technology Outline 1.! Sequence Analysis by Analogy 2.! DNA Sequencing and Genomics 3.! High

More information

PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA

PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA PAGANTEC: OPENMP PARALLEL ERROR CORRECTION FOR NEXT-GENERATION SEQUENCING DATA Markus Joppich, Tony Bolger, Dirk Schmidl Björn Usadel and Torsten Kuhlen Markus Joppich Lehr- und Forschungseinheit Bioinformatik

More information

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes 2.1 Introduction Large-scale insertional mutagenesis screening in

More information

MiSeq: Imaging and Base Calling

MiSeq: Imaging and Base Calling MiSeq: Imaging and Page Welcome Navigation Presenter Introduction MiSeq Sequencing Workflow Narration Welcome to MiSeq: Imaging and. This course takes 35 minutes to complete. Click Next to continue. Please

More information

Clone Manager. Getting Started

Clone Manager. Getting Started Clone Manager for Windows Professional Edition Volume 2 Alignment, Primer Operations Version 9.5 Getting Started Copyright 1994-2015 Scientific & Educational Software. All rights reserved. The software

More information

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows

Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline Workflows Genes 2012, 3, 545-575; doi:10.3390/genes3030545 Article OPEN ACCESS genes ISSN 2073-4425 www.mdpi.com/journal/genes Next Generation Sequence Analysis and Computational Genomics Using Graphical Pipeline

More information

Visualization of Phylogenetic Trees and Metadata

Visualization of Phylogenetic Trees and Metadata Visualization of Phylogenetic Trees and Metadata November 27, 2015 Sample to Insight CLC bio, a QIAGEN Company Silkeborgvej 2 Prismet 8000 Aarhus C Denmark Telephone: +45 70 22 32 44 www.clcbio.com support-clcbio@qiagen.com

More information

(http://genomes.urv.es/caical) TUTORIAL. (July 2006)

(http://genomes.urv.es/caical) TUTORIAL. (July 2006) (http://genomes.urv.es/caical) TUTORIAL (July 2006) CAIcal manual 2 Table of contents Introduction... 3 Required inputs... 5 SECTION A Calculation of parameters... 8 SECTION B CAI calculation for FASTA

More information

Copy Number Variation: available tools

Copy Number Variation: available tools Copy Number Variation: available tools Jeroen F. J. Laros Leiden Genome Technology Center Department of Human Genetics Center for Human and Clinical Genetics Introduction A literature review of available

More information

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory

Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Lawrence Berkeley National Laboratory Title: Outline of the Assembly process: JAZZ, the JGI In-House Assembler Author: Shapiro, Harris Publication Date: 07-08-2005

More information

Picture Maze Generation by Successive Insertion of Path Segment

Picture Maze Generation by Successive Insertion of Path Segment 1 2 3 Picture Maze Generation by Successive Insertion of Path Segment 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32. ABSTRACT Tomio Kurokawa 1* 1 Aichi Institute of Technology,

More information

Working with Spreadsheets

Working with Spreadsheets osborne books Working with Spreadsheets UPDATE SUPPLEMENT 2015 The AAT has recently updated its Study and Assessment Guide for the Spreadsheet Software Unit with some minor additions and clarifications.

More information

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova New generation sequencing: current limits and future perspectives Giorgio Valle CRIBI Università di Padova Around 2004 the Race for the 1000$ Genome started A few questions... When? How? Why? Standard

More information

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA

Nebula A web-server for advanced ChIP-seq data analysis. Tutorial. by Valentina BOEVA Nebula A web-server for advanced ChIP-seq data analysis Tutorial by Valentina BOEVA Content Upload data to the history pp. 5-6 Check read number and sequencing quality pp. 7-9 Visualize.BAM files in UCSC

More information

Version 5.0 Release Notes

Version 5.0 Release Notes Version 5.0 Release Notes 2011 Gene Codes Corporation Gene Codes Corporation 775 Technology Drive, Ann Arbor, MI 48108 USA 1.800.497.4939 (USA) +1.734.769.7249 (elsewhere) +1.734.769.7074 (fax) www.genecodes.com

More information

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison RETRIEVING SEQUENCE INFORMATION Nucleotide sequence databases Database search Sequence alignment and comparison Biological sequence databases Originally just a storage place for sequences. Currently the

More information

Sequence Assembly. Karsten Scheibye-Alsing Steve Hoffmann Annett M. Frankel Peter Jensen Peter F. Stadler SFI WORKING PAPER: 2009-04-010

Sequence Assembly. Karsten Scheibye-Alsing Steve Hoffmann Annett M. Frankel Peter Jensen Peter F. Stadler SFI WORKING PAPER: 2009-04-010 Sequence Assembly Karsten Scheibye-Alsing Steve Hoffmann Annett M. Frankel Peter Jensen Peter F. Stadler SFI WORKING PAPER: 2009-04-010 SFI Working Papers contain accounts of scientific work of the author(s)

More information

Merging Labels, Letters, and Envelopes Word 2013

Merging Labels, Letters, and Envelopes Word 2013 Merging Labels, Letters, and Envelopes Word 2013 Merging... 1 Types of Merges... 1 The Merging Process... 2 Labels - A Page of the Same... 2 Labels - A Blank Page... 3 Creating Custom Labels... 3 Merged

More information

Human-Mouse Synteny in Functional Genomics Experiment

Human-Mouse Synteny in Functional Genomics Experiment Human-Mouse Synteny in Functional Genomics Experiment Ksenia Krasheninnikova University of the Russian Academy of Sciences, JetBrains krasheninnikova@gmail.com September 18, 2012 Ksenia Krasheninnikova

More information

GenBank: A Database of Genetic Sequence Data

GenBank: A Database of Genetic Sequence Data GenBank: A Database of Genetic Sequence Data Computer Science 105 Boston University David G. Sullivan, Ph.D. An Explosion of Scientific Data Scientists are generating ever increasing amounts of data. Relevant

More information

Personal Genome Sequencing with Complete Genomics Technology. Maido Remm

Personal Genome Sequencing with Complete Genomics Technology. Maido Remm Personal Genome Sequencing with Complete Genomics Technology Maido Remm 11 th Oct 2010 Three related papers 1. Describing the Complete Genomics technology Drmanac et al., Science 1 January 2010: Vol. 327.

More information

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

Focusing on results not data comprehensive data analysis for targeted next generation sequencing Focusing on results not data comprehensive data analysis for targeted next generation sequencing Daniel Swan, Jolyon Holdstock, Angela Matchan, Richard Stark, John Shovelton, Duarte Mohla and Simon Hughes

More information

CHALLENGES IN NEXT-GENERATION SEQUENCING

CHALLENGES IN NEXT-GENERATION SEQUENCING CHALLENGES IN NEXT-GENERATION SEQUENCING BASIC TENETS OF DATA AND HPC Gray s Laws of data engineering 1 : Scientific computing is very dataintensive, with no real limits. The solution is scale-out architecture

More information

pcas-guide System Validation in Genome Editing

pcas-guide System Validation in Genome Editing pcas-guide System Validation in Genome Editing Tagging HSP60 with HA tag genome editing The latest tool in genome editing CRISPR/Cas9 allows for specific genome disruption and replacement in a flexible

More information

Radius Maps and Notification Mailing Lists

Radius Maps and Notification Mailing Lists Radius Maps and Notification Mailing Lists To use the online map service for obtaining notification lists and location maps, start the mapping service in the browser (mapping.archuletacounty.org/map).

More information

BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data

BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data BIGDATA: Small: DA: DCM: Low-memory Streaming Prefilters for Biological Sequencing Data C. Titus Brown July 2012 Summary We will soon be able to exhaustively sequence the DNA and RNA of entire communities

More information

Metagenomic Assembly

Metagenomic Assembly Metagenomic Assembly Successes, Validation, And Challenges Matthew Scholz Los Alamos National Laboratory Genome Sciences Division Metagenomic Assembly is Complex Complexity dictates method of assembly

More information

GAST, A GENOMIC ALIGNMENT SEARCH TOOL

GAST, A GENOMIC ALIGNMENT SEARCH TOOL Kalle Karhu, Juho Mäkinen, Jussi Rautio, Jorma Tarhio Department of Computer Science and Engineering, Aalto University, Espoo, Finland {kalle.karhu, jorma.tarhio}@aalto.fi Hugh Salamon AbaSci, LLC, San

More information

Genome Sequencing. Phil McClean September, 2005

Genome Sequencing. Phil McClean September, 2005 Genome Sequencing Phil McClean September, 2005 The concept of genome sequencing is quite simple. Break your genome up into many different small fragments, clone those fragments into a cloning vector, isolate

More information

Graph Theory Problems and Solutions

Graph Theory Problems and Solutions raph Theory Problems and Solutions Tom Davis tomrdavis@earthlink.net http://www.geometer.org/mathcircles November, 005 Problems. Prove that the sum of the degrees of the vertices of any finite graph is

More information

On Covert Data Communication Channels Employing DNA Recombinant and Mutagenesis-based Steganographic Techniques

On Covert Data Communication Channels Employing DNA Recombinant and Mutagenesis-based Steganographic Techniques On Covert Data Communication Channels Employing DNA Recombinant and Mutagenesis-based Steganographic Techniques MAGDY SAEB 1, EMAN EL-ABD 2, MOHAMED E. EL-ZANATY 1 1. School of Engineering, Computer Department,

More information

Typing in the NGS era: The way forward!

Typing in the NGS era: The way forward! Typing in the NGS era: The way forward! Valeria Michelacci NGS course, June 2015 Typing from sequence data NGS-derived conventional Multi Locus Sequence Typing (University of Warwick, 7 housekeeping genes)

More information

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis 2012 4th International Conference on Bioinformatics and Biomedical Technology IPCBEE vol.29 (2012) (2012) IACSIT Press, Singapore Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance

More information

Analyzing A DNA Sequence Chromatogram

Analyzing A DNA Sequence Chromatogram LESSON 9 HANDOUT Analyzing A DNA Sequence Chromatogram Student Researcher Background: DNA Analysis and FinchTV DNA sequence data can be used to answer many types of questions. Because DNA sequences differ

More information

PROYECTO GENOMA HUMANO. Medicina Molecular 2011 Maestría en Biología Molecular Médica- UBA

PROYECTO GENOMA HUMANO. Medicina Molecular 2011 Maestría en Biología Molecular Médica- UBA PROYECTO GENOMA HUMANO Medicina Molecular 2011 Maestría en Biología Molecular Médica- UBA GENOMA HUMANO Medicina Molecular 2011 Maestría en Biología Molecular Médica- UBA Some landmarks on the way to 1940s

More information

Next Generation Sequencing for DUMMIES

Next Generation Sequencing for DUMMIES Next Generation Sequencing for DUMMIES Looking at a presentation without the explanation from the author is sometimes difficult to understand. This document contains extra information for some slides that

More information

The Chinese University of Hong Kong igem 2010 Bacterial based storage and encryp2on device

The Chinese University of Hong Kong igem 2010 Bacterial based storage and encryp2on device The Chinese University of Hong Kong igem 2010 Bacterial based storage and encryp2on device Bacterial based informaaon storage device Bancro4 s group (2001) Mount Sinai School of Meducube Yachie s group

More information

Overview of Eukaryotic Gene Prediction

Overview of Eukaryotic Gene Prediction Overview of Eukaryotic Gene Prediction CBB 231 / COMPSCI 261 W.H. Majoros What is DNA? Nucleus Chromosome Telomere Centromere Cell Telomere base pairs histones DNA (double helix) DNA is a Double Helix

More information

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations

Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations Activity IT S ALL RELATIVES The Role of DNA Evidence in Forensic Investigations SCENARIO You have responded, as a result of a call from the police to the Coroner s Office, to the scene of the death of

More information

Analysis of ChIP-seq data in Galaxy

Analysis of ChIP-seq data in Galaxy Analysis of ChIP-seq data in Galaxy November, 2012 Local copy: https://galaxy.wi.mit.edu/ Joint project between BaRC and IT Main site: http://main.g2.bx.psu.edu/ 1 Font Conventions Bold and blue refers

More information

DOI: 10.1126/science.287.5461.2196

DOI: 10.1126/science.287.5461.2196 A Whole-Genome Assembly of Drosophila Eugene W. Myers, et al. Science 287, 2196 (2000); DOI: 10.1126/science.287.5461.2196 This copy is for your personal, non-commercial use only. If you wish to distribute

More information

Publisher 2010 Cheat Sheet

Publisher 2010 Cheat Sheet April 20, 2012 Publisher 2010 Cheat Sheet Toolbar customize click on arrow and then check the ones you want a shortcut for File Tab (has new, open save, print, and shows recent documents, and has choices

More information

Automated correction of genome sequence errors

Automated correction of genome sequence errors Published online January 26, 2004 562±569 Nucleic Acids Research, 2004, Vol. 32, No. 2 DOI: 10.1093/nar/gkh216 Automated correction of genome sequence errors Pawel Gajer*, Michael Schatz and Steven L.

More information

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009

Genome and DNA Sequence Databases. BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009 Genome and DNA Sequence Databases BME 110/BIOL 181 CompBio Tools Todd Lowe March 31, 2009 Admin Reading: Chapters 1 & 2 Notes available in PDF format on-line (see class calendar page): http://www.soe.ucsc.edu/classes/bme110/spring09/bme110-calendar.html

More information

Pairwise Sequence Alignment

Pairwise Sequence Alignment Pairwise Sequence Alignment carolin.kosiol@vetmeduni.ac.at SS 2013 Outline Pairwise sequence alignment global - Needleman Wunsch Gotoh algorithm local - Smith Waterman algorithm BLAST - heuristics What

More information

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph

Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph Reference-free compression of high throughput sequencing data with a probabilistic de Bruijn graph Gaëtan Benoit, Claire Lemaitre, Dominique Lavenier, Erwan Drezen, Thibault Dayris, Raluca Uricaru, Guillaume

More information

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation. Introduction to SMRTbell Template Preparation 100 338 500 01 1. SMRTbell Template Preparation 1.1 Introduction to SMRTbell Template Preparation Welcome to Pacific Biosciences' Introduction to SMRTbell

More information