Keeping up with DNA technologies



Similar documents
CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

July 7th 2009 DNA sequencing

A Tutorial in Genetic Sequence Classification Tools and Techniques

Automated DNA sequencing 20/12/2009. Next Generation Sequencing

Introduction to NGS data analysis

Computational Genomics. Next generation sequencing (NGS)

Next generation sequencing (NGS)

NGS data analysis. Bernardo J. Clavijo

Introduction to Bioinformatics 3. DNA editing and contig assembly

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Guidelines for Establishment of Contract Areas Computer Science Department

New generation sequencing: current limits and future perspectives. Giorgio Valle CRIBI - Università di Padova

Next Generation Sequencing: Technology, Mapping, and Analysis

Reading DNA Sequences:

NECC History. Karl V. Steiner 2011 Annual NECC Meeting, Orono, Maine March 15, 2011

Analyze Human Genome Using Big Data

BLAST. Anders Gorm Pedersen & Rasmus Wernersson

Introduction to transcriptome analysis using High Throughput Sequencing technologies (HTS)

An Overview of DNA Sequencing

Version 5.0 Release Notes

Overview of Next Generation Sequencing platform technologies

What is a contig? What are the contig assembly programs?

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

GeneSifter: Next Generation Data Management and Analysis for Next Generation Sequencing

HPC-MAQ : A PARALLEL SHORT-READ REFERENCE ASSEMBLER

Searching Nucleotide Databases

Eoulsan Analyse du séquençage à haut débit dans le cloud et sur la grille

Efficient Parallel Execution of Sequence Similarity Analysis Via Dynamic Load Balancing

Deep Sequencing Data Analysis

Cloud-Based Big Data Analytics in Bioinformatics

Microbial Oceanomics using High-Throughput DNA Sequencing

Cloud-enabling Sequence Alignment with Hadoop MapReduce: A Performance Analysis

Large-scale DNA Sequence Analysis in the Cloud: A Stream-based Approach

The NGS IT notes. George Magklaras PhD RHCE

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Modified Genetic Algorithm for DNA Sequence Assembly by Shotgun and Hybridization Sequencing Techniques

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Chapter 2. imapper: A web server for the automated analysis and mapping of insertional mutagenesis sequence data against Ensembl genomes

An FPGA Acceleration of Short Read Human Genome Mapping

NGS Technologies for Genomics and Transcriptomics

Next Generation Sequencing for DUMMIES

Analysis of NGS Data

Sequence Formats and Sequence Database Searches. Gloria Rendon SC11 Education June, 2011

14.3 Studying the Human Genome

Next Generation Sequencing

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

Bioinformatics I, WS 09-10, D. Huson, January 27,

Introduction to next-generation sequencing data

Hadoopizer : a cloud environment for bioinformatics data analysis

How long is long enough?

De Novo Assembly Using Illumina Reads

DNA Sequencing & The Human Genome Project

High Performance Computing for DNA Sequence Alignment and Assembly

Algorithms in Computational Biology (236522) spring 2007 Lecture #1

Final Project Report

MAKING AN EVOLUTIONARY TREE

Next Generation Sequencing; Technologies, applications and data analysis

CD-HIT User s Guide. Last updated: April 5,

Bursting to a Hybrid Cloud for Services OFC 2015

Genomic Applications on Cray supercomputers: Next Generation Sequencing Workflow. Barry Bolding. Cray Inc Seattle, WA

Welcome to Pacific Biosciences' Introduction to SMRTbell Template Preparation.

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Data Analysis for Ion Torrent Sequencing

Nazneen Aziz, PhD. Director, Molecular Medicine Transformation Program Office

DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Bioinformatics: course introduction

Comparing Methods for Identifying Transcription Factor Target Genes

Next generation DNA sequencing technologies. theory & prac-ce

Genomics GENterprise

How-To: SNP and INDEL detection

Figure 1: Genome sizes of different organisms.

FlipFlop: Fast Lasso-based Isoform Prediction as a Flow Problem

Analysis of ChIP-seq data in Galaxy

History of DNA Sequencing & Current Applications

Lecture 4: Exact string searching algorithms. Exact string search algorithms. Definitions. Exact string searching or matching

Fast. Integrated Genome Browser & DAS. Easy. Flexible. Free. bioviz.org/igb

Databases and mapping BWA. Samtools

The Chinese University of Hong Kong igem 2010 Bacterial based storage and encryp2on device

Typing in the NGS era: The way forward!

Two scientific papers (1, 2) recently appeared reporting

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

2 The Human Genome Project

Next Generation Sequencing data Analysis at Genoscope. Jean-Marc Aury

Overview of Genome Assembly Techniques

Sharing Data from Large-scale Biological Research Projects: A System of Tripartite Responsibility

New solutions for Big Data Analysis and Visualization

2.3 Identify rrna sequences in DNA

Transcription:

Keeping up with DNA technologies Mihai Pop Department of Computer Science Center for Bioinformatics and Computational Biology University of Maryland, College Park

The evolution of DNA sequencing Since Technology Read length Throughput/run Throughput/hour cost/run 1977- Sanger sequencing > 1000bp 4hr 400-500 kbp 100 kbp $200 2005-454 pyrosequencing 250-400bp 4hr 100-500 Mbp 25-100 Mbp $13,000 2006- Illumina/Solexa 50-100bp 3 days 2-3 Gbp 2007- ABI SOLiD 35-50bp 3 days 6-20 Gbp 25-40 Mbp $3,000 75-250 Mbp est. $3-5,000 2008- Helicos single molecule 25-50 bp 8 days 10 Gbp ~50 Mbp est. 1Gbp/hour ~$18,000 TBA (2010) Pacific Biosciences single molecule 100-200 kbp???

From DNA to images Image from Helicos through http://www.laserfocusworld.com

What? X 1000 www.1000genomes.org Image from Tree of Life Web Project www.tolweb.org http://dels.nas.edu/metagenomics/ Sage/RNAseq Image from wikipedia

Not too long ago 1995 Haemophilus influenzae 2000 Drosophila melanogaster 2001 Homo sapiens

Assembly Given a collection of short DNA pieces from a genome Reconstruct the original genome sequence Related to Shortest Common Superstring given a set of strings S, find the shortest string that contains all the strings in S as a substring. NP-hard SHOTGUN SEQUENCING shearing original DNA (hopefully) sequencing assembly

Challenges Note: Research on alignment/assembly algorithms is quite old and the basic concepts are unchanged by technology Large amounts of data even data storage/transfer is a problem Special error characteristics most very short reads (25-35 bp instead of 1000s) 454 homopolymer stutter ACAAGAAATGT ACAAGA--TGT ACAAGAA-TGT Helicos high error rates, same DNA fragment sequenced multiple times Pacific Biosciences very long reads, possibly high error rate Special data ABI/SOLiD color space sequencing

Short-read assembly at the CBCB De novo assembly Minimus-SR (w/ Dan Sommer) all-pair alignments performed fast (~2 minutes for 8.6 million reads) key use bit-wise operations (e.g. XOR to test equality) very good results (better than most short read assemblers) Comparative assembly AMOScmp (w/ Adam Phillippy, Daniela Puiu, S. Salzberg) originally developed for Sanger data works for short reads too ABBA (Dan Sommer, Steven Salzberg) comparative assembly with a protein reference (resequencing of genes) Short read alignment Bowtie (w/ Ben Langmead and Steven Salzberg) uses Burrows-Wheeler index of a genome (only 1.1GB for human) up to 30-40-fold faster than other aligners

Optical mapping The better restriction mapping output: ordered list of sizes Restriction Fragment Size(kb) Standard Deviation (kb) -------------------------------------------------------------------- 0 13.802 0.146 1 16.607 0.150 2 11.719 0.119 Map-map alignment and contig-map alignment easier relatively simple dynamic programming

SOMA scaffolding with optical maps Dynamic programming O(m 2 n 2 ) S [i, j]=max 0 k i,0 l j C r i k l j s=k penalty for mismatched sites Bootstrapped computation of p-values (how likely is a match) Scheduling algorithm for near-optimal placement of contigs with multiple good matches. Approx. 80-90% of a genome can be placed in one scaffold i c s t=l j 2 t=l t j o t 2 chi-squared score of alignment btwn fragments 2 best /n best F-test to asses equally good matches 2 other /n other S [k 1, l 1] Nagarajan, Read, Pop. Bioinformatics 2008

Opportunities for collaborations We are driven by data and technology Let us know if you have interesting data-sets... interesting DNA technologies... useful computational technologies (in particular parallel/grid infrastructure and high performance storage) Examples of collaborations with UM School of Medicine study of diarrhea in 3rd world countries with USDA sequencing of cow and turkey with Google/IBM/NSF applications of Cloud Computing to bioinformatics with Opgen Inc. - new ways to analyze optical mapping data... and many more

CBCB Steven Salzberg Art Delcher Carl Kingsford Daniela Puiu Acknowledgments Optical mapping Tim Read Zheng Wang Dave Schwartz Opgen, Inc. Pop group Niranjan Nagarajan Mohammad Ghodsi Sergey Koren Ben Langmead Bo Liu Dan Sommer James White Funding NSF NIH Bill & Melinda Gates Foundation Henry Jackson Foundation OpenOffice.org