DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky



Similar documents
DNA Sequencing Data Compression. Michael Chung

Acceleration for Personalized Medicine Big Data Applications

Overview. Swarms in nature. Fish, birds, ants, termites, Introduction to swarm intelligence principles Particle Swarm Optimization (PSO)

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Hardware and Software

Windows Server Performance Monitoring

Compiling PCRE to FPGA for Accelerating SNORT IDS

Reconfigurable FPGA Inter-Connect For Optimized High Speed DNA Sequencing

An FPGA Acceleration of Short Read Human Genome Mapping

Storage Solutions for Bioinformatics

Computer Graphics Hardware An Overview

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Whitepaper. Innovations in Business Intelligence Database Technology.

A Tutorial in Genetic Sequence Classification Tools and Techniques

Binary search tree with SIMD bandwidth optimization using SSE

Next generation sequencing (NGS)

Bricata Next Generation Intrusion Prevention System A New, Evolved Breed of Threat Mitigation

1. Molecular computation uses molecules to represent information and molecular processes to implement information processing.

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Distance Degree Sequences for Network Analysis

MOMENTUM - A MEMORY-HARD PROOF-OF-WORK VIA FINDING BIRTHDAY COLLISIONS. DANIEL LARIMER dlarimer@invictus-innovations.com Invictus Innovations, Inc

A Time Efficient Algorithm for Web Log Analysis

Bioinformatics Resources at a Glance

Physical Data Organization

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

FPGA-based Multithreading for In-Memory Hash Joins

Cloud-Based Big Data Analytics in Bioinformatics

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

Data Backup and Archiving with Enterprise Storage Systems

LDA, the new family of Lortu Data Appliances

The enhancement of the operating speed of the algorithm of adaptive compression of binary bitmap images

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Table Lookups: From IF-THEN to Key-Indexing

Architecture bits. (Chromosome) (Evolved chromosome) Downloading. Downloading PLD. GA operation Architecture bits

3 SOFTWARE AND PROGRAMMING LANGUAGES

Application of Neural Network in User Authentication for Smart Home System

Rethinking SIMD Vectorization for In-Memory Databases

Hardware Configuration Guide

How To Make A Backup System More Efficient

DeltaStor Data Deduplication: A Technical Review

Accelerating variant calling

Evaluation of Different Task Scheduling Policies in Multi-Core Systems with Reconfigurable Hardware

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Subject knowledge requirements for entry into computer science teacher training. Expert group s recommendations

high-performance computing so you can move your enterprise forward

Synthetic Biology: DNA Digital Storage, Computation and the Organic Computer

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Extending the Power of FPGAs. Salil Raje, Xilinx

International Language Character Code

Evolutionary SAT Solver (ESS)

Comparative genomic hybridization Because arrays are more than just a tool for expression analysis

Inline Deduplication

ENHANCEMENTS TO SQL SERVER COLUMN STORES. Anuhya Mallempati #

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Network Traffic Monitoring an architecture using associative processing.

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Minimizing code defects to improve software quality and lower development costs.

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

A FAST STRING MATCHING ALGORITHM

Delivering the power of the world s most successful genomics platform

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Programming NAND devices

INCREASING EFFICIENCY WITH EASY AND COMPREHENSIVE STORAGE MANAGEMENT

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

HIGH DENSITY DATA STORAGE IN DNA USING AN EFFICIENT MESSAGE ENCODING SCHEME Rahul Vishwakarma 1 and Newsha Amiri 2

Big Data Challenges in Bioinformatics

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

Next Generation Sequencing: Technology, Mapping, and Analysis

Top Ten Questions. to Ask Your Primary Storage Provider About Their Data Efficiency. May Copyright 2014 Permabit Technology Corporation

Deploying De-Duplication on Ext4 File System

High-Volume Data Warehousing in Centerprise. Product Datasheet

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

RevoScaleR Speed and Scalability

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

How To Design An Image Processing System On A Chip

Master's projects at ITMO University. Daniil Chivilikhin PhD ITMO University

In Memory Accelerator for MongoDB

The Curious Case of Database Deduplication. PRESENTATION TITLE GOES HERE Gurmeet Goindi Oracle

Parallel Computing. Benson Muite. benson.

Communicating with devices

Non-Data Aided Carrier Offset Compensation for SDR Implementation

SSD Performance Tips: Avoid The Write Cliff

Key Components of WAN Optimization Controller Functionality

Chapter 18: Database System Architectures. Centralized Systems

STORAGE SOURCE DATA DEDUPLICATION PRODUCTS. Buying Guide: inside

CRAC: An integrated approach to analyse RNA-seq reads Additional File 3 Results on simulated RNA-seq data.

STORAGE. Buying Guide: TARGET DATA DEDUPLICATION BACKUP SYSTEMS. inside

Next Generation Sequencing

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

Moving Virtual Storage to the Cloud. Guidelines for Hosters Who Want to Enhance Their Cloud Offerings with Cloud Storage

COS 318: Operating Systems. Virtual Memory and Address Translation

A greedy algorithm for the DNA sequencing by hybridization with positive and negative errors and information about repetitions

FUSION iocontrol HYBRID STORAGE ARCHITECTURE 1

Transcription:

DNA Mapping/Alignment Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Overview Summary Research Paper 1 Research Paper 2 Research Paper 3 Current Progress Software Designs to Come

Summary Next generation sequencing allows genetic information to be sequenced and analyzed through rigorous computation In order to obtain genetic data, a biological sample must be prepared before it is to be sequenced. Once the sample has gone through the chemical preparation, it is able to then be run through one of the various NGS technologies to be sequenced.

Summary The data generated from the sequencer is in the form of reads. Reads are strings of nucleotides which are a partial copy of the genetic material of interest. Reads can range from 10 s to 1000 s of nucleotides long. After a few quality control measures, the reads are then ready to be analyzed. In order to be analyzed, the reads must be mapped and aligned to a reference of interest in order to compute results such as differential expression.

Summary A common algorithm that has spawned many derivatives within the mapping/alignment program community includes the Seed and Extend method.

Summary The Seed and Extend (Reference hashed) method breaks the problem down into these steps: Index the reference sequence via a hash Break reads into Seeds or smaller portions for each seed from a read: Find the most unique seed within the reference Extend the seed outward to check for a more confident match Record reads location in respect to the reference sequence

Summary

Summary

A hybrid short read mapping accelerator Authors: Yupeng Chen, Bertil Schmidt & Douglas L. Maskell Publication: BMC Bioinformatics Date: February 2013 Doi: 10.1186/1471-2105-14-67 Abstract: A hybrid of parallel software and special hardware, specifically a field programmable gate array, is used to provide faster processing of mapping-based sequence assembly while maintaining accuracy.

A hybrid short read mapping accelerator Problem caused by ever-growing volume of short read sequence data: Fast methods do not accommodate much error. Approaches that do handle error well tend to be impractically slow. Goal of this technique: to improve both speed and accuracy of mapping short read alignments, or SRAs. Most previous answers focus on software only.

A hybrid short read mapping accelerator This approach: Indexes the genomic template once. Done once and saved for future use. Index is generated as a separate process before program execution. Uses a fixed seed length. Required in order to always use the same genomic template index for a given species.

A hybrid short read mapping accelerator This approach uses a hybrid of both software and a type of hardware known as field programmable gate arrays, or FPGAs. FPGAs: Great potential for massively parallel computations. Require additional design work for implementation. Few attempts have been made to utilize an FPGA based approach.

A hybrid short read mapping accelerator Hardware: FPGA (one Virtex5 FPGA chip): used for the generation of seeds and sequence alignment processes both tasks demand large amounts of computational resources data in memory divided between host PC and the FPGA

A hybrid short read mapping accelerator Software: uses seed-and-extend method commonly used in SRAs (Short Read Aligners) 2 stages of the algorithm are each run in parallel: seed generation is done in parallel seed extension is done in parallel

A hybrid short read mapping accelerator Seed extension process: Longest running time of any part of the algorithm. Well suited for FPGA parallelization. Primarily composed of repeated random access of a sizable lookup table.

A hybrid short read mapping accelerator Division of tasks: CPU (less demanding tasks): Convert reads into binary representation (2 bits per nucleotide) Sending the encoded reads to the short read alignment process on the FPGA Sending commands to the process on the FPGA Accepting the results and writing them to disk

A hybrid short read mapping accelerator Division of tasks: FPGA (Highly computationally intensive and parallelizable tasks): Generation of seeds. Extension of seed matches.

A hybrid short read mapping accelerator Results: Seed extension (previously the most time consuming step) was made faster to the point where it was no longer the bottleneck of the SRA process Seed generation is now the bottleneck The authors site future plans to parallelize the initial step of encoding the reads for further speed up.

A hybrid short read mapping accelerator What can we use from this? We are unlikely to be able to use FPGA hardware. However, we can use some of the software concepts used in this approach in our solution.

A hybrid short read mapping accelerator What can we use from this? Software concepts: More than one portion of the SRA process can be made parallel: Initial encoding of reads. Generation of seeds. Extension of initial seed matches.

A hybrid short read mapping accelerator What can we use from this? Software concepts: To further reduce execution time: The genomic template can be indexed prior to program execution. This index need only be generated once for a given species, and can be re-used many times.

Efficient storage of high throughput DNA sequencing data using reference-based compression Authors: Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney Publication: Genome Research Date: January 2011 Doi:10.1101/gr.114819.110 Abstract: Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data, hence the necessity of high throughput DNA sequencing data using referencebased compression is evident.high throughput DNA

Efficient storage of high throughput DNA sequencing data using reference-based compression Problem: There are many challenges in handling the next generation of sequence data, from the highly fragmented nature of the shorter reads generated by the new technologies, to storage, analyze and computational requirements for such large data volumes. The main concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity.

Efficient storage of high throughput DNA sequencing data using reference-based compression Addressing the Issue: Aligning new sequences to a reference genome and then encode the differences between the new sequence and the reference genome, these differences are then stored creating a relatively less storage.

Efficient storage of high throughput DNA sequencing data using reference-based compression The efficiency of the compression method is increased exponentially with the increase in the read length i.e, the bigger the size of the read the greater the quality of compression. The magnitude of this efficiency gain can be controlled by changing the amount of quality information stored.

Efficient storage of high throughput DNA sequencing data using reference-based compression Prior to 2005 the rate of increase in sequencing capacity was close to the rate of increase in disk storage capacity on a per unit cost basis.

Efficient storage of high throughput DNA sequencing data using reference-based compression Given the potential memory demands of this project, this new concept of structuring our data may help us in the future If we foresee this memory bottleneck within our program, we will incorporate this approach to read/reference storage and analysis

Sense from sequence reads: methods for alignment and assembly Authors: Paul Flicek, Ewan Birney Publication: Nature Volume 6, No.11s doi:10.1038/nmeth.1376 Date: November 2009 Abstract: Discussion on the current algorithms behind mapping/alignment and assembly programs and future directions of these algorithms

Sense from sequence reads: methods for alignment and assembly General overview on the importance of mapping/alignment and assembly within the scientific community Alignment/mapping portion is split into two major algorithmic types: Seed and Extend (hash-based) and Burrows-Wheeler Transform (BWT) Explains the basic structures of the above algorithms Our main interest would be the Seed and Extend based methods

Sense from sequence reads: methods for alignment and assembly Two types of hash indexes: Reference-based and Read-based Reference-based hashes read the reference into a hash in sections and matches reads to the hashed index. Pros: Fast look-up Cons: High memory footprint Read-based hashes read the reads into a hash in seeds and the reference is used to search the hash. Pros: Small memory requirement Cons: Increased processing time to scan the reference

Sense from sequence reads: methods for alignment and assembly We will be implementing the reference-hashing algorithm for our project Given that most of our mapping with be exact mapping with the possibility of mutations and that our reference and reads will not reach the size of terabytes, hashing the reference is much easier to conceptualize Intense memory usage will not be that large of an issue

Current Progress Currently, we have begun generating test sets and are still within the design phase of our algorithm For our first test set, we ve taken the Escherichia coli isolate BL26A plasmid plmo226 which is ~2000 basepairs long and have split it into reads of 100 base-pairs for simple testing. We have also generated another test set using the same sample but with 10x coverage.

Current Progress

Current Progress We ve created a simple algorithm that reads a sequence in and splits the sequence into reads of a specified length We also amplify the number of reads generated based on another specified amount to generate that many more duplicates to simulate coverage

Software Designs to Come Designs we plan to implement include: Nucleotide representation conversion from strings to binary. A = 00, C = 01, G = 10, T = 11. Seed/hash key representation will be in the form of bit sets using long variables. For example: ATCG = 00110110 or 54 Efficient hash table storage for locations with the same nucleotide string. Reference-hashing Seed and Extend algorithm. And if possible: Seed masks for handling of mutations/insertions/deletions

Questions?