DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Transcription

1 DNA Mapping/Alignment Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

2 Overview Summary Research Paper 1 Research Paper 2 Research Paper 3 Current Progress Software Designs to Come

3 Summary Next generation sequencing allows genetic information to be sequenced and analyzed through rigorous computation In order to obtain genetic data, a biological sample must be prepared before it is to be sequenced. Once the sample has gone through the chemical preparation, it is able to then be run through one of the various NGS technologies to be sequenced.

4 Summary The data generated from the sequencer is in the form of reads. Reads are strings of nucleotides which are a partial copy of the genetic material of interest. Reads can range from 10 s to 1000 s of nucleotides long. After a few quality control measures, the reads are then ready to be analyzed. In order to be analyzed, the reads must be mapped and aligned to a reference of interest in order to compute results such as differential expression.

5 Summary A common algorithm that has spawned many derivatives within the mapping/alignment program community includes the Seed and Extend method.

6 Summary The Seed and Extend (Reference hashed) method breaks the problem down into these steps: Index the reference sequence via a hash Break reads into Seeds or smaller portions for each seed from a read: Find the most unique seed within the reference Extend the seed outward to check for a more confident match Record reads location in respect to the reference sequence

7 Summary

8 Summary

9 A hybrid short read mapping accelerator Authors: Yupeng Chen, Bertil Schmidt & Douglas L. Maskell Publication: BMC Bioinformatics Date: February 2013 Doi: / Abstract: A hybrid of parallel software and special hardware, specifically a field programmable gate array, is used to provide faster processing of mapping-based sequence assembly while maintaining accuracy.

10 A hybrid short read mapping accelerator Problem caused by ever-growing volume of short read sequence data: Fast methods do not accommodate much error. Approaches that do handle error well tend to be impractically slow. Goal of this technique: to improve both speed and accuracy of mapping short read alignments, or SRAs. Most previous answers focus on software only.

11 A hybrid short read mapping accelerator This approach: Indexes the genomic template once. Done once and saved for future use. Index is generated as a separate process before program execution. Uses a fixed seed length. Required in order to always use the same genomic template index for a given species.

12 A hybrid short read mapping accelerator This approach uses a hybrid of both software and a type of hardware known as field programmable gate arrays, or FPGAs. FPGAs: Great potential for massively parallel computations. Require additional design work for implementation. Few attempts have been made to utilize an FPGA based approach.

13 A hybrid short read mapping accelerator Hardware: FPGA (one Virtex5 FPGA chip): used for the generation of seeds and sequence alignment processes both tasks demand large amounts of computational resources data in memory divided between host PC and the FPGA

14 A hybrid short read mapping accelerator Software: uses seed-and-extend method commonly used in SRAs (Short Read Aligners) 2 stages of the algorithm are each run in parallel: seed generation is done in parallel seed extension is done in parallel

15 A hybrid short read mapping accelerator Seed extension process: Longest running time of any part of the algorithm. Well suited for FPGA parallelization. Primarily composed of repeated random access of a sizable lookup table.

16 A hybrid short read mapping accelerator Division of tasks: CPU (less demanding tasks): Convert reads into binary representation (2 bits per nucleotide) Sending the encoded reads to the short read alignment process on the FPGA Sending commands to the process on the FPGA Accepting the results and writing them to disk

17 A hybrid short read mapping accelerator Division of tasks: FPGA (Highly computationally intensive and parallelizable tasks): Generation of seeds. Extension of seed matches.

18 A hybrid short read mapping accelerator Results: Seed extension (previously the most time consuming step) was made faster to the point where it was no longer the bottleneck of the SRA process Seed generation is now the bottleneck The authors site future plans to parallelize the initial step of encoding the reads for further speed up.

19 A hybrid short read mapping accelerator What can we use from this? We are unlikely to be able to use FPGA hardware. However, we can use some of the software concepts used in this approach in our solution.

20 A hybrid short read mapping accelerator What can we use from this? Software concepts: More than one portion of the SRA process can be made parallel: Initial encoding of reads. Generation of seeds. Extension of initial seed matches.

21 A hybrid short read mapping accelerator What can we use from this? Software concepts: To further reduce execution time: The genomic template can be indexed prior to program execution. This index need only be generated once for a given species, and can be re-used many times.

22 Efficient storage of high throughput DNA sequencing data using reference-based compression Authors: Markus Hsi-Yang Fritz, Rasko Leinonen, Guy Cochrane, and Ewan Birney Publication: Genome Research Date: January 2011 Doi: /gr Abstract: Data storage costs have become an appreciable proportion of total cost in the creation and analysis of DNA sequence data, hence the necessity of high throughput DNA sequencing data using referencebased compression is evident.high throughput DNA

23 Efficient storage of high throughput DNA sequencing data using reference-based compression Problem: There are many challenges in handling the next generation of sequence data, from the highly fragmented nature of the shorter reads generated by the new technologies, to storage, analyze and computational requirements for such large data volumes. The main concern is that the rate of increase in DNA sequencing is significantly outstripping the rate of increase in disk storage capacity.

24 Efficient storage of high throughput DNA sequencing data using reference-based compression Addressing the Issue: Aligning new sequences to a reference genome and then encode the differences between the new sequence and the reference genome, these differences are then stored creating a relatively less storage.

25 Efficient storage of high throughput DNA sequencing data using reference-based compression The efficiency of the compression method is increased exponentially with the increase in the read length i.e, the bigger the size of the read the greater the quality of compression. The magnitude of this efficiency gain can be controlled by changing the amount of quality information stored.

26 Efficient storage of high throughput DNA sequencing data using reference-based compression Prior to 2005 the rate of increase in sequencing capacity was close to the rate of increase in disk storage capacity on a per unit cost basis.

27 Efficient storage of high throughput DNA sequencing data using reference-based compression Given the potential memory demands of this project, this new concept of structuring our data may help us in the future If we foresee this memory bottleneck within our program, we will incorporate this approach to read/reference storage and analysis

28 Sense from sequence reads: methods for alignment and assembly Authors: Paul Flicek, Ewan Birney Publication: Nature Volume 6, No.11s doi: /nmeth.1376 Date: November 2009 Abstract: Discussion on the current algorithms behind mapping/alignment and assembly programs and future directions of these algorithms

29 Sense from sequence reads: methods for alignment and assembly General overview on the importance of mapping/alignment and assembly within the scientific community Alignment/mapping portion is split into two major algorithmic types: Seed and Extend (hash-based) and Burrows-Wheeler Transform (BWT) Explains the basic structures of the above algorithms Our main interest would be the Seed and Extend based methods

30 Sense from sequence reads: methods for alignment and assembly Two types of hash indexes: Reference-based and Read-based Reference-based hashes read the reference into a hash in sections and matches reads to the hashed index. Pros: Fast look-up Cons: High memory footprint Read-based hashes read the reads into a hash in seeds and the reference is used to search the hash. Pros: Small memory requirement Cons: Increased processing time to scan the reference

31 Sense from sequence reads: methods for alignment and assembly We will be implementing the reference-hashing algorithm for our project Given that most of our mapping with be exact mapping with the possibility of mutations and that our reference and reads will not reach the size of terabytes, hashing the reference is much easier to conceptualize Intense memory usage will not be that large of an issue

32 Current Progress Currently, we have begun generating test sets and are still within the design phase of our algorithm For our first test set, we ve taken the Escherichia coli isolate BL26A plasmid plmo226 which is ~2000 basepairs long and have split it into reads of 100 base-pairs for simple testing. We have also generated another test set using the same sample but with 10x coverage.

33 Current Progress

34 Current Progress We ve created a simple algorithm that reads a sequence in and splits the sequence into reads of a specified length We also amplify the number of reads generated based on another specified amount to generate that many more duplicates to simulate coverage

35 Software Designs to Come Designs we plan to implement include: Nucleotide representation conversion from strings to binary. A = 00, C = 01, G = 10, T = 11. Seed/hash key representation will be in the form of bit sets using long variables. For example: ATCG = or 54 Efficient hash table storage for locations with the same nucleotide string. Reference-hashing Seed and Extend algorithm. And if possible: Seed masks for handling of mutations/insertions/deletions

36 Questions?