DNA Sequencing Data Compression. Michael Chung



Similar documents
DNA Mapping/Alignment. Team: I Thought You GNU? Lars Olsen, Venkata Aditya Kovuri, Nick Merowsky

Arithmetic Coding: Introduction

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Tutorial for Windows and Macintosh. Preparing Your Data for NGS Alignment

Accelerating Data-Intensive Genome Analysis in the Cloud

Next Generation Sequencing: Technology, Mapping, and Analysis

Removing Sequential Bottlenecks in Analysis of Next-Generation Sequencing Data

Towards Integrating the Detection of Genetic Variants into an In-Memory Database

Compression techniques

SRA File Formats Guide

Challenges associated with analysis and storage of NGS data

Information, Entropy, and Coding

On the Use of Compression Algorithms for Network Traffic Classification

A Tutorial in Genetic Sequence Classification Tools and Techniques

Analysis of Compression Algorithms for Program Data

Image Compression through DCT and Huffman Coding Technique

Version 5.0 Release Notes

How-To: SNP and INDEL detection

CHAPTER 2 LITERATURE REVIEW

Complete Genomics Sequencing

SeqScape Software Version 2.5 Comprehensive Analysis Solution for Resequencing Applications

Focusing on results not data comprehensive data analysis for targeted next generation sequencing

A Catalogue of the Steiner Triple Systems of Order 19

LifeScope Genomic Analysis Software 2.5

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

RETRIEVING SEQUENCE INFORMATION. Nucleotide sequence databases. Database search. Sequence alignment and comparison

Inverted Indexes: Trading Precision for Efficiency

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Analysis of NGS Data

Copy Number Variation: available tools

Next generation sequencing (NGS)

Data formats and file conversions

encoding compression encryption

Information Theory and Coding Prof. S. N. Merchant Department of Electrical Engineering Indian Institute of Technology, Bombay

SAP HANA Enabling Genome Analysis

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Binary Number System. 16. Binary Numbers. Base 10 digits: Base 2 digits: 0 1

17 July 2014 WEB-SERVER MANUAL. Contact: Michael Hackenberg

Big Data & Scripting Part II Streaming Algorithms

THE SECURITY AND PRIVACY ISSUES OF RFID SYSTEM

Dimensionality Reduction of DNA Sequences through Wavelet Transforms

UCLA Team Sequences Cell Line, Puts Open Source Software Framework into Production

BROADBAND AND HIGH SPEED NETWORKS

Today s topics. Digital Computers. More on binary. Binary Digits (Bits)

Extensible Sequence (XSQ) File Format Specification 1.0.1

Data Storage 3.1. Foundations of Computer Science Cengage Learning

Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

8/7/2012. Experimental Design & Intro to NGS Data Analysis. Examples. Agenda. Shoe Example. Breast Cancer Example. Rat Example (Experimental Design)

Streaming Lossless Data Compression Algorithm (SLDC)

Comparing Methods for Identifying Transcription Factor Target Genes

Secured Lossless Medical Image Compression Based On Adaptive Binary Optimization

Go where the biology takes you. Genome Analyzer IIx Genome Analyzer IIe

Class Notes CS Creating and Using a Huffman Code. Ref: Weiss, page 433

Succinct Data Structures: From Theory to Practice

Deep Sequencing Data Analysis

Issues in Data Storage and Data Management in Large- Scale Next-Gen Sequencing

Shouguo Gao Ph. D Department of Physics and Comprehensive Diabetes Center

Data Storage. Chapter 3. Objectives. 3-1 Data Types. Data Inside the Computer. After studying this chapter, students should be able to:

Global Alliance. Ewan Birney Associate Director EMBL-EBI

SMALL INDEX LARGE INDEX (SILT)

The string of digits in the binary number system represents the quantity

Data Processing of Nextera Mate Pair Reads on Illumina Sequencing Platforms

Molecular typing of VTEC: from PFGE to NGS-based phylogeny

HADOOP IN THE LIFE SCIENCES:

Video compression: Performance of available codec software

Genotyping by sequencing and data analysis. Ross Whetten North Carolina State University

Module 1. Sequence Formats and Retrieval. Charles Steward

Raima Database Manager Version 14.0 In-memory Database Engine

Using Illumina BaseSpace Apps to Analyze RNA Sequencing Data

Package PKI. July 28, 2015

When you install Mascot, it includes a copy of the Swiss-Prot protein database. However, it is almost certain that you and your colleagues will want

Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Modified Golomb-Rice Codes for Lossless Compression of Medical Images

Lossless Compression of Cloud-Cover Forecasts for Low-Overhead Distribution in Solar-Harvesting Sensor Networks

Practical Guideline for Whole Genome Sequencing

SDGen: Mimicking Datasets for Content Generation in Storage Benchmarks

Binary Numbering Systems

Assuring the Quality of Next-Generation Sequencing in Clinical Laboratory Practice. Supplementary Guidelines

Delivering the power of the world s most successful genomics platform

Acceleration for Personalized Medicine Big Data Applications

Compression of next-generation sequencing quality scores using memetic algorithm

Transcription:

DNA Sequencing Data Compression Michael Chung

Problem DNA sequencing per dollar is increasing faster than storage capacity per dollar. Stein (2010)

Data 3 billion base pairs in human genome Genomes are 99.9% similar 95% SNPs

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Data

Compression Tools Fixed Codes VINT Golomb / Golomb-Rice Elias Codes MOV Codes Variable Codes Huffman Codes

Variable Integers (VINT) Standard integer data type uses 4 bytes Example: Binary code for 5:

Variable Integers (VINT) Standard integer data type uses 4 bytes Example: Binary code for 5: Small integers don t need full 4 bytes Last bit of each byte used as a flag: 0: VINT continues to next byte 1: End of VINT

Golomb / Golomb-Rice Encode an integer with two bit strings Preamble Mantissa Choose parameter m to encode integer j:

Golomb / Golomb-Rice Example Encode 61 with m = 8:

Delta Encoding Store relative integers, not absolute integers Instead of 2, 4, 6, 8, 10 Store 2, 2, 2, 2, 2

Elias Codes Binary code for 5: Can we compress the leading zeroes?

Elias Codes To encode integer j:

Monotone Value Coding (MOV Coding) Encode j 1, j 2,, j k Encode j 1 with Elias Gamma encoding Preamble: (log j i log j i-1 ) 0-bits Mantissa: binary code for j i starting with most significant 1 bit

Monotone Value Coding (MOV Coding) 2: 0 10 3: 11 4: 0 100 5: 101 6: 110 7: 111 8: 0 1000

Huffman Encoding I will not pass notes in class. I will not pass notes in class. I will not pass notes in class. I will not pass notes in class. I will not pass notes in class.

Huffman Encoding Let 0 = I will not pass notes in class. 00000

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Huffman Encoding Huffman encoding assigns shorter tags to more frequently encountered data.

Papers 2008 Human genomes as email attachments Scott Christley, Yiming Lu, Chen Li, Xiaohui Xie 2010 Data structures and compression algorithms for high-throughput sequencing technologies Kenny Daily, Paul Rigor, Scott Christley, Xiaohui Xie, Pierre Baldi 2011 Efficient Storage of high throughput DNA sequencing data using reference-based compression Markus His-Yang Fritz, Rasko Leinonen, Guy Cochrane, Ewan Birney

Papers 2008 Human genomes as email attachments Scott Christley, Yiming Lu, Chen Li, Xiaohui Xie 2010 Data structures and compression algorithms for high-throughput sequencing technologies Kenny Daily, Paul Rigor, Scott Christley, Xiaohui Xie, Pierre Baldi 2011 Efficient Storage of high throughput DNA sequencing data using reference-based compression Markus His-Yang Fritz, Rasko Leinonen, Guy Cochrane, Ewan Birney

Summary

Summary Genomes are 99.9% similar Store only the differences

Assumptions

Method

Method

Techniques SNP Mapping

Techniques SNP Mapping NCBI dbsnp 11 million SNPs 99% are bi-allele http://www.ncbi.nlm.nih.gov/projects/snp/

Techniques VINT and Delta Positions Use for indels (insertions and deletions)

Techniques VINT and Delta Positions Use for indels (insertions and deletions)

Techniques VINT and Delta Positions Use for indels (insertions and deletions) Deletion Example

Techniques K-mer partitioning

Techniques K-mer partitioning

Techniques K-mer partitioning

Techniques K-mer partitioning

Results James Watson s Genome 3.5 M differences 3.3 M SNPs 220 K indels

Review Good general concepts used in later research Compare to reference genome Take advantage of: Common SNPs Position integer compression + delta encoding Frequency analysis based compression (Huffman)

Papers 2008 Human genomes as email attachments Scott Christley, Yiming Lu, Chen Li, Xiaohui Xie 2010 Data structures and compression algorithms for high-throughput sequencing technologies Kenny Daily, Paul Rigor, Scott Christley, Xiaohui Xie, Pierre Baldi 2011 Efficient Storage of high throughput DNA sequencing data using reference-based compression Markus His-Yang Fritz, Rasko Leinonen, Guy Cochrane, Ewan Birney

Summary Compress position and length integers with various compression methods Fixed code compression works well

Assumptions

Method Substitution triples

Method Substitution triples

Method Substitution triples

Method Substitution triples

Results Compressed Data Size (Bytes) Compression Method Dataset 1 Dataset 2 Dataset 3 GenCompress 56,166,419 36,099,480 390,541,330 gzip 41,378,624 95,688,992 618,818,824 bzip2 42,233,336 94,030,320 955,061,616 7zip 30,651,664 83,319,584 411,811,520

Results Relative Elias Gamma (REG) indexed encoding performs best Huffman coding can outperform when dataset is highly repetitive

Review Pros Efficient encoding of integers Can handle read-level compression Cons Storage space used to represent data identical to reference Does not handle quality scores Assumes aligned data

Papers 2008 Human genomes as email attachments Scott Christley, Yiming Lu, Chen Li, Xiaohui Xie 2010 Data structures and compression algorithms for high-throughput sequencing technologies Kenny Daily, Paul Rigor, Scott Christley, Xiaohui Xie, Pierre Baldi 2011 Efficient Storage of high throughput DNA sequencing data using reference-based compression Markus His-Yang Fritz, Rasko Leinonen, Guy Cochrane, Ewan Birney

Summary Lossy compression using quality budget

Summary Lossy compression using quality budget Include extra sequences in reference common unmapped reads bacterial / viral sequences

Summary Lossy compression using quality budget Include extra sequences in reference common unmapped reads bacterial / viral sequences Support paired-end reads

Summary Lossy compression using quality budget Include extra sequences in reference common unmapped reads bacterial / viral sequences Support paired-end reads 0.02 0.66 bits / base pair

Assumptions

Method Relative Golomb encoding for positions Elias Gamma encoding for deletion lengths Huffman encoding for read lengths, quality scores

Method Quality score budget

Method Quality score budget

Results Bits per base Compression Method NA12878 chrom20 Pseudomonas syringae pathovar syringae B728a Bzip2 1.09 0.99 BAM 17.48 7.02 Reference-based, 2% quality budget Bzip2, referencebased, 2% quality budget 0.59 0.47 0.56 0.39

Results 0.02 0.66 bits / base pair Compared to bzip2 (1 bit / base pair) 2 hours for compression

Review Pros Stores quality scores Handles some unaligned data Allows for controlled loss of precision Can handle read-level compression Cons Storage space used to represent data identical to reference Still cannot efficiently store all unaligned data

What is too expensive to compress? Image data? Quality scores? Unaligned reads?

Conclusions No single compression method is optimal for all data sets Choose assumptions about data carefully Must make decisions about which data to discard, depending on goals Need a centralized database of genomic sequence data Short Read Archive (SRA) Things to consider Storage Speed Accessibility Annotation and mass data organization Scalable, reliable, secure distributed system design Business

References 2008 Human genomes as email attachments Scott Christley, Yiming Lu, Chen Li, Xiaohui Xie 2010 Data structures and compression algorithms for high-throughput sequencing technologies Kenny Daily, Paul Rigor, Scott Christley, Xiaohui Xie, Pierre Baldi 2010 The case for cloud computing in genome informatics Lincoln D. Stein 2011 Efficient Storage of high throughput DNA sequencing data using reference-based compression Markus His-Yang Fritz, Rasko Leinonen, Guy Cochrane, Ewan Birney