Data Reduction: Deduplication and Compression Danny Harnik IBM Haifa Research Labs
Motivation Reducing the amount of data is a desirable goal Data reduction: an attempt to compress the huge amounts of data at hand Is it possible? information theoretically, technically Is it financially worth it? storage is becoming cheaper all the time requires resources and time
Compression and Deduplication Compression What is the most succinct representation of this file? Deduplication Hasn t this file appeared before? Different workloads give different results: Some favor compression, some favor dedup Sometimes the combination is best
Compression
Compression Zip runs an algorithm called DEFLATE A combination of two techniques: Lempel Ziv [1977] Huffman code [1952] Will show these 2 techniques + Arithmetic encoding
LZ77 Compression Go over a stream At each point, search for the longest identical string that has already appeared in the past. If none appeared, write the string If appeared, save Pointer to start of string (how many bytes back) Length of current string. Many variations How far to search back? Typically 32KB LZ78 hold a dictionary table A good approximation of the entropy for some sources
Huffman Code An information theoretic approach to compression: A typical text of n characters (or bytes) is not uniformly distributed. Use the skewed distribution to achieve a shorter representation. Most popular byte character gets shortest representation E.g. In a typical English text: Use the shortest encoding for e The longest for q Huffman code: A method of presenting a text using nearly its shannon entropy worth in bits. Optimal when considering just single characters
Huffman Code this is an example of a huffman tree Example taken from: http://sector0.dk/?p=29
Deduplication
Deduplication Similar to Lempel Ziv 78, but at a whole different scale Basic Block is typically ~ 4KB, 8KB, 16KB, full file Rather than byte, or string of bytes An ongoing process. Need to address a file after it is saved and closed. Two main approaches Inline dedup process data as it arrives Offline dedup background process, first save data, then dedup in spare time.
How to dedupe? Fingerprint each block using a hash function Common hashes used: Sha1, Sha256, others Store an index of all the hashes already in the system New block: Compute hash Look hash up in index table If new add to index If known hash store as pointer to existing data If known hash, do you want to look at the actual data?? 11
Client-side deduplication A method to save bandwidth as well as storage. Also know as source-based dedupe or WAN deduplication Client computes hash and sends to server If new server requests client for the data (upload data) Otherwise (dedupe) skip upload and add a new pointer to the data Client Server Let it be.mp3 hash Index 2fd4e1 2fd4e1 2fd4e1 12 Let it be.mp3
Choice of hash function In most deduplication systems this is done using a cryptographic hash Usually SHA-1 which has an output of 160 bits Probability of a collision: 1. n is the number of blocks 2. b is the number of bits in the hash p n( n 1) 2 1 2 b The above is true for any random hash function. However, a malicious adversary may choose blocks especially to create a collision. This is why a cryptographic hash is used Typically more expensive than any random like hash function
Issues Smaller blocks = Better Dedup But smaller blocks = more work More fingerprints More searches More metadata Bottom line: the choice of block size depends on the workload E.g. a file system with a 1KB page size
Alignment issues What if we insert 1 byte into an existing file. Almost identical data Dedup will fail miserably. Solution: variable block size Rabin-Karp fingerprinting: Compute a rolling hash Cut when hash equals 0 mod p Average block size = p
Existing data reduction solutions (A sample of solutions for storage systems)
Deduplication some systems and applications Content Adressable Storage (CAS) mainly for archiving Venti (Lucent), Centera(EMC), JumboStore (HP), Hydrastor(NEC) Backup Virtual Tape Library (VTL) Backup Dilligent (IBM), DataDomain (EMC), D2D (HP) Backup with client side dedup Cloud backup services: Mozy(EMC), DropBox,. Avamar(EMC), Ocarina (Dell), Netbackup (Symantec) Tivoli Storage Manager (IBM) Primary (mainly file systems) useful for VM images Netapp Filer 2 to 1 ratio guarantee on some VMWare usage. ZFS (Sun open source file system) Dell (planned for next year)
Compression in storage systems Real-time (Inline) RTC (IBM) ZFS (Oracle) Nimble Storage Offline Mix EMC Data Compression Dell (planned for next year dedupe inline, compression offline) Netapp Writes online, updates offline. Backup
Dedup vs. Compression vs. both Compression and Deduplication for Various Data Types 1.2 Data Reduction Ratio (Compressed size / Original size) 1 0.8 0.6 0.4 0.2 Compress (Gzip) DedupV (4K, var) DedupV+Compress DedupF (4K, fix) DedupF+Compress Compress+DedupV Compress+DedupF 0 VM Images Medical Images Website Archive Project Repository DB2 TPC Laptop1 (29.9GB) Data type Data taken from C. Constantinescu, J. Glider, D. Chambliss: Mixing Deduplication and Compression on Active Data Sets. DCC 2011
Summary Data reduction is a useful concept, but not for all cases Compression and Deduplication 2 similar concepts at the two ends of the same scale The large scale in dedupe creates new challenges Different challenges and use cases No one solution fits all