Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework



Similar documents
Lempel-Ziv Coding Adaptive Dictionary Compression Algorithm

LZ77. Example 2.10: Let T = badadadabaab and assume d max and l max are large. phrase b a d adadab aa b

Storage Optimization in Cloud Environment using Compression Algorithm

Compression techniques

Image Compression through DCT and Huffman Coding Technique

Analysis of Compression Algorithms for Program Data

On the Use of Compression Algorithms for Network Traffic Classification

Data Mining Un-Compressed Images from cloud with Clustering Compression technique using Lempel-Ziv-Welch

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Hadoop SNS. renren.com. Saturday, December 3, 11

A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

Introduction to Parallel Programming and MapReduce

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Extended Application of Suffix Trees to Data Compression

map/reduce connected components

Hadoop and Map-reduce computing

16.1 MAPREDUCE. For personal use only, not for distribution. 333

Data Reduction: Deduplication and Compression. Danny Harnik IBM Haifa Research Labs

Report: Declarative Machine Learning on MapReduce (SystemML)

Data Compression Using Long Common Strings

Searching BWT compressed text with the Boyer-Moore algorithm and binary search

Introduction to DISC and Hadoop

Parallel Processing of cluster by Map Reduce

Cloudera Certified Developer for Apache Hadoop

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Original-page small file oriented EXT3 file storage system

Distributed Computing and Big Data: Hadoop and MapReduce


A Model of Computation for MapReduce

4.2 Sorting and Searching

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Data-Intensive Computing with Map-Reduce and Hadoop

Classification On The Clouds Using MapReduce

A Catalogue of the Steiner Triple Systems of Order 19

Streaming Lossless Data Compression Algorithm (SLDC)

Jeffrey D. Ullman slides. MapReduce for data intensive computing

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Task Scheduling in Hadoop

Lecture 18: Applications of Dynamic Programming Steven Skiena. Department of Computer Science State University of New York Stony Brook, NY

RevoScaleR Speed and Scalability

Developing MapReduce Programs

Distance Degree Sequences for Network Analysis

Theory of Computation Chapter 2: Turing Machines

Arithmetic Coding: Introduction

Lempel-Ziv Factorization: LZ77 without Window

CIS 4930/6930 Spring 2014 Introduction to Data Science /Data Intensive Computing. University of Florida, CISE Department Prof.

Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 6: Episode discovery process

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

To convert an arbitrary power of 2 into its English equivalent, remember the rules of exponential arithmetic:

Automata and Formal Languages

Energy Efficient MapReduce

Wan Accelerators: Optimizing Network Traffic with Compression. Bartosz Agas, Marvin Germar & Christopher Tran

Big Data Begets Big Database Theory

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Architecture. Part 1

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Chapter 13. Chapter Outline. Disk Storage, Basic File Structures, and Hashing

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

A Performance Analysis of Distributed Indexing using Terrier

encoding compression encryption

Load Balancing in Structured Peer to Peer Systems

CS 378 Big Data Programming

Load Balancing in Structured Peer to Peer Systems

Integrating NLTK with the Hadoop Map Reduce Framework Human Language Technology Project

Reading.. IMAGE COMPRESSION- I IMAGE COMPRESSION. Image compression. Data Redundancy. Lossy vs Lossless Compression. Chapter 8.

CSE-E5430 Scalable Cloud Computing Lecture 2

Reducing Replication Bandwidth for Distributed Document Databases

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

CMPSCI 250: Introduction to Computation. Lecture #19: Regular Expressions and Their Languages David Mix Barrington 11 April 2013

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Big Data With Hadoop

Python Programming: An Introduction to Computer Science

! Solve problem to optimality. ! Solve problem in poly-time. ! Solve arbitrary instances of the problem. !-approximation algorithm.

Lecture 2: Regular Languages [Fa 14]

Lecture 24: WSC, Datacenters. Topics: network-on-chip wrap-up, warehouse-scale computing and datacenters (Sections )

Analysis of Web Archives. Vinay Goel Senior Data Engineer

1 Approximating Set Cover

Modified Golomb-Rice Codes for Lossless Compression of Medical Images

IEEE TRANSACTIONS ON INFORMATION THEORY 1. The Smallest Grammar Problem

Issues in Identification and Linkage of Patient Records Across an Integrated Delivery System

Transcription:

Lossless Data Compression Standard Applications and the MapReduce Web Computing Framework Sergio De Agostino Computer Science Department Sapienza University of Rome

Internet as a Distributed System Modern distributed systems may nowadays consist of hundreds of thousands of computers. Decentralization is the fundamental Decentralization is the fundamental approach to reach unprecedented scales, towards millions or even billions of elements using Internet as an extreme distributed system.

Web Computing Web Computing is a special kind of distributed computing and involves internet-based collaboration of an arbitrary number of remotely located applications. MapReduce is the framework proposed by Google to realize distributed applications on the Internet, making distributed computing more accesible.

The MapReduce Framework MapReduce (MR) is a framework for processing parallelizable problems across huge datasets using a large number of computers (nodes). Nodes could be either on the same local network or shared across geographically and administratively more heterogenous distributed systems.

The MR Programming Paradigm P = μ 1 ρ 1. μ R ρ R is a sequence where μ i is a mapper and ρ i is a reducer for 1 i R. Mapper μ i receives the input from reducer ρ i -1 for 1 < i R. For 1 i R, such input is a multiset U i-1 of (key, value) pairs (U 0 is given) and each pair (k, v) U i-1 is mapped by μ i to a set of new (key, value) pairs.

The MR Programming Paradigm (2) The input to reducer ρ i is U i, where U i is the union of the sets output by mapper μ i. For each key k, ρ i reduces the subset of pairs with k as key component to a new set of pairs with key still equal to k. U i is the union of these new sets and is output by ρ i as input to μ i+1.

Distributed Implementation A key is associated with a processor. Each pair with a given key is processed by the same node. More keys can be associated with the same node in order to lower the scale of the distributed memory system of nodes involved in the Web computer network.

Input/Output Phases The sequence P = μ 1 ρ 1. μ R ρ R does not include the I/O phases (μ 1 can be seen as a second step of the input phase). Extend P to μ 0 μ 1 ρ 1. μ R ρ R μ R+1 ρ R+1. μ 0 μ 1 is the input phase. μ R+1 ρ R+1 is the output phase.

Input Phase There is a unique (key, value) pair input to μ 0, where the value component is the input data. μ 0 distributes the data generating U 0. μ 1 distributes the data furtherly or is the identity function.

Output Phase μ R+1 maps the multiset U R to a multiset where all the pairs have the same key k. ρ R+1 reduces such multiset to the unique pair (k, y). y is the output data.

Complexity Requirements - R is polylogarithmic in the input size - sublinear number of nodes - sub-linear work-space for each node - mappers running time is polynomial - reducers running time is polynomial Karloff, Suri and Vassilvistkii, A Model of Computation for MapReduce, SODA 2010

Stronger Requirements R is constant (< 10). Total running time of mappers and reducers times the number of processors must be equal to sequential time. The MR implementations of Standard Lossless Data Compression satisfy these stronger requirements with R < 3 and guarantee a linear speed-up.

Lossless Data Compression Standard lossless data compression applications are based on Lempel-Ziv methods (Zip compressors or Unix Compress and Dos Stuffit). The Zip family is based on sliding window (SW) compression. Unix Compress and Dos Stuffit implement the Lempel, Ziv and Welch method (LZW compression).

SW Factorization and Compression SW factorization: string S = f 1 f 2 f i f k where f i is either the longest substring occurring previously in f 1 f 2 f i or an alphabet character. SW compression is obtained by encoding f i with a pointer to a previous copy. Bounded memory factorization: f i is determined by the last w characters (w=32k in the Zip compressor).

LZW Factorization and Compression Finite alphabet A, dictionary D initially equal to A, S in A*. LZW factorization: S = f 1 f 2 f i f k where f i is the longest match with the concatenation of a previous factor and the next character. LZW compression is obtained by encoding f i with a pointer to a copy of such concatenation previously added to D.

Bounded Memory Factorizations LZW-FREEZE: f i is the longest match with one of the first d factors (d=64k). LZW-RESTART: f i is determined by the last (i mod d) factors. To improve LZW-RESTART the dictionary of d factors is frozen and the restart operation happens when the compression ratio starts deteriorating.

Distributed Memory SW Factorization The SW factorization process is applied independently to different blocks of the input string (no communication cost). A block of 300kB guarantees robustness (about ten times the window length of a Zip compressor). 1000 processors 300 megabytes

The MapReduce Implementation Input X; MR sequence μ 0 μ 1 ρ 1 μ 2 ρ 2 ; R=1. --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- μ 0 maps (0, X) to U 0 = (1, X 1 ), (m, X m ) μ 1 is the identity function (U 0 = U 0 ) ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ρ 1 reduces U 0 to U 1 = (1, Y 1 ), (m, Y m ) where Y i is the SW coding of X i,1 i m ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- μ 2 maps U 1 to U 1 = (0, Y 1 ), (0, Y m ) ρ 2 reduces U 1 to (0, Y) where Y= Y 1 Y m (Y is the output).

Distributed Memory LZW Factorization If LZW factorization is applied in parallel, blocks of 600kB are needed to guarantee robustness (300kB to learn the dictionary and 300kB to compress staticly). Static compression is extremely scalable. According to the scale of the network, different approaches can be described. The MapReduce implementations require two iterations (R=2).

LZW Static Optimal Factorization The dictionary of d factors built by the LZW algorithm is prefix, that is, every prefix of a factor is in the dictionary. Since the dictionary is not suffix, the greedy factorization of the input string after the dictionary has been frozen is not optimal.

Greedy vs Optimal: Example string: aba n dictionary a, b, ab, ba k for 0 < k < n+1 greedy factorization ab, a, a,.., a optimal factorization a, ba n

Semi-Greedy (Optimal) Factorization x t+1, x t+2,. is the suffix of the input string after the dictionary has been frozen j = t; i := t repeat forever for k = j +1 to i +1 compute h(k): x k.x h(k) is the longest match let k : h(k ) is maximum x j. x k -1 is a factor j = i+1; i = h(k )

Distributed Approximation Scheme Processors store blocks of length m(k+2) overlapping on 2m characters where m is the maximum factor length. Each processor computes the boundary matches ending furthest to the right. Proc. 1 2 3 4 5 6 7 etc.

Distributed Approximation Scheme (2) Each processor applies the semi-greedy procedure from the first position of the match computed on the left boundary of its block to the position of the match on the right boundary. Stopping the parsing at the beginning of Stopping the parsing at the beginning of the match on the right boundary might create for each block a surplus phrase with respect to the optimal parsing of the string (this might happen when the last factor could be extended to cover the right boundary). Therefore, the approximation factor is (k+1)/k.

Distributed Approximation Scheme (3) If k 10, there is no relevant loss of compression effectiveness with respect to the optimal factorization. The approximation scheme is robust on large scale and extreme distributed systems since m 10. Greedy compares to optimal in practice. If k 100, there is no need of computing boundary matches.

The MapReduce Implementation MR sequence μ 0 μ 1 ρ 1 μ 2 ρ 2 μ 3 ρ 3 ; R=2. Input Phase μ 0 μ 1 (input string X) μ 0 maps (0, X) to U 0 = (1, X 1 ), (m, X m ) - X i =Y i Z i with Y i = Z i 300K for 1 i m - Z i =B i,1 B i,r with B i,j =c 1000 for 1 j r μ 1 maps U 0 to U 0 = ( i, Y i ), ( (i,j), B i,j ) for 1 i m, 1 j r

The MapReduce Implementation (2) MR sequence μ 0 μ 1 ρ 1 μ 2 ρ 2 μ 3 ρ 3 Computational phase ρ 1 μ 2 ρ 2 first step ρ 1 reduces U 0 to U 1 = ( i, C i ), ( i, D i ), ( (i,j), B i,j ) for 1 i m, 1 j r. C i is the LZW coding of Y i and D i is the dictionary learned during the coding process for 1 i m.

The MapReduce Implementation (3) MR sequence μ 0 μ 1 ρ 1 μ 2 ρ 2 μ 3 ρ 3 Computational phase ρ 1 μ 2 ρ 2 second step μ 2 maps U 1 to U 1 = ( i, C i ), ( (i,j), D i ), ( (i,j), B i,j ) for 1 i m, 1 j r. ρ 2 reduces ((i,j), B i,j ) to ((i,j), C i,j ) with C i,j compressed form of B i,j using D i as a static dictionary to produce U 2.

The MapReduce Implementation (4) MR sequence μ 0 μ 1 ρ 1 μ 2 ρ 2 μ 3 ρ 3 Output phase μ 3 ρ 3 μ 3 maps U 2 to U 2 = (0, C i ), (0, C i,j ) for 1 i m, 1 j r. ρ 3 reduces U 2 to (0, C = C 1 C 1,1 C 1,r C m C m,1 C m,r ) where C is the output.

The Extended Star Network The central node distributes blocks of 600kB (the first half of 300kB to the adjacent processors and a sub-block of the second half to a leaf).

Robustness and Communication Low communication cost is required between the two phases of the computation. If the input phase broadcasts sub-blocks of size between 100 bytes and one kylobyte the computation of boundary matches enhance robustness (extreme case beyond large scale).

Beyond Large Scale Input Phase μ 0 μ 1 (input string X) μ 0 maps (0, X) to U 0 = (1, X 1 ), (m, X m ) X i =Y i Z i, Z i =B i,1 B i,r with 100 B i,j <<1000 i i i i i,1 i,r i,j μ 1 maps U 0 to U 0 = ( i, Y i ), ( (i,j), B i,j ) for 1 i m, 1 j r B i,j extends B i,j by overlapping B i-1,j and B i+1,j on m characters.

Compressing Large Size Files Scalability, robustness and communication are not an issue with SW and LZW compression of very large size files (gigabytes or beyond). Bounding the dictionary of the LZW compressor with a relaxed version of the least recently used strategy has been proposed as the most efficient approach. De Agostino ICIW 2013

Future Work The main goal is the design of a robust and scalable approach for lossless compression of arbitrary size files. Such goal can be achieved experimentally for a specific type of files by decreasing the dictionary size. A more ambitious project is the design of a new general procedure for distributed memory compression.