Report: Declarative Machine Learning on MapReduce (SystemML)



Similar documents
Developing MapReduce Programs

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Cloud Computing at Google. Architecture

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Big Data and Scripting map/reduce in Hadoop

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

16.1 MAPREDUCE. For personal use only, not for distribution. 333

MapReduce and the New Software Stack

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

MAPREDUCE Programming Model

Open source large scale distributed data management with Google s MapReduce and Bigtable

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Chapter 7. Using Hadoop Cluster and MapReduce

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Optimization and analysis of large scale data sorting algorithm based on Hadoop

NoSQL and Hadoop Technologies On Oracle Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Similarity Search in a Very Large Scale Using Hadoop and HBase

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Classification On The Clouds Using MapReduce

Introduction to DISC and Hadoop

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

Distributed Computing and Big Data: Hadoop and MapReduce

Energy Efficient MapReduce

Bringing Big Data Modelling into the Hands of Domain Experts

Architectures for Big Data Analytics A database perspective

Duke University

Map Reduce / Hadoop / HDFS

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:


Apache Flink Next-gen data analysis. Kostas

MapReduce. Tushar B. Kute,

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Introduction to Parallel Programming and MapReduce

Big Data Processing with Google s MapReduce. Alexandru Costan

Parameterizable benchmarking framework for designing a MapReduce performance model

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Introduction to Hadoop

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

MapReduce: Algorithm Design Patterns

GraySort on Apache Spark by Databricks

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Integrating Big Data into the Computing Curricula

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Map-Reduce for Machine Learning on Multicore

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BIG DATA What it is and how to use?

Big Data With Hadoop

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

Distributed R for Big Data

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Task Scheduling in Hadoop

Distributed R for Big Data

Using In-Memory Computing to Simplify Big Data Analytics

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group)

Big Data and Apache Hadoop s MapReduce

Parallel Processing of cluster by Map Reduce

Jeffrey D. Ullman slides. MapReduce for data intensive computing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

CSE-E5430 Scalable Cloud Computing Lecture 2

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

map/reduce connected components

Log Mining Based on Hadoop s Map and Reduce Technique

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Big Data Systems CS 5965/6965 FALL 2015


Challenges for Data Driven Systems

Analysis of MapReduce Algorithms

The Performance Characteristics of MapReduce Applications on Scalable Clusters

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

RevoScaleR Speed and Scalability

Big Table A Distributed Storage System For Data

SOLVING LINEAR SYSTEMS

HiBench Introduction. Carson Wang Software & Services Group

Hadoop SNS. renren.com. Saturday, December 3, 11

Benchmarking Cassandra on Violin

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Hadoop Design and k-means Clustering

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

Lecture Data Warehouse Systems

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Transcription:

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop, a MapReduce environment. Due to web-search, web analytics,... etc., there has been a high demand for scalable implementations of ML algorithms to process large datasets. Up until now there have only been hand-tuned implementations, making it very difficult to integrate those into a different environment. Ever since MapReduce has gotten more popular, there has been a high interest into combining ML algorithms and MapReduce to implement a scalable algorithm for processing large datasets. This comes from the fact that hand-tuned implementations have two big drawbacks: 1. Each MapReduce job has to be hand-coded 2. If the input and cluster sizes vary, the execution plan for the algorithm has to be hand-tuned to receive a better performance Take for example the problem of solving matrix multiplication. There are multiple ways to achieve this. We can for example use the replication-based matrix multiplication or the cross-product based matrix multiplication. As explained later in this report, it depends on the matrices which strategy would be better. Often the matrices change over time or depend on the input, which can make it very hard for the developer to always choose the best strategy. To address such problems, SystemML has been developed. It has the properties of being a scalable declarative machine learning system that allows to write ML algorithms in a higher-level language, thus freeing the developer of tasks like performance tuning and low-level implementation details. This report is going to quickly summarize the paper SystemML: Declarative Machine Learning on MapReduce by first taking a quick look at MapReduce and Hadoop and then moving on to present SystemML. Lastly, some experimental results are shown to demonstrate the optimization of the performance.

(a) replication-based (b) cross-product based Figure 1: matrix multiplication 2 MapReduce MapReduce is an algorithm used to process large sets of data by dividing its work onto clusters of machines. This process contains two phases[2]: 1. Map: The algorithm takes as input key-value pairs. Those get divided and distributes between the worker nodes. Those workers process the input they received and return the result as key-value pairs. 2. Reduce: The master node collects the results of the subproblems and combines them to receive the result for the original problem Thus MapReduce can increase the performance of large datasets immensely, if the subproblems don t depend on each other as it can perform many jobs simultaneously. 3 Hadoop Hadoop is an open source implementation of the Google FileSystem. It can be used to allow the distributed processing of large data sets across clusters of computers.[1] It is programmed to scale up from single servers to large numbers of machines, where each machine has to offer local computation and storage. Hadoop has already implemented the MapReduce algorithm with the classes Map, Reduce and Combine, which can be configured to fit the given task. Jessica Falk, ETH-ID 11-947-512 2

4 Implementation of SystemML SystemML consists of four components: 4.1 Declarative Machine learning Language (DML) Algorithms in SystemML are written in the Declarative Machine learning Language (DML). As DML provides several mathematical and linear algebraic primitives on matrices, including PCA, PageRank, etc. and control constructs, it makes it easy to use. Using this DML, SystemML is able to break it into smaller subunits also called statement blocks, which will be evaluated sequentially. DML supports two data types: matrices and scalars. We can use integer, double, string and logical for scalars and the cells in a matrix. There are a number of constructs that are currently supported in DML: 1. Input/Output: To read and write matrices from and to files, ReadMM and WriteMM are used. 2. Control Structures: We are able to use if, while and for statements as control structures. 3. Assignment: Each assignment consists out of an expression and the result that is to be assigned. We can use assignments for both scalars and matrices. 4. Users can also define their own methods using the following syntax: function (arglist) body The arglist has to be a set of formal input and output arguments and the body consists of a group of valid DML statements. To generate a parsed representation that can be used by the following component, the program gets quickly analysed by using the following steps: 1. Type Assignment: We need to figure out the data types of each variable in the DML script. We know that ReadMM and WriteMM are only used on matrices and therefore can already assign this type to the corresponding variables. Furthermore the right side of the assignment is used to define the type of the variable on the left side. 2. Statement Block Identification: A statement block only consists out of consecutive assignment, ReadMM and WriteMM statements as those operations can be collectively optimized. If a control structure appears, we need to divide the program into several statement blocks. 3. Live Variable Analysis: We want to achieve two things. First, connect the use of a variable with the preceding write of a variable across all evaluation paths. Jessica Falk, ETH-ID 11-947-512 3

Second, analyse each statement block upon the variables that will be required from previous statement blocks and the variables that will be output by the current one. Thus we may be able to optimize by eliminating dead assignments and using constant folding. 4.2 High-level Operator Component (HOP) Table I: Example Hops [3] HOP Type Notation Semantics Binary b(op) : X, Y x i,j, y i,j : do op(x i,j, y i,j ), op is +, -, *, /,... Unary u(op) : X AggregateUnary au(aggop, dim) : X x i,j : do op(x i,j ), op is log, sin,... do aggop(, ) on dim, where dim is row (row-wise), col (col-wise) or all (whole matrix) AggregateBinary ab(aggop, op): X, Y i, j : do aggop(op(x i,k, y k,j ) k) Reorg r(op) : X reorganize elements in a matrix, e.g. transpose (op = T) Data data(op) : X read (op = r) or write (op = w) a matrix The parsed representation gets taken as an input by the HOP. Each statement block gets analysed and the best execution plan is chosen. Every plan will then be represented as a directed acyclic graph (HOP-Dag) of basic operations (called hops) over matrices and scalars. So each statement block will have one HOP-Dag. What exactly does a hop represent? Each hop takes at least one input. Depending on this, it performs an operation and produces an output that will be taken as an input for at least one following hop. See Table I for some input/output possibilities. Several optimizations like algebraic rewrites and physical representations for intermediate matrices are made in this step. 4.3 Low-level Operator Component (LOP) Now the HOP-Dags get translated into low-level physical plans (LOP-Dags). We can construct these by processing the HOP-Dag bottom-up. During this process each hop gets translated into at least one HOP-Dag LOP-Dag (i,j), ci,j/di,j lop. Lops are similiar to hops as seen in binary(/) matrix b(/) Table II. They represent the basic operations (i,j), {ci,j, di,j} group matrix matrix in a MapReduce environment. As stated in the MapReduce section, a keyvalue data(r): C data(r): D (i,j), ci,j (i,j), di,j pair will always be used as input and data C data D output. To use those key-value pairs, they are going to be grouped together based on their keys. Figure 2: Transforming a HOP-Dag into a LOP-Dag [3] Jessica Falk, ETH-ID 11-947-512 4

data A A A agg (+) group mmcj data B b (+) group B R M R M R M R M R M data C data D (a) LOP Dag C D (b) naive approach Figure 3: Piggybacking algorithm B C D (c) piggybacking At the end each group is passed on to the corresponding lop to perform the correct operation. The LOP-Dag gets compiled into at least one MapReduce job. Taking one lop per job is easy to implement, but results in a lack of performance. SystemML wants to improve the performance gained by using ML algorithms together with MapReduce, so we need to figure out how to group the lops together. There are two properties of the lops to consider: 1. Location: Where do we need to execute the lop? During the Map phase, Reduce phase or both phases? 2. Key Characteristics: Are the input or output keys required to be grouped? different output keys? Does the lop generate This way the number of data scans small is kept small. It is also called piggybacking. We shortly explain this greedy algorithm: We are given the nodes of the LOP-Dag, which are first sorted in a topological order and then partitioned into one of three lists: Map, Reduce and MapAndReduce That way all lists are sorted and the operations of the different phases can be quickly accessed. Now it iterates through the nodes to assign all lops to their corresponding MapReduce job. It first assigns the lops of the Map phase, then the lops of the Map and Reduce phase and lastly the lops of the Reduce phase get assigned. The runtime complexity is quadratic in terms of the LOP-Dag size. Now let us look at the statement A = B * (C + D) where A, B, C, D are matrices. Let us assume that SystemML chose to use the cross product based matrix multiplication. You can see the created LOP-Dag for this case in figure 2 (a). A naive approach to construct the MapReduce jobs is shown in figure 2 (b). For each operation we need a new MapReduce job. This means we get one job for the addition and another two for the multiplication. As explained later we always need two MapRe- Jessica Falk, ETH-ID 11-947-512 5

duce jobs to calculate the cross product based matrix multiplication. Using the piggybacking algorithm, we are able to reduce this to a total of two MapReduce jobs as shown in figure 2 (c). We simply calculate in the first job the addition and the first part of the matrix multiplication. The second job needs to be done in the second job. By packing as many operations as possible into one MapReduce job, the piggybacking algorithim is able to reduce the amount of jobs needed. This can result in a huge performance speed up as we have to initialize less jobs. Table II:Example Lops [3] LOP type Description Execution Location Key Characteristics data input data source or output data Map or Reduce none sink, in key value pairs unary operate on each value with an optional Map or Reduce none scalar transform transform each key Map or Reduce keys changed group groups values by the key Mao and Reduce output keys grouped binary operate on two values with the same Reduce input keys key grouped aggregate aggregate all values with the same Reduce input keys key grouped mmcj cross product computation in the Map and Reduce none CPMM matrix multiplication mmrj RMM matrix multiplication Map and Reduce none 4.4 Runtime The last component is the runtime component. We will in the following consider three main aspects of the runtime component: 1. The key-value representation of matrices 2. The MR runtime to execute individual LOP-Dags over MapReduce 3. The control module to manage the execution of all MapReduce jobs 1) Matrices as Key-Value Pairs: SystemML divides matrices into blocks (submatrices). Each block has an unique keyvalue pair that can be used to identify the block. The key denotes the block id and the value consists out of all cell values of the block. With those blocks we can use the local sparsity within each block to decide whether it would be better to only show the non-zero values (presenting a sparse presentation) or to show all values (giving a dense presentation). That way the number of key-value pairs that have to be used to represent Jessica Falk, ETH-ID 11-947-512 6

the matrix can be reduced. Due to making use of the local sparsity, we also have to consider how to operate on those blocks. Whenever some kind of matrix operation with two blocks as input is performed, we will need to choose which algorithm to use. If both blocks are dense, it would be better to choose an algorithm that operates on all values. If, on the other hand, at least one block is sparse, it would be better to take an algorithm that operates only on non-zero values. 2) Generic MapReduce Job (G-MR): G-MR is the main executon engine in SystemML. It will be instantiated by the piggybacking algorithm and executes its job in the MapReduce environment. 3)Control Module: It is responsible for managing all executions of the instantiated MapReduce jobs for the corresponding DML script. During the execution of the Runtime component, several optimizations are made like dynamically deciding which execution plans to use for lops based on several characteristics. 5 Matrix Multiplication Algorithms in SystemML SystemML offers two execution plans to process matrix multiplications: RMM and CPMM In this section, this report is going to take a look at both execution plans. Let us assume we want to calculate the product of two matrices A and B with the result being matrix C. This multiplication can be presented in blocked format as follows: C i,j = k A i,k B k,j, i < M b, k < K b, j < N b Where M b x K b blocks are in A and K b x N b blocks are in B located. 5.1 RMM The replication based matrix multiplication strategy needs exactly one MapReduce job. In this strategy, each reducer needs to process the result of at least one block of the resulting block C i,j. As each input block can be used to calculate several result blocks in C, we need to replicate them and give them as input to each reducer that needs them. 5.2 CPMM with local aggregator This strategy is an algorithm based on the cross product. It can be represented as a LOP-Dag using three lops and two MapReduce jobs: mmcj group aggregate( ) In the Map phase, the first MapReduce job reads the two input matrices A and B and groups the input blocks by the common key k. Afterwards the reducer uses the cross product to compute Pi,j k = A i,k B k,j. The second MapReduce job reads the result of the first MapReduce job and groups all Pi,j k s by the key (i,j). To optimize the performance Jessica Falk, ETH-ID 11-947-512 7

a local aggregator is used. If we compute Pi,j k k and Pi,j in the same reducer and can aggregate those partial results within the reducer instead of outputting them separately. Doing this we are able to reduce the amount of reducers needed in the second job. In the Reduce phase of the second job the aggregate lop finally computes C i,j = k P i,j k. As it could be too large to fit into memory, SystemML implemented a disk-based local aggregator. It uses an in-memory buffer pool. As CPMM uses a certain order to compute the result, the best choice for a buffer replacement policy is to use LRU (least recently used) to optimize the performance using the disk-based local aggregator. 5.3 RMM vs CPMM After looking at RMM and CPMMM, we see that both have their advantages and disadvantages. Now we will look at the differences of their performance. RMM replicates each block of A N b and of B M b times. Therefore N b A + M b B data is shuffled in the job. It is now easy to calculate the cost of RMM: cost(rmm) = shuffle(n b A + M b B ) + IO dfs ( A + B + C ), where IO dfs stands for the cost of the distributed file system IO. CPMM first reads the blocks of A and B and sends them on to the Reduce phase. We shuffle data of the amount A + B. Next for each k the cross product gets calculated and the local aggregator will be applied within a reducer. Thus the result size of each reducer can be bounded by C. If there are k reducers, the result size will be bounded by k C. This data will be used by the second job. This means that the data gets shuffled again and will then be forwarded to the reducers to produce the end result. With this knowledge we are able to create an upper bound for the cost of CPMM: cost(cp MM) shuffle( A + B + k C ) + IO d fs( A + B + C + 2 k C ) Concluding, The comparison of the cost models of both algorithms shows that CPMM performs in general better, if A and B are both very large, since RMM has a huge shuffle overhead. RMM performs better than CPMM, if A or B is small enough to fit into one block. 6 Experiments Several experiments were performed to study scalability and the performance of optimizations in SystemML using different data and Hadoop cluster sizes. To study this GNMF, linear regression and PageRank were used as ML algorithms. 6.1 Setup Two different clusters were used to perform the experiments: 1. 40-core cluster: It uses 5 local machines as worker nodes with each having 8 cores and hyperthreading enabled. Each node runs 15 concurrent mappers and 10 concurrent reducers. Jessica Falk, ETH-ID 11-947-512 8

2. 100-core EC2 cluster: This cluster has 100 worker nodes. Each node runs 2 mappers and 1 reducer concurrently. The data generator creates random matrices with uniformly distributed non-zero cells. All matrices have a size of 1000 x 1000 except for the matrix blocking experiments. The local aggregator is an in-memory buffer pool of size 900 MB on the 40-core cluster and 500 MB on the 100-core EC2 cluster. 6.2 Scalability For those experiments GNMF (Graph Regularized Non-negative Matrix Factorization) was used as an example to demonstrate scalability on both clusters. The input V is a sparse matrix with varying d rows and w = 100.000 columns. This means that our goal is to compute dense matrices W of size d x t and H of size t x w with V W H and t = 10 (t is the number of topics). The next step is to compare SystemML against the best known results and against results on single machines. 1. single machine: It executes quite efficiently for small sizes. If the number of rows in V increases to 10 million, it runs out of memory. SystemML clearly outperforms this as it countinues to scale for very large numbers. 2. best known result: The best known result [4] is based on a hand-coded MapReduce implementation of GNMF. This one contains 8 full MapReduce jobs and 2 map-only jobs, while the execution plan by SystemML consists of 10 full MapReduce jobs. The experiments have shown that the performance of SystemML compared to the hand-coded one increases significantly as the sparsity of the matrices increase. One reason for this is that SystemML only uses block representation while the hand-coded one uses cell, row-wise and column-wise representation. The other reason is that the hand-coded algorithm doesn t use an local aggregator to perform CPMM. 6.3 Optimizations We want to analyse four optimizations SystemML uses to gain a better performance. 1. RMM vs CPMM Using different test cases, we are able to see that neither algorithm always outperforms the other. Therefore it is important for SystemML to choose the best algorithm for each individual case. If both matrices are large, CPMM is in general the better solution. If at least one of the matrices is small, SystemML should choose RMM. CPMM has the advantage of having a higher degree of parallelism, thus making it better for large matrices. Jessica Falk, ETH-ID 11-947-512 9

2. Piggybacking Without piggybacking each lop would translate into a separate MapReduce job. Depending on whether piggybacking allows to translate several lops into one MapReduce job with one lop dominating the cost, the optimization can be significant or not. 3. Matrix Blocking: Especially for dense matrices the block representation helps to signifcantly reduce storage space compared to the cell representation. On the other hand, for sparse matrices the space used to store the blocks increase as there is only a small fraction of non-zero values per block. See Table III for further details. Table III: Comparison of different block sizes with H being a sparse matrix [3] Block Size 1000x1000 100x100 10x10 1x1 Execution time 117s 136x 3h >5h Size of V (GB) 1.5 1.9 4.8 3.0 Size of H (MB) 7.8 7.9 8.1 31.0 4. Local Aggregator for CPMM By using the local aggregator, we are able to reduce the size of the intermediate result. This implies a slower increase of the running time compared to the algorithm without an aggregator. 7 Conclusion Concluding, this report has quickly presented SystemML. A system that greatly helps to develop large-scale ML algorithms. It first translate the machine learning algorithms into execution plans over MapReduce by evalutating different execution plans. At the end the created lops get translated into MapReduce jobs. By showing shortly the experiments used to evaluate SystemML, we were able to prove the scalability and the benefits of several optimization strategies like blocking, piggybacking and local aggregration. All in all the developers have achieved their goal of a declarative Machine Learning System, which is optimized for linear algebraic operations and scales well. It is a very good system, if you want to process largely independent sets of data. On the other hand it doesn t scale well, if the program contains many iterations or if the jobs are not independent. In the first case of having many iterations, we need to reload and reprocess data for each iteration and even need an extra job to process the condition. This way we get a huge overhead for each iteration. The second case of having dependent jobs is because of the way MapReduce is implemented. MapReduce can improve the performance heavily, if the jobs are independent. That way MapReduce is able to process all jobs concurrently. If the jobs are dependent of each other, we lose performance as MapReduce has to wait for the jobs to finish before starting the next one. Jessica Falk, ETH-ID 11-947-512 10

References [1] Apache Hadoop. http://hadoop.apache.org/e. [2] MapReduce. http://en.wikipedia.org/wiki/mapreduce. [3] SystemML: Declarative Machine Learning on MapReduce. http://people.cs. uchicago.edu/~vikass/systemml.pdf. [4] C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. WWW, 2010. Jessica Falk, ETH-ID 11-947-512 11