Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop, a MapReduce environment. Due to web-search, web analytics,... etc., there has been a high demand for scalable implementations of ML algorithms to process large datasets. Up until now there have only been hand-tuned implementations, making it very difficult to integrate those into a different environment. Ever since MapReduce has gotten more popular, there has been a high interest into combining ML algorithms and MapReduce to implement a scalable algorithm for processing large datasets. This comes from the fact that hand-tuned implementations have two big drawbacks: 1. Each MapReduce job has to be hand-coded 2. If the input and cluster sizes vary, the execution plan for the algorithm has to be hand-tuned to receive a better performance Take for example the problem of solving matrix multiplication. There are multiple ways to achieve this. We can for example use the replication-based matrix multiplication or the cross-product based matrix multiplication. As explained later in this report, it depends on the matrices which strategy would be better. Often the matrices change over time or depend on the input, which can make it very hard for the developer to always choose the best strategy. To address such problems, SystemML has been developed. It has the properties of being a scalable declarative machine learning system that allows to write ML algorithms in a higher-level language, thus freeing the developer of tasks like performance tuning and low-level implementation details. This report is going to quickly summarize the paper SystemML: Declarative Machine Learning on MapReduce by first taking a quick look at MapReduce and Hadoop and then moving on to present SystemML. Lastly, some experimental results are shown to demonstrate the optimization of the performance.
(a) replication-based (b) cross-product based Figure 1: matrix multiplication 2 MapReduce MapReduce is an algorithm used to process large sets of data by dividing its work onto clusters of machines. This process contains two phases[2]: 1. Map: The algorithm takes as input key-value pairs. Those get divided and distributes between the worker nodes. Those workers process the input they received and return the result as key-value pairs. 2. Reduce: The master node collects the results of the subproblems and combines them to receive the result for the original problem Thus MapReduce can increase the performance of large datasets immensely, if the subproblems don t depend on each other as it can perform many jobs simultaneously. 3 Hadoop Hadoop is an open source implementation of the Google FileSystem. It can be used to allow the distributed processing of large data sets across clusters of computers.[1] It is programmed to scale up from single servers to large numbers of machines, where each machine has to offer local computation and storage. Hadoop has already implemented the MapReduce algorithm with the classes Map, Reduce and Combine, which can be configured to fit the given task. Jessica Falk, ETH-ID 11-947-512 2
4 Implementation of SystemML SystemML consists of four components: 4.1 Declarative Machine learning Language (DML) Algorithms in SystemML are written in the Declarative Machine learning Language (DML). As DML provides several mathematical and linear algebraic primitives on matrices, including PCA, PageRank, etc. and control constructs, it makes it easy to use. Using this DML, SystemML is able to break it into smaller subunits also called statement blocks, which will be evaluated sequentially. DML supports two data types: matrices and scalars. We can use integer, double, string and logical for scalars and the cells in a matrix. There are a number of constructs that are currently supported in DML: 1. Input/Output: To read and write matrices from and to files, ReadMM and WriteMM are used. 2. Control Structures: We are able to use if, while and for statements as control structures. 3. Assignment: Each assignment consists out of an expression and the result that is to be assigned. We can use assignments for both scalars and matrices. 4. Users can also define their own methods using the following syntax: function (arglist) body The arglist has to be a set of formal input and output arguments and the body consists of a group of valid DML statements. To generate a parsed representation that can be used by the following component, the program gets quickly analysed by using the following steps: 1. Type Assignment: We need to figure out the data types of each variable in the DML script. We know that ReadMM and WriteMM are only used on matrices and therefore can already assign this type to the corresponding variables. Furthermore the right side of the assignment is used to define the type of the variable on the left side. 2. Statement Block Identification: A statement block only consists out of consecutive assignment, ReadMM and WriteMM statements as those operations can be collectively optimized. If a control structure appears, we need to divide the program into several statement blocks. 3. Live Variable Analysis: We want to achieve two things. First, connect the use of a variable with the preceding write of a variable across all evaluation paths. Jessica Falk, ETH-ID 11-947-512 3
Second, analyse each statement block upon the variables that will be required from previous statement blocks and the variables that will be output by the current one. Thus we may be able to optimize by eliminating dead assignments and using constant folding. 4.2 High-level Operator Component (HOP) Table I: Example Hops [3] HOP Type Notation Semantics Binary b(op) : X, Y x i,j, y i,j : do op(x i,j, y i,j ), op is +, -, *, /,... Unary u(op) : X AggregateUnary au(aggop, dim) : X x i,j : do op(x i,j ), op is log, sin,... do aggop(, ) on dim, where dim is row (row-wise), col (col-wise) or all (whole matrix) AggregateBinary ab(aggop, op): X, Y i, j : do aggop(op(x i,k, y k,j ) k) Reorg r(op) : X reorganize elements in a matrix, e.g. transpose (op = T) Data data(op) : X read (op = r) or write (op = w) a matrix The parsed representation gets taken as an input by the HOP. Each statement block gets analysed and the best execution plan is chosen. Every plan will then be represented as a directed acyclic graph (HOP-Dag) of basic operations (called hops) over matrices and scalars. So each statement block will have one HOP-Dag. What exactly does a hop represent? Each hop takes at least one input. Depending on this, it performs an operation and produces an output that will be taken as an input for at least one following hop. See Table I for some input/output possibilities. Several optimizations like algebraic rewrites and physical representations for intermediate matrices are made in this step. 4.3 Low-level Operator Component (LOP) Now the HOP-Dags get translated into low-level physical plans (LOP-Dags). We can construct these by processing the HOP-Dag bottom-up. During this process each hop gets translated into at least one HOP-Dag LOP-Dag (i,j), ci,j/di,j lop. Lops are similiar to hops as seen in binary(/) matrix b(/) Table II. They represent the basic operations (i,j), {ci,j, di,j} group matrix matrix in a MapReduce environment. As stated in the MapReduce section, a keyvalue data(r): C data(r): D (i,j), ci,j (i,j), di,j pair will always be used as input and data C data D output. To use those key-value pairs, they are going to be grouped together based on their keys. Figure 2: Transforming a HOP-Dag into a LOP-Dag [3] Jessica Falk, ETH-ID 11-947-512 4
data A A A agg (+) group mmcj data B b (+) group B R M R M R M R M R M data C data D (a) LOP Dag C D (b) naive approach Figure 3: Piggybacking algorithm B C D (c) piggybacking At the end each group is passed on to the corresponding lop to perform the correct operation. The LOP-Dag gets compiled into at least one MapReduce job. Taking one lop per job is easy to implement, but results in a lack of performance. SystemML wants to improve the performance gained by using ML algorithms together with MapReduce, so we need to figure out how to group the lops together. There are two properties of the lops to consider: 1. Location: Where do we need to execute the lop? During the Map phase, Reduce phase or both phases? 2. Key Characteristics: Are the input or output keys required to be grouped? different output keys? Does the lop generate This way the number of data scans small is kept small. It is also called piggybacking. We shortly explain this greedy algorithm: We are given the nodes of the LOP-Dag, which are first sorted in a topological order and then partitioned into one of three lists: Map, Reduce and MapAndReduce That way all lists are sorted and the operations of the different phases can be quickly accessed. Now it iterates through the nodes to assign all lops to their corresponding MapReduce job. It first assigns the lops of the Map phase, then the lops of the Map and Reduce phase and lastly the lops of the Reduce phase get assigned. The runtime complexity is quadratic in terms of the LOP-Dag size. Now let us look at the statement A = B * (C + D) where A, B, C, D are matrices. Let us assume that SystemML chose to use the cross product based matrix multiplication. You can see the created LOP-Dag for this case in figure 2 (a). A naive approach to construct the MapReduce jobs is shown in figure 2 (b). For each operation we need a new MapReduce job. This means we get one job for the addition and another two for the multiplication. As explained later we always need two MapRe- Jessica Falk, ETH-ID 11-947-512 5
duce jobs to calculate the cross product based matrix multiplication. Using the piggybacking algorithm, we are able to reduce this to a total of two MapReduce jobs as shown in figure 2 (c). We simply calculate in the first job the addition and the first part of the matrix multiplication. The second job needs to be done in the second job. By packing as many operations as possible into one MapReduce job, the piggybacking algorithim is able to reduce the amount of jobs needed. This can result in a huge performance speed up as we have to initialize less jobs. Table II:Example Lops [3] LOP type Description Execution Location Key Characteristics data input data source or output data Map or Reduce none sink, in key value pairs unary operate on each value with an optional Map or Reduce none scalar transform transform each key Map or Reduce keys changed group groups values by the key Mao and Reduce output keys grouped binary operate on two values with the same Reduce input keys key grouped aggregate aggregate all values with the same Reduce input keys key grouped mmcj cross product computation in the Map and Reduce none CPMM matrix multiplication mmrj RMM matrix multiplication Map and Reduce none 4.4 Runtime The last component is the runtime component. We will in the following consider three main aspects of the runtime component: 1. The key-value representation of matrices 2. The MR runtime to execute individual LOP-Dags over MapReduce 3. The control module to manage the execution of all MapReduce jobs 1) Matrices as Key-Value Pairs: SystemML divides matrices into blocks (submatrices). Each block has an unique keyvalue pair that can be used to identify the block. The key denotes the block id and the value consists out of all cell values of the block. With those blocks we can use the local sparsity within each block to decide whether it would be better to only show the non-zero values (presenting a sparse presentation) or to show all values (giving a dense presentation). That way the number of key-value pairs that have to be used to represent Jessica Falk, ETH-ID 11-947-512 6
the matrix can be reduced. Due to making use of the local sparsity, we also have to consider how to operate on those blocks. Whenever some kind of matrix operation with two blocks as input is performed, we will need to choose which algorithm to use. If both blocks are dense, it would be better to choose an algorithm that operates on all values. If, on the other hand, at least one block is sparse, it would be better to take an algorithm that operates only on non-zero values. 2) Generic MapReduce Job (G-MR): G-MR is the main executon engine in SystemML. It will be instantiated by the piggybacking algorithm and executes its job in the MapReduce environment. 3)Control Module: It is responsible for managing all executions of the instantiated MapReduce jobs for the corresponding DML script. During the execution of the Runtime component, several optimizations are made like dynamically deciding which execution plans to use for lops based on several characteristics. 5 Matrix Multiplication Algorithms in SystemML SystemML offers two execution plans to process matrix multiplications: RMM and CPMM In this section, this report is going to take a look at both execution plans. Let us assume we want to calculate the product of two matrices A and B with the result being matrix C. This multiplication can be presented in blocked format as follows: C i,j = k A i,k B k,j, i < M b, k < K b, j < N b Where M b x K b blocks are in A and K b x N b blocks are in B located. 5.1 RMM The replication based matrix multiplication strategy needs exactly one MapReduce job. In this strategy, each reducer needs to process the result of at least one block of the resulting block C i,j. As each input block can be used to calculate several result blocks in C, we need to replicate them and give them as input to each reducer that needs them. 5.2 CPMM with local aggregator This strategy is an algorithm based on the cross product. It can be represented as a LOP-Dag using three lops and two MapReduce jobs: mmcj group aggregate( ) In the Map phase, the first MapReduce job reads the two input matrices A and B and groups the input blocks by the common key k. Afterwards the reducer uses the cross product to compute Pi,j k = A i,k B k,j. The second MapReduce job reads the result of the first MapReduce job and groups all Pi,j k s by the key (i,j). To optimize the performance Jessica Falk, ETH-ID 11-947-512 7
a local aggregator is used. If we compute Pi,j k k and Pi,j in the same reducer and can aggregate those partial results within the reducer instead of outputting them separately. Doing this we are able to reduce the amount of reducers needed in the second job. In the Reduce phase of the second job the aggregate lop finally computes C i,j = k P i,j k. As it could be too large to fit into memory, SystemML implemented a disk-based local aggregator. It uses an in-memory buffer pool. As CPMM uses a certain order to compute the result, the best choice for a buffer replacement policy is to use LRU (least recently used) to optimize the performance using the disk-based local aggregator. 5.3 RMM vs CPMM After looking at RMM and CPMMM, we see that both have their advantages and disadvantages. Now we will look at the differences of their performance. RMM replicates each block of A N b and of B M b times. Therefore N b A + M b B data is shuffled in the job. It is now easy to calculate the cost of RMM: cost(rmm) = shuffle(n b A + M b B ) + IO dfs ( A + B + C ), where IO dfs stands for the cost of the distributed file system IO. CPMM first reads the blocks of A and B and sends them on to the Reduce phase. We shuffle data of the amount A + B. Next for each k the cross product gets calculated and the local aggregator will be applied within a reducer. Thus the result size of each reducer can be bounded by C. If there are k reducers, the result size will be bounded by k C. This data will be used by the second job. This means that the data gets shuffled again and will then be forwarded to the reducers to produce the end result. With this knowledge we are able to create an upper bound for the cost of CPMM: cost(cp MM) shuffle( A + B + k C ) + IO d fs( A + B + C + 2 k C ) Concluding, The comparison of the cost models of both algorithms shows that CPMM performs in general better, if A and B are both very large, since RMM has a huge shuffle overhead. RMM performs better than CPMM, if A or B is small enough to fit into one block. 6 Experiments Several experiments were performed to study scalability and the performance of optimizations in SystemML using different data and Hadoop cluster sizes. To study this GNMF, linear regression and PageRank were used as ML algorithms. 6.1 Setup Two different clusters were used to perform the experiments: 1. 40-core cluster: It uses 5 local machines as worker nodes with each having 8 cores and hyperthreading enabled. Each node runs 15 concurrent mappers and 10 concurrent reducers. Jessica Falk, ETH-ID 11-947-512 8
2. 100-core EC2 cluster: This cluster has 100 worker nodes. Each node runs 2 mappers and 1 reducer concurrently. The data generator creates random matrices with uniformly distributed non-zero cells. All matrices have a size of 1000 x 1000 except for the matrix blocking experiments. The local aggregator is an in-memory buffer pool of size 900 MB on the 40-core cluster and 500 MB on the 100-core EC2 cluster. 6.2 Scalability For those experiments GNMF (Graph Regularized Non-negative Matrix Factorization) was used as an example to demonstrate scalability on both clusters. The input V is a sparse matrix with varying d rows and w = 100.000 columns. This means that our goal is to compute dense matrices W of size d x t and H of size t x w with V W H and t = 10 (t is the number of topics). The next step is to compare SystemML against the best known results and against results on single machines. 1. single machine: It executes quite efficiently for small sizes. If the number of rows in V increases to 10 million, it runs out of memory. SystemML clearly outperforms this as it countinues to scale for very large numbers. 2. best known result: The best known result [4] is based on a hand-coded MapReduce implementation of GNMF. This one contains 8 full MapReduce jobs and 2 map-only jobs, while the execution plan by SystemML consists of 10 full MapReduce jobs. The experiments have shown that the performance of SystemML compared to the hand-coded one increases significantly as the sparsity of the matrices increase. One reason for this is that SystemML only uses block representation while the hand-coded one uses cell, row-wise and column-wise representation. The other reason is that the hand-coded algorithm doesn t use an local aggregator to perform CPMM. 6.3 Optimizations We want to analyse four optimizations SystemML uses to gain a better performance. 1. RMM vs CPMM Using different test cases, we are able to see that neither algorithm always outperforms the other. Therefore it is important for SystemML to choose the best algorithm for each individual case. If both matrices are large, CPMM is in general the better solution. If at least one of the matrices is small, SystemML should choose RMM. CPMM has the advantage of having a higher degree of parallelism, thus making it better for large matrices. Jessica Falk, ETH-ID 11-947-512 9
2. Piggybacking Without piggybacking each lop would translate into a separate MapReduce job. Depending on whether piggybacking allows to translate several lops into one MapReduce job with one lop dominating the cost, the optimization can be significant or not. 3. Matrix Blocking: Especially for dense matrices the block representation helps to signifcantly reduce storage space compared to the cell representation. On the other hand, for sparse matrices the space used to store the blocks increase as there is only a small fraction of non-zero values per block. See Table III for further details. Table III: Comparison of different block sizes with H being a sparse matrix [3] Block Size 1000x1000 100x100 10x10 1x1 Execution time 117s 136x 3h >5h Size of V (GB) 1.5 1.9 4.8 3.0 Size of H (MB) 7.8 7.9 8.1 31.0 4. Local Aggregator for CPMM By using the local aggregator, we are able to reduce the size of the intermediate result. This implies a slower increase of the running time compared to the algorithm without an aggregator. 7 Conclusion Concluding, this report has quickly presented SystemML. A system that greatly helps to develop large-scale ML algorithms. It first translate the machine learning algorithms into execution plans over MapReduce by evalutating different execution plans. At the end the created lops get translated into MapReduce jobs. By showing shortly the experiments used to evaluate SystemML, we were able to prove the scalability and the benefits of several optimization strategies like blocking, piggybacking and local aggregration. All in all the developers have achieved their goal of a declarative Machine Learning System, which is optimized for linear algebraic operations and scales well. It is a very good system, if you want to process largely independent sets of data. On the other hand it doesn t scale well, if the program contains many iterations or if the jobs are not independent. In the first case of having many iterations, we need to reload and reprocess data for each iteration and even need an extra job to process the condition. This way we get a huge overhead for each iteration. The second case of having dependent jobs is because of the way MapReduce is implemented. MapReduce can improve the performance heavily, if the jobs are independent. That way MapReduce is able to process all jobs concurrently. If the jobs are dependent of each other, we lose performance as MapReduce has to wait for the jobs to finish before starting the next one. Jessica Falk, ETH-ID 11-947-512 10
References [1] Apache Hadoop. http://hadoop.apache.org/e. [2] MapReduce. http://en.wikipedia.org/wiki/mapreduce. [3] SystemML: Declarative Machine Learning on MapReduce. http://people.cs. uchicago.edu/~vikass/systemml.pdf. [4] C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. WWW, 2010. Jessica Falk, ETH-ID 11-947-512 11