Report: Declarative Machine Learning on MapReduce (SystemML)


 Amice Byrd
 2 years ago
 Views:
Transcription
1 Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETHID May 28, Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop, a MapReduce environment. Due to websearch, web analytics,... etc., there has been a high demand for scalable implementations of ML algorithms to process large datasets. Up until now there have only been handtuned implementations, making it very difficult to integrate those into a different environment. Ever since MapReduce has gotten more popular, there has been a high interest into combining ML algorithms and MapReduce to implement a scalable algorithm for processing large datasets. This comes from the fact that handtuned implementations have two big drawbacks: 1. Each MapReduce job has to be handcoded 2. If the input and cluster sizes vary, the execution plan for the algorithm has to be handtuned to receive a better performance Take for example the problem of solving matrix multiplication. There are multiple ways to achieve this. We can for example use the replicationbased matrix multiplication or the crossproduct based matrix multiplication. As explained later in this report, it depends on the matrices which strategy would be better. Often the matrices change over time or depend on the input, which can make it very hard for the developer to always choose the best strategy. To address such problems, SystemML has been developed. It has the properties of being a scalable declarative machine learning system that allows to write ML algorithms in a higherlevel language, thus freeing the developer of tasks like performance tuning and lowlevel implementation details. This report is going to quickly summarize the paper SystemML: Declarative Machine Learning on MapReduce by first taking a quick look at MapReduce and Hadoop and then moving on to present SystemML. Lastly, some experimental results are shown to demonstrate the optimization of the performance.
2 (a) replicationbased (b) crossproduct based Figure 1: matrix multiplication 2 MapReduce MapReduce is an algorithm used to process large sets of data by dividing its work onto clusters of machines. This process contains two phases[2]: 1. Map: The algorithm takes as input keyvalue pairs. Those get divided and distributes between the worker nodes. Those workers process the input they received and return the result as keyvalue pairs. 2. Reduce: The master node collects the results of the subproblems and combines them to receive the result for the original problem Thus MapReduce can increase the performance of large datasets immensely, if the subproblems don t depend on each other as it can perform many jobs simultaneously. 3 Hadoop Hadoop is an open source implementation of the Google FileSystem. It can be used to allow the distributed processing of large data sets across clusters of computers.[1] It is programmed to scale up from single servers to large numbers of machines, where each machine has to offer local computation and storage. Hadoop has already implemented the MapReduce algorithm with the classes Map, Reduce and Combine, which can be configured to fit the given task. Jessica Falk, ETHID
3 4 Implementation of SystemML SystemML consists of four components: 4.1 Declarative Machine learning Language (DML) Algorithms in SystemML are written in the Declarative Machine learning Language (DML). As DML provides several mathematical and linear algebraic primitives on matrices, including PCA, PageRank, etc. and control constructs, it makes it easy to use. Using this DML, SystemML is able to break it into smaller subunits also called statement blocks, which will be evaluated sequentially. DML supports two data types: matrices and scalars. We can use integer, double, string and logical for scalars and the cells in a matrix. There are a number of constructs that are currently supported in DML: 1. Input/Output: To read and write matrices from and to files, ReadMM and WriteMM are used. 2. Control Structures: We are able to use if, while and for statements as control structures. 3. Assignment: Each assignment consists out of an expression and the result that is to be assigned. We can use assignments for both scalars and matrices. 4. Users can also define their own methods using the following syntax: function (arglist) body The arglist has to be a set of formal input and output arguments and the body consists of a group of valid DML statements. To generate a parsed representation that can be used by the following component, the program gets quickly analysed by using the following steps: 1. Type Assignment: We need to figure out the data types of each variable in the DML script. We know that ReadMM and WriteMM are only used on matrices and therefore can already assign this type to the corresponding variables. Furthermore the right side of the assignment is used to define the type of the variable on the left side. 2. Statement Block Identification: A statement block only consists out of consecutive assignment, ReadMM and WriteMM statements as those operations can be collectively optimized. If a control structure appears, we need to divide the program into several statement blocks. 3. Live Variable Analysis: We want to achieve two things. First, connect the use of a variable with the preceding write of a variable across all evaluation paths. Jessica Falk, ETHID
4 Second, analyse each statement block upon the variables that will be required from previous statement blocks and the variables that will be output by the current one. Thus we may be able to optimize by eliminating dead assignments and using constant folding. 4.2 Highlevel Operator Component (HOP) Table I: Example Hops [3] HOP Type Notation Semantics Binary b(op) : X, Y x i,j, y i,j : do op(x i,j, y i,j ), op is +, , *, /,... Unary u(op) : X AggregateUnary au(aggop, dim) : X x i,j : do op(x i,j ), op is log, sin,... do aggop(, ) on dim, where dim is row (rowwise), col (colwise) or all (whole matrix) AggregateBinary ab(aggop, op): X, Y i, j : do aggop(op(x i,k, y k,j ) k) Reorg r(op) : X reorganize elements in a matrix, e.g. transpose (op = T) Data data(op) : X read (op = r) or write (op = w) a matrix The parsed representation gets taken as an input by the HOP. Each statement block gets analysed and the best execution plan is chosen. Every plan will then be represented as a directed acyclic graph (HOPDag) of basic operations (called hops) over matrices and scalars. So each statement block will have one HOPDag. What exactly does a hop represent? Each hop takes at least one input. Depending on this, it performs an operation and produces an output that will be taken as an input for at least one following hop. See Table I for some input/output possibilities. Several optimizations like algebraic rewrites and physical representations for intermediate matrices are made in this step. 4.3 Lowlevel Operator Component (LOP) Now the HOPDags get translated into lowlevel physical plans (LOPDags). We can construct these by processing the HOPDag bottomup. During this process each hop gets translated into at least one HOPDag LOPDag (i,j), ci,j/di,j lop. Lops are similiar to hops as seen in binary(/) matrix b(/) Table II. They represent the basic operations (i,j), {ci,j, di,j} group matrix matrix in a MapReduce environment. As stated in the MapReduce section, a keyvalue data(r): C data(r): D (i,j), ci,j (i,j), di,j pair will always be used as input and data C data D output. To use those keyvalue pairs, they are going to be grouped together based on their keys. Figure 2: Transforming a HOPDag into a LOPDag [3] Jessica Falk, ETHID
5 data A A A agg (+) group mmcj data B b (+) group B R M R M R M R M R M data C data D (a) LOP Dag C D (b) naive approach Figure 3: Piggybacking algorithm B C D (c) piggybacking At the end each group is passed on to the corresponding lop to perform the correct operation. The LOPDag gets compiled into at least one MapReduce job. Taking one lop per job is easy to implement, but results in a lack of performance. SystemML wants to improve the performance gained by using ML algorithms together with MapReduce, so we need to figure out how to group the lops together. There are two properties of the lops to consider: 1. Location: Where do we need to execute the lop? During the Map phase, Reduce phase or both phases? 2. Key Characteristics: Are the input or output keys required to be grouped? different output keys? Does the lop generate This way the number of data scans small is kept small. It is also called piggybacking. We shortly explain this greedy algorithm: We are given the nodes of the LOPDag, which are first sorted in a topological order and then partitioned into one of three lists: Map, Reduce and MapAndReduce That way all lists are sorted and the operations of the different phases can be quickly accessed. Now it iterates through the nodes to assign all lops to their corresponding MapReduce job. It first assigns the lops of the Map phase, then the lops of the Map and Reduce phase and lastly the lops of the Reduce phase get assigned. The runtime complexity is quadratic in terms of the LOPDag size. Now let us look at the statement A = B * (C + D) where A, B, C, D are matrices. Let us assume that SystemML chose to use the cross product based matrix multiplication. You can see the created LOPDag for this case in figure 2 (a). A naive approach to construct the MapReduce jobs is shown in figure 2 (b). For each operation we need a new MapReduce job. This means we get one job for the addition and another two for the multiplication. As explained later we always need two MapRe Jessica Falk, ETHID
6 duce jobs to calculate the cross product based matrix multiplication. Using the piggybacking algorithm, we are able to reduce this to a total of two MapReduce jobs as shown in figure 2 (c). We simply calculate in the first job the addition and the first part of the matrix multiplication. The second job needs to be done in the second job. By packing as many operations as possible into one MapReduce job, the piggybacking algorithim is able to reduce the amount of jobs needed. This can result in a huge performance speed up as we have to initialize less jobs. Table II:Example Lops [3] LOP type Description Execution Location Key Characteristics data input data source or output data Map or Reduce none sink, in key value pairs unary operate on each value with an optional Map or Reduce none scalar transform transform each key Map or Reduce keys changed group groups values by the key Mao and Reduce output keys grouped binary operate on two values with the same Reduce input keys key grouped aggregate aggregate all values with the same Reduce input keys key grouped mmcj cross product computation in the Map and Reduce none CPMM matrix multiplication mmrj RMM matrix multiplication Map and Reduce none 4.4 Runtime The last component is the runtime component. We will in the following consider three main aspects of the runtime component: 1. The keyvalue representation of matrices 2. The MR runtime to execute individual LOPDags over MapReduce 3. The control module to manage the execution of all MapReduce jobs 1) Matrices as KeyValue Pairs: SystemML divides matrices into blocks (submatrices). Each block has an unique keyvalue pair that can be used to identify the block. The key denotes the block id and the value consists out of all cell values of the block. With those blocks we can use the local sparsity within each block to decide whether it would be better to only show the nonzero values (presenting a sparse presentation) or to show all values (giving a dense presentation). That way the number of keyvalue pairs that have to be used to represent Jessica Falk, ETHID
7 the matrix can be reduced. Due to making use of the local sparsity, we also have to consider how to operate on those blocks. Whenever some kind of matrix operation with two blocks as input is performed, we will need to choose which algorithm to use. If both blocks are dense, it would be better to choose an algorithm that operates on all values. If, on the other hand, at least one block is sparse, it would be better to take an algorithm that operates only on nonzero values. 2) Generic MapReduce Job (GMR): GMR is the main executon engine in SystemML. It will be instantiated by the piggybacking algorithm and executes its job in the MapReduce environment. 3)Control Module: It is responsible for managing all executions of the instantiated MapReduce jobs for the corresponding DML script. During the execution of the Runtime component, several optimizations are made like dynamically deciding which execution plans to use for lops based on several characteristics. 5 Matrix Multiplication Algorithms in SystemML SystemML offers two execution plans to process matrix multiplications: RMM and CPMM In this section, this report is going to take a look at both execution plans. Let us assume we want to calculate the product of two matrices A and B with the result being matrix C. This multiplication can be presented in blocked format as follows: C i,j = k A i,k B k,j, i < M b, k < K b, j < N b Where M b x K b blocks are in A and K b x N b blocks are in B located. 5.1 RMM The replication based matrix multiplication strategy needs exactly one MapReduce job. In this strategy, each reducer needs to process the result of at least one block of the resulting block C i,j. As each input block can be used to calculate several result blocks in C, we need to replicate them and give them as input to each reducer that needs them. 5.2 CPMM with local aggregator This strategy is an algorithm based on the cross product. It can be represented as a LOPDag using three lops and two MapReduce jobs: mmcj group aggregate( ) In the Map phase, the first MapReduce job reads the two input matrices A and B and groups the input blocks by the common key k. Afterwards the reducer uses the cross product to compute Pi,j k = A i,k B k,j. The second MapReduce job reads the result of the first MapReduce job and groups all Pi,j k s by the key (i,j). To optimize the performance Jessica Falk, ETHID
8 a local aggregator is used. If we compute Pi,j k k and Pi,j in the same reducer and can aggregate those partial results within the reducer instead of outputting them separately. Doing this we are able to reduce the amount of reducers needed in the second job. In the Reduce phase of the second job the aggregate lop finally computes C i,j = k P i,j k. As it could be too large to fit into memory, SystemML implemented a diskbased local aggregator. It uses an inmemory buffer pool. As CPMM uses a certain order to compute the result, the best choice for a buffer replacement policy is to use LRU (least recently used) to optimize the performance using the diskbased local aggregator. 5.3 RMM vs CPMM After looking at RMM and CPMMM, we see that both have their advantages and disadvantages. Now we will look at the differences of their performance. RMM replicates each block of A N b and of B M b times. Therefore N b A + M b B data is shuffled in the job. It is now easy to calculate the cost of RMM: cost(rmm) = shuffle(n b A + M b B ) + IO dfs ( A + B + C ), where IO dfs stands for the cost of the distributed file system IO. CPMM first reads the blocks of A and B and sends them on to the Reduce phase. We shuffle data of the amount A + B. Next for each k the cross product gets calculated and the local aggregator will be applied within a reducer. Thus the result size of each reducer can be bounded by C. If there are k reducers, the result size will be bounded by k C. This data will be used by the second job. This means that the data gets shuffled again and will then be forwarded to the reducers to produce the end result. With this knowledge we are able to create an upper bound for the cost of CPMM: cost(cp MM) shuffle( A + B + k C ) + IO d fs( A + B + C + 2 k C ) Concluding, The comparison of the cost models of both algorithms shows that CPMM performs in general better, if A and B are both very large, since RMM has a huge shuffle overhead. RMM performs better than CPMM, if A or B is small enough to fit into one block. 6 Experiments Several experiments were performed to study scalability and the performance of optimizations in SystemML using different data and Hadoop cluster sizes. To study this GNMF, linear regression and PageRank were used as ML algorithms. 6.1 Setup Two different clusters were used to perform the experiments: core cluster: It uses 5 local machines as worker nodes with each having 8 cores and hyperthreading enabled. Each node runs 15 concurrent mappers and 10 concurrent reducers. Jessica Falk, ETHID
9 core EC2 cluster: This cluster has 100 worker nodes. Each node runs 2 mappers and 1 reducer concurrently. The data generator creates random matrices with uniformly distributed nonzero cells. All matrices have a size of 1000 x 1000 except for the matrix blocking experiments. The local aggregator is an inmemory buffer pool of size 900 MB on the 40core cluster and 500 MB on the 100core EC2 cluster. 6.2 Scalability For those experiments GNMF (Graph Regularized Nonnegative Matrix Factorization) was used as an example to demonstrate scalability on both clusters. The input V is a sparse matrix with varying d rows and w = columns. This means that our goal is to compute dense matrices W of size d x t and H of size t x w with V W H and t = 10 (t is the number of topics). The next step is to compare SystemML against the best known results and against results on single machines. 1. single machine: It executes quite efficiently for small sizes. If the number of rows in V increases to 10 million, it runs out of memory. SystemML clearly outperforms this as it countinues to scale for very large numbers. 2. best known result: The best known result [4] is based on a handcoded MapReduce implementation of GNMF. This one contains 8 full MapReduce jobs and 2 maponly jobs, while the execution plan by SystemML consists of 10 full MapReduce jobs. The experiments have shown that the performance of SystemML compared to the handcoded one increases significantly as the sparsity of the matrices increase. One reason for this is that SystemML only uses block representation while the handcoded one uses cell, rowwise and columnwise representation. The other reason is that the handcoded algorithm doesn t use an local aggregator to perform CPMM. 6.3 Optimizations We want to analyse four optimizations SystemML uses to gain a better performance. 1. RMM vs CPMM Using different test cases, we are able to see that neither algorithm always outperforms the other. Therefore it is important for SystemML to choose the best algorithm for each individual case. If both matrices are large, CPMM is in general the better solution. If at least one of the matrices is small, SystemML should choose RMM. CPMM has the advantage of having a higher degree of parallelism, thus making it better for large matrices. Jessica Falk, ETHID
10 2. Piggybacking Without piggybacking each lop would translate into a separate MapReduce job. Depending on whether piggybacking allows to translate several lops into one MapReduce job with one lop dominating the cost, the optimization can be significant or not. 3. Matrix Blocking: Especially for dense matrices the block representation helps to signifcantly reduce storage space compared to the cell representation. On the other hand, for sparse matrices the space used to store the blocks increase as there is only a small fraction of nonzero values per block. See Table III for further details. Table III: Comparison of different block sizes with H being a sparse matrix [3] Block Size 1000x x100 10x10 1x1 Execution time 117s 136x 3h >5h Size of V (GB) Size of H (MB) Local Aggregator for CPMM By using the local aggregator, we are able to reduce the size of the intermediate result. This implies a slower increase of the running time compared to the algorithm without an aggregator. 7 Conclusion Concluding, this report has quickly presented SystemML. A system that greatly helps to develop largescale ML algorithms. It first translate the machine learning algorithms into execution plans over MapReduce by evalutating different execution plans. At the end the created lops get translated into MapReduce jobs. By showing shortly the experiments used to evaluate SystemML, we were able to prove the scalability and the benefits of several optimization strategies like blocking, piggybacking and local aggregration. All in all the developers have achieved their goal of a declarative Machine Learning System, which is optimized for linear algebraic operations and scales well. It is a very good system, if you want to process largely independent sets of data. On the other hand it doesn t scale well, if the program contains many iterations or if the jobs are not independent. In the first case of having many iterations, we need to reload and reprocess data for each iteration and even need an extra job to process the condition. This way we get a huge overhead for each iteration. The second case of having dependent jobs is because of the way MapReduce is implemented. MapReduce can improve the performance heavily, if the jobs are independent. That way MapReduce is able to process all jobs concurrently. If the jobs are dependent of each other, we lose performance as MapReduce has to wait for the jobs to finish before starting the next one. Jessica Falk, ETHID
11 References [1] Apache Hadoop. [2] MapReduce. [3] SystemML: Declarative Machine Learning on MapReduce. uchicago.edu/~vikass/systemml.pdf. [4] C. Liu, H. chih Yang, J. Fan, L.W. He, and Y.M. Wang. Distributed nonnegative matrix factorization for webscale dyadic data analysis on MapReduce. WWW, Jessica Falk, ETHID
SystemML: Declarative Machine Learning on MapReduce
SystemML: Declarative Machine Learning on MapReduce Amol Ghoting #, Rajasekar Krishnamurthy, Edwin Pednault #, Berthold Reinwald Vikas Sindhwani #, Shirish Tatikonda, Yuanyuan Tian, Shivakumar Vaithyanathan
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationCloud Computing at Google. Architecture
Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale
More informationCumulon: Optimizing Statistical Data Analysis in the Cloud
Cumulon: Optimizing Statistical Data Analysis in the Cloud Botong Huang Duke University bhuang@cs.duke.edu Shivnath Babu Duke University shivnath@cs.duke.edu Jun Yang Duke University junyang@cs.duke.edu
More informationParallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel
Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:
More informationApache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014
Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24 Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation
More informationBig Data Technology MapReduce Motivation: Indexing in Search Engines
Big Data Technology MapReduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process
More informationFrom GWS to MapReduce: Google s Cloud Technology in the Early Days
LargeScale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop
More information16.1 MAPREDUCE. For personal use only, not for distribution. 333
For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several
More informationMapReduce and the New Software Stack
20 Chapter 2 MapReduce and the New Software Stack Modern datamining applications, often called bigdata analysis, require us to manage immense amounts of data quickly. In many of these applications, the
More informationBig Data and Scripting map/reduce in Hadoop
Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster  latency Do more computations in given time  throughput
More informationGraph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang
Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *
More informationLargeScale Data Sets Clustering Based on MapReduce and Hadoop
Journal of Computational Information Systems 7: 16 (2011) 59565963 Available at http://www.jofcis.com LargeScale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE
More informationMapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012
MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte
More informationChapter 7. Using Hadoop Cluster and MapReduce
Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in
More informationA Study on Workload Imbalance Issues in Data Intensive Distributed Computing
A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 461 Komaba, Meguroku, Tokyo 1538505, Japan Abstract.
More informationMAPREDUCE Programming Model
CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce
More informationProgramming Hadoop 5day, instructorled BD106. MapReduce Overview. Hadoop Overview
Programming Hadoop 5day, instructorled BD106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce
More informationMapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
More informationOpen source large scale distributed data management with Google s MapReduce and Bigtable
Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory
More informationSimilarity Search in a Very Large Scale Using Hadoop and HBase
Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE  Universite Paris Dauphine, France Internet Memory Foundation, Paris, France
More informationMining Large Datasets: Case of Mining Graph Data in the Cloud
Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large
More informationOptimization and analysis of large scale data sorting algorithm based on Hadoop
Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,
More informationNoSQL and Hadoop Technologies On Oracle Cloud
NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath
More informationHadoop: A Framework for Data Intensive Distributed Computing. CS561Spring 2012 WPI, Mohamed Y. Eltabakh
1 Hadoop: A Framework for Data Intensive Distributed Computing CS561Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets
More informationEnergy Efficient MapReduce
Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing
More informationIntroduction to DISC and Hadoop
Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a threelayer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and
More informationBringing Big Data Modelling into the Hands of Domain Experts
Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the
More informationIntroduction to Parallel Programming and MapReduce
Introduction to Parallel Programming and MapReduce Audience and PreRequisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The prerequisites are significant
More informationMap Reduce / Hadoop / HDFS
Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (revisited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview
More informationBig Data Processing with Google s MapReduce. Alexandru Costan
1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:
More informationMASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015
7/04/05 Fundamentals of Distributed Systems CC5 PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE
More informationClassification On The Clouds Using MapReduce
Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt
More informationDistributed Computing and Big Data: Hadoop and MapReduce
Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:
More informationMapReduce for Machine Learning on Multicore
MapReduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers  dual core to 12+core Shift to more concurrent programming paradigms and languages Erlang,
More informationArchitectures for Big Data Analytics A database perspective
Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum
More informationGraySort on Apache Spark by Databricks
GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner
More informationParameterizable benchmarking framework for designing a MapReduce performance model
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3229 SPECIAL ISSUE PAPER Parameterizable
More informationBig Data With Hadoop
With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials
More informationBigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic
BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduceHadoop
More informationTackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.
Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden
More informationApache Flink Nextgen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas
Apache Flink Nextgen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research
More informationBIG DATA What it is and how to use?
BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14
More informationDuke University http://www.cs.duke.edu/starfish
Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists
More informationZihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical
More informationMapReduce. Tushar B. Kute, http://tusharkute.com
MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity
More informationThe Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org
The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlinbased database research groups In the Apache
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationDistributed R for Big Data
Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies largescale analysis. It extends R. R is a singlethreaded environment which limits its utility for
More informationTask Scheduling in Hadoop
Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed
More informationLambda Architecture. Near RealTime Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com
Lambda Architecture Near RealTime Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...
More informationIntroduction to Hadoop
Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples
More informationPaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012
PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random
More informationDistributed R for Big Data
Distributed R for Big Data Indrajit Roy, HP Labs November 2013 Team: Shivara m Erik Kyungyon g Alvin Rob Vanish A Big Data story Once upon a time, a customer in distress had. 2+ billion rows of financial
More informationBig Data and Apache Hadoop s MapReduce
Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23
More informationJeffrey D. Ullman slides. MapReduce for data intensive computing
Jeffrey D. Ullman slides MapReduce for data intensive computing Singlenode architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very
More informationScalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011
Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis
More informationLog Mining Based on Hadoop s Map and Reduce Technique
Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com
More informationCSEE5430 Scalable Cloud Computing Lecture 2
CSEE5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.92015 1/36 Google MapReduce A scalable batch processing
More informationIntegrating Big Data into the Computing Curricula
Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big
More informationMapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example
MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design
More informationParallel Programming MapReduce. Needless to Say, We Need Machine Learning for Big Data
Case Study 2: Document Retrieval Parallel Programming MapReduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin
More informationRevoScaleR Speed and Scalability
EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution
More informationThe Performance Characteristics of MapReduce Applications on Scalable Clusters
The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have
More informationMonitis Project Proposals for AUA. September 2014, Yerevan, Armenia
Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop
More informationOverview on Graph Datastores and Graph Computing Systems.  Litao Deng (Cloud Computing Group) 06082012
Overview on Graph Datastores and Graph Computing Systems  Litao Deng (Cloud Computing Group) 06082012 Graph  Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships
More informationHow InMemory Data Grids Can Analyze FastChanging Data in Real Time
SCALEOUT SOFTWARE How InMemory Data Grids Can Analyze FastChanging Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wentyfirst
More informationIIT DBGroup Boris Glavic 7) Big Data Analytics
Outline IIT DBGroup CS520 Data Integration, Warehousing, and Provenance 7. Big Data Systems and Integration Boris Glavic http://www.cs.iit.edu/~glavic/ http://www.cs.iit.edu/~cs520/ http://www.cs.iit.edu/~dbgroup/
More informationMATEEC2: A Middleware for Processing Data with AWS
MATEEC2: A Middleware for Processing Data with AWS Tekin Bicer Department of Computer Science and Engineering Ohio State University bicer@cse.ohiostate.edu David Chiu School of Engineering and Computer
More informationMapReduce: Algorithm Design Patterns
Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources
More informationUsing InMemory Computing to Simplify Big Data Analytics
SCALEOUT SOFTWARE Using InMemory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed
More informationParallel Processing of cluster by Map Reduce
Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model
More informationAnalysis and Modeling of MapReduce s Performance on Hadoop YARN
Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and
More informationThis exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.
Big Data Processing 20132014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,
More informationHiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group
HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is
More informationDATA MINING WITH HADOOP AND HIVE Introduction to Architecture
DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 20152025. Reproduction or usage prohibited without permission of
More informationHadoop Design and kmeans Clustering
Hadoop Design and kmeans Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise
More informationOpen source software framework designed for storage and processing of large scale data on clusters of commodity hardware
Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after
More informationManagement & Analysis of Big Data in Zenith Team
Management & Analysis of Big Data in Zenith Team Zenith Team, INRIA & LIRMM Outline Introduction to MapReduce Dealing with Data Skew in Big Data Processing Data Partitioning for MapReduce Frequent Sequence
More informationHadoop MapReduce and Spark. Giorgio Pedrazzi, CINECASCAI School of Data Analytics and Visualisation Milan, 10/06/2015
Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECASCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib
More informationApache MRQL (incubating): Advanced Query Processing for Complex, LargeScale Data Analysis
Apache MRQL (incubating): Advanced Query Processing for Complex, LargeScale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am
More informationLecture Data Warehouse Systems
Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world CStores
More informationIntroduction to Hadoop and MapReduce
Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Largescale Computation Traditional solutions for computing large quantities of data
More informationChallenges for Data Driven Systems
Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2
More informationhttp://www.wordle.net/
Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an opensource software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely
More informationMapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12
MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:
More informationA linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form
Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...
More informationBig Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
More informationSOLVING LINEAR SYSTEMS
SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis
More informationHadoop SNS. renren.com. Saturday, December 3, 11
Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December
More informationMapReduce Jeffrey Dean and Sanjay Ghemawat. Background context
MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Largescale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able
More informationMapReduce Job Processing
April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File
More informationmap/reduce connected components
1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains
More informationA bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18
A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is
More informationMapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy
MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous
More informationPerformance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications
Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introducon Open source MapReduce
More informationBig Data Analytics. Lucas Rego Drumond
Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany Big Data Analytics Big Data Analytics 1 / 33 Outline
More informationComputing Load Aware and LongView Load Balancing for Cluster Storage Systems
215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and LongView Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical
More information