Report: Declarative Machine Learning on MapReduce (SystemML)

Size: px
Start display at page:

Download "Report: Declarative Machine Learning on MapReduce (SystemML)"

Transcription

1 Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID May 28, Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop, a MapReduce environment. Due to web-search, web analytics,... etc., there has been a high demand for scalable implementations of ML algorithms to process large datasets. Up until now there have only been hand-tuned implementations, making it very difficult to integrate those into a different environment. Ever since MapReduce has gotten more popular, there has been a high interest into combining ML algorithms and MapReduce to implement a scalable algorithm for processing large datasets. This comes from the fact that hand-tuned implementations have two big drawbacks: 1. Each MapReduce job has to be hand-coded 2. If the input and cluster sizes vary, the execution plan for the algorithm has to be hand-tuned to receive a better performance Take for example the problem of solving matrix multiplication. There are multiple ways to achieve this. We can for example use the replication-based matrix multiplication or the cross-product based matrix multiplication. As explained later in this report, it depends on the matrices which strategy would be better. Often the matrices change over time or depend on the input, which can make it very hard for the developer to always choose the best strategy. To address such problems, SystemML has been developed. It has the properties of being a scalable declarative machine learning system that allows to write ML algorithms in a higher-level language, thus freeing the developer of tasks like performance tuning and low-level implementation details. This report is going to quickly summarize the paper SystemML: Declarative Machine Learning on MapReduce by first taking a quick look at MapReduce and Hadoop and then moving on to present SystemML. Lastly, some experimental results are shown to demonstrate the optimization of the performance.

2 (a) replication-based (b) cross-product based Figure 1: matrix multiplication 2 MapReduce MapReduce is an algorithm used to process large sets of data by dividing its work onto clusters of machines. This process contains two phases[2]: 1. Map: The algorithm takes as input key-value pairs. Those get divided and distributes between the worker nodes. Those workers process the input they received and return the result as key-value pairs. 2. Reduce: The master node collects the results of the subproblems and combines them to receive the result for the original problem Thus MapReduce can increase the performance of large datasets immensely, if the subproblems don t depend on each other as it can perform many jobs simultaneously. 3 Hadoop Hadoop is an open source implementation of the Google FileSystem. It can be used to allow the distributed processing of large data sets across clusters of computers.[1] It is programmed to scale up from single servers to large numbers of machines, where each machine has to offer local computation and storage. Hadoop has already implemented the MapReduce algorithm with the classes Map, Reduce and Combine, which can be configured to fit the given task. Jessica Falk, ETH-ID

3 4 Implementation of SystemML SystemML consists of four components: 4.1 Declarative Machine learning Language (DML) Algorithms in SystemML are written in the Declarative Machine learning Language (DML). As DML provides several mathematical and linear algebraic primitives on matrices, including PCA, PageRank, etc. and control constructs, it makes it easy to use. Using this DML, SystemML is able to break it into smaller subunits also called statement blocks, which will be evaluated sequentially. DML supports two data types: matrices and scalars. We can use integer, double, string and logical for scalars and the cells in a matrix. There are a number of constructs that are currently supported in DML: 1. Input/Output: To read and write matrices from and to files, ReadMM and WriteMM are used. 2. Control Structures: We are able to use if, while and for statements as control structures. 3. Assignment: Each assignment consists out of an expression and the result that is to be assigned. We can use assignments for both scalars and matrices. 4. Users can also define their own methods using the following syntax: function (arglist) body The arglist has to be a set of formal input and output arguments and the body consists of a group of valid DML statements. To generate a parsed representation that can be used by the following component, the program gets quickly analysed by using the following steps: 1. Type Assignment: We need to figure out the data types of each variable in the DML script. We know that ReadMM and WriteMM are only used on matrices and therefore can already assign this type to the corresponding variables. Furthermore the right side of the assignment is used to define the type of the variable on the left side. 2. Statement Block Identification: A statement block only consists out of consecutive assignment, ReadMM and WriteMM statements as those operations can be collectively optimized. If a control structure appears, we need to divide the program into several statement blocks. 3. Live Variable Analysis: We want to achieve two things. First, connect the use of a variable with the preceding write of a variable across all evaluation paths. Jessica Falk, ETH-ID

4 Second, analyse each statement block upon the variables that will be required from previous statement blocks and the variables that will be output by the current one. Thus we may be able to optimize by eliminating dead assignments and using constant folding. 4.2 High-level Operator Component (HOP) Table I: Example Hops [3] HOP Type Notation Semantics Binary b(op) : X, Y x i,j, y i,j : do op(x i,j, y i,j ), op is +, -, *, /,... Unary u(op) : X AggregateUnary au(aggop, dim) : X x i,j : do op(x i,j ), op is log, sin,... do aggop(, ) on dim, where dim is row (row-wise), col (col-wise) or all (whole matrix) AggregateBinary ab(aggop, op): X, Y i, j : do aggop(op(x i,k, y k,j ) k) Reorg r(op) : X reorganize elements in a matrix, e.g. transpose (op = T) Data data(op) : X read (op = r) or write (op = w) a matrix The parsed representation gets taken as an input by the HOP. Each statement block gets analysed and the best execution plan is chosen. Every plan will then be represented as a directed acyclic graph (HOP-Dag) of basic operations (called hops) over matrices and scalars. So each statement block will have one HOP-Dag. What exactly does a hop represent? Each hop takes at least one input. Depending on this, it performs an operation and produces an output that will be taken as an input for at least one following hop. See Table I for some input/output possibilities. Several optimizations like algebraic rewrites and physical representations for intermediate matrices are made in this step. 4.3 Low-level Operator Component (LOP) Now the HOP-Dags get translated into low-level physical plans (LOP-Dags). We can construct these by processing the HOP-Dag bottom-up. During this process each hop gets translated into at least one HOP-Dag LOP-Dag (i,j), ci,j/di,j lop. Lops are similiar to hops as seen in binary(/) matrix b(/) Table II. They represent the basic operations (i,j), {ci,j, di,j} group matrix matrix in a MapReduce environment. As stated in the MapReduce section, a keyvalue data(r): C data(r): D (i,j), ci,j (i,j), di,j pair will always be used as input and data C data D output. To use those key-value pairs, they are going to be grouped together based on their keys. Figure 2: Transforming a HOP-Dag into a LOP-Dag [3] Jessica Falk, ETH-ID

5 data A A A agg (+) group mmcj data B b (+) group B R M R M R M R M R M data C data D (a) LOP Dag C D (b) naive approach Figure 3: Piggybacking algorithm B C D (c) piggybacking At the end each group is passed on to the corresponding lop to perform the correct operation. The LOP-Dag gets compiled into at least one MapReduce job. Taking one lop per job is easy to implement, but results in a lack of performance. SystemML wants to improve the performance gained by using ML algorithms together with MapReduce, so we need to figure out how to group the lops together. There are two properties of the lops to consider: 1. Location: Where do we need to execute the lop? During the Map phase, Reduce phase or both phases? 2. Key Characteristics: Are the input or output keys required to be grouped? different output keys? Does the lop generate This way the number of data scans small is kept small. It is also called piggybacking. We shortly explain this greedy algorithm: We are given the nodes of the LOP-Dag, which are first sorted in a topological order and then partitioned into one of three lists: Map, Reduce and MapAndReduce That way all lists are sorted and the operations of the different phases can be quickly accessed. Now it iterates through the nodes to assign all lops to their corresponding MapReduce job. It first assigns the lops of the Map phase, then the lops of the Map and Reduce phase and lastly the lops of the Reduce phase get assigned. The runtime complexity is quadratic in terms of the LOP-Dag size. Now let us look at the statement A = B * (C + D) where A, B, C, D are matrices. Let us assume that SystemML chose to use the cross product based matrix multiplication. You can see the created LOP-Dag for this case in figure 2 (a). A naive approach to construct the MapReduce jobs is shown in figure 2 (b). For each operation we need a new MapReduce job. This means we get one job for the addition and another two for the multiplication. As explained later we always need two MapRe- Jessica Falk, ETH-ID

6 duce jobs to calculate the cross product based matrix multiplication. Using the piggybacking algorithm, we are able to reduce this to a total of two MapReduce jobs as shown in figure 2 (c). We simply calculate in the first job the addition and the first part of the matrix multiplication. The second job needs to be done in the second job. By packing as many operations as possible into one MapReduce job, the piggybacking algorithim is able to reduce the amount of jobs needed. This can result in a huge performance speed up as we have to initialize less jobs. Table II:Example Lops [3] LOP type Description Execution Location Key Characteristics data input data source or output data Map or Reduce none sink, in key value pairs unary operate on each value with an optional Map or Reduce none scalar transform transform each key Map or Reduce keys changed group groups values by the key Mao and Reduce output keys grouped binary operate on two values with the same Reduce input keys key grouped aggregate aggregate all values with the same Reduce input keys key grouped mmcj cross product computation in the Map and Reduce none CPMM matrix multiplication mmrj RMM matrix multiplication Map and Reduce none 4.4 Runtime The last component is the runtime component. We will in the following consider three main aspects of the runtime component: 1. The key-value representation of matrices 2. The MR runtime to execute individual LOP-Dags over MapReduce 3. The control module to manage the execution of all MapReduce jobs 1) Matrices as Key-Value Pairs: SystemML divides matrices into blocks (submatrices). Each block has an unique keyvalue pair that can be used to identify the block. The key denotes the block id and the value consists out of all cell values of the block. With those blocks we can use the local sparsity within each block to decide whether it would be better to only show the non-zero values (presenting a sparse presentation) or to show all values (giving a dense presentation). That way the number of key-value pairs that have to be used to represent Jessica Falk, ETH-ID

7 the matrix can be reduced. Due to making use of the local sparsity, we also have to consider how to operate on those blocks. Whenever some kind of matrix operation with two blocks as input is performed, we will need to choose which algorithm to use. If both blocks are dense, it would be better to choose an algorithm that operates on all values. If, on the other hand, at least one block is sparse, it would be better to take an algorithm that operates only on non-zero values. 2) Generic MapReduce Job (G-MR): G-MR is the main executon engine in SystemML. It will be instantiated by the piggybacking algorithm and executes its job in the MapReduce environment. 3)Control Module: It is responsible for managing all executions of the instantiated MapReduce jobs for the corresponding DML script. During the execution of the Runtime component, several optimizations are made like dynamically deciding which execution plans to use for lops based on several characteristics. 5 Matrix Multiplication Algorithms in SystemML SystemML offers two execution plans to process matrix multiplications: RMM and CPMM In this section, this report is going to take a look at both execution plans. Let us assume we want to calculate the product of two matrices A and B with the result being matrix C. This multiplication can be presented in blocked format as follows: C i,j = k A i,k B k,j, i < M b, k < K b, j < N b Where M b x K b blocks are in A and K b x N b blocks are in B located. 5.1 RMM The replication based matrix multiplication strategy needs exactly one MapReduce job. In this strategy, each reducer needs to process the result of at least one block of the resulting block C i,j. As each input block can be used to calculate several result blocks in C, we need to replicate them and give them as input to each reducer that needs them. 5.2 CPMM with local aggregator This strategy is an algorithm based on the cross product. It can be represented as a LOP-Dag using three lops and two MapReduce jobs: mmcj group aggregate( ) In the Map phase, the first MapReduce job reads the two input matrices A and B and groups the input blocks by the common key k. Afterwards the reducer uses the cross product to compute Pi,j k = A i,k B k,j. The second MapReduce job reads the result of the first MapReduce job and groups all Pi,j k s by the key (i,j). To optimize the performance Jessica Falk, ETH-ID

8 a local aggregator is used. If we compute Pi,j k k and Pi,j in the same reducer and can aggregate those partial results within the reducer instead of outputting them separately. Doing this we are able to reduce the amount of reducers needed in the second job. In the Reduce phase of the second job the aggregate lop finally computes C i,j = k P i,j k. As it could be too large to fit into memory, SystemML implemented a disk-based local aggregator. It uses an in-memory buffer pool. As CPMM uses a certain order to compute the result, the best choice for a buffer replacement policy is to use LRU (least recently used) to optimize the performance using the disk-based local aggregator. 5.3 RMM vs CPMM After looking at RMM and CPMMM, we see that both have their advantages and disadvantages. Now we will look at the differences of their performance. RMM replicates each block of A N b and of B M b times. Therefore N b A + M b B data is shuffled in the job. It is now easy to calculate the cost of RMM: cost(rmm) = shuffle(n b A + M b B ) + IO dfs ( A + B + C ), where IO dfs stands for the cost of the distributed file system IO. CPMM first reads the blocks of A and B and sends them on to the Reduce phase. We shuffle data of the amount A + B. Next for each k the cross product gets calculated and the local aggregator will be applied within a reducer. Thus the result size of each reducer can be bounded by C. If there are k reducers, the result size will be bounded by k C. This data will be used by the second job. This means that the data gets shuffled again and will then be forwarded to the reducers to produce the end result. With this knowledge we are able to create an upper bound for the cost of CPMM: cost(cp MM) shuffle( A + B + k C ) + IO d fs( A + B + C + 2 k C ) Concluding, The comparison of the cost models of both algorithms shows that CPMM performs in general better, if A and B are both very large, since RMM has a huge shuffle overhead. RMM performs better than CPMM, if A or B is small enough to fit into one block. 6 Experiments Several experiments were performed to study scalability and the performance of optimizations in SystemML using different data and Hadoop cluster sizes. To study this GNMF, linear regression and PageRank were used as ML algorithms. 6.1 Setup Two different clusters were used to perform the experiments: core cluster: It uses 5 local machines as worker nodes with each having 8 cores and hyperthreading enabled. Each node runs 15 concurrent mappers and 10 concurrent reducers. Jessica Falk, ETH-ID

9 core EC2 cluster: This cluster has 100 worker nodes. Each node runs 2 mappers and 1 reducer concurrently. The data generator creates random matrices with uniformly distributed non-zero cells. All matrices have a size of 1000 x 1000 except for the matrix blocking experiments. The local aggregator is an in-memory buffer pool of size 900 MB on the 40-core cluster and 500 MB on the 100-core EC2 cluster. 6.2 Scalability For those experiments GNMF (Graph Regularized Non-negative Matrix Factorization) was used as an example to demonstrate scalability on both clusters. The input V is a sparse matrix with varying d rows and w = columns. This means that our goal is to compute dense matrices W of size d x t and H of size t x w with V W H and t = 10 (t is the number of topics). The next step is to compare SystemML against the best known results and against results on single machines. 1. single machine: It executes quite efficiently for small sizes. If the number of rows in V increases to 10 million, it runs out of memory. SystemML clearly outperforms this as it countinues to scale for very large numbers. 2. best known result: The best known result [4] is based on a hand-coded MapReduce implementation of GNMF. This one contains 8 full MapReduce jobs and 2 map-only jobs, while the execution plan by SystemML consists of 10 full MapReduce jobs. The experiments have shown that the performance of SystemML compared to the hand-coded one increases significantly as the sparsity of the matrices increase. One reason for this is that SystemML only uses block representation while the hand-coded one uses cell, row-wise and column-wise representation. The other reason is that the hand-coded algorithm doesn t use an local aggregator to perform CPMM. 6.3 Optimizations We want to analyse four optimizations SystemML uses to gain a better performance. 1. RMM vs CPMM Using different test cases, we are able to see that neither algorithm always outperforms the other. Therefore it is important for SystemML to choose the best algorithm for each individual case. If both matrices are large, CPMM is in general the better solution. If at least one of the matrices is small, SystemML should choose RMM. CPMM has the advantage of having a higher degree of parallelism, thus making it better for large matrices. Jessica Falk, ETH-ID

10 2. Piggybacking Without piggybacking each lop would translate into a separate MapReduce job. Depending on whether piggybacking allows to translate several lops into one MapReduce job with one lop dominating the cost, the optimization can be significant or not. 3. Matrix Blocking: Especially for dense matrices the block representation helps to signifcantly reduce storage space compared to the cell representation. On the other hand, for sparse matrices the space used to store the blocks increase as there is only a small fraction of non-zero values per block. See Table III for further details. Table III: Comparison of different block sizes with H being a sparse matrix [3] Block Size 1000x x100 10x10 1x1 Execution time 117s 136x 3h >5h Size of V (GB) Size of H (MB) Local Aggregator for CPMM By using the local aggregator, we are able to reduce the size of the intermediate result. This implies a slower increase of the running time compared to the algorithm without an aggregator. 7 Conclusion Concluding, this report has quickly presented SystemML. A system that greatly helps to develop large-scale ML algorithms. It first translate the machine learning algorithms into execution plans over MapReduce by evalutating different execution plans. At the end the created lops get translated into MapReduce jobs. By showing shortly the experiments used to evaluate SystemML, we were able to prove the scalability and the benefits of several optimization strategies like blocking, piggybacking and local aggregration. All in all the developers have achieved their goal of a declarative Machine Learning System, which is optimized for linear algebraic operations and scales well. It is a very good system, if you want to process largely independent sets of data. On the other hand it doesn t scale well, if the program contains many iterations or if the jobs are not independent. In the first case of having many iterations, we need to reload and reprocess data for each iteration and even need an extra job to process the condition. This way we get a huge overhead for each iteration. The second case of having dependent jobs is because of the way MapReduce is implemented. MapReduce can improve the performance heavily, if the jobs are independent. That way MapReduce is able to process all jobs concurrently. If the jobs are dependent of each other, we lose performance as MapReduce has to wait for the jobs to finish before starting the next one. Jessica Falk, ETH-ID

11 References [1] Apache Hadoop. [2] MapReduce. [3] SystemML: Declarative Machine Learning on MapReduce. uchicago.edu/~vikass/systemml.pdf. [4] C. Liu, H. chih Yang, J. Fan, L.-W. He, and Y.-M. Wang. Distributed nonnegative matrix factorization for web-scale dyadic data analysis on MapReduce. WWW, Jessica Falk, ETH-ID

SystemML: Declarative Machine Learning on MapReduce

SystemML: Declarative Machine Learning on MapReduce SystemML: Declarative Machine Learning on MapReduce Amol Ghoting #, Rajasekar Krishnamurthy, Edwin Pednault #, Berthold Reinwald Vikas Sindhwani #, Shirish Tatikonda, Yuanyuan Tian, Shivakumar Vaithyanathan

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Cumulon: Optimizing Statistical Data Analysis in the Cloud

Cumulon: Optimizing Statistical Data Analysis in the Cloud Cumulon: Optimizing Statistical Data Analysis in the Cloud Botong Huang Duke University bhuang@cs.duke.edu Shivnath Babu Duke University shivnath@cs.duke.edu Jun Yang Duke University junyang@cs.duke.edu

More information

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

More information

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014

Apache Mahout's new DSL for Distributed Machine Learning. Sebastian Schelter GOTO Berlin 11/06/2014 Apache Mahout's new DSL for Distributed Machine Learning Sebastian Schelter GOO Berlin /6/24 Overview Apache Mahout: Past & Future A DSL for Machine Learning Example Under the covers Distributed computation

More information

Cloud Computing at Google. Architecture

Cloud Computing at Google. Architecture Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

More information

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

More information

From GWS to MapReduce: Google s Cloud Technology in the Early Days

From GWS to MapReduce: Google s Cloud Technology in the Early Days Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

More information

Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

More information

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

MapReduce and the New Software Stack

MapReduce and the New Software Stack 20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

More information

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

More information

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012 MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

More information

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

More information

MAPREDUCE Programming Model

MAPREDUCE Programming Model CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

More information

Open source large scale distributed data management with Google s MapReduce and Bigtable

Open source large scale distributed data management with Google s MapReduce and Bigtable Open source large scale distributed data management with Google s MapReduce and Bigtable Ioannis Konstantinou Email: ikons@cslab.ece.ntua.gr Web: http://www.cslab.ntua.gr/~ikons Computing Systems Laboratory

More information

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

More information

Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7. Using Hadoop Cluster and MapReduce Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

More information

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview Programming Hadoop 5-day, instructor-led BD-106 MapReduce Overview The Client Server Processing Pattern Distributed Computing Challenges MapReduce Defined Google's MapReduce The Map Phase of MapReduce

More information

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale data sorting algorithm based on Hadoop Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

More information

NoSQL and Hadoop Technologies On Oracle Cloud

NoSQL and Hadoop Technologies On Oracle Cloud NoSQL and Hadoop Technologies On Oracle Cloud Vatika Sharma 1, Meenu Dave 2 1 M.Tech. Scholar, Department of CSE, Jagan Nath University, Jaipur, India 2 Assistant Professor, Department of CSE, Jagan Nath

More information

Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

More information

Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

More information

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

More information

Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

More information

Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

More information

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015 7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

More information

Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

More information

Energy Efficient MapReduce

Energy Efficient MapReduce Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

More information

Bringing Big Data Modelling into the Hands of Domain Experts

Bringing Big Data Modelling into the Hands of Domain Experts Bringing Big Data Modelling into the Hands of Domain Experts David Willingham Senior Application Engineer MathWorks david.willingham@mathworks.com.au 2015 The MathWorks, Inc. 1 Data is the sword of the

More information

Architectures for Big Data Analytics A database perspective

Architectures for Big Data Analytics A database perspective Architectures for Big Data Analytics A database perspective Fernando Velez Director of Product Management Enterprise Information Management, SAP June 2013 Outline Big Data Analytics Requirements Spectrum

More information

Duke University http://www.cs.duke.edu/starfish

Duke University http://www.cs.duke.edu/starfish Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists

More information

Map Reduce / Hadoop / HDFS

Map Reduce / Hadoop / HDFS Chapter 3: Map Reduce / Hadoop / HDFS 97 Overview Outline Distributed File Systems (re-visited) Motivation Programming Model Example Applications Big Data in Apache Hadoop HDFS in Hadoop YARN 98 Overview

More information

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January 2015. Email: bdg@qburst.com Website: www.qburst.com Lambda Architecture Near Real-Time Big Data Analytics Using Hadoop January 2015 Contents Overview... 3 Lambda Architecture: A Quick Introduction... 4 Batch Layer... 4 Serving Layer... 4 Speed Layer...

More information

Zihang Yin Introduction R is commonly used as an open share statistical software platform that enables analysts to do complex statistical analysis with limited computing knowledge. Frequently these analytical

More information

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

More information

MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce. Tushar B. Kute, http://tusharkute.com MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

More information

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora {mbalassi, gyfora}@apache.org The Flink Big Data Analytics Platform Marton Balassi, Gyula Fora" {mbalassi, gyfora}@apache.org What is Apache Flink? Open Source Started in 2009 by the Berlin-based database research groups In the Apache

More information

Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

More information

Big Data Processing with Google s MapReduce. Alexandru Costan

Big Data Processing with Google s MapReduce. Alexandru Costan 1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

More information

Parameterizable benchmarking framework for designing a MapReduce performance model

Parameterizable benchmarking framework for designing a MapReduce performance model CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. (2014) Published online in Wiley Online Library (wileyonlinelibrary.com)..3229 SPECIAL ISSUE PAPER Parameterizable

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Introduction to Hadoop

Introduction to Hadoop Introduction to Hadoop Miles Osborne School of Informatics University of Edinburgh miles@inf.ed.ac.uk October 28, 2010 Miles Osborne Introduction to Hadoop 1 Background Hadoop Programming Model Examples

More information

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012

PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping. Version 1.0, Oct 2012 PaRFR : Parallel Random Forest Regression on Hadoop for Multivariate Quantitative Trait Loci Mapping Version 1.0, Oct 2012 This document describes PaRFR, a Java package that implements a parallel random

More information

MapReduce: Algorithm Design Patterns

MapReduce: Algorithm Design Patterns Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

More information

GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Juwei Shi, Yunjie Qiu, Umar Farooq Minhas, Limei Jiao, Chen Wang, Berthold Reinwald, and Fatma Özcan IBM Research China IBM Almaden

More information

Integrating Big Data into the Computing Curricula

Integrating Big Data into the Computing Curricula Integrating Big Data into the Computing Curricula Yasin Silva, Suzanne Dietrich, Jason Reed, Lisa Tsosie Arizona State University http://www.public.asu.edu/~ynsilva/ibigdata/ 1 Overview Motivation Big

More information

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011 Scalable Data Analysis in R Lee E. Edlefsen Chief Scientist UserR! 2011 1 Introduction Our ability to collect and store data has rapidly been outpacing our ability to analyze it We need scalable data analysis

More information

Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

More information

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

More information

BIG DATA What it is and how to use?

BIG DATA What it is and how to use? BIG DATA What it is and how to use? Lauri Ilison, PhD Data Scientist 21.11.2014 Big Data definition? There is no clear definition for BIG DATA BIG DATA is more of a concept than precise term 1 21.11.14

More information

Big Data With Hadoop

Big Data With Hadoop With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

More information

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

More information

MATE-EC2: A Middleware for Processing Data with AWS

MATE-EC2: A Middleware for Processing Data with AWS MATE-EC2: A Middleware for Processing Data with AWS Tekin Bicer Department of Computer Science and Engineering Ohio State University bicer@cse.ohio-state.edu David Chiu School of Engineering and Computer

More information

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia Monitis Project Proposals for AUA September 2014, Yerevan, Armenia Distributed Log Collecting and Analysing Platform Project Specifications Category: Big Data and NoSQL Software Requirements: Apache Hadoop

More information

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time SCALEOUT SOFTWARE How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time by Dr. William Bain and Dr. Mikhail Sobolev, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T wenty-first

More information

Distributed R for Big Data

Distributed R for Big Data Distributed R for Big Data Indrajit Roy HP Vertica Development Team Abstract Distributed R simplifies large-scale analysis. It extends R. R is a single-threaded environment which limits its utility for

More information

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc. 2015 The MathWorks, Inc. 1 Challenges of Big Data Any collection of data sets so large and complex that it becomes difficult

More information

Task Scheduling in Hadoop

Task Scheduling in Hadoop Task Scheduling in Hadoop Sagar Mamdapure Munira Ginwala Neha Papat SAE,Kondhwa SAE,Kondhwa SAE,Kondhwa Abstract Hadoop is widely used for storing large datasets and processing them efficiently under distributed

More information

Distributed R for Big Data

Distributed R for Big Data Distributed R for Big Data Indrajit Roy, HP Labs November 2013 Team: Shivara m Erik Kyungyon g Alvin Rob Vanish A Big Data story Once upon a time, a customer in distress had. 2+ billion rows of financial

More information

Using In-Memory Computing to Simplify Big Data Analytics

Using In-Memory Computing to Simplify Big Data Analytics SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

More information

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012

Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012 Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships

More information

Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

More information

Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

More information

Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides. MapReduce for data intensive computing Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

More information

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

More information

CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

More information

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing. Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

More information

Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

More information

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

More information

map/reduce connected components

map/reduce connected components 1, map/reduce connected components find connected components with analogous algorithm: map edges randomly to partitions (k subgraphs of n nodes) for each partition remove edges, so that only tree remains

More information

Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

More information

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

More information

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Hadoop MapReduce and Spark Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015 Outline Hadoop Hadoop Import data on Hadoop Spark Spark features Scala MLlib MLlib

More information

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems

Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems 215 IEEE International Conference on Big Data (Big Data) Computing Load Aware and Long-View Load Balancing for Cluster Storage Systems Guoxin Liu and Haiying Shen and Haoyu Wang Department of Electrical

More information

Big Data Systems CS 5965/6965 FALL 2015

Big Data Systems CS 5965/6965 FALL 2015 Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html

More information

http://www.wordle.net/

http://www.wordle.net/ Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

More information

Challenges for Data Driven Systems

Challenges for Data Driven Systems Challenges for Data Driven Systems Eiko Yoneki University of Cambridge Computer Laboratory Quick History of Data Management 4000 B C Manual recording From tablets to papyrus to paper A. Payberah 2014 2

More information

Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

More information

The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

More information

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12 MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

More information

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form

A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form Section 1.3 Matrix Products A linear combination is a sum of scalars times quantities. Such expressions arise quite frequently and have the form (scalar #1)(quantity #1) + (scalar #2)(quantity #2) +...

More information

RevoScaleR Speed and Scalability

RevoScaleR Speed and Scalability EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

More information

Big Table A Distributed Storage System For Data

Big Table A Distributed Storage System For Data Big Table A Distributed Storage System For Data OSDI 2006 Fay Chang, Jeffrey Dean, Sanjay Ghemawat et.al. Presented by Rahul Malviya Why BigTable? Lots of (semi-)structured data at Google - - URLs: Contents,

More information

SOLVING LINEAR SYSTEMS

SOLVING LINEAR SYSTEMS SOLVING LINEAR SYSTEMS Linear systems Ax = b occur widely in applied mathematics They occur as direct formulations of real world problems; but more often, they occur as a part of the numerical analysis

More information

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group

HiBench Introduction. Carson Wang (carson.wang@intel.com) Software & Services Group HiBench Introduction Carson Wang (carson.wang@intel.com) Agenda Background Workloads Configurations Benchmark Report Tuning Guide Background WHY Why we need big data benchmarking systems? WHAT What is

More information

Hadoop SNS. renren.com. Saturday, December 3, 11

Hadoop SNS. renren.com. Saturday, December 3, 11 Hadoop SNS renren.com Saturday, December 3, 11 2.2 190 40 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December 3, 11 Saturday, December

More information

Benchmarking Cassandra on Violin

Benchmarking Cassandra on Violin Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract

More information

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat ESS event: Big Data in Official Statistics Antonino Virgillito, Istat v erbi v is 1 About me Head of Unit Web and BI Technologies, IT Directorate of Istat Project manager and technical coordinator of Web

More information

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014 Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

More information

MapReduce Job Processing

MapReduce Job Processing April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

More information

Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

More information

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

More information

A Comparison of Join Algorithms for Log Processing in MapReduce

A Comparison of Join Algorithms for Log Processing in MapReduce A Comparison of Join Algorithms for Log Processing in MapReduce Spyros Blanas, Jignesh M. Patel Computer Sciences Department University of Wisconsin-Madison {sblanas,jignesh}@cs.wisc.edu Vuk Ercegovac,

More information

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy

MapReduce Online. Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears. Neeraj Ganapathy MapReduce Online Tyson Condie, Neil Conway, Peter Alvaro, Joseph Hellerstein, Khaled Elmeleegy, Russell Sears Neeraj Ganapathy Outline Hadoop Architecture Pipelined MapReduce Online Aggregation Continuous

More information

Lecture Data Warehouse Systems

Lecture Data Warehouse Systems Lecture Data Warehouse Systems Eva Zangerle SS 2013 PART C: Novel Approaches in DW NoSQL and MapReduce Stonebraker on Data Warehouses Star and snowflake schemas are a good idea in the DW world C-Stores

More information

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications

Performance Comparison of Intel Enterprise Edition for Lustre* software and HDFS for MapReduce Applications Performance Comparison of Intel Enterprise Edition for Lustre software and HDFS for MapReduce Applications Rekha Singhal, Gabriele Pacciucci and Mukesh Gangadhar 2 Hadoop Introduc-on Open source MapReduce

More information

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18

A bit about Hadoop. Luca Pireddu. March 9, 2012. CRS4Distributed Computing Group. luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 A bit about Hadoop Luca Pireddu CRS4Distributed Computing Group March 9, 2012 luca.pireddu@crs4.it (CRS4) Luca Pireddu March 9, 2012 1 / 18 Often seen problems Often seen problems Low parallelism I/O is

More information