# Distributed Response Time Analysis of GSPN Models with MapReduce

Save this PDF as:

Size: px
Start display at page:

## Transcription

2 number of tokens on each place of the model. The reachability set or state space of a Place-Transition net is the set of all possible markings that can be reached from a given initial marking. The reachability graph shows the connections between these markings. Generalised Stochastic Petri nets (see e.g. Figures 4 and 5) extend Place-Transition nets by incorporating timing information. A timed transition t i has an exponentially distributed firing rate λ i. Immediate transitions have priority over timed transitions and fire in zero time. Markings that enable timed transitions only are known as tangible, while markings that enable any immediate transition are called vanishing. The sojourn time in a tangible marking M i is exponentially distributed with parameter µ i = k en(mi ) λ k where en(m i ) is the set of transitions enabled by marking M i. The sojourn time in vanishing markings is zero. Formally, [2]: Definition 2.1 A Generalised Stochastic Petri net is an 8- tuple GSPN = (P,T,I,I +,M 0,T 1,T 2,W). P = {p 1,..., p P } is a finite and non-empty set of places T = {t 1,...,t T } is a finite and non-empty set of transitions. P T = /0. I,I + : P T N 0 are the backward and forward incidence functions, respectively. M 0 : P N 0 is the initial marking. T 1 T is the set of timed transitions. T 2 T is the set of immediate transitions; T 1 T 2 = /0 and T = T 1 T 2. W = ( w 1,...,w T ) is an array whose entry w i R + is either a rate of a negative exponential distribution specifying the firing delay, when transition t i is a timed transition, or a firing weight, when transition t i is an immediate transition. We further define p i j to be the probability that M j is the next marking entered after marking M i and, for tangible marking M i, q i j = µ i p i j, i.e. q i j is the instantaneous transition rate into marking M j from marking M i. 2.1 Response Time Analysis using Numerical Laplace Transform Inversion If we first consider a GSPN whose state space does not contain any vanishing states, the definition of the first passage time from a single source marking i to a non-empty set of target markings j is given by: T i j = inf{u > 0 : M(u) j,n(u) > 0,M(0) = i} where M(u) is the marking of the GSPN at time u and N(u) is the number of transitions which have fired by time u. When studying GSPNs whose state spaces include vanishing states we define the passage time as: T i j = inf{u > 0 : N(u) M i j } where M i j = min{m Z+ : X m j X 0 = i}; here X m is the state of the system after the mth transition firing [4]. To find this passage time we must convolve the state sojourn time densities for all paths from i to j j. This is best done in the Laplace domain as we can take advantage of the convolution property which states that the convolution of two functions is equal to the product of their Laplace transforms. We perform a first-step analysis to find the Laplace transform of the relevant density. This process can be thought of as first finding the probability density of moving from state i to its set of direct successor states k and then convolving it with the probability density of moving from k to the set of target states j. Vanishing markings have a sojourn time density of 0, with probability 1, which results in their Laplace transform equalling 1 for all values of s. If L i j (s) is the Laplace transform of the density function f i j (t) of the passage time variable T i j, then we can express L i j (s) as: ( ) ( ) qik L L i j (s) = s+µ k j (s)+ qik i ifi T s+µ i k/ j k j k/ j p ikl k j (s)+ k j p ik ifi V where T is the set of tangible markings and V is the set of vanishing markings. This system of linear equations can also be expressed in matrix vector form. If, for example, we wish to find the passage time from state i to the set of states j = {M 1,M 3 }, where T = {M 1,M 3,...,M n } and V = {M 2 }, then: s q 11 q 12 0 q 1n q p 2n p 21 + p 23 0 q 32 s q 33 q 3n q L = 31 (1) q n2 0 q nn q n1 + q n3 where L = (L 1 j (s),...,l n j(s)). If we wish to calculate the passage time from multiple source states, denoted by the vector i, the Laplace transform of the passage time density is given by: L i j (s) = α k L k j (s) k i where α k is the steady-state probability that the GSPN is in state k at the starting instant of the passage. α k is given by: { πk α k = / n i π n ifk i (2) 0 otherwise where π k is the kth element of the steady-state probability vector π of the GSPN s underlying embedded Markov Chain. Now that we have the Laplace transform of the passage time, we must invert it to get the density of interest in the real domain. To do this we can use Euler inversion [1]. This works by evaluating the Laplace transform f (s) at various s-values determined by the value(s) of t at which we wish to evaluate f(t). From these results it approximates the inverse Laplace transform of f (s), i.e. f(t). Formally: ( ( f(t) ea/2 A Re f ))+ ea/2 ( ( )) A+2kπi ( 1) k Re f (3) 2t 2t 2t 2t k=1

5 Figure 2. Overview of Response Time Analysis module cally or as a distributed MapReduce job on Hadoop. Finally, the results are displayed as a graph whose underlying data can be saved as CSV file. 4.2 Reachability Graph Generator The reachability graph genenerator used in the Response Time Analysis module is based on an existing one already implemented in PIPE2 by [7]. Its concept is to perform a breadth-first search of the states of the GSPN s underlying SMP. It starts with a single state and finds all the states that can be reached from it in a single transition. This process is then repeated for each of those successor states until all states have been explored. In order to detect cycles a record must be kept in memory of each state identified; this presents a significant problem when dealing with large state spaces. Storing an array representing the marking of each state s places would consume far too much memory. A better approach is to employ a probabilistic, dynamic hashing technique, as devised in [8]. Here, only a hash of the state s marking array is stored in one of many linked lists which are in turn stored in a hash table. By using a second hash function to determine which list to store each state in the risk of collisions is dramatically reduced. A full representation is also stored on disk as it is necessary when identifying start and target states. The new I/O classes introduced in Java J2SE 5 were used to dramatically improve performance when writing to disk. 4.3 Dynamic Start/Target State Identifier A passage time of interest can be specified by defining a set of start states and a set of target states. For example, a user might wish to calculate the passage time from any state where a buffer contains three items, to any state where it contains none. In Petri net modelling the buffer would correspond to a place while the items would be tokens. A convenient way for the user to be able to specify sets of start and target states is by giving conditions on the number of tokens on places. Finding the corresponding states is a non-trivial problem as the entire state space must be searched to identify such states. A very fast algorithm is required as state spaces can be huge. We accomplish this by allowing the user to enter a logical expression, whose terms compare the markings of places with constants or the markings of other places. This is then translated into a Java expression which is inserted into a template that is compiled and invoked at run-time to check whether each state matches the user s conditions. 4.4 Steady-State Solver The steady-state solver uses the Gauss-Seidel iterative method to find the steady-state distribution vector of a Markov chain represented by a Q (or P) matrix by solving the equation πq = 0 (or πp = π). To obtain standard linear system form Ax = b requires the transpose of the Q or P matrix, which we generate with an appropriate transpose function.

6 Figure 3. An overview of the MapReduce distributed linear equation solver used in the RTA module 4.5 Linear Solution and Numerical Laplace Transform Inversion The next step is to set up the linear system of Equation 1 of the form AL = b with the aim of solving to find the response time vector, L. Recall that each element of the vector L i = L i j (s) represents the Laplace transform of the response time distribution between an initial state i and a set of target states j sampled at a point s for 1 i n. If multiple start markings are identified, a vector α is calculated from the normalised steady-state probability vector and the quantity α L found. This gives us the Laplace transform of the response time density from a set of initial states to a set of target states. The solution process is driven by the time-range over which the user wishes to plot the probability density function of the response time. Each t-point of the final response time distribution requires 65 s-point function calls (in this implementation) of the Laplace transform of the response time density. Each s-point sample of the Laplace transform is given by a single solution of Equation 1. The precise set of s-values required are calculated from the Euler Laplace inversion algorithm as a function of the desired time range of the final plot. Thus a time range of 100 points may require as many as 6500 distinct solutions of Equation 1, provided by a standard Gauss-Seidel iterative method For models with large state spaces solving the sets of linear equations is too processor intensive to do locally. We therefore integrate the module with the Hadoop MapReduce framework. An overview of this process is shown in Figure 3. In order to store L(s), we set up a Hashmap indexed on the s-value of the Laplace transform. This has the advantage that any repeated s-values need only be calculated once. The list of s-values is then copied to a number of sequence files, a special file format containing key/value pairs which is used by Hadoop as an input to a MapReduce job. By additionally storing the quantity L(s)/s and inverting, we can easily retrieve the CDF of the passage time, for very little extra computation. Each sequence file corresponds to a Map and the s-values are split evenly between them. It was necessary to do this explicitly as Hadoop s automatic file splitting functionality is aimed at much larger data files. The set of A-matrices corresponding to a set of the required s-values are serialised and the resulting binary file is copied into the cluster s HDFS. When a node receives a Map it will run the Map function a number of times; once for each s- value in its associated sequence file. For the first Map function run on a node, the A-matrices are copied out of the HDFS to local storage and deserialised. Subsequent calls to the Map function (even as part of different Map s) then use this

7 local copy, thereby greatly reducing network traffic. Whilst the L(s) values are being calculated, a single Reduce is started. We use the Reduce simply to collect all the L(s) values from across the cluster and copy them to a single output sequence file. With the distributed job complete, the response time calculator copies the results into a HashMap indexed on s-values for fast access and runs the Euler algorithm. 5 NUMERICAL RESULTS All results presented in this section were produced by PIPE2 running in conjunction with the latest development version of Hadoop (0.13.1) on a cluster of 15 Sun Fire x4100 machines, each with two dual-core, 64-bit Opteron 275 processors and 8GB of RAM. The operating system is a 64-bit version of Mandrake Linux and nodes are connected by gigabit ethernet and an Infiniband interface managed by a Silverstorm 9024 switch with a throughput of 2.5Gbit/s. One of the nodes was designated the master machine and ran the Hadoop Namenode and JobTracker processes, as well as PIPE Validation Our validation process began with the Branching Erlang model, taken from [9] and shown in Figure 4, which consists of two branches with known response times. In particular, the upper branch has an Erlang(12, 2) distribution, while the lower has an Erlang(3, 1) distribution. There is an equal probability of either branch being taken, as the weights of the immediate transitions are identical. As Erlang distributions are trivial to calculate analytically we can therefore compare the results form our numerical Laplace transform inversion method with their true values. Figures 6 and 7 compare the results produced by PIPE2 and those calculated analytically for the cycle time density and its corresponding CDF function of the Branching Erlang model. Excellent agreement can be seen between the two. These results demonstrate the Response Time Analysis module s ability to handle cases where the set of source and target states overlap (i.e. to calculate cycle times), as well as bimodal density curves. To validate the module for larger models with multiple start and target states we used the Courier Protocol model, first presented in [10] and shown in Figure 5. It models the ISO Application, Session and Transport layers of the Courier sliding-window communication protocol. By increasing the number of tokens on p14, the sliding-window size, we can dramatically increase the state space of the model. We begin our validation with this set to one, which results in a state space of The module completed this exploration in less than 8 seconds on a single machine. Results for the passage time from the set of markings where M(p11) > 0 to those where M(p20) > 0 are shown in Figure 8, where Cluster No. Maps Total Total Time Size Per Node Cores Maps (seconds) Table 1. Laplace transform inversion times for the Courier Protocol (window size 1) on various cluster sizes source markings and target markings were identified. They closely match simulation results for this same model that were produced in [6]. It should be noted that our model uses a scaled set of rates that are equal to the original benchmarked rates divided by This is necessary as the range in magnitude of the original rates causes problems with the numerical methods used to invert the Laplace transform. The results presented here are the raw results from the PIPE2 module and so must be re-scaled to give the correct timings. In order to ascertain how the Response Time Analysis module performs with models with larger state spaces we again used the Courier Protocol model, increasing its window size to 3. This results in a state space of states (including vanishing states) with transitions between them. Again, analysing from markings where M(p11) > 0 to markings where M(p20) > 0, we find there are start markings and target markings. Results were produced for 50 t-points ranging from 1 to 99 in increments of 2, resulting in a work queue of over systems of linear equations, each of rank 2.2 million. State space exploration took 20 minutes, while the Laplace transform inversion took 8 hours 9 minutes on a single node. Generation times for the various other matrices totalled less than 20 seconds. 5.2 Processing Times Table 1 shows the time taken to perform the Laplace transform inversion for the Courier Protocol model (window size 1) for 50 t-points on various cluster sizes. The cluster size column refers to the number of compute nodes assigned to the Hadoop cluster. The second column indicates the number of Map s assigned to each node. Hadoop allows multiple Map s to be run concurrently on a single node which is of particular benefit with multicore machines as it allows full use to be made of all cores. Where only one Map was assigned to a node only one core was in use. This was scaled up to 8 and 15 machine clusters until all cores were in use at once. The third column shows the total number of cores being

8 used simultaneously. The optimum map granularity for each cluster size was found through experimentation and is listed in the fourth column. t1 (r7) t34 (r10) sender application p1 p2 p45 receiver p46 application t2 t33 (r1) courier1 p3 p4 p43 p44 courier4 t3 (r1) t32 p5 p42 sender session p6 t4 (r2) t31 (r2) receiver p41 session p8 p40 t5 t30 (r1) courier2 p10 p9 p38 p39 courier3 t6 (r1) t29 p11 p36 p37 t7 (r8) t28 (r9) sender transport p12 p13 p35 receiver transport t8 (q1) t9 (q2) t27 n p15 p16 p33 p34 p14 t12 (r3) t10 (r5) t11 (r5) m p17 p32 t25 (r5) t26 (r5) p20 p18 p19 p30 p31 t13 t14 t22 t23 t24 t15 p21 p22 p27 p28 p29 p23 t16 (r6) t17 (r6) network delay t18 (r4) p24 p25 network delay t19 (r3) t20 (r4) t21 (r4) p26 Figure 5. The Courier Protocol model PIPE2 results Analytical results 0.12 Figure 4. The Branching Erlang model It is clear from Table 1 that the distributed response time calculator offers excellent scalability. With small clusters there is an approximate halving of calculation time as the cluster size is doubled. As the cluster sizes (and hence the number of Map s) grow this improvement drops slightly to a factor of approximately 1.8. This is to be expected as there is some overhead in setting up Map s. When the number of cores used on each node is increased we again see a good reduction in processing times. However, we no longer see the calculation time halve as the available cores double. It is likely that this is due to contention for Probability density Time (seconds) Figure 6. Cycle time distribution from markings where M(p1) > 0 to markings where M(p1) > 0 in the Branching Erlang model

10 No. Map Tasks Calculation Time (s) Fastest Map (s) Slowest Map (s) Table 2. Laplace transform inversion times for the Courier Protocol (window size 1) for various granularities cellent reliability, good fault tolerance and much improved development time. Models of up to at least 2.2 million states were shown to be easily accommodated using in-core processing. Reimplementing the linear equation solving algorithms as diskbased, rather than in-core, would allow for much larger model sizes. Results produced by the Response Time Analysis module were validated for smaller models with analytically calculated results and for larger models with simulations. Excellent scalability was shown, with an almost linear improvement in calculation times with increased cluster sizes. REFERENCES [1] J. Abate and W. Whitt. The Fourier-series method for inverting transforms of probability distributions. Queueing Systems, 10(1):5 88, [2] F. Bause and P.S. Kritzinger. Stochastic Petri Nets: An Introduction to the Theory. Vieweg Verlag, Wiesbaden, Germany, 2nd edition, [7] T. Kimber, B. Kirby, T. Master, and M. Worthington. Petri nets group project final report. Technical report, Imperial College, London, United Kingdom, March [8] W.J. Knottenbelt. Generalised Markovian analysis of timed transition systems. Master s thesis, University of Cape Town, Cape Town, South Africa, July [9] W.J. Knottenbelt and P.G. Harrison. Passage time distributions in large Markov chains. In Proceedings of ACM SIGMETRICS, pages 77 85, Marina Del Rey, California, U.S.A., June [10] C.M. Woodside and Y. Li. Performance Petri net analysis of communication protocol software by delayequivalent aggregation. In Proceeding of the 4th International Workshop on Petri nets and Performance Models (PNPM 91), pages 64 73, Melbourne, Australia, December IEEE Computer Society Press. [3] P. Bonet, C.M. Llado, R. Puijaner, and W.J. Knottenbelt. PIPE v2.5: A Petri net tool for performance modelling. In Proc. 23rd Latin American Conference on Informatics (CLEI 2007), San Jose, Costa Rica, October [4] J.T. Bradley, N.J. Dingle, W.J. Knottenbelt, and H.J. Wilson. Hypergraph-based parallel computation of passage time desnsities in large semi-markov models. In Linear Algebra and its Applications: Volume 386, pages Elsevier, July [5] J. Dean and S. Ghemawat. MapReduce: Simplified data processing on large clusters. In Proceedings of the OSDI 04: Sixth Symposium on Operating System Design and Implementation, San Francisco, California, U.S.A., December [6] N.J. Dingle, P.G. Harrison, and W.J. Knottenbelt. Response time densities in Generalised Stochastic Petri net models. In Proceedings of the 3rd ACM Workshop on Software and Performance (WOSP 2002), pages 46 54, Rome, Italy, 2002.

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

### Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components of Hadoop. We will see what types of nodes can exist in a Hadoop

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

### Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

### Keywords: Big Data, HDFS, Map Reduce, Hadoop

Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

### Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Weekly Report Hadoop Introduction submitted By Anurag Sharma Department of Computer Science and Engineering Indian Institute of Technology Bombay Chapter 1 What is Hadoop? Apache Hadoop (High-availability

### Fault Tolerance in Hadoop for Work Migration

1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

### Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### marlabs driving digital agility WHITEPAPER Big Data and Hadoop

marlabs driving digital agility WHITEPAPER Big Data and Hadoop Abstract This paper explains the significance of Hadoop, an emerging yet rapidly growing technology. The prime goal of this paper is to unveil

### MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

### A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS

A STUDY ON HADOOP ARCHITECTURE FOR BIG DATA ANALYTICS Dr. Ananthi Sheshasayee 1, J V N Lakshmi 2 1 Head Department of Computer Science & Research, Quaid-E-Millath Govt College for Women, Chennai, (India)

### DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

### Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

### http://www.wordle.net/

Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

### Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

### Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware Created by Doug Cutting and Mike Carafella in 2005. Cutting named the program after

### Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

### International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

1 What is Hadoop? Introduction to Hadoop We are living in an era where large volumes of data are available and the problem is to extract meaning from the data avalanche. The goal of the software tools

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### MAPREDUCE Programming Model

CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

### Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Introduction to Hadoop HDFS and Ecosystems ANSHUL MITTAL Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data Topics The goal of this presentation is to give

### From GWS to MapReduce: Google s Cloud Technology in the Early Days

Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

### MapReduce. Tushar B. Kute, http://tusharkute.com

MapReduce Tushar B. Kute, http://tusharkute.com What is MapReduce? MapReduce is a framework using which we can write applications to process huge amounts of data, in parallel, on large clusters of commodity

### MapReduce and Hadoop Distributed File System

MapReduce and Hadoop Distributed File System 1 B. RAMAMURTHY Contact: Dr. Bina Ramamurthy CSE Department University at Buffalo (SUNY) bina@buffalo.edu http://www.cse.buffalo.edu/faculty/bina Partially

### http://www.paper.edu.cn

5 10 15 20 25 30 35 A platform for massive railway information data storage # SHAN Xu 1, WANG Genying 1, LIU Lin 2** (1. Key Laboratory of Communication and Information Systems, Beijing Municipal Commission

### CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

### Mobile Cloud Computing for Data-Intensive Applications

Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, vct@andrew.cmu.edu Advisor: Professor Priya Narasimhan, priya@cs.cmu.edu Abstract The computational and storage

### Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. https://hadoop.apache.org. Big Data Management and Analytics

Overview Big Data in Apache Hadoop - HDFS - MapReduce in Hadoop - YARN https://hadoop.apache.org 138 Apache Hadoop - Historical Background - 2003: Google publishes its cluster architecture & DFS (GFS)

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

### White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

White Paper Big Data and Hadoop Abhishek S, Java COE www.marlabs.com Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP Table of contents Abstract.. 1 Introduction. 2 What is Big

### Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

### Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework Vidya Dhondiba Jadhav, Harshada Jayant Nazirkar, Sneha Manik Idekar Dept. of Information Technology, JSPM s BSIOTR (W),

### Mobile Storage and Search Engine of Information Oriented to Food Cloud

Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### MapReduce (in the cloud)

MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

### Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

### PIPE2: A Tool for the Performance Evaluation of Generalised Stochastic Petri Nets

PIPE2: A Tool for the Performance Evaluation of Generalised Stochastic Petri Nets Nicholas J. Dingle William J. Knottenbelt Tamas Suto Department of Computing, Imperial College London, 180 Queen s Gate,

### HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

### Energy Efficient MapReduce

Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Outline High Performance Computing (HPC) Towards exascale computing: a brief history Challenges in the exascale era Big Data meets HPC Some facts about Big Data Technologies HPC and Big Data converging

### LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT

LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT Samira Daneshyar 1 and Majid Razmjoo 2 1,2 School of Computer Science, Centre of Software Technology and Management (SOFTEM),

### MapReduce, Hadoop and Amazon AWS

MapReduce, Hadoop and Amazon AWS Yasser Ganjisaffar http://www.ics.uci.edu/~yganjisa February 2011 What is Hadoop? A software framework that supports data-intensive distributed applications. It enables

### MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

### Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

### Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

### Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute

### MapReduce Job Processing

April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

### Performance and Energy Efficiency of. Hadoop deployment models

Performance and Energy Efficiency of Hadoop deployment models Contents Review: What is MapReduce Review: What is Hadoop Hadoop Deployment Models Metrics Experiment Results Summary MapReduce Introduced

### Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Lecture 32 Big Data 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop 1 2 Big Data Problems Data explosion Data from users on social

### Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

### BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

### INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK A REVIEW ON HIGH PERFORMANCE DATA STORAGE ARCHITECTURE OF BIGDATA USING HDFS MS.

The Hadoop Framework Nils Braden University of Applied Sciences Gießen-Friedberg Wiesenstraße 14 35390 Gießen nils.braden@mni.fh-giessen.de Abstract. The Hadoop Framework offers an approach to large-scale

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE Mr. Santhosh S 1, Mr. Hemanth Kumar G 2 1 PG Scholor, 2 Asst. Professor, Dept. Of Computer Science & Engg, NMAMIT, (India) ABSTRACT

### R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

Distributed data processing in heterogeneous cloud environments R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5 1 uskenbaevar@gmail.com, 2 abu.kuandykov@gmail.com,

### A Survey of Cloud Computing

A Survey of Cloud Computing Guanfeng Nov 7, 2010 Abstract The principal service provided by cloud computing is that underlying infrastructure, which often consists of compute resources like storage, processors,

### SIMULATION AND MODELLING OF RAID 0 SYSTEM PERFORMANCE

SIMULATION AND MODELLING OF RAID 0 SYSTEM PERFORMANCE F. Wan N.J. Dingle W.J. Knottenbelt A.S. Lebrecht Department of Computing, Imperial College London, South Kensington Campus, London SW7 2AZ email:

Hadoop Scheduler w i t h Deadline Constraint Geetha J 1, N UdayBhaskar 2, P ChennaReddy 3,Neha Sniha 4 1,4 Department of Computer Science and Engineering, M S Ramaiah Institute of Technology, Bangalore,

### A computational model for MapReduce job flow

A computational model for MapReduce job flow Tommaso Di Noia, Marina Mongiello, Eugenio Di Sciascio Dipartimento di Ingegneria Elettrica e Dell informazione Politecnico di Bari Via E. Orabona, 4 70125

### Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Prepared By : Manoj Kumar Joshi & Vikas Sawhney General Agenda Introduction to Hadoop Architecture Acknowledgement Thanks to all the authors who left their selfexplanatory images on the internet. Thanks

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### International Journal of Advance Research in Computer Science and Management Studies

Volume 2, Issue 8, August 2014 ISSN: 2321 7782 (Online) International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online

### GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

### Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13

Big Data Analytics with MapReduce VL Implementierung von Datenbanksystemen 05-Feb-13 Astrid Rheinländer Wissensmanagement in der Bioinformatik What is Big Data? collection of data sets so large and complex

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Optimizing Java* and Apache Hadoop* for Intel Architecture

WHITE PAPER Intel Xeon Processors Optimizing and Apache Hadoop* Optimizing and Apache Hadoop* for Intel Architecture With the ability to analyze virtually unlimited amounts of unstructured and semistructured

### Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Volume 6, Issue 3, March 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Special Issue

### Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

### Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

### Hadoop Fair Scheduler Design Document

Hadoop Fair Scheduler Design Document October 18, 2010 Contents 1 Introduction 2 2 Fair Scheduler Goals 2 3 Scheduler Features 2 3.1 Pools........................................ 2 3.2 Minimum Shares.................................

### IJREAT International Journal of Research in Engineering & Advanced Technology, Volume 2, Issue 1, Feb-Mar, 2014 ISSN: 2320-8791 www.ijreat.

Design of Log Analyser Algorithm Using Hadoop Framework Banupriya P 1, Mohandas Ragupathi 2 PG Scholar, Department of Computer Science and Engineering, Hindustan University, Chennai Assistant Professor,

### High Performance Computing MapReduce & Hadoop. 17th Apr 2014

High Performance Computing MapReduce & Hadoop 17th Apr 2014 MapReduce Programming model for parallel processing vast amounts of data (TBs/PBs) distributed on commodity clusters Borrows from map() and reduce()

### Hadoop. Scalable Distributed Computing. Claire Jaja, Julian Chan October 8, 2013

Hadoop Scalable Distributed Computing Claire Jaja, Julian Chan October 8, 2013 What is Hadoop? A general-purpose storage and data-analysis platform Open source Apache software, implemented in Java Enables

### Windows Server Performance Monitoring

Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

### Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

### Snapshots in Hadoop Distributed File System

Snapshots in Hadoop Distributed File System Sameer Agarwal UC Berkeley Dhruba Borthakur Facebook Inc. Ion Stoica UC Berkeley Abstract The ability to take snapshots is an essential functionality of any

### Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

### Internals of Hadoop Application Framework and Distributed File System

International Journal of Scientific and Research Publications, Volume 5, Issue 7, July 2015 1 Internals of Hadoop Application Framework and Distributed File System Saminath.V, Sangeetha.M.S Abstract- Hadoop

### Distributed File Systems

Distributed File Systems Paul Krzyzanowski Rutgers University October 28, 2012 1 Introduction The classic network file systems we examined, NFS, CIFS, AFS, Coda, were designed as client-server applications.

### Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input