# Efficient Parallel knn Joins for Large Data in MapReduce

Save this PDF as:

Size: px
Start display at page:

## Transcription

5 the rest (with our partitioning constraint in mind), two simple and good choices are to ensure the: () same size for R i s blocks; (2) same size for S j s blocks. In the first choice, each reduce task has a block R i,j that satisfies R i,j = R i /n; the size of S i,j may vary. The worst case happens when there is a block S i,j such that S i,j = S i, i.e., all points in S i were contained in the jth block S i,j. In this case, the jth reducer does most of the job, and the other (n ) reducers have a very light load (since for any other block of S i, it contains only 2k records, see (4)). The bottleneck of the Reduce phase is apparently the jth reduce task, with the cost of O( R i,j log S i,j ) = O( R i log Si ). n In the second choice, each reduce task has a block S i,j that satisfies S i,j = S i /n; the size of R i,j may vary. The worst case happens when there is a block R i,j such that R i,j = R i, i.e., all points in R i were contained in the jth block R i,j. In this case, the jth reducer does most of the job, and the other (n ) reducers have effectively no computation cost. The bottleneck of the Reduce phase is the jth reduce task, with a cost of O( R i,j log S i,j ) = O( R i log S i ) = n O( R i log S i R i log n). For massive datasets, n S i, which implies this cost is O( R i log S i ). Hence, choice () is a natural pick for massive data. Our first challenge is how to create equal-sized blocks {D,..., D n} such that x D i and y D j if i < j then x < y, over a massive one-dimensional data set D in the Map phase. Given a value n, if we know the, 2,..., n quantiles of n n n D, clearly, using these quantiles guarantees n equal-sized blocks. However, obtaining these quantiles exactly is expensive for massive datasets in MapReduce. Thus, we search for an approximation to estimate the φ-quantile over D for any φ (,) efficiently in MapReduce. Let D consist of N distinct integers {d, d 2,..., d N}, such that i [, N], d i Z +, and i i 2, d i d i2. For any element x D, we define the rank of x in D as r(x) = P N i= [di < x], where [di < x] = if di < x and otherwise. We construct a sample D b = {s,..., s D b } of D by uniformly and randomly selecting records with probability p =, ε 2 N for any ε (, ); x D, b its rank s(x) in D b is s(x) = P D b i= [si < x] where [si < x] = if si < x and otherwise. Theorem 2 x D, b br(x) = s(x) is an unbiased estimator of r(x) with standard deviation εn, for any ε p (,). Proof. We define N independent identically distributed random variables, X,..., X N, where X i = with probability p and otherwise. x b D, s(x) is given by how many d i s such that d i < x have been sampled from D. There are r(x) such d i s in D, each is sampled with a probability p. Suppose their indices are {l,..., l r(x) }. This implies: D b r(x) X X s(x) = [s i < x] = X li. i= X i is a Bernoulli trial with p = for i [, N]. Thus, ε 2 N s(x) is a Binomial distribution expressed as B(r(x),p): i= E[br(x)] = E[ p s(x)] = E[s(x)] = r(x) and, p Var[br(x)] = Var( s(x) p ) = p Var(s(x)) 2 = p 2 r(x)p( p) < N p = (εn)2. Now, given required rank r, we estimate the element with the rank r in D as follows. We return the sampled element x D b that has the closest estimated rank br(x) to r, i.e: x = argmin br(x) r. (5) x D b Lemma 2 Any x returned by Equation (5) satisfies: Pr[ r(x) r εn] e 2/ε. Proof. We bound the probability of the event that r(x) is indeed away from r by more than εn. When this happens, it means that no elements with ranks that are within εn to r have been sampled. There are 2 εn such rank values. To simplify the presentation, we use 2εN in our derivation. Since each rank corresponds to a unique element in D, hence: Pr[ r(x) r > εn] = ( p) 2εN = ( ε 2 N )2εN < e ( /ε2 )2ε = e 2/ε. Thus, we can use D b and (5) to obtain a set of estimated quantiles for the, 2,..., n quantiles of D. This means n n n that for any R i, we can partition Z Ri into n roughly equalsized blocks for any n with high probability. However, there is another challenge. Consider the shifted copy R i, and the estimated quantiles {z i,,..., z i,n } from Z Ri. In order to partition S i using these z-values, we need to ensure that, in the jth block of S i, z (z i,j, k, S i) and z + (z i,j, k, S i) are copied over (as illustrated in Figure 4). This implies that the partitioning boundary of S i,j is no longer simply z i,j and z i,j! Instead, they have been extended to the kth z-values to the left and right of z i,j and z i,j respectively. We denote these two boundary values for S i,j as z i,j,k and z i,j,k+. Clearly, z i,j,k = min(z (z i,j, k, S i)) and z i,j,k+ = max( z + (z i,j, k, S i)); see Figure 4 for an illustration on the third S i block S i,3. To find z i,j,k and z i,j,k+ exactly given z i,j and z i,j is expensive in MapReduce: we need to sort z-values in S i to obtain Z Si and commute the entire S i to one reducer. This has to be done for every random shift. We again settle for approximations using random sampling. Our problem is generalized to the following. Given a dataset D of N distinct integer values {d,..., d N} and a value z, we want to find the kth closest value from D that is smaller than z (or larger than z, which is the same problem). Assume that D is already sorted and we have the entire D, this problem is easy: we simply find the kth value in D to the left of z which is denoted as z k. When this is not the case, we obtain a random sample bd of D by selecting each element in D into D b using a sampling probability p. We sort D b and return the kp th value in D b to the left of z, denoted as bz k, as our estimation for z k (see Figure 5). When searching for z k or bz k, only values z matter. Hence, we keep only d i z in D and D b in the following analysis (as well as in Figure 5). For ease of discussion, we assume that D k and D b kp. The special case when this does not hold can be easily handled by returning the furthest element to the left of z (i.e., min(d) or min( D)) b and our theoretical analysis below for the general case can also be easily extended. Note that both z k and bz k are elements from D. We define r(d i) as the rank of d i in descending order for any d i D; and s(d i) as the rank of d i in descending order for

6 kp =, p =.5 ẑ k, s(ẑ k )=, r(ẑ k )=2 X = r(ẑ k ) = 3 k = 2 d d z l z k, r(z k )=2 D : d i : sampled d i : z Figure 5: Estimate z k given any value z. any d i b D. Clearly, by our construction, r(z k ) = k and s(bz k ) = kp. Theorem 3 Let X = r(bz k ) be a random variable; by our construction, X takes the value in the range [ kp, r(d )]. Then, for any a [ kp, r(d )], Pr[X = a] = p a kp! z D p kp ( p) a kp, (6) and Pr[X k] εn e 2/ε if p = /(ε 2 N). (7) Proof. When X = a, two events must have happened: e : the ath element d l (r(d l ) = a) smaller than z in D must have been sampled into b D. e 2: exactly ( kp ) elements were sampled into b D from the (a ) elements between d l and z in D. e ` and e 2 are independent, and Pr(e ) = p and Pr(e 2) = a kp p kp ( p) a ( kp ), which leads to (6). Next, by Theorem 2, br(x) = s(x) is an unbiased estimator of r(x) for any x bd (i.e., E(br(x)) = r(x)). By our p construction in Figure 5, s(bz k ) = kp. Hence, br(bz k ) = k. This means that bz k is always the element in bd that has the closest estimated rank value to the kth rank in D (which corresponds exactly to r(z k )). Since X = r(bz k ), when p = /(ε 2 N), by Lemma 2, Pr[X k] εn e 2/ε. Theorem 3 implies that boundary values z i,j,k and z i,j,k+ for any jth block S i,j of S i can be estimated accurately using a random sample of S i with the sampling probability p = /(ε 2 S ): we apply the results in Theorem 3 using z i,j (or z i,j by searching towards the right) as z and S i as D. Theorem 3 ensures that our estimated boundary values for S i,j will not copy too many (or too few) records than necessary (k records immediately to the left and right of z i,j and z i,j respectively). Note that z i,j and z i,j are the two boundary values of R i,j, which themselves are also estimations (the estimated j th and j th quantiles of n n ZR ). i The algorithm. Complete is in Algorithm System Issues We implement in three rounds of MapReduce. Phase : The first MapReduce phase of corresponds to lines -3 of Algorithm 2. This phase constructs the shifted copies of R and S, R i and S i, also determines the partitioning values for R i and S i. To begin, the master node first generates {v,v 2,...,v α}, saves them to a file in DFS, and adds the file to the distributed cache, which is communicated to all mappers during their initialization. Each mapper processes a split for R or S: a single record at a time. For each record x, a mapper, for each vector v i, first computes v i + x and then computes the z-value z vi +x Algorithm 2: (R, S, k, n, ε, α) get {v 2,...,v α}, v i is a random vector in R d ; v = ; 2 let R i = R + v i and S i = S + v i for i [, α]; 3 for i =,..., α do 4 set p =, R b ε 2 R i =, A = ; 5 for each element x R i do 6 sample x into R b i with probability p; 7 A =estimator( R b i, ε,n, R ); 8 set p =, S b ε 2 S i = ; 9 for each point s S i do sample s into S b i with probability p; bz i,,k = and bz i,n,k+ = + ; 2 for j =,..., n do 3 bz i,j,k, bz i,j,k+ =estimator2( S b i, k, p, A[j]); 4 for each point r R i do 5 Find j [, n] such that A[j ] z r < A[j]; Insert r into the block R i,j 6 for each point s S i do 7 for j =... n do 8 if bz i,j,k z s bz i,j,k+ then 9 Insert s into the block S i,j 2 for any point r R do 2 for i =,..., α do 22 find block R i,j containing r; find C i(r) in S i,j; 23 for any s C i(r), update s = s v i; let C(r) = S 24 α i= Ci(r); output (r,knn(r, C(r)); Algorithm 3: estimator( b R, ε, n, N) p = /(ε 2 N), A = ; sort z-values of R b to obtain Z br ; 2 for each element x Z br do 3 get x s rank s(x) in Z br ; 4 estimate x s rank r(x) in Z R using br(x) = s(x)/p; 5 for i =,..., n do 6 A[i]=x in Z br with the closest br(x) value to i/n N; 7 A[] = and A[n] = + ; return A; Algorithm 4: estimator2( b S, k, p, z) sort z-values of bs to obtain Z bs. 2 return the kp th value to the left and the kp th value to the right of z in Z bs. and writes an entry (rid, z vi +x) to a single-chunk unreplicated file in DFS identifiable by the source dataset R or S, the mapper s identifier, and the random vector identifier i. Hadoop optimizes a write to single-chunk unreplicated DFS files from a slave by writing the file to available local space. Therefore, typically the only communication in this operation is small metadata to the master. While transforming each input record, each mapper also samples a record from R i into b R i as in lines 4-6 of Algorithm 2 and samples a record from S i into b S i as in lines 8- of Algorithm 2. Each sampled record x from b R i or b S i is emitted as a ((z x, i), (z x, i, src)) key-value pair, where z x is a byte array for x s z- value, i is a byte representing the shift for i [, α], and src is a byte indicating the source dataset R or S. We build a customized key comparator which sorts the emitted records for a partition (Hadoop s mapper output partition) in ascending order of z x. We also have a customized partitioner

### Similarity Search in a Very Large Scale Using Hadoop and HBase

Similarity Search in a Very Large Scale Using Hadoop and HBase Stanislav Barton, Vlastislav Dohnal, Philippe Rigaux LAMSADE - Universite Paris Dauphine, France Internet Memory Foundation, Paris, France

### Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

### Developing MapReduce Programs

Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### Benchmarking Hadoop & HBase on Violin

Technical White Paper Report Technical Report Benchmarking Hadoop & HBase on Violin Harnessing Big Data Analytics at the Speed of Memory Version 1.0 Abstract The purpose of benchmarking is to show advantages

### Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

### MapReduce Job Processing

April 17, 2012 Background: Hadoop Distributed File System (HDFS) Hadoop requires a Distributed File System (DFS), we utilize the Hadoop Distributed File System (HDFS). Background: Hadoop Distributed File

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

### BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB Planet Size Data!? Gartner s 10 key IT trends for 2012 unstructured data will grow some 80% over the course of the next

### Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints

Adaptive Tolerance Algorithm for Distributed Top-K Monitoring with Bandwidth Constraints Michael Bauer, Srinivasan Ravichandran University of Wisconsin-Madison Department of Computer Sciences {bauer, srini}@cs.wisc.edu

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

### Parallel Processing of cluster by Map Reduce

Parallel Processing of cluster by Map Reduce Abstract Madhavi Vaidya, Department of Computer Science Vivekanand College, Chembur, Mumbai vamadhavi04@yahoo.co.in MapReduce is a parallel programming model

### Energy Efficient MapReduce

Energy Efficient MapReduce Motivation: Energy consumption is an important aspect of datacenters efficiency, the total power consumption in the united states has doubled from 2000 to 2005, representing

### Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

### Survey on Scheduling Algorithm in MapReduce Framework

Survey on Scheduling Algorithm in MapReduce Framework Pravin P. Nimbalkar 1, Devendra P.Gadekar 2 1,2 Department of Computer Engineering, JSPM s Imperial College of Engineering and Research, Pune, India

### Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### http://www.wordle.net/

Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

### Scalable Histograms on Large Probabilistic Data

Scalable Histograms on Large Probabilistic Data Mingwang Tang and Feifei Li School of Computing, University of Utah, Salt Lake City, USA {tang, lifeifei}@cs.utah.edu ABSTRACT Histogram construction is

### Hadoop Design and k-means Clustering

Hadoop Design and k-means Clustering Kenneth Heafield Google Inc January 15, 2008 Example code from Hadoop 0.13.1 used under the Apache License Version 2.0 and modified for presentation. Except as otherwise

### Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

1 Hadoop: A Framework for Data- Intensive Distributed Computing CS561-Spring 2012 WPI, Mohamed Y. Eltabakh 2 What is Hadoop? Hadoop is a software framework for distributed processing of large datasets

### Processing Joins over Big Data in MapReduce

Processing Joins over Big Data in MapReduce Christos Doulkeridis Department of Digital Systems School of Information and Communication Technologies University of Piraeus http://www.ds.unipi.gr/cdoulk/

### A very short Intro to Hadoop

4 Overview A very short Intro to Hadoop photo by: exfordy, flickr 5 How to Crunch a Petabyte? Lots of disks, spinning all the time Redundancy, since disks die Lots of CPU cores, working all the time Retry,

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA http://kzhang6.people.uic.edu/tutorial/amcis2014.html August 7, 2014 Schedule I. Introduction to big data

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### GraySort and MinuteSort at Yahoo on Hadoop 0.23

GraySort and at Yahoo on Hadoop.23 Thomas Graves Yahoo! May, 213 The Apache Hadoop[1] software library is an open source framework that allows for the distributed processing of large data sets across clusters

### The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

### Fault Tolerance in Hadoop for Work Migration

1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

### Efficient Fault-Tolerant Infrastructure for Cloud Computing

Efficient Fault-Tolerant Infrastructure for Cloud Computing Xueyuan Su Candidate for Ph.D. in Computer Science, Yale University December 2013 Committee Michael J. Fischer (advisor) Dana Angluin James Aspnes

### Sorting revisited. Build the binary search tree: O(n^2) Traverse the binary tree: O(n) Total: O(n^2) + O(n) = O(n^2)

Sorting revisited How did we use a binary search tree to sort an array of elements? Tree Sort Algorithm Given: An array of elements to sort 1. Build a binary search tree out of the elements 2. Traverse

### GraySort on Apache Spark by Databricks

GraySort on Apache Spark by Databricks Reynold Xin, Parviz Deyhim, Ali Ghodsi, Xiangrui Meng, Matei Zaharia Databricks Inc. Apache Spark Sorting in Spark Overview Sorting Within a Partition Range Partitioner

### Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Analysing Large Files in a Hadoop Distributed Cluster Environment S Saravanan, B Uma Maheswari Department of Computer Science and Engineering, Amrita School of Engineering, Amrita Vishwa Vidyapeetham,

### Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

### EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

ABSTRACT EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES Tyler Cossentine and Ramon Lawrence Department of Computer Science, University of British Columbia Okanagan Kelowna, BC, Canada tcossentine@gmail.com

### MAPREDUCE Programming Model

CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

### Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework

Implementation and Analysis of Join Algorithms to handle skew for the Hadoop Map/Reduce Framework Fariha Atta MSc Informatics School of Informatics University of Edinburgh 2010 T Abstract he Map/Reduce

### RevoScaleR Speed and Scalability

EXECUTIVE WHITE PAPER RevoScaleR Speed and Scalability By Lee Edlefsen Ph.D., Chief Scientist, Revolution Analytics Abstract RevoScaleR, the Big Data predictive analytics library included with Revolution

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

### Lecture 4 Online and streaming algorithms for clustering

CSE 291: Geometric algorithms Spring 2013 Lecture 4 Online and streaming algorithms for clustering 4.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

### Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

### Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

PERFORMANCE TUNING Abstract This paper explains tuning of Hadoop configuration parameters which directly affects Map-Reduce job performance under various conditions, to achieve maximum performance. The

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching

### Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

### International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

### SIDN Server Measurements

SIDN Server Measurements Yuri Schaeffer 1, NLnet Labs NLnet Labs document 2010-003 July 19, 2010 1 Introduction For future capacity planning SIDN would like to have an insight on the required resources

### Lecture 6 Online and streaming algorithms for clustering

CSE 291: Unsupervised learning Spring 2008 Lecture 6 Online and streaming algorithms for clustering 6.1 On-line k-clustering To the extent that clustering takes place in the brain, it happens in an on-line

### Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

### Distributed Framework for Data Mining As a Service on Private Cloud

RESEARCH ARTICLE OPEN ACCESS Distributed Framework for Data Mining As a Service on Private Cloud Shraddha Masih *, Sanjay Tanwani** *Research Scholar & Associate Professor, School of Computer Science &

### HadoopRDF : A Scalable RDF Data Analysis System

HadoopRDF : A Scalable RDF Data Analysis System Yuan Tian 1, Jinhang DU 1, Haofen Wang 1, Yuan Ni 2, and Yong Yu 1 1 Shanghai Jiao Tong University, Shanghai, China {tian,dujh,whfcarter}@apex.sjtu.edu.cn

### Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

### Advances in Natural and Applied Sciences

AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

### BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

### Big Fast Data Hadoop acceleration with Flash. June 2013

Big Fast Data Hadoop acceleration with Flash June 2013 Agenda The Big Data Problem What is Hadoop Hadoop and Flash The Nytro Solution Test Results The Big Data Problem Big Data Output Facebook Traditional

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### 1. Comments on reviews a. Need to avoid just summarizing web page asks you for:

1. Comments on reviews a. Need to avoid just summarizing web page asks you for: i. A one or two sentence summary of the paper ii. A description of the problem they were trying to solve iii. A summary of

### Parallel Programming Map-Reduce. Needless to Say, We Need Machine Learning for Big Data

Case Study 2: Document Retrieval Parallel Programming Map-Reduce Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin January 31 st, 2013 Carlos Guestrin

### A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework

A Comparative Analysis of Join Algorithms Using the Hadoop Map/Reduce Framework Konstantina Palla E H U N I V E R S I T Y T O H F G R E D I N B U Master of Science School of Informatics University of Edinburgh

### Image Search by MapReduce

Image Search by MapReduce COEN 241 Cloud Computing Term Project Final Report Team #5 Submitted by: Lu Yu Zhe Xu Chengcheng Huang Submitted to: Prof. Ming Hwa Wang 09/01/2015 Preface Currently, there s

### Keywords: Big Data, HDFS, Map Reduce, Hadoop

Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

### Efficient Data Replication Scheme based on Hadoop Distributed File System

, pp. 177-186 http://dx.doi.org/10.14257/ijseia.2015.9.12.16 Efficient Data Replication Scheme based on Hadoop Distributed File System Jungha Lee 1, Jaehwa Chung 2 and Daewon Lee 3* 1 Division of Supercomputing,

### DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture Dr. Wlodek Zadrozny (Most slides come from Prof. Akella s class in 2014) 2015-2025. Reproduction or usage prohibited without permission of

### Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

### CS2510 Computer Operating Systems

CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

### CS2510 Computer Operating Systems

CS2510 Computer Operating Systems HADOOP Distributed File System Dr. Taieb Znati Computer Science Department University of Pittsburgh Outline HDF Design Issues HDFS Application Profile Block Abstraction

### Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Introduction to Hadoop New York Oracle User Group Vikas Sawhney GENERAL AGENDA Driving Factors behind BIG-DATA NOSQL Database 2014 Database Landscape Hadoop Architecture Map/Reduce Hadoop Eco-system Hadoop

### Intro to Map/Reduce a.k.a. Hadoop

Intro to Map/Reduce a.k.a. Hadoop Based on: Mining of Massive Datasets by Ra jaraman and Ullman, Cambridge University Press, 2011 Data Mining for the masses by North, Global Text Project, 2012 Slides by

### PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce

PARMA: A Parallel Randomized Algorithm for Approximate Association Rules Mining in MapReduce Matteo Riondato Dept. of Computer Science Brown University Providence, RI, USA matteo@cs.brown.edu Justin A.

### Map Reduce & Hadoop Recommended Text:

Big Data Map Reduce & Hadoop Recommended Text:! Large datasets are becoming more common The New York Stock Exchange generates about one terabyte of new trade data per day. Facebook hosts approximately

### MASSIVE DATA PROCESSING (THE GOOGLE WAY ) 27/04/2015. Fundamentals of Distributed Systems. Inside Google circa 2015

7/04/05 Fundamentals of Distributed Systems CC5- PROCESAMIENTO MASIVO DE DATOS OTOÑO 05 Lecture 4: DFS & MapReduce I Aidan Hogan aidhog@gmail.com Inside Google circa 997/98 MASSIVE DATA PROCESSING (THE

### Statistical Learning Theory Meets Big Data

Statistical Learning Theory Meets Big Data Randomized algorithms for frequent itemsets Eli Upfal Brown University Data, data, data In God we trust, all others (must) bring data Prof. W.E. Deming, Statistician,

Hadoop Architecture Part 1 Node, Rack and Cluster: A node is simply a computer, typically non-enterprise, commodity hardware for nodes that contain data. Consider we have Node 1.Then we can add more nodes,

### Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

### Cost Model: Work, Span and Parallelism. 1 The RAM model for sequential computation:

CSE341T 08/31/2015 Lecture 3 Cost Model: Work, Span and Parallelism In this lecture, we will look at how one analyze a parallel program written using Cilk Plus. When we analyze the cost of an algorithm

### Can you design an algorithm that searches for the maximum value in the list?

The Science of Computing I Lesson 2: Searching and Sorting Living with Cyber Pillar: Algorithms Searching One of the most common tasks that is undertaken in Computer Science is the task of searching. We

### Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

### Map-Reduce for Machine Learning on Multicore

Map-Reduce for Machine Learning on Multicore Chu, et al. Problem The world is going multicore New computers - dual core to 12+-core Shift to more concurrent programming paradigms and languages Erlang,

Load Balancing Backtracking, branch & bound and alpha-beta pruning: how to assign work to idle processes without much communication? Additionally for alpha-beta pruning: implementing the young-brothers-wait

### SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large

### An improved task assignment scheme for Hadoop running in the clouds

Dai and Bassiouni Journal of Cloud Computing: Advances, Systems and Applications 2013, 2:23 RESEARCH An improved task assignment scheme for Hadoop running in the clouds Wei Dai * and Mostafa Bassiouni

MapReduce and Implementation Hadoop Parallel Data Processing Kai Shen A programming interface (two stage Map and Reduce) and system support such that: the interface is easy to program, and suitable for

### Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

Distributed File System 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributed File System Don t move data to workers move workers to the data! Store data on the local disks of nodes

### Joint Optimization of Overlapping Phases in MapReduce

Joint Optimization of Overlapping Phases in MapReduce Minghong Lin, Li Zhang, Adam Wierman, Jian Tan Abstract MapReduce is a scalable parallel computing framework for big data processing. It exhibits multiple

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### Distributed Computing and Big Data: Hadoop and MapReduce

Distributed Computing and Big Data: Hadoop and MapReduce Bill Keenan, Director Terry Heinze, Architect Thomson Reuters Research & Development Agenda R&D Overview Hadoop and MapReduce Overview Use Case:

### Offline sorting buffers on Line

Offline sorting buffers on Line Rohit Khandekar 1 and Vinayaka Pandit 2 1 University of Waterloo, ON, Canada. email: rkhandekar@gmail.com 2 IBM India Research Lab, New Delhi. email: pvinayak@in.ibm.com