# Iterative Computation of Connected Graph Components with MapReduce

Size: px
Start display at page:

## Transcription

1 atenbank Spektrum manuscript No. (will be inserted by the editor) terative omputation of onnected raph omponents with MapReduce Preprint, accepted for publication in atenbank-spektrum ars olb Ziad Sehili rhard Rahm Received: date / ccepted: date bstract The use of the MapReduce framework for iterative graph algorithms is challenging. To achieve high performance it is critical to limit the amount of intermediate results as well as the number of necessary iterations. We address these issues for the important problem of finding connected components in large graphs. We analyze an existing MapReduce algorithm, -MR, and present techniques to improve its performance including a memory-based connection of subgraphs in the map phase. Our evaluation with several large graph datasets shows that the improvements can substantially reduce the amount of generated data by up to a factor of 8.8 and runtime by up to factor of 3.5. eywords MapReduce adoop onnected raph omponents Transitive losure 1 ntroduction Many ig ata applications require the efficient processing of very large graphs, e.g., for social networks or bibliographic datasets. n enterprise applications, there are also numerous interconnected entities such as customers, products, employees and associated business activities like quotations and invoices that can be represented in large graphs for improved analysis [17]. inding connected graph components within such graphs is ars olb Ziad Sehili rhard Rahm nstitut für nformatik Universität eipzig P eipzig ermany -mail: ig. 1 xample graph with two connected components. a fundamental step to cluster related entities or to find new patterns among entities. connected component () of an undirected graph is a maximal subgraph in which any two vertices are interconnected by a path. igure 1 shows a graph consisting of two s. fficiently analyzing large graphs with millions of vertices and requires a parallel processing. Map- Reduce (MR) is a popular framework for parallel data processing in cluster environments providing scalability while largely hiding the complexity of a parallel system. We study the problem of efficiently computing the s of very large graphs with MR. inding connected vertices is an inherently iterative process so that a MRbased implementation results in the repeated execution of an MR program where the output of an iteration serves as input for the next iteration. lthough the MR framework might not be the best choice for iterative graph algorithms, the huge popularity of the freely available pache adoop implementation makes it important to find efficient MR-based implementations. We implemented our approaches as part of our existing MR-based entity resolution framework [14,13]. The computation of the transitive closure of a binary relation, expressed as a graph, is closely related to the problem. or entity resolution, it is a typical post-processing step to compute the transitive closure

2 2 ars olb et al. of matching entity pairs to find additional, indirectly matching entities. or an efficient MR-based computation of the transitive closure it is important to not explicitly determine all matching pairs which would result in an enormous amount of intermediate data to store and exchange between iterations. or instance, the left of igure 1 with nine entities results in a transitive closure consisting of 36 pairs (, ), (, ),..., (, ). nstead, we only need to determine the set or cluster of matching entities, i.e. {,,,,,,,, }, which all individual pairs can be easily derived from, if necessary. etermining these clusters is analogous to determining the s with the minimal number of, e.g. by choosing a /cluster representative, say, and having an edge between and any other element of the, resulting in the eight pairs (, ), (, ),...(, ) for our example. The algorithms we consider determine such star-like connected components. or the efficient use of MapReduce for iterative algorithms, it is of critical importance to keep the number of iterations as small as possible because each iteration implies a significant amount of task scheduling and /O overhead. Second, the amount of generated (intermediate) data should be minimized to reduce the /O overhead per iteration. urthermore, one may cache data needed in different iterations (similar to [4]). To this end, we propose and evaluate several enhancements over previous MR-based algorithms. Our specific contributions are as follows: We review existing approaches to determine the s of an undirected graph (Section 2) and present an efficient MR-based algorithm, -MR, (Section 3) as a basis for comparison. We propose several algorithmic extensions over -MR to reduce the number of iterations, the amount of intermediate data, and, ultimately, execution time (Section 4). major improvement is to connect early during the map phase, so that the remaining work for the reduce phase and further iterations is lowered. We perform a comprehensive evaluation for several large datasets to analyze the proposed techniques in comparison with -MR (Section 5). 2 Related work 2.1 Parallel transitive closure computation n early iterative algorithm, TPO, to compute the transitive closure of a binary relation was proposed in the context of parallel database systems [22]. t relies on a distributed computation of join and union operations. n improved version of TPO proposed in [6] uses a double hashing technique to reduce the amount of data repartitioning in each round. oth approaches are parallel implementations of sequential iterative algorithms [3] which terminate after d iterations where d is the depth of the graph. The Smart algorithm [11] improves these sequential approaches by limiting the number of required iterations to log d + 1. lthough not evaluated, a possible parallel MR implementation of the Smart algorithm was discussed in [1]. The proposed approach translates each iteration of the Smart algorithm into multiple MR jobs that must be executed sequentially. owever, because MR relies on materializing (intermediate) results, the proposed approach is not feasible for large graphs. 2.2 etection of onnected omponents inding the connected components of a graph is a well studied problem. Traditional approaches have a linear runtime complexity and traverse the graph using depth first (or breadth first) search to discover connected components [21]. or the efficient handling of large graphs, parallel algorithms with logarithmic time complexity were proposed [10,20,2] (see [9] for a comparison). Those approaches rely on a shared memory system and are not applicable for the MR programming model which relies on shared nothing clusters. The authors of [5] proposed an algorithm for distributed memory cluster environments in which nodes communicate with each other to access remote memory. owever, the MR framework is designed for independent parallel batch processing of disjoint data partitions. n MR algorithm to detect in graphs was proposed in [7]. The main drawback of this algorithm is that it needs three MapReduce jobs for each iteration. There are further approaches that strive to minimize the number of required iterations [12, 15]. ll approaches are clearly outperformed by the -MR algorithm proposed in [19]. or a graph with depth d, -MR requires d iterations in the worst case but needs in practice only a logarithmic number of iterations according to [19]. This algorithm will be described in more detail in the following section. very similar approach was independently proposed at the same time in [18]. The authors suggest four different algorithms of which the one with the best performance for large graphs corresponds to -MR. Recent distributed graph processing frameworks like oogle Pregel [16] rely on the ulk Synchronous Parallel (SP) paradigm which segments distributed computations into a sequence of supersteps consisting of a

4 Partition by hash(ey.first) mod 2 / roup by ey.first Partition by hash(ey.first) mod 2 / roup by ey.first Partition by hash(ey.first) mod 2 / roup by ey.first 4 ars olb et al. Map 1 : mit (inverted) Reduce 1 : lgorithm 1 Reduce 2 : lgorithm 1 Reduce 3 : lgorithm dges dges -, - -, - -, - -, - - M -, - - M dges dges -, - - M -, - M , dges dges ig. 2 xample dataflow (upper part) of the -MR algorithm for the example graph of igure 1. red indicates a local max state, whereas relevant merge states are indicated by a red M. Newly discovered are highlighted in boldface. The algorithm terminates after three iterations since no new backward (italic) are generated. The map phase of each iteration i > 1 emits each output edge of the previous iteration unchanged and is omitted to save space. The lower part of the figure shows the resulting graph after each iteration. Red vertices indicate the smallest vertices of the components. reen vertices are already assigned correctly whereas blue vertices still need to be reassigned. outputted for each value u adj(v) (see ines 4-6 and of lgorithm 1). or example in the first iteration of igure 2, the (sub)components : [,, ] and : [, ] are in the ocmaxstate (marked with an ). or the merge case, i.e. v > first, all u adj(v)\ {first} in the value list are assigned to vertex first. To this end, the reduce function emits a key-value pair (first, u) for each such vertex u (ine 15). dditionally, reduce outputs the inverse key-value pair (u, f irst) for each such u (ine 16) as well as a final pair (v, first) if v is not the largest vertex in adj(first) (ine 19). The latter two kinds of reduce output pairs represent so-called backward that are temporarily added to the graph. ackward serve as bridges to connect first with further neighbors of u and v (that might even be smaller than f irst) in the following iterations. n the example of igure 2, relevant merge states are marked with an M. The reduce input group : [, ] of the first iteration results in a newly discovered edge (, ) and backward (, ), (, ). roup : [,,,, ] also reaches the merge state and generates (amongst others) the backward edge (, ). n the second iteration, the first reduce task extends the component for vertex by adding the newly generated neighbors of. The second reduce task merges and (for group : [, ]) as well as and. Note, that might be detected multiple times, e.g. the edge (, ) is generated by both reduce tasks in the second iteration. Such duplicates are removed in the next iteration (by the first reduce task of the third iteration in the example) according to ine 11 of lgorithm 1. The output of iteration i serves as input for iteration i + 1. The map phase of the following iterations outputs each input edge unchanged (aside from the construction of composite keys), i.e., no reverse are generated as in the first iteration. The algorithm terminates when no further backward are generated. This can be determined by a driver program which repeatably executes the same MR job by analyzing the job counter values that are collected by the slave nodes and are aggregated by the master node at the end of the job execution. s iteration i > 1 only depends on the output of iteration i 1, the driver program can replace the input directory of iteration i 1 with the output directory of iteration i 1.

11 terative omputation of onnected raph omponents with MapReduce 11 6 Summary The computation of connected components () for large graphs requires a parallel processing, e.g. on the popular adoop platform. We proposed and evaluated three extensions over the previously proposed Map- Reduce-based implementation -MR to reduce both, the amount of intermediate data and the number of required iterations. The best results are achieved by -MR-Mem which connects sub-components in both the reduce and in the map phase of each iteration. The evaluation showed that -MR-Mem significantly outperforms -MR for all considered graph datasets, especially for larger graphs. urthermore, we proposed a strategy to early separate stable components from further processing. This approach introduces nearly no additional overhead but can significantly improve the performance of -MR-Mem for sparse graphs. References 1. frati,.n., orkar, V.R., arey, M.., Polyzotis, N., Ullman,..: Map-Reduce xtensions and Recursive Queries. n: Proc. of ntl. onference on xtending atabase Technology, pp. 1 8 (2011) 2. werbuch,., Shiloach, Y.: New onnectivity and MS lgorithms for Shuffle-xchange Network and PRM. Transactions on omputers 36(10), (1987) 3. ancilhon,., Maier,., Sagiv, Y., Ullman,..: Magic Sets and Other Strange Ways to mplement ogic Programs. n: Proc. of Symposium on Principles of atabase Systems, pp (1986) 4. u, Y., owe,., alazinska, M., rnst, M..: The aoop approach to large-scale iterative data analysis. V ournal 21(2), (2012) 5. us,., Tvrdík, P.: Parallel lgorithm for onnected omponents on istributed Memory Machines. n: Proc. of uropean PVM/MP Users roup Meeting, pp (2001) 6. heiney,.p., de Maindreville,.: Parallel Transitive losure lgorithm Using ash-ased lustering. n: Proc. of ntl. Workshop on atabase Machines, pp (1989) 7. ohen,.: raph Twiddling in a MapReduce World. omputing in Science and ngineering 11(4), (2009) 8. ean,., hemawat, S.: MapReduce: Simplified ata Processing on arge lusters. n: Proc. of Symposium on Operating System esign and mplementation, pp (2004) 9. reiner,.: omparison of Parallel lgorithms for onnected omponents. n: Proc. of Symposium on Parallelism in lgorithms and rchitectures, pp (1994) 10. irschberg,.s., handra,.., Sarwate,.V.: omputing onnected omponents on Parallel omputers. ommunications of the M 22(8), (1979) 11. oannidis, Y..: On the omputation of the Transitive losure of Relational Operators. n: Proc. of ntl. onference on Very arge atabases, pp (1986) 12. ang, U., Tsourakakis,.., aloutsos,.: PSUS: Peta-Scale raph Mining System. n: Proc. of ntl. onference on ata Mining, pp (2009) 13. olb,., Rahm,.: Parallel ntity Resolution with edoop. atenbank-spektrum 13(1), (2013) 14. olb,., Thor,., Rahm,.: edoop: fficient eduplication with adoop. Proceedings of the V ndowment 5(12), (2012) 15. attanzi, S., Moseley,., Suri, S., Vassilvitskii, S.: iltering: Method for Solving raph Problems in Map- Reduce. n: Proc. of Symposium on Parallelism in lgorithms and rchitectures, pp (2011) 16. Malewicz,., ustern, M.., ik,..., ehnert,.., orn,., eiser, N., zajkowski,.: Pregel: System for arge-scale raph Processing. n: Proc. of the ntl. onference on Management of ata, pp (2010) 17. Petermann,., unghanns, M., Mueller, R., Rahm,.: : nabling usiness ntelligence with ntegrated nstance raphs. n: Proc. of ntl. Workshop on raph ata Management (M) (2014) 18. Rastogi, V., Machanavajjhala,., hitnis,., Sarma,..: inding connected components in map-reduce in logarithmic rounds. n: Proc. of ntl. onference on ata ngineering, pp (2013) 19. Seidl, T., oden,., ries, S.: -MR - inding onnected omponents in uge raphs with MapReduce. n: Proc. of Machine earning and nowledge iscovery in atabases, pp (2012) 20. Shiloach, Y., Vishkin, U.: n O(log n) Parallel onnectivity lgorithm.. lgorithms 3(1), (1982) 21. Tarjan, R..: epth-irst Search and inear raph lgorithms. SM ournal on omputing 1(2), (1972) 22. Valduriez, P., hoshafian, S.: Parallel valuation of the Transitive losure of a atabase Relation. ntl. ournal of Parallel Programming 17(1), (1988)

### MapReduce Algorithms. Sergei Vassilvitskii. Saturday, August 25, 12

MapReduce Algorithms A Sense of Scale At web scales... Mail: Billions of messages per day Search: Billions of searches per day Social: Billions of relationships 2 A Sense of Scale At web scales... Mail:

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### http://www.wordle.net/

Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

### MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With

### Developing MapReduce Programs

Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

### Overview on Graph Datastores and Graph Computing Systems. -- Litao Deng (Cloud Computing Group) 06-08-2012

Overview on Graph Datastores and Graph Computing Systems -- Litao Deng (Cloud Computing Group) 06-08-2012 Graph - Everywhere 1: Friendship Graph 2: Food Graph 3: Internet Graph Most of the relationships

### Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

### Infrastructure Components: Hub & Repeater. Network Infrastructure. Switch: Realization. Infrastructure Components: Switch

Network Infrastructure or building computer networks more complex than e.g. a short bus, some additional components are needed. They can be arranged hierarchically regarding their functionality: Repeater

### MapReduce. MapReduce and SQL Injections. CS 3200 Final Lecture. Introduction. MapReduce. Programming Model. Example

MapReduce MapReduce and SQL Injections CS 3200 Final Lecture Jeffrey Dean and Sanjay Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. OSDI'04: Sixth Symposium on Operating System Design

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

### Learning-based Entity Resolution with MapReduce

Learning-based Entity Resolution with MapReduce Lars Kolb 1 Hanna Köpcke 2 Andreas Thor 1 Erhard Rahm 1,2 1 Database Group, 2 WDI Lab University of Leipzig {kolb,koepcke,thor,rahm}@informatik.uni-leipzig.de

### CSE 326, Data Structures. Sample Final Exam. Problem Max Points Score 1 14 (2x7) 2 18 (3x6) 3 4 4 7 5 9 6 16 7 8 8 4 9 8 10 4 Total 92.

Name: Email ID: CSE 326, Data Structures Section: Sample Final Exam Instructions: The exam is closed book, closed notes. Unless otherwise stated, N denotes the number of elements in the data structure

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### Chapter 13: Query Processing. Basic Steps in Query Processing

Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

### Asking Hard Graph Questions. Paul Burkhardt. February 3, 2014

Beyond Watson: Predictive Analytics and Big Data U.S. National Security Agency Research Directorate - R6 Technical Report February 3, 2014 300 years before Watson there was Euler! The first (Jeopardy!)

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### SCAN: A Structural Clustering Algorithm for Networks

SCAN: A Structural Clustering Algorithm for Networks Xiaowei Xu, Nurcan Yuruk, Zhidan Feng (University of Arkansas at Little Rock) Thomas A. J. Schweiger (Acxiom Corporation) Networks scaling: #edges connected

### Graph Algorithms. Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

Graph Algorithms Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text Introduction to Parallel Computing, Addison Wesley, 3. Topic Overview Definitions and Representation Minimum

### Part 2: Community Detection

Chapter 8: Graph Data Part 2: Community Detection Based on Leskovec, Rajaraman, Ullman 2014: Mining of Massive Datasets Big Data Management and Analytics Outline Community Detection - Social networks -

### Graph Processing and Social Networks

Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

### ProTrack: A Simple Provenance-tracking Filesystem

ProTrack: A Simple Provenance-tracking Filesystem Somak Das Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology das@mit.edu Abstract Provenance describes a file

### 12 Abstract Data Types

12 Abstract Data Types 12.1 Source: Foundations of Computer Science Cengage Learning Objectives After studying this chapter, the student should be able to: Define the concept of an abstract data type (ADT).

### From GWS to MapReduce: Google s Cloud Technology in the Early Days

Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

### Introduction to Parallel Programming and MapReduce

Introduction to Parallel Programming and MapReduce Audience and Pre-Requisites This tutorial covers the basics of parallel programming and the MapReduce programming model. The pre-requisites are significant

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

### Load Balancing for Entity Matching over Big Data using Sorted Neighborhood

San Jose State University SJSU ScholarWorks Master's Projects Master's Theses and Graduate Research Fall 2015 Load Balancing for Entity Matching over Big Data using Sorted Neighborhood Yogesh Wattamwar

### Cloud Computing at Google. Architecture

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

### DATABASE DESIGN - 1DL400

DATABASE DESIGN - 1DL400 Spring 2015 A course on modern database systems!! http://www.it.uu.se/research/group/udbl/kurser/dbii_vt15/ Kjell Orsborn! Uppsala Database Laboratory! Department of Information

### Distributed Computing over Communication Networks: Maximal Independent Set

Distributed Computing over Communication Networks: Maximal Independent Set What is a MIS? MIS An independent set (IS) of an undirected graph is a subset U of nodes such that no two nodes in U are adjacent.

### FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

### InfiniteGraph: The Distributed Graph Database

A Performance and Distributed Performance Benchmark of InfiniteGraph and a Leading Open Source Graph Database Using Synthetic Data Objectivity, Inc. 640 West California Ave. Suite 240 Sunnyvale, CA 94086

### The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge,

The Union-Find Problem Kruskal s algorithm for finding an MST presented us with a problem in data-structure design. As we looked at each edge, cheapest first, we had to determine whether its two endpoints

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### Apache Hama Design Document v0.6

Apache Hama Design Document v0.6 Introduction Hama Architecture BSPMaster GroomServer Zookeeper BSP Task Execution Job Submission Job and Task Scheduling Task Execution Lifecycle Synchronization Fault

### MapReduce and Hadoop. Aaron Birkland Cornell Center for Advanced Computing. January 2012

MapReduce and Hadoop Aaron Birkland Cornell Center for Advanced Computing January 2012 Motivation Simple programming model for Big Data Distributed, parallel but hides this Established success at petabyte

### An Efficient Clustering Algorithm for Market Basket Data Based on Small Large Ratios

An Efficient lustering Algorithm for Market Basket Data Based on Small Large Ratios hing-uang Yun and Kun-Ta huang and Ming-Syan hen Department of Electrical Engineering National Taiwan University Taipei,

### BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

### Systems and Algorithms for Big Data Analytics

Systems and Algorithms for Big Data Analytics YAN, Da Email: yanda@cse.cuhk.edu.hk My Research Graph Data Distributed Graph Processing Spatial Data Spatial Query Processing Uncertain Data Querying & Mining

### MapReduce: Algorithm Design Patterns

Designing Algorithms for MapReduce MapReduce: Algorithm Design Patterns Need to adapt to a restricted model of computation Goals Scalability: adding machines will make the algo run faster Efficiency: resources

### Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs

Approximated Distributed Minimum Vertex Cover Algorithms for Bounded Degree Graphs Yong Zhang 1.2, Francis Y.L. Chin 2, and Hing-Fung Ting 2 1 College of Mathematics and Computer Science, Hebei University,

### Performance Tuning for the Teradata Database

Performance Tuning for the Teradata Database Matthew W Froemsdorf Teradata Partner Engineering and Technical Consulting - i - Document Changes Rev. Date Section Comment 1.0 2010-10-26 All Initial document

### 2.3 Scheduling jobs on identical parallel machines

2.3 Scheduling jobs on identical parallel machines There are jobs to be processed, and there are identical machines (running in parallel) to which each job may be assigned Each job = 1,,, must be processed

### Keywords: Big Data, HDFS, Map Reduce, Hadoop

Volume 5, Issue 7, July 2015 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Configuration Tuning

### Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

### The LCA Problem Revisited

The LA Problem Revisited Michael A. Bender Martín Farach-olton SUNY Stony Brook Rutgers University May 16, 2000 Abstract We present a very simple algorithm for the Least ommon Ancestor problem. We thus

### Fast Sequential Summation Algorithms Using Augmented Data Structures

Fast Sequential Summation Algorithms Using Augmented Data Structures Vadim Stadnik vadim.stadnik@gmail.com Abstract This paper provides an introduction to the design of augmented data structures that offer

### Load Balancing for MapReduce-based Entity Resolution

Load Balancing for MapReduce-based Entity Resolution Lars Kolb, Andreas Thor, Erhard Rahm Database Group, University of Leipzig Leipzig, Germany {kolb,thor,rahm}@informatik.uni-leipzig.de Abstract The

### Big Data Mining Services and Knowledge Discovery Applications on Clouds

Big Data Mining Services and Knowledge Discovery Applications on Clouds Domenico Talia DIMES, Università della Calabria & DtoK Lab Italy talia@dimes.unical.it Data Availability or Data Deluge? Some decades

### Spark. Fast, Interactive, Language- Integrated Cluster Computing

Spark Fast, Interactive, Language- Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica UC

### A fast online spanner for roadmap construction

fast online spanner for roadmap construction Weifu Wang evin alkcom mit hakrabarti bstract This paper introduces a fast weighted streaming spanner algorithm (WSS) that trims edges from roadmaps generated

### Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

### Entropy based Graph Clustering: Application to Biological and Social Networks

Entropy based Graph Clustering: Application to Biological and Social Networks Edward C Kenley Young-Rae Cho Department of Computer Science Baylor University Complex Systems Definition Dynamically evolving

### Persistent Binary Search Trees

Persistent Binary Search Trees Datastructures, UvA. May 30, 2008 0440949, Andreas van Cranenburgh Abstract A persistent binary tree allows access to all previous versions of the tree. This paper presents

### MAPREDUCE Programming Model

CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

### Load Balancing for MapReduce-based Entity Resolution

Load Balancing for MapReduce-based Entity Resolution Lars Kolb, Andreas Thor, Erhard Rahm Database Group, University of Leipzig Leipzig, Germany {kolb,thor,rahm}@informatik.uni-leipzig.de arxiv:1108.1631v1

### A Fast Algorithm For Finding Hamilton Cycles

A Fast Algorithm For Finding Hamilton Cycles by Andrew Chalaturnyk A thesis presented to the University of Manitoba in partial fulfillment of the requirements for the degree of Masters of Science in Computer

### Any two nodes which are connected by an edge in a graph are called adjacent node.

. iscuss following. Graph graph G consist of a non empty set V called the set of nodes (points, vertices) of the graph, a set which is the set of edges and a mapping from the set of edges to a set of pairs

### MapReduce Approach to Collective Classification for Networks

MapReduce Approach to Collective Classification for Networks Wojciech Indyk 1, Tomasz Kajdanowicz 1, Przemyslaw Kazienko 1, and Slawomir Plamowski 1 Wroclaw University of Technology, Wroclaw, Poland Faculty

### 1 2-3 Trees: The Basics

CS10: Data Structures and Object-Oriented Design (Fall 2013) November 1, 2013: 2-3 Trees: Inserting and Deleting Scribes: CS 10 Teaching Team Lecture Summary In this class, we investigated 2-3 Trees in

### Duke University http://www.cs.duke.edu/starfish

Herodotos Herodotou, Harold Lim, Fei Dong, Shivnath Babu Duke University http://www.cs.duke.edu/starfish Practitioners of Big Data Analytics Google Yahoo! Facebook ebay Physicists Biologists Economists

### A Heuristic Approach to Service Restoration in MPLS Networks

euristic pproach to Service Restoration in MPS Networks Radim artoš and Mythilikanth Raman epartment of omputer Science, University of New ampshire urham, N 3, US. Phone: (3) -379, ax: (3) -393 -mail:

### Graph Database Proof of Concept Report

Objectivity, Inc. Graph Database Proof of Concept Report Managing The Internet of Things Table of Contents Executive Summary 3 Background 3 Proof of Concept 4 Dataset 4 Process 4 Query Catalog 4 Environment

### Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large quantities of data

### A Review on Load Balancing Algorithms in Cloud

A Review on Load Balancing Algorithms in Cloud Hareesh M J Dept. of CSE, RSET, Kochi hareeshmjoseph@ gmail.com John P Martin Dept. of CSE, RSET, Kochi johnpm12@gmail.com Yedhu Sastri Dept. of IT, RSET,

### PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers

PDQCollections*: A Data-Parallel Programming Model and Library for Associative Containers Maneesh Varshney Vishwa Goudar Christophe Beaumont Jan, 2013 * PDQ could stand for Pretty Darn Quick or Processes

### Graph Mining and Social Network Analysis

Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### 5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

1. The advantage of.. is that they solve the problem if sequential storage representation. But disadvantage in that is they are sequential lists. [A] Lists [B] Linked Lists [A] Trees [A] Queues 2. The

### Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

### Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

### SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG

SCALABLE GRAPH ANALYTICS WITH GRADOOP AND BIIIG MARTIN JUNGHANNS, ANDRE PETERMANN, ERHARD RAHM www.scads.de RESEARCH ON GRAPH ANALYTICS Graph Analytics on Hadoop (Gradoop) Distributed graph data management

### The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

### Using In-Memory Computing to Simplify Big Data Analytics

SCALEOUT SOFTWARE Using In-Memory Computing to Simplify Big Data Analytics by Dr. William Bain, ScaleOut Software, Inc. 2012 ScaleOut Software, Inc. 12/27/2012 T he big data revolution is upon us, fed

### Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

### Simple Solution for a Location Service. Naming vs. Locating Entities. Forwarding Pointers (2) Forwarding Pointers (1)

Naming vs. Locating Entities Till now: resources with fixed locations (hierarchical, caching,...) Problem: some entity may change its location frequently Simple solution: record aliases for the new address

### MapReduce Jeffrey Dean and Sanjay Ghemawat. Background context

MapReduce Jeffrey Dean and Sanjay Ghemawat Background context BIG DATA!! o Large-scale services generate huge volumes of data: logs, crawls, user databases, web site content, etc. o Very useful to be able

### MapReduce and the New Software Stack

20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

### SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION. Marc-Olivier Briat, Jean-Luc Monnot, Edith M.

SCALABILITY OF CONTEXTUAL GENERALIZATION PROCESSING USING PARTITIONING AND PARALLELIZATION Abstract Marc-Olivier Briat, Jean-Luc Monnot, Edith M. Punt Esri, Redlands, California, USA mbriat@esri.com, jmonnot@esri.com,

### 8. Query Processing. Query Processing & Optimization

ECS-165A WQ 11 136 8. Query Processing Goals: Understand the basic concepts underlying the steps in query processing and optimization and estimating query processing cost; apply query optimization techniques;

### The Sierra Clustered Database Engine, the technology at the heart of

A New Approach: Clustrix Sierra Database Engine The Sierra Clustered Database Engine, the technology at the heart of the Clustrix solution, is a shared-nothing environment that includes the Sierra Parallel

### Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### Distributed Apriori in Hadoop MapReduce Framework

Distributed Apriori in Hadoop MapReduce Framework By Shulei Zhao (sz2352) and Rongxin Du (rd2537) Individual Contribution: Shulei Zhao: Implements centralized Apriori algorithm and input preprocessing

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

### Survey on Load Rebalancing for Distributed File System in Cloud

Survey on Load Rebalancing for Distributed File System in Cloud Prof. Pranalini S. Ketkar Ankita Bhimrao Patkure IT Department, DCOER, PG Scholar, Computer Department DCOER, Pune University Pune university

### Data storage Tree indexes

Data storage Tree indexes Rasmus Pagh February 7 lecture 1 Access paths For many database queries and updates, only a small fraction of the data needs to be accessed. Extreme examples are looking or updating

### International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 ISSN 2278-7763

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February-2014 10 A Discussion on Testing Hadoop Applications Sevuga Perumal Chidambaram ABSTRACT The purpose of analysing

### Herodotos Herodotou Shivnath Babu. Duke University

Herodotos Herodotou Shivnath Babu Duke University Analysis in the Big Data Era Popular option Hadoop software stack Java / C++ / R / Python Pig Hive Jaql Oozie Elastic MapReduce Hadoop HBase MapReduce

### BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

### Information Processing, Big Data, and the Cloud

Information Processing, Big Data, and the Cloud James Horey Computational Sciences & Engineering Oak Ridge National Laboratory Fall Creek Falls 2010 Information Processing Systems Model Parameters Data-intensive

### Naming vs. Locating Entities

Naming vs. Locating Entities Till now: resources with fixed locations (hierarchical, caching,...) Problem: some entity may change its location frequently Simple solution: record aliases for the new address

### CS 2112 Spring 2014. 0 Instructions. Assignment 3 Data Structures and Web Filtering. 0.1 Grading. 0.2 Partners. 0.3 Restrictions

CS 2112 Spring 2014 Assignment 3 Data Structures and Web Filtering Due: March 4, 2014 11:59 PM Implementing spam blacklists and web filters requires matching candidate domain names and URLs very rapidly

### Simplified External memory Algorithms for Planar DAGs. July 2004

Simplified External Memory Algorithms for Planar DAGs Lars Arge Duke University Laura Toma Bowdoin College July 2004 Graph Problems Graph G = (V, E) with V vertices and E edges DAG: directed acyclic graph

### Load Balancing in MapReduce Based on Scalable Cardinality Estimates

Load Balancing in MapReduce Based on Scalable Cardinality Estimates Benjamin Gufler 1, Nikolaus Augsten #, Angelika Reiser 3, Alfons Kemper 4 Technische Universität München Boltzmannstraße 3, 85748 Garching