# Map-Based Graph Analysis on MapReduce

Size: px
Start display at page:

## Transcription

3 The MapReduce framework also allows developers to specify a function, called the combiner, to improve performance. It is similar to the reducer function but it runs directly on the output of the mapper. The combiner output becomes the input to the reducer. As it is an optimization, there is no guarantee on the number of times it will be called. When there is a large amount of shuffling of data between the map and the reduce phases, combiners can be used to aggregate the partial result at the map side to reduce the network traffic. IV. GRAPH ALGORITHMS A graph is defined as G(V, E), where V is a set of vertices and E is a set of directed edges. A directed edge from v i to v j is represented as the pair of nodes (v i, v j ), where v i V and v j V. Each vertex may have some information associated with it (such as, a node label, a page-rank value, the number of out-links, etc.) and, similarly, each edge may have some information associated with it (such as, edge label and relationship type). The focus of this paper is on iterative graph algorithms on directed graphs where partial results associated with nodes can be improved at each iteration. Such graph algorithms can be formulated as follows: repeat for all v n V do R n F n F n f({f m (v m, v n ) E}) end for until v i V : ρ(r i, F i ) < θ where F n and F m are the partial results at vertex n and m, respectively, f is a function to compute the partial result for each vertex of the graph, ρ is the function to compute the result improvement between iterations, and θ is the threshold determining the stopping condition. The algorithm given above repeats until the termination criterion is met. Page-Rank, a well-known algorithm for computing the importance of vertices in a graph based on its structure, can be captured using the above algorithm. It computes the pagerank P i for every vertex v i V belonging to the graph. P i is the probability of reaching the vertex v i through a random walk in the graph, which is computed based on the topology of the graph. The page-rank computation often includes a random periodic jump to any other vertex in the graph with a probability 1 d, where d is the dumping factor. The pagerank of a vertex v i V of the graph is calculated iteratively as follows: P i = 1 d P j + d (1) V {v m (v j, v m ) E} (v j,v i) E where v i, v j and v m are the vertices of the graph, P j is the page-rank of node v j from the previous iteration, and P i is the new page-rank of the vertex v i. The page-rank equation can be compared to the general iterative algorithm where calculating the page-rank of the vertex v i in a single iteration is the function f and the page-rank calculated for all the vertices can be seen as a partial result which will be used to calculate the page-rank of all the vertices in the next iteration. A. Basic Implementation V. EARLIER WORK Although it can apply to other graph algorithms too, we describe the earlier work on graph analysis based on MapReduce in terms of the page-rank algorithm. A graph in a MapReduce framework is typically represented as a set of directed edges, where each edge is represented as a key-value pair with the source vertex as the key and the destination vertex as the value. Each vertex p contains the identifier of the vertex p.id and its meta-data, which includes its current pagerank value p.pagerank and the number of outgoing edges p.numof OutLinks from the vertex. Algorithm 1 The Mapper for the Basic Implementation of Page-Rank 1: function MAP(Vertex f rom, Vertex to) 2: Emit (from.id, (from, to)) 3: p f rom.pagerank/f rom.numof OutLinks 4: Emit (to.id, p) 5: end function We first describe the basic approach of applying MapReduce to the graph algorithms described in Section IV. The mapper function given in Algorithm 1 applies to each key-value pair, with the source vertix serving as a key. It computes the pagerank contributions from the source vertex to the destination vertex and emits the destination vertex id as the key and its corresponding fraction of page-rank as the value. In addition to the page-rank contributions, the mapper regenerates the graph structure by emitting the source vertex id as the key and the whole edge (a pair) as the value. Algorithm 2 The Reducer for the Basic Implementation of Page-Rank 1: function REDUCE(ID m, List [p 1,..., p n ]) 2: s 0 3: M null 4: N [ ] 5: for all p [p 1,..., p n ] do 6: if IsPair(p) then 7: M p.from 8: insert p.to into N 9: else 10: s s + p 11: end if 12: end for 13: M.pageRank s 14: for all n N do 15: Emit (M, n) 16: end for 17: end function 26

5 Fig. 2. Page-Rank Computation using Parallel Merge Join loss of generality, we describe our graph analysis technique applied to the page-rank computation. Each mapper M i reads the same unchanged partition M i and combines it with the global page-rank table using a mergejoin between M i and the entire global table. Each mapper M i generates new page-rank values only for those nodes that are destinations of the edges in G i. Note that no other mapper has edges that have the same destination as those in G i due to the way the graph G is partitioned. Thus, each mapper M i is responsible for generating its own subset of the new pagerank values P i, which does not overlap with anyone else s. This is done by aggregating the incoming contributions from the current page-ranks stored in the global table. As we can see in Algorithm 4, each mapper uses a dictionary D to store its own partition P i of the new page-rank values. The average size of the dictionary D is V /m, where V is the number of graph nodes and m is the number of mappers. Our assumption is that D can fit in the mapper s memory. To analyze the graph, a merge-join is performed between G i and the global table during each iteration and the new page-rank values are aggregated in D, which contains P i. After G i is processed,d is flushed to DFS and becomes one of the new partitions of the global table. This algorithm is implemented in MapReduce as follows. An initial MapReduce job, which is a pre-processing step, is first called to partition the graph based on the destination vertex of each edge by using a user-defined Partitioner method (which uses uniform hashing on the destination vertex). At the same time, each partition G i is sorted by the source node of the edges by using user-defined Key Comparator and Grouping Comparator methods. Once all the partitions are formed, they are saved on the distributed file system. In addition, each reducer node of this MapReduce job generates one partition P i of the global table that contains the initial page-rank values and saves them in a sequence DFS file. All these files generated by the reducer nodes are saved in the same DFS directory, thus forming the initial global table. The mapper that evaluates each iteration step is shown in Algorithm 4. It evaluates a merge-join between the graph partition G i (the mapper input) and the global page-rank table, since G i is read sorted by the source of the edges and the page-rank table is read sorted by the node. During this join, the mapper aggregates the incoming page-rank contributions for those vertices that are destinations of G i edges. When the source vertex f rom of the input vertex has the same ID as the current vertex n read from the page-rank table, it adds a page-rank contribution from the source vertex f rom to the destination vertex to of the edge. These contributions are aggregated together and saved in the dictionary D. After the entire G i is processed, this dictionary will contain the updated page-ranks of the vertices belonging to that partition. These values are sorted and flushed out to the disk in a sequence file format to form a single partition P i of the new global page-rank table, which is used by next iteration. The mapper does not write anything to the disk other than the new pagerank values. It should be noted that, to decrease the disk write overhead, we set the data replication parameter of the DFS sequence file that contains the global page-rank table to one, so that there is always a single replica of global page-rank table in DFS. Algorithm 4 The Mapper for Map-Based Parallel Merge-Join for Computing Page-Rank 1: function INITIALIZE(()) 2: P.OpenPageRankFile() 3: Dictionary D empty 4: (n, rank) P.Read() 5: end function 6: function MAP(Vertex f rom, Vertex to) 7: if from.id = n then 8: D[to] + = rank/f rom.numof OutLinks 9: end if 10: if from.id n then 11: repeat 12: (n, rank) P.Read() 13: until from.id n 14: end if 15: end function 16: function CLOSE 17: P.WritePageRankFile(D) 18: end function VII. EXPERIMENTATION To evaluate the different approaches to implement the pagerank algorithm, we generate a synthetic graph. These graphs are generated by R-MAT algorithm [18] using the parameters a=0.57, b=0.19 and d=0.5. We used the MRQL system [19] to generate the synthetic large graph of size 7GB, 14GB and 28 GB. They contain 500 million, 1 billion and 2 billion edges respectively and all of them have 15 million vertices. The file generated is represented as a flat list of edges stored in the text format. 28

6 Fig. 3. Evaluation of Various Design Patterns on a Synthetic Graph The experimentations are performed on a cluster consisting of 1 frontend server and 17 worker nodes, connected through a Gigabit Ethernet switch. The cluster is managed by Rocks Cluster 6.3 running CentOS-5 Linux. For our experiments, we used Hadoop 1.0.4, distributed by Apache. The cluster frontend was used exclusively as a NameNode/JobTracker, while the rest 17 worker nodes were used as DataNodes/TaskTrackers. The frontend server and each of the worker nodes contain 3.2 Ghz Intel Xeon quad-core CPUs, 4 GB memory. As each worker machine has 4 cores, there are total of 68 cores available for map/reduce tasks. We evaluated the efficiency of the map-reduce optimizations by computing 5 iterations of the page-rank algorithm. We compared the time taken to perform the page-rank iterations using the basic, schimmy and our map-based implementations. We used the range-partitioning method to partition the graph for each of the methods. The computation time for 5 iterations of the page-rank algorithm for different graphs is given in Figure 3. Before starting the page-rank iterations, the synthetic graph is preprocessed to compute metadata information of the vertices, such as number of outgoing edges from the vertices, and to initialize the initial page-rank for each of the vertices. For the basic and schimmy implementations, there is only one preprocessing step, but for our map-based implementation there are two preprocessing steps: one to compute the metadata information and the other to initialize the first global table with initial page-rank values for every vertex of the graph. Our Map-Based approach improves the performance of graph-analysis over the basic approach as well as the Schimmy approach. As our appraoch separates the immutable graph topology from the graph analysis results, there is no shuffling and sorting phase and hence our approach improves the performance of the graph analysis. There has been an improvement of around 10% over the schimmy based approach. It should be noted that our approach is applicable to a general class of graph algorithms, as discussed in section IV. It should also be noted that the page-rank is computed for each vertex of the graph and hence, it is an exhaustive graph analysis. Graph analysis other than page-rank may be less exhaustive and hence can benefit more from our approach. VIII. CONCLUSION Graph analysis in a distributed frameworks, such as Map- Reduce, is a challenge. There has been approaches for the analysis of graph algorithms, but most of these approaches has a high communication cost because of the shuffling and sorting phases of the map-reduce. Our approach detaches the immutable graph topology from the analysis and as a result we get an improved performance as there is less communication cost. IX. ACKNOWLEDGEMENTS This work is supported in part by the National Science Foundation under the grant

7 REFERENCES [1] J. Dean and S. Ghemawat, Mapreduce: simplified data processing on large clusters, vol. 51, no. 1. New York, NY, USA: ACM, Jan. 2008, pp [Online]. Available: [2] S. Brin and L. Page, The anatomy of a large-scale hypertextual web search engine, in Proceedings of the seventh international conference on World Wide Web 7, ser. WWW7. Amsterdam, The Netherlands, The Netherlands: Elsevier Science Publishers B. V., 1998, pp [Online]. Available: [3] J. Ekanayake, H. Li, B. Zhang, T. Gunarathne, S.-H. Bae, J. Qiu, and G. Fox, Twister: a runtime for iterative mapreduce, in Proceedings of the 19th ACM International Symposium on High Performance Distributed Computing, ser. HPDC 10. New York, NY, USA: ACM, 2010, pp [Online]. Available: [4] Y. Bu, B. Howe, M. Balazinska, and M. D. Ernst, Haloop: efficient iterative data processing on large clusters, vol. 3, no VLDB Endowment, Sep. 2010, pp [Online]. Available: [5] L. G. Valiant, A bridging model for parallel computation, Commun. ACM, vol. 33, no. 8, pp , Aug [Online]. Available: [6] G. Malewicz, M. H. Austern, A. J. Bik, J. C. Dehnert, I. Horn, N. Leiser, and G. Czajkowski, Pregel: a system for large-scale graph processing, in Proceedings of the 2010 ACM SIGMOD International Conference on Management of data, ser. SIGMOD 10. New York, NY, USA: ACM, 2010, pp [Online]. Available: [7] S. Seo, E. J. Yoon, J. Kim, S. Jin, J.-S. Kim, and S. Maeng, Hama: An efficient matrix computation with the mapreduce framework, in Proceedings of the 2010 IEEE Second International Conference on Cloud Computing Technology and Science, ser. CLOUDCOM 10. Washington, DC, USA: IEEE Computer Society, 2010, pp [Online]. Available: [8] Giraph, [9] T. White, Hadoop: The Definitive Guide, 1st ed. O Reilly Media, Inc., [10] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, Interpreting the data: Parallel analysis with sawzall, Sci. Program., vol. 13, no. 4, pp , Oct [Online]. Available: id= [11] M. Isard, M. Budiu, Y. Yu, A. Birrell, and D. Fetterly, Dryad: distributed data-parallel programs from sequential building blocks, in Proceedings of the 2nd ACM SIGOPS/EuroSys European Conference on Computer Systems 2007, ser. EuroSys 07. New York, NY, USA: ACM, 2007, pp [Online]. Available: [12] U. Kang, C. Tsourakakis, A. P. Appel, C. Faloutsos, and J. Leskovec, Hadi: Fast diameter estimation and mining in massive graphs with hadoop, [13] U. Kang, C. E. Tsourakakis, and C. Faloutsos, Pegasus: A peta-scale graph mining system implementation and observations, in Proceedings of the 2009 Ninth IEEE International Conference on Data Mining, ser. ICDM 09. Washington, DC, USA: IEEE Computer Society, 2009, pp [Online]. Available: [14] J. Lin and M. Schatz, Design patterns for efficient graph algorithms in mapreduce, in Proceedings of the Eighth Workshop on Mining and Learning with Graphs, ser. MLG 10. New York, NY, USA: ACM, 2010, pp [Online]. Available: [15] S. S. S. Lattanzi, B. Moseley and S. Vassilvitskii, Filtering: a method for solving graph problems in mapreduce, in SPAA, 2011, pp [16] R. R. U. A. A. P. Bhatotia, A. Wieder and R. Pasquin, Incoop: Mapreduce for incremental computations, in SoCC, 2011, p. 7. [17] D. Peng and F. Dabek, Large-scale incremental processing using distributed transactions and notifications, in OSDI, 2010, pp [18] D. Chakrabarti, Y. Zhan, and C. Faloutsos, R-mat: A recursive model for graph mining, in In SDM, [19] Mrql, 30

### A Study on Workload Imbalance Issues in Data Intensive Distributed Computing

A Study on Workload Imbalance Issues in Data Intensive Distributed Computing Sven Groot 1, Kazuo Goda 1, and Masaru Kitsuregawa 1 University of Tokyo, 4-6-1 Komaba, Meguro-ku, Tokyo 153-8505, Japan Abstract.

### BSPCloud: A Hybrid Programming Library for Cloud Computing *

BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China liuxiaodongxht@qq.com,

### Software tools for Complex Networks Analysis. Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team

Software tools for Complex Networks Analysis Fabrice Huet, University of Nice Sophia- Antipolis SCALE (ex-oasis) Team MOTIVATION Why do we need tools? Source : nature.com Visualization Properties extraction

### Graph Processing and Social Networks

Graph Processing and Social Networks Presented by Shu Jiayu, Yang Ji Department of Computer Science and Engineering The Hong Kong University of Science and Technology 2015/4/20 1 Outline Background Graph

### Optimization and analysis of large scale data sorting algorithm based on Hadoop

Optimization and analysis of large scale sorting algorithm based on Hadoop Zhuo Wang, Longlong Tian, Dianjie Guo, Xiaoming Jiang Institute of Information Engineering, Chinese Academy of Sciences {wangzhuo,

### Evaluating partitioning of big graphs

Evaluating partitioning of big graphs Fredrik Hallberg, Joakim Candefors, Micke Soderqvist fhallb@kth.se, candef@kth.se, mickeso@kth.se Royal Institute of Technology, Stockholm, Sweden Abstract. Distributed

### Hadoop MapReduce using Cache for Big Data Processing

Hadoop MapReduce using Cache for Big Data Processing Janani.J Dept. of computer science and engineering Arasu Engineering College Kumbakonam Kalaivani.K Dept. of computer science and engineering Arasu

### CSE-E5430 Scalable Cloud Computing Lecture 2

CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University keijo.heljanko@aalto.fi 14.9-2015 1/36 Google MapReduce A scalable batch processing

### Review on the Cloud Computing Programming Model

, pp.11-16 http://dx.doi.org/10.14257/ijast.2014.70.02 Review on the Cloud Computing Programming Model Chao Shen and Weiqin Tong School of Computer Engineering and Science Shanghai University, Shanghai

### Developing MapReduce Programs

Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2015/16 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

### Big Data Analytics. Lucas Rego Drumond

Big Data Analytics Lucas Rego Drumond Information Systems and Machine Learning Lab (ISMLL) Institute of Computer Science University of Hildesheim, Germany MapReduce II MapReduce II 1 / 33 Outline 1. Introduction

### LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD. Dr. Buğra Gedik, Ph.D.

LARGE-SCALE GRAPH PROCESSING IN THE BIG DATA WORLD Dr. Buğra Gedik, Ph.D. MOTIVATION Graph data is everywhere Relationships between people, systems, and the nature Interactions between people, systems,

### Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce

Storage and Retrieval of Large RDF Graph Using Hadoop and MapReduce Mohammad Farhan Husain, Pankil Doshi, Latifur Khan, and Bhavani Thuraisingham University of Texas at Dallas, Dallas TX 75080, USA Abstract.

### Large-Scale Data Sets Clustering Based on MapReduce and Hadoop

Journal of Computational Information Systems 7: 16 (2011) 5956-5963 Available at http://www.jofcis.com Large-Scale Data Sets Clustering Based on MapReduce and Hadoop Ping ZHOU, Jingsheng LEI, Wenjun YE

### 2 Wikipedia. Data Stream Processing for Large-Scale Bipartite Graph with Wikipedia Edit History

Wikipedia 2 1 1 1 1, 2 2 Wikipedia 4 48 207,329 1,847,166 22,034,825 2 2 3 30 32 Data Stream Processing for Large-Scale Bipartite Graph with Wikipedia Edit History Souhei Takeno, 1 Koji Ueno, 1 Masaru

### http://www.wordle.net/

Hadoop & MapReduce http://www.wordle.net/ http://www.wordle.net/ Hadoop is an open-source software framework (or platform) for Reliable + Scalable + Distributed Storage/Computational unit Failures completely

### Securing Health Care Information by Using Two-Tier Cipher Cloud Technology

Securing Health Care Information by Using Two-Tier Cipher Cloud Technology M.Divya 1 J.Gayathri 2 A.Gladwin 3 1 B.TECH Information Technology, Jeppiaar Engineering College, Chennai 2 B.TECH Information

With Saurabh Singh singh.903@osu.edu The Ohio State University February 11, 2016 Overview 1 2 3 Requirements Ecosystem Resilient Distributed Datasets (RDDs) Example Code vs Mapreduce 4 5 Source: [Tutorials

### Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique

Comparative analysis of mapreduce job by keeping data constant and varying cluster size technique Mahesh Maurya a, Sunita Mahajan b * a Research Scholar, JJT University, MPSTME, Mumbai, India,maheshkmaurya@yahoo.co.in

### Big Data Processing with Google s MapReduce. Alexandru Costan

1 Big Data Processing with Google s MapReduce Alexandru Costan Outline Motivation MapReduce programming model Examples MapReduce system architecture Limitations Extensions 2 Motivation Big Data @Google:

### Distributed computing: index building and use

Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

### IncMR: Incremental Data Processing based on MapReduce

IncMR: Incremental Data Processing based on MapReduce Cairong Yan, Department of Computer Science and Technology Donghua University, Shanghai, China cryan@dhu.edu.cn Xin Yang, Ze Yu, Min Li, Xiaolin Li

### Exploring the Efficiency of Big Data Processing with Hadoop MapReduce

Exploring the Efficiency of Big Data Processing with Hadoop MapReduce Brian Ye, Anders Ye School of Computer Science and Communication (CSC), Royal Institute of Technology KTH, Stockholm, Sweden Abstract.

### Classification On The Clouds Using MapReduce

Classification On The Clouds Using MapReduce Simão Martins Instituto Superior Técnico Lisbon, Portugal simao.martins@tecnico.ulisboa.pt Cláudia Antunes Instituto Superior Técnico Lisbon, Portugal claudia.antunes@tecnico.ulisboa.pt

### Incremental Map Reduce & Keyword-Aware for Mining Evolving Big Data

Incremental Map Reduce & Keyword-Aware for Mining Evolving Big Data Rahul More 1, Sheshdhari Shete 2, SnehalJagtap 3,Prajakta Shete 4 1 Student,Department of Computer Engineering,Genba Sopanrao Moze College

### MAPREDUCE Programming Model

CS 2510 COMPUTER OPERATING SYSTEMS Cloud Computing MAPREDUCE Dr. Taieb Znati Computer Science Department University of Pittsburgh MAPREDUCE Programming Model Scaling Data Intensive Application MapReduce

### FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez To cite this version: Miguel Liroz-Gistau, Reza Akbarinia, Patrick Valduriez. FP-Hadoop:

### Analysis and Modeling of MapReduce s Performance on Hadoop YARN

Analysis and Modeling of MapReduce s Performance on Hadoop YARN Qiuyi Tang Dept. of Mathematics and Computer Science Denison University tang_j3@denison.edu Dr. Thomas C. Bressoud Dept. of Mathematics and

### A Novel Optimized Mapreduce across Datacenter s Using Dache: a Data Aware Caching for BigData Applications

A Novel Optimized Mapreduce across Datacenter s Using Dache: a Data Aware Caching for BigData Applications M.Santhi 1, A.K. Punith Kumar 2 1 M.Tech., Scholar, Dept. of CSE, Siddhartha Educational Academy

### Big Data with Rough Set Using Map- Reduce

Big Data with Rough Set Using Map- Reduce Mr.G.Lenin 1, Mr. A. Raj Ganesh 2, Mr. S. Vanarasan 3 Assistant Professor, Department of CSE, Podhigai College of Engineering & Technology, Tirupattur, Tamilnadu,

### MapReduce for Data Intensive Scientific Analyses

MapReduce for Data Intensive Scientific Analyses Jaliya Ekanayake, Shrideep Pallickara, and Geoffrey Fox Department of Computer Science Indiana University Bloomington, USA {jekanaya, spallick, gcf}@indiana.edu

### Introduction to DISC and Hadoop

Introduction to DISC and Hadoop Alice E. Fischer April 24, 2009 Alice E. Fischer DISC... 1/20 1 2 History Hadoop provides a three-layer paradigm Alice E. Fischer DISC... 2/20 Parallel Computing Past and

### Big Data Begets Big Database Theory

Big Data Begets Big Database Theory Dan Suciu University of Washington 1 Motivation Industry analysts describe Big Data in terms of three V s: volume, velocity, variety. The data is too big to process

### From GWS to MapReduce: Google s Cloud Technology in the Early Days

Large-Scale Distributed Systems From GWS to MapReduce: Google s Cloud Technology in the Early Days Part II: MapReduce in a Datacenter COMP6511A Spring 2014 HKUST Lin Gu lingu@ieee.org MapReduce/Hadoop

### Lecture 5: GFS & HDFS! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl

Big Data Processing, 2014/15 Lecture 5: GFS & HDFS!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm Looking behind

### Evaluating HDFS I/O Performance on Virtualized Systems

Evaluating HDFS I/O Performance on Virtualized Systems Xin Tang xtang@cs.wisc.edu University of Wisconsin-Madison Department of Computer Sciences Abstract Hadoop as a Service (HaaS) has received increasing

### Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

### GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications

GridBatch: Cloud Computing for Large-Scale Data-Intensive Batch Applications Huan Liu, Dan Orban Accenture Technology Labs {huan.liu, dan.orban}@accenture.com Abstract To be competitive, Enterprises are

### MapReduce (in the cloud)

MapReduce (in the cloud) How to painlessly process terabytes of data by Irina Gordei MapReduce Presentation Outline What is MapReduce? Example How it works MapReduce in the cloud Conclusion Demo Motivation:

### This exam contains 13 pages (including this cover page) and 18 questions. Check to see if any pages are missing.

Big Data Processing 2013-2014 Q2 April 7, 2014 (Resit) Lecturer: Claudia Hauff Time Limit: 180 Minutes Name: Answer the questions in the spaces provided on this exam. If you run out of room for an answer,

### Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis

Apache MRQL (incubating): Advanced Query Processing for Complex, Large-Scale Data Analysis Leonidas Fegaras University of Texas at Arlington http://mrql.incubator.apache.org/ 04/12/2015 Outline Who am

### Distributed Execution Engines

Distributed Execution Engines Abdul Saboor 1 Introduction Big Data Analytics - DIMA, Technical University Berlin, Einsteinufer17, 10587 Berlin, Germany. abdul.saboor@tu-berlin.de A distributed execution

### Resource Scalability for Efficient Parallel Processing in Cloud

Resource Scalability for Efficient Parallel Processing in Cloud ABSTRACT Govinda.K #1, Abirami.M #2, Divya Mercy Silva.J #3 #1 SCSE, VIT University #2 SITE, VIT University #3 SITE, VIT University In the

### Advances in Natural and Applied Sciences

AENSI Journals Advances in Natural and Applied Sciences ISSN:1995-0772 EISSN: 1998-1090 Journal home page: www.aensiweb.com/anas Clustering Algorithm Based On Hadoop for Big Data 1 Jayalatchumy D. and

### Lifetime Management of Cache Memory using Hadoop Snehal Deshmukh 1 Computer, PGMCOE, Wagholi, Pune, India

Volume 3, Issue 1, January 2015 International Journal of Advance Research in Computer Science and Management Studies Research Article / Survey Paper / Case Study Available online at: www.ijarcsms.com ISSN:

### Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines Edward Bortnikov & Ronny Lempel Yahoo Labs, Haifa Indexing in Search Engines Information Retrieval s two main stages: Indexing process

### Big Data and Apache Hadoop s MapReduce

Big Data and Apache Hadoop s MapReduce Michael Hahsler Computer Science and Engineering Southern Methodist University January 23, 2012 Michael Hahsler (SMU/CSE) Hadoop/MapReduce January 23, 2012 1 / 23

### Systems Engineering II. Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de

Systems Engineering II Pramod Bhatotia TU Dresden pramod.bhatotia@tu- dresden.de About me! Since May 2015 2015 2012 Research Group Leader cfaed, TU Dresden PhD Student MPI- SWS Research Intern Microsoft

### Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Advanced Engineering Forum Vols. 6-7 (2012) pp 82-87 Online: 2012-09-26 (2012) Trans Tech Publications, Switzerland doi:10.4028/www.scientific.net/aef.6-7.82 Research on Clustering Analysis of Big Data

### Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications Ahmed Abdulhakim Al-Absi, Dae-Ki Kang and Myong-Jong Kim Abstract In Hadoop MapReduce distributed file system, as the input

### Efficient Query Evaluation on Distributed Graphs with Hadoop Environment

Efficient Query Evaluation on Distributed Graphs with Hadoop Environment Le-Duc Tung The Graduate University for Advanced Studies Japan tung@nii.ac.jp Quyet Nguyen-Van Hung Yen University of Technology

### Index Terms : Load rebalance, distributed file systems, clouds, movement cost, load imbalance, chunk.

Load Rebalancing for Distributed File Systems in Clouds. Smita Salunkhe, S. S. Sannakki Department of Computer Science and Engineering KLS Gogte Institute of Technology, Belgaum, Karnataka, India Affiliated

### International Journal of Engineering Research ISSN: 2348-4039 & Management Technology November-2015 Volume 2, Issue-6

International Journal of Engineering Research ISSN: 2348-4039 & Management Technology Email: editor@ijermt.org November-2015 Volume 2, Issue-6 www.ijermt.org Modeling Big Data Characteristics for Discovering

### Big Data and Scripting map/reduce in Hadoop

Big Data and Scripting map/reduce in Hadoop 1, 2, parts of a Hadoop map/reduce implementation core framework provides customization via indivudual map and reduce functions e.g. implementation in mongodb

### What is Analytic Infrastructure and Why Should You Care?

What is Analytic Infrastructure and Why Should You Care? Robert L Grossman University of Illinois at Chicago and Open Data Group grossman@uic.edu ABSTRACT We define analytic infrastructure to be the services,

### BIG DATA Giraph. Felipe Caicedo December-2012. Cloud Computing & Big Data. FIB-UPC Master MEI

BIG DATA Giraph Cloud Computing & Big Data Felipe Caicedo December-2012 FIB-UPC Master MEI Content What is Apache Giraph? Motivation Existing solutions Features How it works Components and responsibilities

### Apache Flink Next-gen data analysis. Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas

Apache Flink Next-gen data analysis Kostas Tzoumas ktzoumas@apache.org @kostas_tzoumas What is Flink Project undergoing incubation in the Apache Software Foundation Originating from the Stratosphere research

### Fault Tolerance in Hadoop for Work Migration

1 Fault Tolerance in Hadoop for Work Migration Shivaraman Janakiraman Indiana University Bloomington ABSTRACT Hadoop is a framework that runs applications on large clusters which are built on numerous

### SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE SYSTEM IN CLOUD

International Journal of Advances in Applied Science and Engineering (IJAEAS) ISSN (P): 2348-1811; ISSN (E): 2348-182X Vol-1, Iss.-3, JUNE 2014, 54-58 IIST SOLVING LOAD REBALANCING FOR DISTRIBUTED FILE

### Cloud Computing at Google. Architecture

Cloud Computing at Google Google File System Web Systems and Algorithms Google Chris Brooks Department of Computer Science University of San Francisco Google has developed a layered system to handle webscale

### Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques

Comparision of k-means and k-medoids Clustering Algorithms for Big Data Using MapReduce Techniques Subhashree K 1, Prakash P S 2 1 Student, Kongu Engineering College, Perundurai, Erode 2 Assistant Professor,

### Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce

2012 Third International Conference on Networking and Computing Dynamic processing slots scheduling for I/O intensive jobs of Hadoop MapReduce Shiori KURAZUMI, Tomoaki TSUMURA, Shoichi SAITO and Hiroshi

### Large data computing using Clustering algorithms based on Hadoop

Large data computing using Clustering algorithms based on Hadoop Samrudhi Tabhane, Prof. R.A.Fadnavis Dept. Of Information Technology,YCCE, email id: sstabhane@gmail.com Abstract The Hadoop Distributed

### Data-Intensive Computing with Map-Reduce and Hadoop

Data-Intensive Computing with Map-Reduce and Hadoop Shamil Humbetov Department of Computer Engineering Qafqaz University Baku, Azerbaijan humbetov@gmail.com Abstract Every day, we create 2.5 quintillion

### System G Data Store: Big, Rich Graph Data Analytics in the Cloud

System G Data Store: Big, Rich Graph Data Analytics in the Cloud Mustafa Canim and Yuan-Chi Chang IBM Thomas J. Watson Research Center P. O. Box 218, Yorktown Heights, New York, U.S.A. {mustafa, yuanchi}@us.ibm.com

### RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE

RECOMMENDATION SYSTEM USING BLOOM FILTER IN MAPREDUCE Reena Pagare and Anita Shinde Department of Computer Engineering, Pune University M. I. T. College Of Engineering Pune India ABSTRACT Many clients

### Implementing Graph Pattern Mining for Big Data in the Cloud

Implementing Graph Pattern Mining for Big Data in the Cloud Chandana Ojah M.Tech in Computer Science & Engineering Department of Computer Science & Engineering, PES College of Engineering, Mandya Ojah.chandana@gmail.com

### Jeffrey D. Ullman slides. MapReduce for data intensive computing

Jeffrey D. Ullman slides MapReduce for data intensive computing Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very

### Graph Mining on Big Data System. Presented by Hefu Chai, Rui Zhang, Jian Fang

Graph Mining on Big Data System Presented by Hefu Chai, Rui Zhang, Jian Fang Outline * Overview * Approaches & Environment * Results * Observations * Notes * Conclusion Overview * What we have done? *

### The Performance Characteristics of MapReduce Applications on Scalable Clusters

The Performance Characteristics of MapReduce Applications on Scalable Clusters Kenneth Wottrich Denison University Granville, OH 43023 wottri_k1@denison.edu ABSTRACT Many cluster owners and operators have

### CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop) Rezaul A. Chowdhury Department of Computer Science SUNY Stony Brook Spring 2016 MapReduce MapReduce is a programming model

### BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

BigData An Overview of Several Approaches David Mera Masaryk University Brno, Czech Republic 16/12/2013 Table of Contents 1 Introduction 2 Terminology 3 Approaches focused on batch data processing MapReduce-Hadoop

### A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs In a Workflow Application

2012 International Conference on Information and Computer Applications (ICICA 2012) IPCSIT vol. 24 (2012) (2012) IACSIT Press, Singapore A Locality Enhanced Scheduling Method for Multiple MapReduce Jobs

### Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Parallel Databases Increase performance by performing operations in parallel Parallel Architectures Shared memory Shared disk Shared nothing closely coupled loosely coupled Parallelism Terminology Speedup:

### Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture

Analysis and Optimization of Massive Data Processing on High Performance Computing Architecture He Huang, Shanshan Li, Xiaodong Yi, Feng Zhang, Xiangke Liao and Pan Dong School of Computer Science National

### Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds

ABSTRACT Scheduling Data Intensive Workloads through Virtualization on MapReduce based Clouds 1 B.Thirumala Rao, 2 L.S.S.Reddy Department of Computer Science and Engineering, Lakireddy Bali Reddy College

### Log Mining Based on Hadoop s Map and Reduce Technique

Log Mining Based on Hadoop s Map and Reduce Technique ABSTRACT: Anuja Pandit Department of Computer Science, anujapandit25@gmail.com Amruta Deshpande Department of Computer Science, amrutadeshpande1991@gmail.com

### Comparison of Different Implementation of Inverted Indexes in Hadoop

Comparison of Different Implementation of Inverted Indexes in Hadoop Hediyeh Baban, S. Kami Makki, and Stefan Andrei Department of Computer Science Lamar University Beaumont, Texas (hbaban, kami.makki,

### MapReduce Design of K-Means Clustering Algorithm

MapReduce Design of K-Means Clustering Algorithm Prajesh P Anchalia Department of CSE, Email: prajeshanchalia@yahoo.in Anjan K Koundinya Department of CSE, Email: annjank@gmail.com Srinath N K Department

### Big Graph Processing: Some Background

Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs

### 16.1 MAPREDUCE. For personal use only, not for distribution. 333

For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

### CloudClustering: Toward an iterative data processing pattern on the cloud

CloudClustering: Toward an iterative data processing pattern on the cloud Ankur Dave University of California, Berkeley Berkeley, California, USA ankurd@eecs.berkeley.edu Wei Lu, Jared Jackson, Roger Barga

### An Approach to Implement Map Reduce with NoSQL Databases

www.ijecs.in International Journal Of Engineering And Computer Science ISSN: 2319-7242 Volume 4 Issue 8 Aug 2015, Page No. 13635-13639 An Approach to Implement Map Reduce with NoSQL Databases Ashutosh

### CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES 1 MYOUNGJIN KIM, 2 CUI YUN, 3 SEUNGHO HAN, 4 HANKU LEE 1,2,3,4 Department of Internet & Multimedia Engineering,

### Programming Abstractions and Security for Cloud Computing Systems

Programming Abstractions and Security for Cloud Computing Systems By Sapna Bedi A MASTERS S ESSAY SUBMITTED IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF MASTER OF SCIENCE in The Faculty

### Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Scalable Cloud Computing Solutions for Next Generation Sequencing Data Matti Niemenmaa 1, Aleksi Kallio 2, André Schumacher 1, Petri Klemelä 2, Eija Korpelainen 2, and Keijo Heljanko 1 1 Department of

Introduction to Hadoop 1 What is Hadoop? the big data revolution extracting value from data cloud computing 2 Understanding MapReduce the word count problem more examples MCS 572 Lecture 24 Introduction

### Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime

Performance Model for Parallel Matrix Multiplication with Dryad: Dataflow Graph Runtime Hui Li, Geoffrey Fox, Judy Qiu School of Informatics and Computing, Pervasive Technology Institute Indiana University

### Report: Declarative Machine Learning on MapReduce (SystemML)

Report: Declarative Machine Learning on MapReduce (SystemML) Jessica Falk ETH-ID 11-947-512 May 28, 2014 1 Introduction SystemML is a system used to execute machine learning (ML) algorithms in HaDoop,

### Mining Large Datasets: Case of Mining Graph Data in the Cloud

Mining Large Datasets: Case of Mining Graph Data in the Cloud Sabeur Aridhi PhD in Computer Science with Laurent d Orazio, Mondher Maddouri and Engelbert Mephu Nguifo 16/05/2014 Sabeur Aridhi Mining Large

### Mobile Storage and Search Engine of Information Oriented to Food Cloud

Advance Journal of Food Science and Technology 5(10): 1331-1336, 2013 ISSN: 2042-4868; e-issn: 2042-4876 Maxwell Scientific Organization, 2013 Submitted: May 29, 2013 Accepted: July 04, 2013 Published:

### MapReduce With Columnar Storage

SEMINAR: COLUMNAR DATABASES 1 MapReduce With Columnar Storage Peitsa Lähteenmäki Abstract The MapReduce programming paradigm has achieved more popularity over the last few years as an option to distributed

### MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS

MRGIS: A MapReduce-Enabled High Performance Workflow System for GIS Qichang Chen, Liqiang Wang Department of Computer Science University of Wyoming {qchen2, wang}@cs.uwyo.edu Zongbo Shang WyGISC and Department

### A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems

A Hybrid Scheduling Approach for Scalable Heterogeneous Hadoop Systems Aysan Rasooli Department of Computing and Software McMaster University Hamilton, Canada Email: rasooa@mcmaster.ca Douglas G. Down

### Chapter 7. Using Hadoop Cluster and MapReduce

Chapter 7 Using Hadoop Cluster and MapReduce Modeling and Prototyping of RMS for QoS Oriented Grid Page 152 7. Using Hadoop Cluster and MapReduce for Big Data Problems The size of the databases used in

DryadLINQ for Scientific Analyses Jaliya Ekanayake a, Thilina Gunarathne a, Geoffrey Fox a,b a School of Informatics and Computing, b Pervasive Technology Institute Indiana University Bloomington {jekanaya,

### Analysis of MapReduce Algorithms

Analysis of MapReduce Algorithms Harini Padmanaban Computer Science Department San Jose State University San Jose, CA 95192 408-924-1000 harini.gomadam@gmail.com ABSTRACT MapReduce is a programming model

### Big Data and Scripting Systems beyond Hadoop

Big Data and Scripting Systems beyond Hadoop 1, 2, ZooKeeper distributed coordination service many problems are shared among distributed systems ZooKeeper provides an implementation that solves these avoid