CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters



Similar documents
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms

Scalability and Classifications

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

Lecture 2 Parallel Programming Platforms

Parallel Computing. Benson Muite. benson.

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Big Graph Processing: Some Background

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Parallel Programming Survey

Overlapping Data Transfer With Application Execution on Clusters

Distributed communication-aware load balancing with TreeMatch in Charm++

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

UTS: An Unbalanced Tree Search Benchmark

Binary search tree with SIMD bandwidth optimization using SSE

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Parallel Simplification of Large Meshes on PC Clusters

International Journal of Computer & Organization Trends Volume20 Number1 May 2015

Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison

Analysis and Implementation of Cluster Computing Using Linux Operating System

Topological Properties

PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN

BSPCloud: A Hybrid Programming Library for Cloud Computing *

IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version Fix Pack 2.

How To Understand The Concept Of A Distributed System

A Comparison of General Approaches to Multiprocessor Scheduling

Virtuoso and Database Scalability

Stream Processing on GPUs Using Distributed Multimedia Middleware

SERVER CLUSTERING TECHNOLOGY & CONCEPT

CSE-E5430 Scalable Cloud Computing Lecture 2

Real-time Process Network Sonar Beamformer

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

High Performance Computing in CST STUDIO SUITE

Multicore Parallel Computing with OpenMP

Initial Hardware Estimation Guidelines. AgilePoint BPMS v5.0 SP1

Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003

Cluster Implementation and Management; Scheduling

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Oracle Database Scalability in VMware ESX VMware ESX 3.5

MOSIX: High performance Linux farm

Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach

Cluster Computing at HRI

White Paper. Recording Server Virtualization

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Chapter 18: Database System Architectures. Centralized Systems

FPGA-based Multithreading for In-Memory Hash Joins

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Parallel Prefix Sum (Scan) with CUDA. Mark Harris

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Cellular Computing on a Linux Cluster

Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

GRID Computing: CAS Style

CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER

SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs

Patriot Hardware and Systems Software Requirements

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

A Comparison of Distributed Systems: ChorusOS and Amoeba

Microsoft Windows Apple Mac OS X

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Improved LS-DYNA Performance on Sun Servers

A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery

Understanding the Benefits of IBM SPSS Statistics Server

Business white paper. HP Process Automation. Version 7.0. Server performance

Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems

High Performance Computing with Linux Clusters

Efficiency Considerations of PERL and Python in Distributed Processing

Virtual Machines.

Cloud Storage. Parallels. Performance Benchmark Results. White Paper.

A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS

Resource Allocation Schemes for Gang Scheduling

Distributed File System Performance. Milind Saraph / Rich Sudlow Office of Information Technologies University of Notre Dame

Analysis of Computer Algorithms. Algorithm. Algorithm, Data Structure, Program

Virtualised MikroTik

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Linux clustering. Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University

Control 2004, University of Bath, UK, September 2004

Scaling 10Gb/s Clustering at Wire-Speed

msuite5 & mdesign Installation Prerequisites

An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.

Mobile Cloud Computing for Data-Intensive Applications

AirWave 7.7. Server Sizing Guide

CS 575 Parallel Processing

Performance Monitoring of Parallel Scientific Applications

Big Data Systems CS 5965/6965 FALL 2015

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

A Flexible Cluster Infrastructure for Systems Research and Software Development

Transcription:

CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters Albert Chan and Frank Dehne School of Computer Science, Carleton University, Ottawa, Canada http://www.scs.carleton.ca/ achan and http://www.dehne.net Abstract. In this paper, we present CGMgraph, the first integrated library of parallel graph methods for PC clusters based on CGM algorithms. CGMgraph implements parallel methods for various graph problems. Our implementations of deterministic list ranking, Euler tour, connected components, spanning forest, and bipartite graph detection are, to our knowledge, the first efficient implementations for PC clusters. Our library also includes CGMlib, a library of basic CGM tools such as sorting, prefix sum, one to all broadcast, all to one gather, h-relation, all to all broadcast, array balancing, and CGM partitioning. Both libraries are available for download at http://cgm.dehne.net. 1 Introduction Parallel graph algorithms have been extensively studied in the literature (see e.g. [15] for a survey). However, most of the published parallel graph algorithms have traditionally been designed for the theoretical PRAM model. Following some previous experimental results for the MASPAR and Cray [14, 6, 18, 17, 11], parallel graph algorithms for more PC cluster like parallel models like the BSP [19] and CGM [7, 8] have been presented in [12, 1 4, 9, 1]. In this paper, we present CGMgraph, the first integrated library ofcgm methods for various graph problems including list ranking, Euler tour, connected components, spanning forest, and bipartite graph detection. Our library also includes a library CGMlib ofbasic CGM tools that are necessary for parallel graph methods as well as many other CGM algorithms: sorting, prefix sum, one to all broadcast, all to one gather, h-relation, all to all broadcast, array balancing, and CGM partitioning. In comparison with [12], CGMgraph implements both a randomized as well as a deterministic list ranking method. Our experimental results for randomized list ranking are similar to the ones reported in [12]. Our implementations ofdeterministic list ranking, Euler tour, connected components, spanning forest, and bipartite graph detection are, to our knowledge, the first efficient implementations for PC clusters. Both libraries are available for download at http://cgm.dehne.net. We demonstrate the performance of our methods on two different cluster architectures: a gigabit connected high performance PC cluster and a network ofworkstations. Our experiments show that our library provides good parallel speedup and scalability on both platforms. The communication overhead is, in most cases, small and does not grow significantly with

an increasing number ofprocessors. This is a very important feature ofcgm algorithms which makes them very efficient in practice. 2 Library Overview and Experimental Setup Figure 1 illustrates the general use ofcgmlib and CGMgraph as well as the class hierarchy ofthe main classes. Note that all classes in CGMgraph, except EulerNode, are independent. Both libraries require an underlying communication library such as MPI or PVM. CGMlib provides a class Comm which interfaces with the underlying communication library. It provides an interface for all communication operations used by CGMlib and CGMgraph and thereby hides the details ofthe communication library from the user. The performance of our library was evaluated on two parallel platforms: THOG, andultra. TheTHOG cluster consists of p =64nodes,eachwith two Xeon processors. The nodes are oftwo different generations, with processors at 1.7 or 2.GHz, 1. or 1.5GB RAM, and 6GB disk storage. The nodes are interconnected via a Cisco 659 switch using Gigabit ethernet. The operating system is Linux Red Hat 7.1 together with LAM-MPI version 6.5.6. The ULTRA platform is a network of workstations consisting of p = 1 Sun Sparc Ultra 1. The processor speed is 44MHz. Each processor has 256MB RAM. The nodes are interconnected via 1Mb Switched Fast Ethernet. The operating system is Sun OS 5.7 and LAM-MPI version 6.3.2. All times reported in the remainder ofthis paper are wall clock times in seconds. Each data point in the diagrams represents the average ofthree experiments (on different random test data ofthe same size) for CGMgraph and ten experiments for CGMlib. The input data sets for our tests consisted of randomly created test data. For inputs consisting oflists or graphs, we generated random lists or graphs as follows. For random linked lists, we first created an arbitrary linked list and then permuted it over the processors via random permutations. For random graphs, we created a set ofnodes and then added random edges. Unfortunately, different test data sizes had to be chosen for the different platforms because ofthe smaller memory capacity ofultra. 3 CGMlib: Basic Infrastructure and Utilities 3.1 CGM Communication Operations The basic library, called CGMlib, provides basic functionality for CGM communication. An interface, Comm, defines the basic communication operations such as onetoallbcast(int source, CommObjectList &data): Broadcast the list data from processor number source to all processors. alltoonegather(int target, CommObjectList &data): Gather the lists data from all processors to processor number target.

hrelation(commobjectlist &data, int *ns): Performanh-Relation on the lists data using the integer array ns to indicate for each processor which list objects are to be sent to which processor. alltoallbcast(commobjectlist &data): Every processor broadcasts its list data to all other processors. arraybalancing(commobjectlist &data, int expectedn=-1): Shift the list elements between the lists data such that every processor contains the same number ofelements. partitioncgm(int groupid): Partition the CGM into groups indicated by groupid. All subsequent communication operations, such as the ones listed above, operate within the respective processor s group only. unpartitioncgm(): Undo the previous partition operation. All communication operations in CGMlib send and receive lists oftype Comm ObjectList. AComm ObjectList is a list of CommObject elements. The Comm Object interface defines the operations which every object that is to be sent or received has to support. 3.2 CGM Utilities Parallel Prefix Sum: calculateprefixsum (CommObjectList &result, CommObjectList &data). Parallel Sorting: sort(commobjectlist &data) using the deterministic parallel sample sort methods in [5] and [16]. Request System for exchanging data requests between processors: The CGMlib provides methods sendrequests(...) and sendresponses(...) for routing the requests from their senders to their destinations and returning the responses to the senders, respectively. Other CGM Utilities: A class CGMTimers (with six timers measureing computation time, communication time, and total time, both in wall clock time and CPU ticks) and other utilities including a parallel random number generator. 3.3 Performance Evaluation Figure 2 shows the performance of our prefix sum implementation, and Figure 3 shows the performance of our parallel sort implementation. For our prefix sum implementation, we observe that all experiments show a close to zero communication time, except for some noise on THOG. The prefix sum method communicates only very few data items. The total wall clock time and computation time curves in all four diagrams are similar to 1/p. For our parallel sort implementation, we observe a small fixed communication time, essentially independent of p. This is easily explained by the fact that the parallel sort uses of a fixed number of h-relation operations, independent of p. Most ofthe total wall clock time is spent on local computation which consists mainly oflocal sorts ofn/p data. Therefore, the curves for the local computation and the total parallel wall clock time are similar to 1/p.

4 CGMgraph: Parallel Graph Algorithms Utilizing the CGM Model CGMgraph provides a list ranking method rankthelist(objlist<node> &nodes,...) which implements a randomized method as well as a deterministic method [15, 9]. The input to the list ranking method is a linear linked list where Tthe pointer is stored as the index ofthe next node. CGMgraph also provides a method geteulertour(objlist <Vertex> &r,...) for Euler tour traversal of a forest [9]. The forest is represented by a list ofvertices, a list ofedges and a list ofroots. The input to the Euler tour method is a forest which is stored as follows: r is a set ofvertices that represents the roots ofthe trees, v is the input set ofvertices, e is the input set ofedges, and eulernodes is the output data ofthe method. We implemented the connected component method described in [9]. The method also provides immediately a spanning forest of the given graph. CGMgraph provides amethodfindconnectedcomponents(graph &g, Comm *comm) for connected component computation and a method findspanningforest(graph &g,...) for the calculation of the spanning forest of a graph. The input to the above two methods is a graph represented as a list ofvertices and a list ofedges. We also implemented the bipartite graph detection algorithm described in [3]. CGMgraph provides a method isbipartitegraph(graph &g, Comm *comm) for detecting whether a graph is a bipartite graph. As in the case ofconnected component and spanning forest, the input is a graph represented as a list of vertices and a list ofedges. 4.1 Performance Evaluation In the following, we present the results of our experiment. For each operation, we measured the performance on THOG with n =1,, and on ULTRA with n = 1,. Figure 4 shows the performance of the deterministic list ranking algorithm. Again, we observe that for both machines, the communication time is a small, essentially fixed, portion ofthe total time. The deterministic list ranking requires between c log p and 2c log ph-relation operations. With log p in the range [1, 5], we expect between c and 1c h-relation operations. Since the deterministic algorithm is more involved and incurs larger constants, c may be around 1 which would imply a range of[1, 1] for the number of h-relation operations. We measured usually around 2 h-relation operations. The number is fairly stable, independent of p, which shows again that logp has little influence on the measured communication time. The small increases for p =4, 8, 16 are due to the fact that that the number of h-relation operations grows with log p, which gets incremented by 1 when p reaches a power of2. In summary, since the communication time is only a small fixed value and the computation time is dominating and similar to 1/p, the entire measured wall clock time is similar to 1/p. Figure 5 shows the performance of the Euler tour algorithm on THOG and ULTRA. Our implementation uses the deterministic list ranking method for the Euler tour computation. Not surprisingly, the performance is essentially

the same as for deterministic list ranking. Due to the fact that all tree edges need to be duplicated, the data size increases by a factor of three (original plus two copies). This is the reason why we could execute the Euler tour method on THOG for n =1,, only with p 1. Figure 6 shows the performance of the connected components algorithm on THOG and ULTRA. Figure 7 shows the performance of the spanning forest algorithm on THOG, and ULTRA. The only difference between the two methods is that the spanning forest algorithm has to create the spanning forests after the connected components have been identified. Therefore, the times shown in Figures 6 and 7 are very similar. Again, we observe that for both machines, the communication time is a small, essentially fixed, portion ofthe total time. The connected component method uses deterministic list ranking. It requires c log p h-relation operations with logp in the range [1, 5]. The communication time observed is fairly stable, independent of p, which shows that the log p factor has little influence on the measured communication time. The entire measured wall clock time is dominated by the computation time and similar to 1/p. Figure 8 shows the performance of the bipartite graph detection algorithm on THOG, and ULTRA. The results mirror the fact that the algorithm is essentially a combination ofeuler tour traversal and spanning forest computation. The curves are similar to the former but the amount of communication time is now larger, representing the sum ofthe two. This effect is particularly strong on ULTRA which has the weakest network. Here, the log p in the number ofcommunication rounds actually leads to a steadily increasing communication time which, for p = 9 starts to dominate the computation time. However, for THOG and CGM1, the effect is much smaller. For these machines, the communication time is still essentially fixed over the entire range ofvalues ofp. The computation time is similar to 1/p and determines the shape ofthe curves for the entire wall clock time. The computation and communication times become equal for larger p but only because ofthe decrease in computation time. 5 Future Work Both, CGMlib and CGMgraph are currently in beta state. Despite extensive work on performance tuning, there are still many possibilities for fine-tuning the code in order to obtain further improved performance. Of course, adding more parallel graph algorithm implementations to CGMgraph is an important task for the near future. Other possible extensions include porting CGMlib and CGMgraph to other communication libraries, e.g. PVM and OpenMP. We also plan to integrate CGMlib and CGMgraph with other libraries, in particular the LEDA library [13]. References 1. P. Bose, A. Chan, F. Dehne, and M. Latzel. Coarse Grained Parallel Maximum Matching in Convex Bipartite Graphs. In 13th International Parallel Processing Symposium (IPPS 99), pages 125 129, 1999.

2. E. Cacere, A. Chan, F. Dehne, and G. Prencipe. Coarse Grained Parallel Algorithms for Detecting Convex Bipartite Graphs. In 26th Workshop on Graph- Theoretic Concepts in Computer Science (WG 2), volume 1928 of Lecture Notes in Computer Science, pages 83 94. Springer, 2. 3. E. Caceres, A. Chan, F. Dehne, and G. Prencipe. Coarse Grained Parallel Algorithms for Detecting Convex Bipartite Graphs. In 26th Workshop on Graph- Theoretic Concepts in Computer Science (WG 2), volume 1928 of Springer Lecture Notes in Computer Science, pages 83 94, 2. 4. E. Caceres, A. Chan, F. Dehne, and S. W. Song. Coarse Grained Parallel Graph Planarity Testing. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2). CSREA Press, 2. 5. A. Chan and F. Dehne. A Note on Coarse Grained Parallel Integer Sorting. Parallel Processing Letters, 9(4):533 538., 1999. 6. S. Dascal and U. Vishkin. Experiments with List Ranking on Explicit Multi- Threaded (XMT) Instruction Parallelism. In 3rd Workshop on Algorithms Engineering (WAE-99), volume 1668 of Lecture Notes in Computer Science, page 43 ff, 1999. 7. F. Dehne. Guest Editor s Introduction, Special Issue on Coarse Grained Parallel Algorithms. Algorithmica, 24(3/4):173 176, 1999. 8. F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable Parallel Geometric Algorithms for Coarse Grained Multicomputers. In ACM Symposium on Computational Geometry, pages 298 37, 1993. 9. F. Dehne, A. Ferreira, E. Caceres, S. W. Song, and A. Roncato. Efficient Parallel Graph Algorithms for Coarse Grained Multicomputers and BSP. Algorithmica, 33(2):183 2, 22. 1. F. Dehne and S. W. Song. Randomized Parallel List Ranking for Distributed Memory Multiprocessors. In Asian Computer Science Conference (ASIAN 96), volume 1179 of Lecture Notes in Computer Science, pages 1 1. Springer, 1996. 11. T. Hsu, V. Ramachandran, and N. Dean. Parallel Implementation of Algorithms for Finding Connected Components in Graphs. In AMS/DIMACS Parallel Implementation Challenge Workshop III, 1997. 12. Isabelle Gurin Lassous, Jens Gustedt, and Michel Morvan. Feasability, Portability, Predictability and Efficiency : Four Ambitious Goals for the Design and Implementation of Parallel Coarse Grained Graph Algorithms. Technical Report RR-3885, INRIA, http://www.inria.fr/rrrt/rr-3885.html. 13. LEDA library. http://www.algorithmic-solutions.com/. 14. Margaret Reid-Miller. List Ranking and List Scan on the Cray C-9. In ACM Symposium on Parallel Algorithms and Architectures, pages 14 113, 1994. 15. J. Reif, editor. Synthesis of Parallel Algorithms. Morgan and Kaufmatin Publishers, 1993. 16. H. Shi and J. Schaeffer. Parallel Sorting by Regular Sampling. Journal of Parallel and Distributed Computing, 14:361 372, 1992. 17. Jop F. Sibeyn. List Ranking on Meshes. Acta Informatica, 35(7):543 566, 1998. 18. Jop F. Sibeyn, Frank Guillaume, and Tillmann Seidel. Practical Parallel List Ranking. Journal of Parallel and Distributed Computing, 56(2):156 18, 1999. 19. L. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33(8), 199.

CGMlib CGMgraph Comm CommObject Vertex Request Edge CGM Graph Algorithms Other CGM Algorithms MPIComm Response Node cgmgraph Sortor SimpleCommObject EulerNode ListRanker cgmlib BasicCommObject mpi... pvm HeapSortor ObjList EulerTourer ConnectedComponents Physical Network IntegerSortor CommObjectList BPGD Fig. 1. Overview of CGMlib and CGMgraph Parallel Prefix Sum on thog (n = 5) Parallel Prefix Sum on ultra (n = 1) 4 3.5 3 2.5 2 1.5 1.35.3.25.2.15.1.5.5 5 1 15 2 25 3 1 2 3 4 5 6 7 8 9 1 Fig. 2. Performance of our prefix sum implementation on THOG and ULTRA Parallel Heap Sort on thog (n = 5) Parallel Heap Sort on ultra (n = 1) 4 35 3 25 2 15 1 5 25 2 15 1 5 5 1 15 2 25 3 1 2 3 4 5 6 7 8 9 1 Fig. 3. Performance of our parallel sort implementation on THOG and ULTRA Deterministic List Ranking on thog (n = 1) Deterministic List Ranking on ultra (n = 1) 25 2 7 6 15 1 5 5 4 3 2 1 5 1 15 2 25 3 1 2 3 4 5 6 7 8 9 1 Fig. 4. Performance of the deterministic list ranking algorithm on THOG and ULTRA

Euler Tour on thog (n = 5) Euler Tour on ultra (n = 1) 35 3 25 2 25 2 15 1 5 15 1 5 5 1 15 2 25 3 1 2 3 4 5 6 7 8 9 1 Fig. 5. Performance of the Euler tour algorithm on THOG and ULTRA Connected Components on thog (n = 1) Connected Components on ultra (n = 1) 16 14 12 1 8 6 4 6 5 4 3 2 2 1 5 1 15 2 25 3 1 2 3 4 5 6 7 8 9 1 Fig. 6. Performance of the connected components algorithm on THOG and ULTRA Spanning Forest on thog (n = 1) Spanning Forest on ultra (n = 1) 16 14 12 1 8 6 4 7 6 5 4 3 2 2 1 5 1 15 2 25 3 1 2 3 4 5 6 7 8 9 1 Fig. 7. Performance of the spanning forest algorithm on THOG and ULTRA 2 18 16 14 12 1 8 6 4 2 Bipartite Graph Detection on thog (n = 1) 5 1 15 2 25 3 9 8 7 6 5 4 3 2 1 Bipartite Graph Detection on ultra (n = 1) 1 2 3 4 5 6 7 8 9 1 Fig. 8. Performance of the bipartite graph detection algorithm on THOG and ULTRA