CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters
|
|
|
- Walter Harvey
- 10 years ago
- Views:
Transcription
1 CGMgraph/CGMlib: Implementing and Testing CGM Graph Algorithms on PC Clusters Albert Chan and Frank Dehne School of Computer Science, Carleton University, Ottawa, Canada achan and Abstract. In this paper, we present CGMgraph, the first integrated library of parallel graph methods for PC clusters based on CGM algorithms. CGMgraph implements parallel methods for various graph problems. Our implementations of deterministic list ranking, Euler tour, connected components, spanning forest, and bipartite graph detection are, to our knowledge, the first efficient implementations for PC clusters. Our library also includes CGMlib, a library of basic CGM tools such as sorting, prefix sum, one to all broadcast, all to one gather, h-relation, all to all broadcast, array balancing, and CGM partitioning. Both libraries are available for download at 1 Introduction Parallel graph algorithms have been extensively studied in the literature (see e.g. [15] for a survey). However, most of the published parallel graph algorithms have traditionally been designed for the theoretical PRAM model. Following some previous experimental results for the MASPAR and Cray [14, 6, 18, 17, 11], parallel graph algorithms for more PC cluster like parallel models like the BSP [19] and CGM [7, 8] have been presented in [12, 1 4, 9, 1]. In this paper, we present CGMgraph, the first integrated library ofcgm methods for various graph problems including list ranking, Euler tour, connected components, spanning forest, and bipartite graph detection. Our library also includes a library CGMlib ofbasic CGM tools that are necessary for parallel graph methods as well as many other CGM algorithms: sorting, prefix sum, one to all broadcast, all to one gather, h-relation, all to all broadcast, array balancing, and CGM partitioning. In comparison with [12], CGMgraph implements both a randomized as well as a deterministic list ranking method. Our experimental results for randomized list ranking are similar to the ones reported in [12]. Our implementations ofdeterministic list ranking, Euler tour, connected components, spanning forest, and bipartite graph detection are, to our knowledge, the first efficient implementations for PC clusters. Both libraries are available for download at We demonstrate the performance of our methods on two different cluster architectures: a gigabit connected high performance PC cluster and a network ofworkstations. Our experiments show that our library provides good parallel speedup and scalability on both platforms. The communication overhead is, in most cases, small and does not grow significantly with
2 an increasing number ofprocessors. This is a very important feature ofcgm algorithms which makes them very efficient in practice. 2 Library Overview and Experimental Setup Figure 1 illustrates the general use ofcgmlib and CGMgraph as well as the class hierarchy ofthe main classes. Note that all classes in CGMgraph, except EulerNode, are independent. Both libraries require an underlying communication library such as MPI or PVM. CGMlib provides a class Comm which interfaces with the underlying communication library. It provides an interface for all communication operations used by CGMlib and CGMgraph and thereby hides the details ofthe communication library from the user. The performance of our library was evaluated on two parallel platforms: THOG, andultra. TheTHOG cluster consists of p =64nodes,eachwith two Xeon processors. The nodes are oftwo different generations, with processors at 1.7 or 2.GHz, 1. or 1.5GB RAM, and 6GB disk storage. The nodes are interconnected via a Cisco 659 switch using Gigabit ethernet. The operating system is Linux Red Hat 7.1 together with LAM-MPI version The ULTRA platform is a network of workstations consisting of p = 1 Sun Sparc Ultra 1. The processor speed is 44MHz. Each processor has 256MB RAM. The nodes are interconnected via 1Mb Switched Fast Ethernet. The operating system is Sun OS 5.7 and LAM-MPI version All times reported in the remainder ofthis paper are wall clock times in seconds. Each data point in the diagrams represents the average ofthree experiments (on different random test data ofthe same size) for CGMgraph and ten experiments for CGMlib. The input data sets for our tests consisted of randomly created test data. For inputs consisting oflists or graphs, we generated random lists or graphs as follows. For random linked lists, we first created an arbitrary linked list and then permuted it over the processors via random permutations. For random graphs, we created a set ofnodes and then added random edges. Unfortunately, different test data sizes had to be chosen for the different platforms because ofthe smaller memory capacity ofultra. 3 CGMlib: Basic Infrastructure and Utilities 3.1 CGM Communication Operations The basic library, called CGMlib, provides basic functionality for CGM communication. An interface, Comm, defines the basic communication operations such as onetoallbcast(int source, CommObjectList &data): Broadcast the list data from processor number source to all processors. alltoonegather(int target, CommObjectList &data): Gather the lists data from all processors to processor number target.
3 hrelation(commobjectlist &data, int *ns): Performanh-Relation on the lists data using the integer array ns to indicate for each processor which list objects are to be sent to which processor. alltoallbcast(commobjectlist &data): Every processor broadcasts its list data to all other processors. arraybalancing(commobjectlist &data, int expectedn=-1): Shift the list elements between the lists data such that every processor contains the same number ofelements. partitioncgm(int groupid): Partition the CGM into groups indicated by groupid. All subsequent communication operations, such as the ones listed above, operate within the respective processor s group only. unpartitioncgm(): Undo the previous partition operation. All communication operations in CGMlib send and receive lists oftype Comm ObjectList. AComm ObjectList is a list of CommObject elements. The Comm Object interface defines the operations which every object that is to be sent or received has to support. 3.2 CGM Utilities Parallel Prefix Sum: calculateprefixsum (CommObjectList &result, CommObjectList &data). Parallel Sorting: sort(commobjectlist &data) using the deterministic parallel sample sort methods in [5] and [16]. Request System for exchanging data requests between processors: The CGMlib provides methods sendrequests(...) and sendresponses(...) for routing the requests from their senders to their destinations and returning the responses to the senders, respectively. Other CGM Utilities: A class CGMTimers (with six timers measureing computation time, communication time, and total time, both in wall clock time and CPU ticks) and other utilities including a parallel random number generator. 3.3 Performance Evaluation Figure 2 shows the performance of our prefix sum implementation, and Figure 3 shows the performance of our parallel sort implementation. For our prefix sum implementation, we observe that all experiments show a close to zero communication time, except for some noise on THOG. The prefix sum method communicates only very few data items. The total wall clock time and computation time curves in all four diagrams are similar to 1/p. For our parallel sort implementation, we observe a small fixed communication time, essentially independent of p. This is easily explained by the fact that the parallel sort uses of a fixed number of h-relation operations, independent of p. Most ofthe total wall clock time is spent on local computation which consists mainly oflocal sorts ofn/p data. Therefore, the curves for the local computation and the total parallel wall clock time are similar to 1/p.
4 4 CGMgraph: Parallel Graph Algorithms Utilizing the CGM Model CGMgraph provides a list ranking method rankthelist(objlist<node> &nodes,...) which implements a randomized method as well as a deterministic method [15, 9]. The input to the list ranking method is a linear linked list where Tthe pointer is stored as the index ofthe next node. CGMgraph also provides a method geteulertour(objlist <Vertex> &r,...) for Euler tour traversal of a forest [9]. The forest is represented by a list ofvertices, a list ofedges and a list ofroots. The input to the Euler tour method is a forest which is stored as follows: r is a set ofvertices that represents the roots ofthe trees, v is the input set ofvertices, e is the input set ofedges, and eulernodes is the output data ofthe method. We implemented the connected component method described in [9]. The method also provides immediately a spanning forest of the given graph. CGMgraph provides amethodfindconnectedcomponents(graph &g, Comm *comm) for connected component computation and a method findspanningforest(graph &g,...) for the calculation of the spanning forest of a graph. The input to the above two methods is a graph represented as a list ofvertices and a list ofedges. We also implemented the bipartite graph detection algorithm described in [3]. CGMgraph provides a method isbipartitegraph(graph &g, Comm *comm) for detecting whether a graph is a bipartite graph. As in the case ofconnected component and spanning forest, the input is a graph represented as a list of vertices and a list ofedges. 4.1 Performance Evaluation In the following, we present the results of our experiment. For each operation, we measured the performance on THOG with n =1,, and on ULTRA with n = 1,. Figure 4 shows the performance of the deterministic list ranking algorithm. Again, we observe that for both machines, the communication time is a small, essentially fixed, portion ofthe total time. The deterministic list ranking requires between c log p and 2c log ph-relation operations. With log p in the range [1, 5], we expect between c and 1c h-relation operations. Since the deterministic algorithm is more involved and incurs larger constants, c may be around 1 which would imply a range of[1, 1] for the number of h-relation operations. We measured usually around 2 h-relation operations. The number is fairly stable, independent of p, which shows again that logp has little influence on the measured communication time. The small increases for p =4, 8, 16 are due to the fact that that the number of h-relation operations grows with log p, which gets incremented by 1 when p reaches a power of2. In summary, since the communication time is only a small fixed value and the computation time is dominating and similar to 1/p, the entire measured wall clock time is similar to 1/p. Figure 5 shows the performance of the Euler tour algorithm on THOG and ULTRA. Our implementation uses the deterministic list ranking method for the Euler tour computation. Not surprisingly, the performance is essentially
5 the same as for deterministic list ranking. Due to the fact that all tree edges need to be duplicated, the data size increases by a factor of three (original plus two copies). This is the reason why we could execute the Euler tour method on THOG for n =1,, only with p 1. Figure 6 shows the performance of the connected components algorithm on THOG and ULTRA. Figure 7 shows the performance of the spanning forest algorithm on THOG, and ULTRA. The only difference between the two methods is that the spanning forest algorithm has to create the spanning forests after the connected components have been identified. Therefore, the times shown in Figures 6 and 7 are very similar. Again, we observe that for both machines, the communication time is a small, essentially fixed, portion ofthe total time. The connected component method uses deterministic list ranking. It requires c log p h-relation operations with logp in the range [1, 5]. The communication time observed is fairly stable, independent of p, which shows that the log p factor has little influence on the measured communication time. The entire measured wall clock time is dominated by the computation time and similar to 1/p. Figure 8 shows the performance of the bipartite graph detection algorithm on THOG, and ULTRA. The results mirror the fact that the algorithm is essentially a combination ofeuler tour traversal and spanning forest computation. The curves are similar to the former but the amount of communication time is now larger, representing the sum ofthe two. This effect is particularly strong on ULTRA which has the weakest network. Here, the log p in the number ofcommunication rounds actually leads to a steadily increasing communication time which, for p = 9 starts to dominate the computation time. However, for THOG and CGM1, the effect is much smaller. For these machines, the communication time is still essentially fixed over the entire range ofvalues ofp. The computation time is similar to 1/p and determines the shape ofthe curves for the entire wall clock time. The computation and communication times become equal for larger p but only because ofthe decrease in computation time. 5 Future Work Both, CGMlib and CGMgraph are currently in beta state. Despite extensive work on performance tuning, there are still many possibilities for fine-tuning the code in order to obtain further improved performance. Of course, adding more parallel graph algorithm implementations to CGMgraph is an important task for the near future. Other possible extensions include porting CGMlib and CGMgraph to other communication libraries, e.g. PVM and OpenMP. We also plan to integrate CGMlib and CGMgraph with other libraries, in particular the LEDA library [13]. References 1. P. Bose, A. Chan, F. Dehne, and M. Latzel. Coarse Grained Parallel Maximum Matching in Convex Bipartite Graphs. In 13th International Parallel Processing Symposium (IPPS 99), pages , 1999.
6 2. E. Cacere, A. Chan, F. Dehne, and G. Prencipe. Coarse Grained Parallel Algorithms for Detecting Convex Bipartite Graphs. In 26th Workshop on Graph- Theoretic Concepts in Computer Science (WG 2), volume 1928 of Lecture Notes in Computer Science, pages Springer, E. Caceres, A. Chan, F. Dehne, and G. Prencipe. Coarse Grained Parallel Algorithms for Detecting Convex Bipartite Graphs. In 26th Workshop on Graph- Theoretic Concepts in Computer Science (WG 2), volume 1928 of Springer Lecture Notes in Computer Science, pages 83 94, E. Caceres, A. Chan, F. Dehne, and S. W. Song. Coarse Grained Parallel Graph Planarity Testing. In International Conference on Parallel and Distributed Processing Techniques and Applications (PDPTA 2). CSREA Press, A. Chan and F. Dehne. A Note on Coarse Grained Parallel Integer Sorting. Parallel Processing Letters, 9(4): , S. Dascal and U. Vishkin. Experiments with List Ranking on Explicit Multi- Threaded (XMT) Instruction Parallelism. In 3rd Workshop on Algorithms Engineering (WAE-99), volume 1668 of Lecture Notes in Computer Science, page 43 ff, F. Dehne. Guest Editor s Introduction, Special Issue on Coarse Grained Parallel Algorithms. Algorithmica, 24(3/4): , F. Dehne, A. Fabri, and A. Rau-Chaplin. Scalable Parallel Geometric Algorithms for Coarse Grained Multicomputers. In ACM Symposium on Computational Geometry, pages , F. Dehne, A. Ferreira, E. Caceres, S. W. Song, and A. Roncato. Efficient Parallel Graph Algorithms for Coarse Grained Multicomputers and BSP. Algorithmica, 33(2):183 2, F. Dehne and S. W. Song. Randomized Parallel List Ranking for Distributed Memory Multiprocessors. In Asian Computer Science Conference (ASIAN 96), volume 1179 of Lecture Notes in Computer Science, pages 1 1. Springer, T. Hsu, V. Ramachandran, and N. Dean. Parallel Implementation of Algorithms for Finding Connected Components in Graphs. In AMS/DIMACS Parallel Implementation Challenge Workshop III, Isabelle Gurin Lassous, Jens Gustedt, and Michel Morvan. Feasability, Portability, Predictability and Efficiency : Four Ambitious Goals for the Design and Implementation of Parallel Coarse Grained Graph Algorithms. Technical Report RR-3885, INRIA, LEDA library Margaret Reid-Miller. List Ranking and List Scan on the Cray C-9. In ACM Symposium on Parallel Algorithms and Architectures, pages , J. Reif, editor. Synthesis of Parallel Algorithms. Morgan and Kaufmatin Publishers, H. Shi and J. Schaeffer. Parallel Sorting by Regular Sampling. Journal of Parallel and Distributed Computing, 14: , Jop F. Sibeyn. List Ranking on Meshes. Acta Informatica, 35(7): , Jop F. Sibeyn, Frank Guillaume, and Tillmann Seidel. Practical Parallel List Ranking. Journal of Parallel and Distributed Computing, 56(2):156 18, L. Valiant. A Bridging Model for Parallel Computation. Communications of the ACM, 33(8), 199.
7 CGMlib CGMgraph Comm CommObject Vertex Request Edge CGM Graph Algorithms Other CGM Algorithms MPIComm Response Node cgmgraph Sortor SimpleCommObject EulerNode ListRanker cgmlib BasicCommObject mpi... pvm HeapSortor ObjList EulerTourer ConnectedComponents Physical Network IntegerSortor CommObjectList BPGD Fig. 1. Overview of CGMlib and CGMgraph Parallel Prefix Sum on thog (n = 5) Parallel Prefix Sum on ultra (n = 1) Fig. 2. Performance of our prefix sum implementation on THOG and ULTRA Parallel Heap Sort on thog (n = 5) Parallel Heap Sort on ultra (n = 1) Fig. 3. Performance of our parallel sort implementation on THOG and ULTRA Deterministic List Ranking on thog (n = 1) Deterministic List Ranking on ultra (n = 1) Fig. 4. Performance of the deterministic list ranking algorithm on THOG and ULTRA
8 Euler Tour on thog (n = 5) Euler Tour on ultra (n = 1) Fig. 5. Performance of the Euler tour algorithm on THOG and ULTRA Connected Components on thog (n = 1) Connected Components on ultra (n = 1) Fig. 6. Performance of the connected components algorithm on THOG and ULTRA Spanning Forest on thog (n = 1) Spanning Forest on ultra (n = 1) Fig. 7. Performance of the spanning forest algorithm on THOG and ULTRA Bipartite Graph Detection on thog (n = 1) Bipartite Graph Detection on ultra (n = 1) Fig. 8. Performance of the bipartite graph detection algorithm on THOG and ULTRA
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms
A Load Balancing Technique for Some Coarse-Grained Multicomputer Algorithms Thierry Garcia and David Semé LaRIA Université de Picardie Jules Verne, CURI, 5, rue du Moulin Neuf 80000 Amiens, France, E-mail:
Scalability and Classifications
Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
Lecture 2 Parallel Programming Platforms
Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment
A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment Panagiotis D. Michailidis and Konstantinos G. Margaritis Parallel and Distributed
MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research
MapReduce and Distributed Data Analysis Google Research 1 Dealing With Massive Data 2 2 Dealing With Massive Data Polynomial Memory Sublinear RAM Sketches External Memory Property Testing 3 3 Dealing With
Big Graph Processing: Some Background
Big Graph Processing: Some Background Bo Wu Colorado School of Mines Part of slides from: Paul Burkhardt (National Security Agency) and Carlos Guestrin (Washington University) Mines CSCI-580, Bo Wu Graphs
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
Parallel Programming Survey
Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)
Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems
UTS: An Unbalanced Tree Search Benchmark
UTS: An Unbalanced Tree Search Benchmark LCPC 2006 1 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!
Interconnection Networks Interconnection Networks Interconnection networks are used everywhere! Supercomputers connecting the processors Routers connecting the ports can consider a router as a parallel
Parallel Simplification of Large Meshes on PC Clusters
Parallel Simplification of Large Meshes on PC Clusters Hua Xiong, Xiaohong Jiang, Yaping Zhang, Jiaoying Shi State Key Lab of CAD&CG, College of Computer Science Zhejiang University Hangzhou, China April
International Journal of Computer & Organization Trends Volume20 Number1 May 2015
Performance Analysis of Various Guest Operating Systems on Ubuntu 14.04 Prof. (Dr.) Viabhakar Pathak 1, Pramod Kumar Ram 2 1 Computer Science and Engineering, Arya College of Engineering, Jaipur, India.
Microsoft Windows Server 2003 with Internet Information Services (IIS) 6.0 vs. Linux Competitive Web Server Performance Comparison
April 23 11 Aviation Parkway, Suite 4 Morrisville, NC 2756 919-38-28 Fax 919-38-2899 32 B Lakeside Drive Foster City, CA 9444 65-513-8 Fax 65-513-899 www.veritest.com [email protected] Microsoft Windows
Analysis and Implementation of Cluster Computing Using Linux Operating System
IOSR Journal of Computer Engineering (IOSRJCE) ISSN: 2278-0661 Volume 2, Issue 3 (July-Aug. 2012), PP 06-11 Analysis and Implementation of Cluster Computing Using Linux Operating System Zinnia Sultana
Topological Properties
Advanced Computer Architecture Topological Properties Routing Distance: Number of links on route Node degree: Number of channels per node Network diameter: Longest minimum routing distance between any
PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN
1 PARALLEL & CLUSTER COMPUTING CS 6260 PROFESSOR: ELISE DE DONCKER BY: LINA HUSSEIN Introduction What is cluster computing? Classification of Cluster Computing Technologies: Beowulf cluster Construction
BSPCloud: A Hybrid Programming Library for Cloud Computing *
BSPCloud: A Hybrid Programming Library for Cloud Computing * Xiaodong Liu, Weiqin Tong and Yan Hou Department of Computer Engineering and Science Shanghai University, Shanghai, China [email protected],
IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version 6.3.1 Fix Pack 2.
IBM Tivoli Composite Application Manager for Microsoft Applications: Microsoft Hyper-V Server Agent Version 6.3.1 Fix Pack 2 Reference IBM Tivoli Composite Application Manager for Microsoft Applications:
How To Understand The Concept Of A Distributed System
Distributed Operating Systems Introduction Ewa Niewiadomska-Szynkiewicz and Adam Kozakiewicz [email protected], [email protected] Institute of Control and Computation Engineering Warsaw University of
A Comparison of General Approaches to Multiprocessor Scheduling
A Comparison of General Approaches to Multiprocessor Scheduling Jing-Chiou Liou AT&T Laboratories Middletown, NJ 0778, USA [email protected] Michael A. Palis Department of Computer Science Rutgers University
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
SERVER CLUSTERING TECHNOLOGY & CONCEPT
SERVER CLUSTERING TECHNOLOGY & CONCEPT M00383937, Computer Network, Middlesex University, E mail: [email protected] Abstract Server Cluster is one of the clustering technologies; it is use for
CSE-E5430 Scalable Cloud Computing Lecture 2
CSE-E5430 Scalable Cloud Computing Lecture 2 Keijo Heljanko Department of Computer Science School of Science Aalto University [email protected] 14.9-2015 1/36 Google MapReduce A scalable batch processing
Real-time Process Network Sonar Beamformer
Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories [email protected] Brian L. Evans Dept. Electrical and Computer Engineering [email protected] The University of Texas
Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com
Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and
High Performance Computing in CST STUDIO SUITE
High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Initial Hardware Estimation Guidelines. AgilePoint BPMS v5.0 SP1
Initial Hardware Estimation Guidelines Document Revision r5.2.3 November 2011 Contents 2 Contents Preface...3 Disclaimer of Warranty...3 Copyright...3 Trademarks...3 Government Rights Legend...3 Virus-free
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003
Interconnect Efficiency of Tyan PSC T-630 with Microsoft Compute Cluster Server 2003 Josef Pelikán Charles University in Prague, KSVI Department, [email protected] Abstract 1 Interconnect quality
Cluster Implementation and Management; Scheduling
Cluster Implementation and Management; Scheduling CPS343 Parallel and High Performance Computing Spring 2013 CPS343 (Parallel and HPC) Cluster Implementation and Management; Scheduling Spring 2013 1 /
Measuring Cache and Memory Latency and CPU to Memory Bandwidth
White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary
Oracle Database Scalability in VMware ESX VMware ESX 3.5
Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises
MOSIX: High performance Linux farm
MOSIX: High performance Linux farm Paolo Mastroserio [[email protected]] Francesco Maria Taurino [[email protected]] Gennaro Tortone [[email protected]] Napoli Index overview on Linux farm farm
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach
Parallel Ray Tracing using MPI: A Dynamic Load-balancing Approach S. M. Ashraful Kadir 1 and Tazrian Khan 2 1 Scientific Computing, Royal Institute of Technology (KTH), Stockholm, Sweden [email protected],
Cluster Computing at HRI
Cluster Computing at HRI J.S.Bagla Harish-Chandra Research Institute, Chhatnag Road, Jhunsi, Allahabad 211019. E-mail: [email protected] 1 Introduction and some local history High performance computing
White Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Chapter 18: Database System Architectures. Centralized Systems
Chapter 18: Database System Architectures! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types 18.1 Centralized Systems! Run on a single computer system and
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Parallel Prefix Sum (Scan) with CUDA. Mark Harris [email protected]
Parallel Prefix Sum (Scan) with CUDA Mark Harris [email protected] April 2007 Document Change History Version Date Responsible Reason for Change February 14, 2007 Mark Harris Initial release April 2007
System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1
System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect
IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS
Volume 2, No. 3, March 2011 Journal of Global Research in Computer Science RESEARCH PAPER Available Online at www.jgrcs.info IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE
Cellular Computing on a Linux Cluster
Cellular Computing on a Linux Cluster Alexei Agueev, Bernd Däne, Wolfgang Fengler TU Ilmenau, Department of Computer Architecture Topics 1. Cellular Computing 2. The Experiment 3. Experimental Results
Centralized Systems. A Centralized Computer System. Chapter 18: Database System Architectures
Chapter 18: Database System Architectures Centralized Systems! Centralized Systems! Client--Server Systems! Parallel Systems! Distributed Systems! Network Types! Run on a single computer system and do
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics
22S:295 Seminar in Applied Statistics High Performance Computing in Statistics Luke Tierney Department of Statistics & Actuarial Science University of Iowa August 30, 2007 Luke Tierney (U. of Iowa) HPC
SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri
SWARM: A Parallel Programming Framework for Multicore Processors David A. Bader, Varun N. Kanade and Kamesh Madduri Our Contributions SWARM: SoftWare and Algorithms for Running on Multicore, a portable
GRID Computing: CAS Style
CS4CC3 Advanced Operating Systems Architectures Laboratory 7 GRID Computing: CAS Style campus trunk C.I.S. router "birkhoff" server The CAS Grid Computer 100BT ethernet node 1 "gigabyte" Ethernet switch
CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER
CORRIGENDUM TO TENDER FOR HIGH PERFORMANCE SERVER Tender Notice No. 3/2014-15 dated 29.12.2014 (IIT/CE/ENQ/COM/HPC/2014-15/569) Tender Submission Deadline Last date for submission of sealed bids is extended
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs
SIGMOD RWE Review Towards Proximity Pattern Mining in Large Graphs Fabian Hueske, TU Berlin June 26, 21 1 Review This document is a review report on the paper Towards Proximity Pattern Mining in Large
Patriot Hardware and Systems Software Requirements
Patriot Hardware and Systems Software Requirements Patriot is designed and written for Microsoft Windows. As a result, it is a stable and consistent Windows application. Patriot is suitable for deployment
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
A Comparison of Distributed Systems: ChorusOS and Amoeba
A Comparison of Distributed Systems: ChorusOS and Amoeba Angelo Bertolli Prepared for MSIT 610 on October 27, 2004 University of Maryland University College Adelphi, Maryland United States of America Abstract.
Microsoft Windows Apple Mac OS X
Products Snow License Manager Snow Inventory Server, IDP, IDR Client for Windows Client for OSX Client for Linux Client for Unix Oracle Scanner External Data Provider Snow Distribution Date 2014-02-12
Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
Improved LS-DYNA Performance on Sun Servers
8 th International LS-DYNA Users Conference Computing / Code Tech (2) Improved LS-DYNA Performance on Sun Servers Youn-Seo Roh, Ph.D. And Henry H. Fong Sun Microsystems, Inc. Abstract Current Sun platforms
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery
A Serial Partitioning Approach to Scaling Graph-Based Knowledge Discovery Runu Rathi, Diane J. Cook, Lawrence B. Holder Department of Computer Science and Engineering The University of Texas at Arlington
Understanding the Benefits of IBM SPSS Statistics Server
IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster
Business white paper. HP Process Automation. Version 7.0. Server performance
Business white paper HP Process Automation Version 7.0 Server performance Table of contents 3 Summary of results 4 Benchmark profile 5 Benchmark environmant 6 Performance metrics 6 Process throughput 6
Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems
Using VMware VMotion with Oracle Database and EMC CLARiiON Storage Systems Applied Technology Abstract By migrating VMware virtual machines from one physical environment to another, VMware VMotion can
High Performance Computing with Linux Clusters
High Performance Computing with Linux Clusters Panduranga Rao MV Mtech, MISTE Lecturer, Dept of IS and Engg., JNNCE, Shimoga, Karnataka INDIA email: [email protected] URL: http://www.raomvp.bravepages.com/
Efficiency Considerations of PERL and Python in Distributed Processing
Efficiency Considerations of PERL and Python in Distributed Processing Roger Eggen (presenter) Computer and Information Sciences University of North Florida Jacksonville, FL 32224 [email protected] 904.620.1326
Virtual Machines. www.viplavkambli.com
1 Virtual Machines A virtual machine (VM) is a "completely isolated guest operating system installation within a normal host operating system". Modern virtual machines are implemented with either software
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
Resource Allocation Schemes for Gang Scheduling
Resource Allocation Schemes for Gang Scheduling B. B. Zhou School of Computing and Mathematics Deakin University Geelong, VIC 327, Australia D. Walsh R. P. Brent Department of Computer Science Australian
Distributed File System Performance. Milind Saraph / Rich Sudlow Office of Information Technologies University of Notre Dame
Distributed File System Performance Milind Saraph / Rich Sudlow Office of Information Technologies University of Notre Dame Questions to answer: Why can t you locate an AFS file server in my lab to improve
Analysis of Computer Algorithms. Algorithm. Algorithm, Data Structure, Program
Analysis of Computer Algorithms Hiroaki Kobayashi Input Algorithm Output 12/13/02 Algorithm Theory 1 Algorithm, Data Structure, Program Algorithm Well-defined, a finite step-by-step computational procedure
Virtualised MikroTik
Virtualised MikroTik MikroTik in a Virtualised Hardware Environment Speaker: Tom Smyth CTO Wireless Connect Ltd. Event: MUM Krackow Feb 2008 http://wirelessconnect.eu/ Copyright 2008 1 Objectives Understand
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan
Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan Abstract Big Data is revolutionizing 21st-century with increasingly huge amounts of data to store and be
A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures
11 th International LS-DYNA Users Conference Computing Technology A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures Yih-Yih Lin Hewlett-Packard Company Abstract In this paper, the
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering
Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays Red Hat Performance Engineering Version 1.0 August 2013 1801 Varsity Drive Raleigh NC
Linux clustering. Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University
Linux clustering Morris Law, IT Coordinator, Science Faculty, Hong Kong Baptist University PII 4-node clusters started in 1999 PIII 16 node cluster purchased in 2001. Plan for grid For test base HKBU -
Control 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
Scaling 10Gb/s Clustering at Wire-Speed
Scaling 10Gb/s Clustering at Wire-Speed InfiniBand offers cost-effective wire-speed scaling with deterministic performance Mellanox Technologies Inc. 2900 Stender Way, Santa Clara, CA 95054 Tel: 408-970-3400
msuite5 & mdesign Installation Prerequisites
CommonTime Limited msuite5 & mdesign Installation Prerequisites Administration considerations prior to installing msuite5 and mdesign. 7/7/2011 Version 2.4 Overview... 1 msuite version... 1 SQL credentials...
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce.
An Experimental Approach Towards Big Data for Analyzing Memory Utilization on a Hadoop cluster using HDFS and MapReduce. Amrit Pal Stdt, Dept of Computer Engineering and Application, National Institute
Mobile Cloud Computing for Data-Intensive Applications
Mobile Cloud Computing for Data-Intensive Applications Senior Thesis Final Report Vincent Teo, [email protected] Advisor: Professor Priya Narasimhan, [email protected] Abstract The computational and storage
AirWave 7.7. Server Sizing Guide
AirWave 7.7 Server Sizing Guide Copyright 2013 Aruba Networks, Inc. Aruba Networks trademarks include, Aruba Networks, Aruba Wireless Networks, the registered Aruba the Mobile Edge Company logo, Aruba
CS 575 Parallel Processing
CS 575 Parallel Processing Lecture one: Introduction Wim Bohm Colorado State University Except as otherwise noted, the content of this presentation is licensed under the Creative Commons Attribution 2.5
Performance Monitoring of Parallel Scientific Applications
Performance Monitoring of Parallel Scientific Applications Abstract. David Skinner National Energy Research Scientific Computing Center Lawrence Berkeley National Laboratory This paper introduces an infrastructure
Big Data Systems CS 5965/6965 FALL 2015
Big Data Systems CS 5965/6965 FALL 2015 Today General course overview Expectations from this course Q&A Introduction to Big Data Assignment #1 General Course Information Course Web Page http://www.cs.utah.edu/~hari/teaching/fall2015.html
The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems
202 IEEE 202 26th IEEE International 26th International Parallel Parallel and Distributed and Distributed Processing Processing Symposium Symposium Workshops Workshops & PhD Forum The Green Index: A Metric
159.735. Final Report. Cluster Scheduling. Submitted by: Priti Lohani 04244354
159.735 Final Report Cluster Scheduling Submitted by: Priti Lohani 04244354 1 Table of contents: 159.735... 1 Final Report... 1 Cluster Scheduling... 1 Table of contents:... 2 1. Introduction:... 3 1.1
A Flexible Cluster Infrastructure for Systems Research and Software Development
Award Number: CNS-551555 Title: CRI: Acquisition of an InfiniBand Cluster with SMP Nodes Institution: Florida State University PIs: Xin Yuan, Robert van Engelen, Kartik Gopalan A Flexible Cluster Infrastructure
