UTS: An Unbalanced Tree Search Benchmark
|
|
- Beatrix Norman
- 8 years ago
- Views:
Transcription
1 UTS: An Unbalanced Tree Search Benchmark LCPC
2 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks to Ohio Supercomputer Center, European Center for Parallelism in Barcelona, and William Pugh 2
3 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? 3
4 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? Now suppose the tree is highly unbalanced. 4
5 Meet UTS Unbalanced computation requiring continuous dynamic load balancing for high performance Exposes trade-offs between communication costs and granularity of load balance Fundamentally disadvantages systems with coarse-grained communication networks Challenges the expressibility and performance portability of parallel programming languages 5
6 Benchmark Characteristics Implicit Generation Tree nodes generated on the fly Mimics systematic search Well-suited for parallel exploration Imbalance Large variation in subtree sizes Parameterization Can create trees of different depths and sizes 6
7 Outline Tree Generation Implementation Performance Analysis Conclusions and Future Work 7
8 Generation Strategy Each node consists of a 20B descriptor Implicit child creation using cryptographic hash (Parent Descriptor, Child Index) Hash Child Node Descriptor Number of children determined by sampling a probability distribution using the descriptor 8
9 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... 9
10 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... Don t know until we ve explored them... 10
11 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree Now we know 11
12 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) 12
13 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) Lucky sampled value from tail 13
14 Implementation Goal is to explore the entire tree Each thread performs depth first traversal starting from a given node using a stack When a thread runs out of work, it must be able to acquire more nodes Work sharing - busy threads with surplus work responsible for offloading to idle threads Work stealing - idle threads responsible for seeking out surplus work at other threads Busy threads always making progress on goal 14
15 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 15
16 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 16
17 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Locking 2. A thread s Here steal stack 17
18 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to operations generate must different be serialized trees through in the same a lock. parameter To amortize class heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 18 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial
19 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 19 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial
20 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 20 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial
21 Sequential Exploration Rates 21 System Processor Compiler T1 T2 T3 Cray X1 Vector (800 MHz) Cray SGI Origin 2000 MIPS (300 MHz) GCC SGI Origin 2000 MIPS (300 MHz) Mipspro Sun SunFire 6800 Sparc9 (750 MHz) Sun C P4 Xeon Cluster (OSC) P4 Xeon (2.4GHz) Intel Mac Powerbook PPC G4 (1.33GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) Intel Dell Blade Cluster (UNC) P4 Xeon (3.6GHz) Intel Table 2. Sequential performance for all trees (Thousands of nodes per second) Performance given in knodes/sec Note the variation with tree type (T1 and T2 are geometric trees; T3 is a binomial tree) As a result, we replaced the array of per-thread data with an array of pointers to dynamically allocated structures. During parallel execution, each thread dynamically allocates memory for its own structure with affinity to the thread performing the allocation. This adjustment eliminated false sharing, but was unecessary for UPC. The facilities for explicit distribution of data in OpenMP are weaker than those of UPC.
22 Performance on Shared Memory (SGI Origin 2000) 22 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results
23 Performance on Shared Memory (SGI Origin 2000) UPC Slower 23 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results
24 Performance on Shared Memory (SGI Origin 2000) T1 vs T2 24 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results
25 Performance on Distributed Memory 25 Fig. 4. Performance on the Dell blade cluster (left); Parallel e versus the Origin The 2000 UNIVERSITY (right) of NORTH CAROLINA at CHAPEL HILL
26 Granularity and Scaling Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 26
27 Granularity and Scaling Stealing overhead dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 27
28 Granularity and Scaling Stealing overhead dominates Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 28
29 Granularity and Scaling Stealing overhead dominates Sweet Spot Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 29
30 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale) 30
31 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale!) 31
32 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Sensitive to chunk size Chunk size (linear scale) Chunk size (log scale!) 32
33 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 33 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster
34 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 34 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster
35 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Work Stealing Granularity
36 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Trouble finding work Work Stealing Granularity
37 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Termination Detection Work Stealing Granularity
38 Expressing Locality OpenMP vs. UPC on Shared Memory OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 38
39 Expressing Locality OpenMP vs. UPC on Shared Memory!!! gin 2000 using OpenMP and UPC. On the left, 39 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000
40 Expressing Locality OpenMP vs. UPC on Shared Memory After Fix!!! gin 2000 using OpenMP and UPC. On the left, 40 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000
41 Expressing Locality OpenMP vs. UPC on Shared Memory No problems for UPC either way OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 41
42 Larger Problem (Tree) Size Altix processors used Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix
43 Larger Problem (Tree) Size Altix processors used 3X Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix
44 Comparing Performance on Various Machines T3 using UPC on 16 processors If it scaled better, Blade cluster would outperform the Altix 3800 Fig. 10. Performance for T3 on 16 pro- 44 Fig. 11. Speedup on th
45 Issues Addressed Measure the ability of parallel systems to perform continuous dynamic load balancing Understand the trade-offs between communication costs and granularity of load balance On shared memory and on distributed memory Compare parallel programming languages Expressibility and ease of programming Portability and performance portability 45
46 Conclusions UPC versus OpenMP on shared memory Similar in ease of programming UPC has superior facility for specifying locality Both exhibit near-ideal scaling Comparable absolute performance UPC implementation on distributed memory Frequent access of shared variables for coordinating the work stealing and termination detection lead to poor performance and scaling Several factors impact performance Data locality Granularity of load balance Problem size 46
47 Ongoing and Future Work MPI versions of UTS Work sharing and work stealing methods Further tuning of UPC version, especially to improve distributed memory performance Reduce/replace many shared variable accesses Lose advantage of UPC abstractions for programmability Trade-off with possible decrease in shared memory performance Fails to provide performance portability Performance testing on other systems 47
Dynamic Load Balancing of Unbalanced Computations Using Message Passing
Dynamic Load Balancing of Unbalanced Computations Using Message Passing James Dinan 1, Stephen Olivier 2, Gerald Sabin 1, Jan Prins 2, P. Sadayappan 1, and Chau-Wen Tseng 3 1 Dept. of Comp. Sci. and Engineering
More informationScalable Dynamic Load Balancing Using UPC
Scalable Dynamic Load Balancing Using UPC Stephen Olivier, Jan Prins Department of Computer Science University of North Carolina at Chapel Hill {olivier, prins}@cs.unc.edu Abstract An asynchronous work-stealing
More informationScheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationComparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs
Int J Parallel Prog (2010) 38:341 360 DOI 10.1007/s10766-010-0140-7 Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs Stephen L. Olivier Jan F. Prins Received: 27 April
More informationImproving System Scalability of OpenMP Applications Using Large Page Support
Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline
More informationLoad balancing. David Bindel. 12 Nov 2015
Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single processor performance Typically in the memory system Saw this in matrix multiply assignment Overhead for parallelism
More informationA Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
More informationSpring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
More informationPerformance Engineering of the Community Atmosphere Model
Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory 11th Annual CCSM Workshop June 20-22, 2006
More informationPerformance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
More informationSoftware and Performance Engineering of the Community Atmosphere Model
Software and Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory Science Team Meeting Climate
More informationOpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
More information<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
More informationOpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
More informationPerformance analysis with Periscope
Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis
More informationScalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
More informationBenchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
More informationMulticore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
More informationOverlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer
More informationLS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
More information2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
More informationSystem Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
More informationPhysical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
More informationParallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
More informationSGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
More informationStreamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture
Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture David Camp (LBL, UC Davis), Hank Childs (LBL, UC Davis), Christoph Garth (UC Davis), Dave Pugmire (ORNL), & Kenneth
More informationParallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
More informationPerformance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
More informationContributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
More informationCloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
More informationInteractive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.
Geosci. Model Dev. Discuss., 8, C1166 C1176, 2015 www.geosci-model-dev-discuss.net/8/c1166/2015/ Author(s) 2015. This work is distributed under the Creative Commons Attribute 3.0 License. Geoscientific
More informationA Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se
A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure
More informationSTINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationBLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationVirtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
More informationA Comparison of Dictionary Implementations
A Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A B is a function
More informationPART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
More informationParallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
More informationParallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
More informationClonecloud: Elastic execution between mobile device and cloud [1]
Clonecloud: Elastic execution between mobile device and cloud [1] ACM, Intel, Berkeley, Princeton 2011 Cloud Systems Utility Computing Resources As A Service Distributed Internet VPN Reliable and Secure
More informationReal-time Process Network Sonar Beamformer
Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories gallen@arlut.utexas.edu Brian L. Evans Dept. Electrical and Computer Engineering bevans@ece.utexas.edu The University of Texas
More informationA Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms
A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms Matthias Korch Thomas Rauber Universität Bayreuth Fakultät für Mathematik, Physik und Informatik Fachgruppe Informatik {matthias.korch,
More informationBig Data Visualization on the MIC
Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth
More informationreduction critical_section
A comparison of OpenMP and MPI for the parallel CFD test case Michael Resch, Bjíorn Sander and Isabel Loebich High Performance Computing Center Stuttgart èhlrsè Allmandring 3, D-755 Stuttgart Germany resch@hlrs.de
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
More informationHPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
More informationGPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
More informationLoad Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
More informationMulti-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
More informationClustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
More informationCOMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP
More informationData on Kernel Failures and Security Incidents
Data on Kernel Failures and Security Incidents Ravishankar K. Iyer (W. Gu, Z. Kalbarczyk, G. Lyle, A. Sharma, L. Wang ) Center for Reliable and High-Performance Computing Coordinated Science Laboratory
More informationOpenMP Task Scheduling Strategies for Multicore NUMA Systems
OpenMP Task Scheduling Strategies for Multicore NUMA Systems Stephen L. Olivier University of North Carolina at Chapel Hill Campus Box 3175 Chapel Hill, NC 27599, USA olivier@cs.unc.edu Michael Spiegel
More informationKashif Iqbal - PhD Kashif.iqbal@ichec.ie
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
More informationJan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042
Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042 Jan F. Prins The University of North Carolina at Chapel Hill Department of Computer Science CB#3175, Sitterson
More informationDistributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More informationA Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
More informationThe Gurobi Optimizer
The Gurobi Optimizer Gurobi History Gurobi Optimization, founded July 2008 Zonghao Gu, Ed Rothberg, Bob Bixby Started code development March 2008 Gurobi Version 1.0 released May 2009 History of rapid,
More informationThe SpiceC Parallel Programming System of Computer Systems
UNIVERSITY OF CALIFORNIA RIVERSIDE The SpiceC Parallel Programming System A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science
More informationSubgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro
Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,
More informationOptimizing Performance. Training Division New Delhi
Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,
More informationLarge Scale Simulation on Clusters using COMSOL 4.2
Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,
More informationApplication Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability
Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability Jan Waller, Florian Fittkau, and Wilhelm Hasselbring 2014-11-27 Waller, Fittkau, Hasselbring Application Performance
More informationParallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Ying Peng, Bin Gong, Hui Liu, and Yanxin Zhang School of Computer Science and Technology, Shandong University,
More informationJBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers
JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers Dave Jaffe, PhD, Dell Inc. Michael Yuan, PhD, JBoss / RedHat June 14th, 2006 JBoss Inc. 2006 About us Dave Jaffe Works for Dell
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs
ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ABSTRACT Xu Liu Department of Computer Science College of William and Mary Williamsburg, VA 23185 xl10@cs.wm.edu It is
More informationHP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
More informationPerformance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
More informationCYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS
Computing and Informatics, Vol. 31, 2012, 1255 1278 CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS Antonio García-Dopico, Antonio Pérez Santiago Rodríguez,
More information64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
More informationJava's garbage-collected heap
Sponsored by: This story appeared on JavaWorld at http://www.javaworld.com/javaworld/jw-08-1996/jw-08-gc.html Java's garbage-collected heap An introduction to the garbage-collected heap of the Java
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationLoad balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008
Load balancing Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008 1 Today s sources CS 194/267 at UCB (Yelick/Demmel) Intro
More information11.1 inspectit. 11.1. inspectit
11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.
More informationParallel Computing using MATLAB Distributed Compute Server ZORRO HPC
Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing
More informationLoad Balancing Non-Uniform Parallel Computations
Load Balancing Non-Uniform Parallel Computations Xinghui Zhao School of Engineering and Computer Science, Washington State University 14204 NE Salmon Creek Ave., Vancouver, WA 98686 x.zhao@wsu.edu Nadeem
More informationCOSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
More informationOpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
More informationParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
More informationFast Two-Point Correlations of Extremely Large Data Sets
Fast Two-Point Correlations of Extremely Large Data Sets Joshua Dolence 1 and Robert J. Brunner 1,2 1 Department of Astronomy, University of Illinois at Urbana-Champaign, 1002 W Green St, Urbana, IL 61801
More informationA Locality Approach to Architecture-aware Task-scheduling in OpenMP
A Locality Approach to Architecture-aware Task-scheduling in OpenMP Ananya Muddukrishna ananya@kth.se Mats Brorsson matsbror@kth.se Vladimir Vlassov vladv@kth.se ABSTRACT Multicore and other parallel computer
More informationLoad Balancing using Potential Functions for Hierarchical Topologies
Acta Technica Jaurinensis Vol. 4. No. 4. 2012 Load Balancing using Potential Functions for Hierarchical Topologies Molnárka Győző, Varjasi Norbert Széchenyi István University Dept. of Machine Design and
More informationReducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications
Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications Abstract Thread-Level Speculation has been used to take advantage of multicore processors
More informationINTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
More informationParallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
More informationDebugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
More informationParallel Algorithm for Dense Matrix Multiplication
Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem
More informationParallel Compression and Decompression of DNA Sequence Reads in FASTQ Format
, pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational
More informationExascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
More informationLinux tools for debugging and profiling MPI codes
Competence in High Performance Computing Linux tools for debugging and profiling MPI codes Werner Krotz-Vogel, Pallas GmbH MRCCS September 02000 Pallas GmbH Hermülheimer Straße 10 D-50321
More informationWITH the availability of large data sets in application
IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 1 Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance Ruoming
More informationA Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
More information32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M
Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing Workloads Evaluating a Fixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating
More information