UTS: An Unbalanced Tree Search Benchmark
|
|
|
- Beatrix Norman
- 10 years ago
- Views:
Transcription
1 UTS: An Unbalanced Tree Search Benchmark LCPC
2 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks to Ohio Supercomputer Center, European Center for Parallelism in Barcelona, and William Pugh 2
3 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? 3
4 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? Now suppose the tree is highly unbalanced. 4
5 Meet UTS Unbalanced computation requiring continuous dynamic load balancing for high performance Exposes trade-offs between communication costs and granularity of load balance Fundamentally disadvantages systems with coarse-grained communication networks Challenges the expressibility and performance portability of parallel programming languages 5
6 Benchmark Characteristics Implicit Generation Tree nodes generated on the fly Mimics systematic search Well-suited for parallel exploration Imbalance Large variation in subtree sizes Parameterization Can create trees of different depths and sizes 6
7 Outline Tree Generation Implementation Performance Analysis Conclusions and Future Work 7
8 Generation Strategy Each node consists of a 20B descriptor Implicit child creation using cryptographic hash (Parent Descriptor, Child Index) Hash Child Node Descriptor Number of children determined by sampling a probability distribution using the descriptor 8
9 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... 9
10 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... Don t know until we ve explored them... 10
11 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree Now we know 11
12 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) 12
13 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) Lucky sampled value from tail 13
14 Implementation Goal is to explore the entire tree Each thread performs depth first traversal starting from a given node using a stack When a thread runs out of work, it must be able to acquire more nodes Work sharing - busy threads with surplus work responsible for offloading to idle threads Work stealing - idle threads responsible for seeking out surplus work at other threads Busy threads always making progress on goal 14
15 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 15
16 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 16
17 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Locking 2. A thread s Here steal stack 17
18 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to operations generate must different be serialized trees through in the same a lock. parameter To amortize class heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 18 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial
19 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 19 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial
20 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 20 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial
21 Sequential Exploration Rates 21 System Processor Compiler T1 T2 T3 Cray X1 Vector (800 MHz) Cray SGI Origin 2000 MIPS (300 MHz) GCC SGI Origin 2000 MIPS (300 MHz) Mipspro Sun SunFire 6800 Sparc9 (750 MHz) Sun C P4 Xeon Cluster (OSC) P4 Xeon (2.4GHz) Intel Mac Powerbook PPC G4 (1.33GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) Intel Dell Blade Cluster (UNC) P4 Xeon (3.6GHz) Intel Table 2. Sequential performance for all trees (Thousands of nodes per second) Performance given in knodes/sec Note the variation with tree type (T1 and T2 are geometric trees; T3 is a binomial tree) As a result, we replaced the array of per-thread data with an array of pointers to dynamically allocated structures. During parallel execution, each thread dynamically allocates memory for its own structure with affinity to the thread performing the allocation. This adjustment eliminated false sharing, but was unecessary for UPC. The facilities for explicit distribution of data in OpenMP are weaker than those of UPC.
22 Performance on Shared Memory (SGI Origin 2000) 22 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results
23 Performance on Shared Memory (SGI Origin 2000) UPC Slower 23 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results
24 Performance on Shared Memory (SGI Origin 2000) T1 vs T2 24 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results
25 Performance on Distributed Memory 25 Fig. 4. Performance on the Dell blade cluster (left); Parallel e versus the Origin The 2000 UNIVERSITY (right) of NORTH CAROLINA at CHAPEL HILL
26 Granularity and Scaling Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 26
27 Granularity and Scaling Stealing overhead dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 27
28 Granularity and Scaling Stealing overhead dominates Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 28
29 Granularity and Scaling Stealing overhead dominates Sweet Spot Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 29
30 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale) 30
31 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale!) 31
32 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Sensitive to chunk size Chunk size (linear scale) Chunk size (log scale!) 32
33 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 33 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster
34 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 34 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster
35 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Work Stealing Granularity
36 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Trouble finding work Work Stealing Granularity
37 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Termination Detection Work Stealing Granularity
38 Expressing Locality OpenMP vs. UPC on Shared Memory OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 38
39 Expressing Locality OpenMP vs. UPC on Shared Memory!!! gin 2000 using OpenMP and UPC. On the left, 39 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000
40 Expressing Locality OpenMP vs. UPC on Shared Memory After Fix!!! gin 2000 using OpenMP and UPC. On the left, 40 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000
41 Expressing Locality OpenMP vs. UPC on Shared Memory No problems for UPC either way OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 41
42 Larger Problem (Tree) Size Altix processors used Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix
43 Larger Problem (Tree) Size Altix processors used 3X Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix
44 Comparing Performance on Various Machines T3 using UPC on 16 processors If it scaled better, Blade cluster would outperform the Altix 3800 Fig. 10. Performance for T3 on 16 pro- 44 Fig. 11. Speedup on th
45 Issues Addressed Measure the ability of parallel systems to perform continuous dynamic load balancing Understand the trade-offs between communication costs and granularity of load balance On shared memory and on distributed memory Compare parallel programming languages Expressibility and ease of programming Portability and performance portability 45
46 Conclusions UPC versus OpenMP on shared memory Similar in ease of programming UPC has superior facility for specifying locality Both exhibit near-ideal scaling Comparable absolute performance UPC implementation on distributed memory Frequent access of shared variables for coordinating the work stealing and termination detection lead to poor performance and scaling Several factors impact performance Data locality Granularity of load balance Problem size 46
47 Ongoing and Future Work MPI versions of UTS Work sharing and work stealing methods Further tuning of UPC version, especially to improve distributed memory performance Reduce/replace many shared variable accesses Lose advantage of UPC abstractions for programmability Trade-off with possible decrease in shared memory performance Fails to provide performance portability Performance testing on other systems 47
Scheduling Task Parallelism" on Multi-Socket Multicore Systems"
Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction
MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin
A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86
Spring 2011 Prof. Hyesoon Kim
Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem
Performance Engineering of the Community Atmosphere Model
Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory 11th Annual CCSM Workshop June 20-22, 2006
Performance Characteristics of Large SMP Machines
Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller [email protected] Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark
Software and Performance Engineering of the Community Atmosphere Model
Software and Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory Science Team Meeting Climate
OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware
OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based
<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
OpenMP Programming on ScaleMP
OpenMP Programming on ScaleMP Dirk Schmidl [email protected] Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign
Scalability evaluation of barrier algorithms for OpenMP
Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU
Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview
Multicore Parallel Computing with OpenMP
Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large
Overlapping Data Transfer With Application Execution on Clusters
Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm [email protected] [email protected] Department of Computer Science Department of Electrical and Computer
LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.
LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability
2: Computer Performance
2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12
System Models for Distributed and Cloud Computing
System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems
Physical Data Organization
Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes
Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications
SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD
White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper
Parallel Scalable Algorithms- Performance Parameters
www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Contributions to Gang Scheduling
CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,
Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com
Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...
A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville [email protected]
A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville [email protected] Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure
STINGER: High Performance Data Structure for Streaming Graphs
STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
BLM 413E - Parallel Programming Lecture 3
BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several
FPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
Virtuoso and Database Scalability
Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of
A Comparison of Dictionary Implementations
A Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A B is a function
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design
PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General
Parallel Computing. Parallel shared memory computing with OpenMP
Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014
Parallelization: Binary Tree Traversal
By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First
Clonecloud: Elastic execution between mobile device and cloud [1]
Clonecloud: Elastic execution between mobile device and cloud [1] ACM, Intel, Berkeley, Princeton 2011 Cloud Systems Utility Computing Resources As A Service Distributed Internet VPN Reliable and Secure
Real-time Process Network Sonar Beamformer
Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories [email protected] Brian L. Evans Dept. Electrical and Computer Engineering [email protected] The University of Texas
A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms
A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms Matthias Korch Thomas Rauber Universität Bayreuth Fakultät für Mathematik, Physik und Informatik Fachgruppe Informatik {matthias.korch,
Big Data Visualization on the MIC
Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth [email protected] Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth
reduction critical_section
A comparison of OpenMP and MPI for the parallel CFD test case Michael Resch, Bjíorn Sander and Isabel Loebich High Performance Computing Center Stuttgart èhlrsè Allmandring 3, D-755 Stuttgart Germany [email protected]
Parallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis
HPC Wales Skills Academy Course Catalogue 2015
HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Load Imbalance Analysis
With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization
Multi-core Programming System Overview
Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,
Clustering & Visualization
Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)
COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University [email protected] COMP
Kashif Iqbal - PhD [email protected]
HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD [email protected] ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo
Distributed communication-aware load balancing with TreeMatch in Charm++
Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration
18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
Introduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
A Case Study - Scaling Legacy Code on Next Generation Platforms
Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy
The Gurobi Optimizer
The Gurobi Optimizer Gurobi History Gurobi Optimization, founded July 2008 Zonghao Gu, Ed Rothberg, Bob Bixby Started code development March 2008 Gurobi Version 1.0 released May 2009 History of rapid,
Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro
Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,
Optimizing Performance. Training Division New Delhi
Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,
Large Scale Simulation on Clusters using COMSOL 4.2
Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation
Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Ying Peng, Bin Gong, Hui Liu, and Yanxin Zhang School of Computer Science and Technology, Shandong University,
JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers
JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers Dave Jaffe, PhD, Dell Inc. Michael Yuan, PhD, JBoss / RedHat June 14th, 2006 JBoss Inc. 2006 About us Dave Jaffe Works for Dell
Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs
ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ABSTRACT Xu Liu Department of Computer Science College of William and Mary Williamsburg, VA 23185 [email protected] It is
HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief
Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...
Performance of the JMA NWP models on the PC cluster TSUBAME.
Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)
CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS
Computing and Informatics, Vol. 31, 2012, 1255 1278 CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS Antonio García-Dopico, Antonio Pérez Santiago Rodríguez,
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Java's garbage-collected heap
Sponsored by: This story appeared on JavaWorld at http://www.javaworld.com/javaworld/jw-08-1996/jw-08-gc.html Java's garbage-collected heap An introduction to the garbage-collected heap of the Java
Multi-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
11.1 inspectit. 11.1. inspectit
11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.
Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC
Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing
COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters
COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network
OpenMP and Performance
Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group [email protected] IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an
ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008
ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element
Load Balancing using Potential Functions for Hierarchical Topologies
Acta Technica Jaurinensis Vol. 4. No. 4. 2012 Load Balancing using Potential Functions for Hierarchical Topologies Molnárka Győző, Varjasi Norbert Széchenyi István University Dept. of Machine Design and
Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications
Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications Abstract Thread-Level Speculation has been used to take advantage of multicore processors
INTEL PARALLEL STUDIO XE EVALUATION GUIDE
Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall
Parallel Computing. Benson Muite. [email protected] http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage
Parallel Computing Benson Muite [email protected] http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework
Debugging with TotalView
Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich
Parallel Algorithm for Dense Matrix Multiplication
Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem
Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format
, pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational
Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation
Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03
Linux tools for debugging and profiling MPI codes
Competence in High Performance Computing Linux tools for debugging and profiling MPI codes Werner Krotz-Vogel, Pallas GmbH MRCCS September 02000 Pallas GmbH Hermülheimer Straße 10 D-50321
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture
A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi
32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M
Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing Workloads Evaluating a Fixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating
