UTS: An Unbalanced Tree Search Benchmark

Size: px
Start display at page:

Download "UTS: An Unbalanced Tree Search Benchmark"

Transcription

1 UTS: An Unbalanced Tree Search Benchmark LCPC

2 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks to Ohio Supercomputer Center, European Center for Parallelism in Barcelona, and William Pugh 2

3 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? 3

4 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? Now suppose the tree is highly unbalanced. 4

5 Meet UTS Unbalanced computation requiring continuous dynamic load balancing for high performance Exposes trade-offs between communication costs and granularity of load balance Fundamentally disadvantages systems with coarse-grained communication networks Challenges the expressibility and performance portability of parallel programming languages 5

6 Benchmark Characteristics Implicit Generation Tree nodes generated on the fly Mimics systematic search Well-suited for parallel exploration Imbalance Large variation in subtree sizes Parameterization Can create trees of different depths and sizes 6

7 Outline Tree Generation Implementation Performance Analysis Conclusions and Future Work 7

8 Generation Strategy Each node consists of a 20B descriptor Implicit child creation using cryptographic hash (Parent Descriptor, Child Index) Hash Child Node Descriptor Number of children determined by sampling a probability distribution using the descriptor 8

9 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... 9

10 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... Don t know until we ve explored them... 10

11 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree Now we know 11

12 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) 12

13 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) Lucky sampled value from tail 13

14 Implementation Goal is to explore the entire tree Each thread performs depth first traversal starting from a given node using a stack When a thread runs out of work, it must be able to acquire more nodes Work sharing - busy threads with surplus work responsible for offloading to idle threads Work stealing - idle threads responsible for seeking out surplus work at other threads Busy threads always making progress on goal 14

15 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 15

16 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 16

17 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Locking 2. A thread s Here steal stack 17

18 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to operations generate must different be serialized trees through in the same a lock. parameter To amortize class heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 18 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial

19 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 19 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial

20 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 20 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial

21 Sequential Exploration Rates 21 System Processor Compiler T1 T2 T3 Cray X1 Vector (800 MHz) Cray SGI Origin 2000 MIPS (300 MHz) GCC SGI Origin 2000 MIPS (300 MHz) Mipspro Sun SunFire 6800 Sparc9 (750 MHz) Sun C P4 Xeon Cluster (OSC) P4 Xeon (2.4GHz) Intel Mac Powerbook PPC G4 (1.33GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) Intel Dell Blade Cluster (UNC) P4 Xeon (3.6GHz) Intel Table 2. Sequential performance for all trees (Thousands of nodes per second) Performance given in knodes/sec Note the variation with tree type (T1 and T2 are geometric trees; T3 is a binomial tree) As a result, we replaced the array of per-thread data with an array of pointers to dynamically allocated structures. During parallel execution, each thread dynamically allocates memory for its own structure with affinity to the thread performing the allocation. This adjustment eliminated false sharing, but was unecessary for UPC. The facilities for explicit distribution of data in OpenMP are weaker than those of UPC.

22 Performance on Shared Memory (SGI Origin 2000) 22 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results

23 Performance on Shared Memory (SGI Origin 2000) UPC Slower 23 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results

24 Performance on Shared Memory (SGI Origin 2000) T1 vs T2 24 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results

25 Performance on Distributed Memory 25 Fig. 4. Performance on the Dell blade cluster (left); Parallel e versus the Origin The 2000 UNIVERSITY (right) of NORTH CAROLINA at CHAPEL HILL

26 Granularity and Scaling Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 26

27 Granularity and Scaling Stealing overhead dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 27

28 Granularity and Scaling Stealing overhead dominates Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 28

29 Granularity and Scaling Stealing overhead dominates Sweet Spot Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 29

30 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale) 30

31 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale!) 31

32 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Sensitive to chunk size Chunk size (linear scale) Chunk size (log scale!) 32

33 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 33 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster

34 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 34 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster

35 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Work Stealing Granularity

36 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Trouble finding work Work Stealing Granularity

37 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Termination Detection Work Stealing Granularity

38 Expressing Locality OpenMP vs. UPC on Shared Memory OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 38

39 Expressing Locality OpenMP vs. UPC on Shared Memory!!! gin 2000 using OpenMP and UPC. On the left, 39 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000

40 Expressing Locality OpenMP vs. UPC on Shared Memory After Fix!!! gin 2000 using OpenMP and UPC. On the left, 40 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000

41 Expressing Locality OpenMP vs. UPC on Shared Memory No problems for UPC either way OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 41

42 Larger Problem (Tree) Size Altix processors used Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix

43 Larger Problem (Tree) Size Altix processors used 3X Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix

44 Comparing Performance on Various Machines T3 using UPC on 16 processors If it scaled better, Blade cluster would outperform the Altix 3800 Fig. 10. Performance for T3 on 16 pro- 44 Fig. 11. Speedup on th

45 Issues Addressed Measure the ability of parallel systems to perform continuous dynamic load balancing Understand the trade-offs between communication costs and granularity of load balance On shared memory and on distributed memory Compare parallel programming languages Expressibility and ease of programming Portability and performance portability 45

46 Conclusions UPC versus OpenMP on shared memory Similar in ease of programming UPC has superior facility for specifying locality Both exhibit near-ideal scaling Comparable absolute performance UPC implementation on distributed memory Frequent access of shared variables for coordinating the work stealing and termination detection lead to poor performance and scaling Several factors impact performance Data locality Granularity of load balance Problem size 46

47 Ongoing and Future Work MPI versions of UTS Work sharing and work stealing methods Further tuning of UPC version, especially to improve distributed memory performance Reduce/replace many shared variable accesses Lose advantage of UPC abstractions for programmability Trade-off with possible decrease in shared memory performance Fails to provide performance portability Performance testing on other systems 47

Dynamic Load Balancing of Unbalanced Computations Using Message Passing

Dynamic Load Balancing of Unbalanced Computations Using Message Passing Dynamic Load Balancing of Unbalanced Computations Using Message Passing James Dinan 1, Stephen Olivier 2, Gerald Sabin 1, Jan Prins 2, P. Sadayappan 1, and Chau-Wen Tseng 3 1 Dept. of Comp. Sci. and Engineering

More information

Scalable Dynamic Load Balancing Using UPC

Scalable Dynamic Load Balancing Using UPC Scalable Dynamic Load Balancing Using UPC Stephen Olivier, Jan Prins Department of Computer Science University of North Carolina at Chapel Hill {olivier, prins}@cs.unc.edu Abstract An asynchronous work-stealing

More information

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Scheduling Task Parallelism on Multi-Socket Multicore Systems Scheduling Task Parallelism" on Multi-Socket Multicore Systems" Stephen Olivier, UNC Chapel Hill Allan Porterfield, RENCI Kyle Wheeler, Sandia National Labs Jan Prins, UNC Chapel Hill Outline" Introduction

More information

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp

MPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source

More information

Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs

Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs Int J Parallel Prog (2010) 38:341 360 DOI 10.1007/s10766-010-0140-7 Comparison of OpenMP 3.0 and Other Task Parallel Frameworks on Unbalanced Task Graphs Stephen L. Olivier Jan F. Prins Received: 27 April

More information

Improving System Scalability of OpenMP Applications Using Large Page Support

Improving System Scalability of OpenMP Applications Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

Load balancing. David Bindel. 12 Nov 2015

Load balancing. David Bindel. 12 Nov 2015 Load balancing David Bindel 12 Nov 2015 Inefficiencies in parallel code Poor single processor performance Typically in the memory system Saw this in matrix multiply assignment Overhead for parallelism

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim Today, we will study typical patterns of parallel programming This is just one of the ways. Materials are based on a book by Timothy. Decompose Into tasks Original Problem

More information

Performance Engineering of the Community Atmosphere Model

Performance Engineering of the Community Atmosphere Model Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory 11th Annual CCSM Workshop June 20-22, 2006

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

Software and Performance Engineering of the Community Atmosphere Model

Software and Performance Engineering of the Community Atmosphere Model Software and Performance Engineering of the Community Atmosphere Model Patrick H. Worley Oak Ridge National Laboratory Arthur A. Mirin Lawrence Livermore National Laboratory Science Team Meeting Climate

More information

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware

OpenMP & MPI CISC 879. Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware OpenMP & MPI CISC 879 Tristan Vanderbruggen & John Cavazos Dept of Computer & Information Sciences University of Delaware 1 Lecture Overview Introduction OpenMP MPI Model Language extension: directives-based

More information

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization

<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended

More information

OpenMP Programming on ScaleMP

OpenMP Programming on ScaleMP OpenMP Programming on ScaleMP Dirk Schmidl schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) MPI vs. OpenMP MPI distributed address space explicit message passing typically code redesign

More information

Performance analysis with Periscope

Performance analysis with Periscope Performance analysis with Periscope M. Gerndt, V. Petkov, Y. Oleynik, S. Benedict Technische Universität München September 2010 Outline Motivation Periscope architecture Periscope performance analysis

More information

Scalability evaluation of barrier algorithms for OpenMP

Scalability evaluation of barrier algorithms for OpenMP Scalability evaluation of barrier algorithms for OpenMP Ramachandra Nanjegowda, Oscar Hernandez, Barbara Chapman and Haoqiang H. Jin High Performance Computing and Tools Group (HPCTools) Computer Science

More information

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU Heshan Li, Shaopeng Wang The Johns Hopkins University 3400 N. Charles Street Baltimore, Maryland 21218 {heshanli, shaopeng}@cs.jhu.edu 1 Overview

More information

Multicore Parallel Computing with OpenMP

Multicore Parallel Computing with OpenMP Multicore Parallel Computing with OpenMP Tan Chee Chiang (SVU/Academic Computing, Computer Centre) 1. OpenMP Programming The death of OpenMP was anticipated when cluster systems rapidly replaced large

More information

Overlapping Data Transfer With Application Execution on Clusters

Overlapping Data Transfer With Application Execution on Clusters Overlapping Data Transfer With Application Execution on Clusters Karen L. Reid and Michael Stumm reid@cs.toronto.edu stumm@eecg.toronto.edu Department of Computer Science Department of Electrical and Computer

More information

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp.

LS-DYNA Scalability on Cray Supercomputers. Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. LS-DYNA Scalability on Cray Supercomputers Tin-Ting Zhu, Cray Inc. Jason Wang, Livermore Software Technology Corp. WP-LS-DYNA-12213 www.cray.com Table of Contents Abstract... 3 Introduction... 3 Scalability

More information

2: Computer Performance

2: Computer Performance 2: Computer Performance http://people.sc.fsu.edu/ jburkardt/presentations/ fdi 2008 lecture2.pdf... John Information Technology Department Virginia Tech... FDI Summer Track V: Parallel Programming 10-12

More information

System Models for Distributed and Cloud Computing

System Models for Distributed and Cloud Computing System Models for Distributed and Cloud Computing Dr. Sanjay P. Ahuja, Ph.D. 2010-14 FIS Distinguished Professor of Computer Science School of Computing, UNF Classification of Distributed Computing Systems

More information

Physical Data Organization

Physical Data Organization Physical Data Organization Database design using logical model of the database - appropriate level for users to focus on - user independence from implementation details Performance - other major factor

More information

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes Eric Petit, Loïc Thebault, Quang V. Dinh May 2014 EXA2CT Consortium 2 WPs Organization Proto-Applications

More information

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD

SGI. High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems. January, 2012. Abstract. Haruna Cofer*, PhD White Paper SGI High Throughput Computing (HTC) Wrapper Program for Bioinformatics on SGI ICE and SGI UV Systems Haruna Cofer*, PhD January, 2012 Abstract The SGI High Throughput Computing (HTC) Wrapper

More information

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture

Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture Streamline Integration using MPI-Hybrid Parallelism on a Large Multi-Core Architecture David Camp (LBL, UC Davis), Hank Childs (LBL, UC Davis), Christoph Garth (UC Davis), Dave Pugmire (ORNL), & Kenneth

More information

Parallel Scalable Algorithms- Performance Parameters

Parallel Scalable Algorithms- Performance Parameters www.bsc.es Parallel Scalable Algorithms- Performance Parameters Vassil Alexandrov, ICREA - Barcelona Supercomputing Center, Spain Overview Sources of Overhead in Parallel Programs Performance Metrics for

More information

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France

More information

Contributions to Gang Scheduling

Contributions to Gang Scheduling CHAPTER 7 Contributions to Gang Scheduling In this Chapter, we present two techniques to improve Gang Scheduling policies by adopting the ideas of this Thesis. The first one, Performance- Driven Gang Scheduling,

More information

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com

Cloud Storage. Parallels. Performance Benchmark Results. White Paper. www.parallels.com Parallels Cloud Storage White Paper Performance Benchmark Results www.parallels.com Table of Contents Executive Summary... 3 Architecture Overview... 3 Key Features... 4 No Special Hardware Requirements...

More information

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al.

Interactive comment on A parallelization scheme to simulate reactive transport in the subsurface environment with OGS#IPhreeqc by W. He et al. Geosci. Model Dev. Discuss., 8, C1166 C1176, 2015 www.geosci-model-dev-discuss.net/8/c1166/2015/ Author(s) 2015. This work is distributed under the Creative Commons Attribute 3.0 License. Geoscientific

More information

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se

A Brief Survery of Linux Performance Engineering. Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se A Brief Survery of Linux Performance Engineering Philip J. Mucci University of Tennessee, Knoxville mucci@pdc.kth.se Overview On chip Hardware Performance Counters Linux Performance Counter Infrastructure

More information

STINGER: High Performance Data Structure for Streaming Graphs

STINGER: High Performance Data Structure for Streaming Graphs STINGER: High Performance Data Structure for Streaming Graphs David Ediger Rob McColl Jason Riedy David A. Bader Georgia Institute of Technology Atlanta, GA, USA Abstract The current research focus on

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

BLM 413E - Parallel Programming Lecture 3

BLM 413E - Parallel Programming Lecture 3 BLM 413E - Parallel Programming Lecture 3 FSMVU Bilgisayar Mühendisliği Öğr. Gör. Musa AYDIN 14.10.2015 2015-2016 M.A. 1 Parallel Programming Models Parallel Programming Models Overview There are several

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

A Comparison of Dictionary Implementations

A Comparison of Dictionary Implementations A Comparison of Dictionary Implementations Mark P Neyer April 10, 2009 1 Introduction A common problem in computer science is the representation of a mapping between two sets. A mapping f : A B is a function

More information

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions Slide 1 Outline Principles for performance oriented design Performance testing Performance tuning General

More information

Parallel Computing. Parallel shared memory computing with OpenMP

Parallel Computing. Parallel shared memory computing with OpenMP Parallel Computing Parallel shared memory computing with OpenMP Thorsten Grahs, 14.07.2014 Table of contents Introduction Directives Scope of data Synchronization OpenMP vs. MPI OpenMP & MPI 14.07.2014

More information

Parallelization: Binary Tree Traversal

Parallelization: Binary Tree Traversal By Aaron Weeden and Patrick Royal Shodor Education Foundation, Inc. August 2012 Introduction: According to Moore s law, the number of transistors on a computer chip doubles roughly every two years. First

More information

Clonecloud: Elastic execution between mobile device and cloud [1]

Clonecloud: Elastic execution between mobile device and cloud [1] Clonecloud: Elastic execution between mobile device and cloud [1] ACM, Intel, Berkeley, Princeton 2011 Cloud Systems Utility Computing Resources As A Service Distributed Internet VPN Reliable and Secure

More information

Real-time Process Network Sonar Beamformer

Real-time Process Network Sonar Beamformer Real-time Process Network Sonar Gregory E. Allen Applied Research Laboratories gallen@arlut.utexas.edu Brian L. Evans Dept. Electrical and Computer Engineering bevans@ece.utexas.edu The University of Texas

More information

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms

A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms A Comparison of Task Pools for Dynamic Load Balancing of Irregular Algorithms Matthias Korch Thomas Rauber Universität Bayreuth Fakultät für Mathematik, Physik und Informatik Fachgruppe Informatik {matthias.korch,

More information

Big Data Visualization on the MIC

Big Data Visualization on the MIC Big Data Visualization on the MIC Tim Dykes School of Creative Technologies University of Portsmouth timothy.dykes@port.ac.uk Many-Core Seminar Series 26/02/14 Splotch Team Tim Dykes, University of Portsmouth

More information

reduction critical_section

reduction critical_section A comparison of OpenMP and MPI for the parallel CFD test case Michael Resch, Bjíorn Sander and Isabel Loebich High Performance Computing Center Stuttgart èhlrsè Allmandring 3, D-755 Stuttgart Germany resch@hlrs.de

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

HPC Wales Skills Academy Course Catalogue 2015

HPC Wales Skills Academy Course Catalogue 2015 HPC Wales Skills Academy Course Catalogue 2015 Overview The HPC Wales Skills Academy provides a variety of courses and workshops aimed at building skills in High Performance Computing (HPC). Our courses

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Qingyu Meng, Alan Humphrey, Martin Berzins Thanks to: John Schmidt and J. Davison de St. Germain, SCI Institute Justin Luitjens

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Load Imbalance Analysis

Load Imbalance Analysis With CrayPat Load Imbalance Analysis Imbalance time is a metric based on execution time and is dependent on the type of activity: User functions Imbalance time = Maximum time Average time Synchronization

More information

Multi-core Programming System Overview

Multi-core Programming System Overview Multi-core Programming System Overview Based on slides from Intel Software College and Multi-Core Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts,

More information

Clustering & Visualization

Clustering & Visualization Chapter 5 Clustering & Visualization Clustering in high-dimensional databases is an important problem and there are a number of different clustering paradigms which are applicable to high-dimensional data.

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

Data on Kernel Failures and Security Incidents

Data on Kernel Failures and Security Incidents Data on Kernel Failures and Security Incidents Ravishankar K. Iyer (W. Gu, Z. Kalbarczyk, G. Lyle, A. Sharma, L. Wang ) Center for Reliable and High-Performance Computing Coordinated Science Laboratory

More information

OpenMP Task Scheduling Strategies for Multicore NUMA Systems

OpenMP Task Scheduling Strategies for Multicore NUMA Systems OpenMP Task Scheduling Strategies for Multicore NUMA Systems Stephen L. Olivier University of North Carolina at Chapel Hill Campus Box 3175 Chapel Hill, NC 27599, USA olivier@cs.unc.edu Michael Spiegel

More information

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie HPC/HTC vs. Cloud Benchmarking An empirical evalua.on of the performance and cost implica.ons Kashif Iqbal - PhD Kashif.iqbal@ichec.ie ICHEC, NUI Galway, Ireland With acknowledgment to Michele MicheloDo

More information

Jan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042

Jan F. Prins. Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042 Work-efficient Techniques for the Parallel Execution of Sparse Grid-based Computations TR91-042 Jan F. Prins The University of North Carolina at Chapel Hill Department of Computer Science CB#3175, Sitterson

More information

Distributed communication-aware load balancing with TreeMatch in Charm++

Distributed communication-aware load balancing with TreeMatch in Charm++ Distributed communication-aware load balancing with TreeMatch in Charm++ The 9th Scheduling for Large Scale Systems Workshop, Lyon, France Emmanuel Jeannot Guillaume Mercier Francois Tessier In collaboration

More information

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

A Case Study - Scaling Legacy Code on Next Generation Platforms

A Case Study - Scaling Legacy Code on Next Generation Platforms Available online at www.sciencedirect.com ScienceDirect Procedia Engineering 00 (2015) 000 000 www.elsevier.com/locate/procedia 24th International Meshing Roundtable (IMR24) A Case Study - Scaling Legacy

More information

The Gurobi Optimizer

The Gurobi Optimizer The Gurobi Optimizer Gurobi History Gurobi Optimization, founded July 2008 Zonghao Gu, Ed Rothberg, Bob Bixby Started code development March 2008 Gurobi Version 1.0 released May 2009 History of rapid,

More information

The SpiceC Parallel Programming System of Computer Systems

The SpiceC Parallel Programming System of Computer Systems UNIVERSITY OF CALIFORNIA RIVERSIDE The SpiceC Parallel Programming System A Dissertation submitted in partial satisfaction of the requirements for the degree of Doctor of Philosophy in Computer Science

More information

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro

Subgraph Patterns: Network Motifs and Graphlets. Pedro Ribeiro Subgraph Patterns: Network Motifs and Graphlets Pedro Ribeiro Analyzing Complex Networks We have been talking about extracting information from networks Some possible tasks: General Patterns Ex: scale-free,

More information

Optimizing Performance. Training Division New Delhi

Optimizing Performance. Training Division New Delhi Optimizing Performance Training Division New Delhi Performance tuning : Goals Minimize the response time for each query Maximize the throughput of the entire database server by minimizing network traffic,

More information

Large Scale Simulation on Clusters using COMSOL 4.2

Large Scale Simulation on Clusters using COMSOL 4.2 Large Scale Simulation on Clusters using COMSOL 4.2 Darrell W. Pepper 1 Xiuling Wang 2 Steven Senator 3 Joseph Lombardo 4 David Carrington 5 with David Kan and Ed Fontes 6 1 DVP-USAFA-UNLV, 2 Purdue-Calumet,

More information

Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability

Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability Application Performance Monitoring: Trade-Off between Overhead Reduction and Maintainability Jan Waller, Florian Fittkau, and Wilhelm Hasselbring 2014-11-27 Waller, Fittkau, Hasselbring Application Performance

More information

Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation

Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Parallel Computing for Option Pricing Based on the Backward Stochastic Differential Equation Ying Peng, Bin Gong, Hui Liu, and Yanxin Zhang School of Computer Science and Technology, Shandong University,

More information

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers

JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers JBoss Seam Performance and Scalability on Dell PowerEdge 1855 Blade Servers Dave Jaffe, PhD, Dell Inc. Michael Yuan, PhD, JBoss / RedHat June 14th, 2006 JBoss Inc. 2006 About us Dave Jaffe Works for Dell

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /35 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs

ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ScaAnalyzer: A Tool to Identify Memory Scalability Bottlenecks in Parallel Programs ABSTRACT Xu Liu Department of Computer Science College of William and Mary Williamsburg, VA 23185 xl10@cs.wm.edu It is

More information

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Technical white paper HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief Scale-up your Microsoft SQL Server environment to new heights Table of contents Executive summary... 2 Introduction...

More information

Performance of the JMA NWP models on the PC cluster TSUBAME.

Performance of the JMA NWP models on the PC cluster TSUBAME. Performance of the JMA NWP models on the PC cluster TSUBAME. K.Takenouchi 1), S.Yokoi 1), T.Hara 1) *, T.Aoki 2), C.Muroi 1), K.Aranami 1), K.Iwamura 1), Y.Aikawa 1) 1) Japan Meteorological Agency (JMA)

More information

CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS

CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS Computing and Informatics, Vol. 31, 2012, 1255 1278 CYCLIC: A LOCALITY-PRESERVING LOAD-BALANCING ALGORITHM FOR PDES ON SHARED MEMORY MULTIPROCESSORS Antonio García-Dopico, Antonio Pérez Santiago Rodríguez,

More information

64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing 64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

More information

Java's garbage-collected heap

Java's garbage-collected heap Sponsored by: This story appeared on JavaWorld at http://www.javaworld.com/javaworld/jw-08-1996/jw-08-gc.html Java's garbage-collected heap An introduction to the garbage-collected heap of the Java

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008

Load balancing. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008 Load balancing Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.26] Thursday, April 17, 2008 1 Today s sources CS 194/267 at UCB (Yelick/Demmel) Intro

More information

11.1 inspectit. 11.1. inspectit

11.1 inspectit. 11.1. inspectit 11.1. inspectit Figure 11.1. Overview on the inspectit components [Siegl and Bouillet 2011] 11.1 inspectit The inspectit monitoring tool (website: http://www.inspectit.eu/) has been developed by NovaTec.

More information

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC Goals of the session Overview of parallel MATLAB Why parallel MATLAB? Multiprocessing in MATLAB Parallel MATLAB using the Parallel Computing

More information

Load Balancing Non-Uniform Parallel Computations

Load Balancing Non-Uniform Parallel Computations Load Balancing Non-Uniform Parallel Computations Xinghui Zhao School of Engineering and Computer Science, Washington State University 14204 NE Salmon Creek Ave., Vancouver, WA 98686 x.zhao@wsu.edu Nadeem

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network

More information

OpenMP and Performance

OpenMP and Performance Dirk Schmidl IT Center, RWTH Aachen University Member of the HPC Group schmidl@itc.rwth-aachen.de IT Center der RWTH Aachen University Tuning Cycle Performance Tuning aims to improve the runtime of an

More information

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 ParFUM: A Parallel Framework for Unstructured Meshes Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008 What is ParFUM? A framework for writing parallel finite element

More information

Fast Two-Point Correlations of Extremely Large Data Sets

Fast Two-Point Correlations of Extremely Large Data Sets Fast Two-Point Correlations of Extremely Large Data Sets Joshua Dolence 1 and Robert J. Brunner 1,2 1 Department of Astronomy, University of Illinois at Urbana-Champaign, 1002 W Green St, Urbana, IL 61801

More information

A Locality Approach to Architecture-aware Task-scheduling in OpenMP

A Locality Approach to Architecture-aware Task-scheduling in OpenMP A Locality Approach to Architecture-aware Task-scheduling in OpenMP Ananya Muddukrishna ananya@kth.se Mats Brorsson matsbror@kth.se Vladimir Vlassov vladv@kth.se ABSTRACT Multicore and other parallel computer

More information

Load Balancing using Potential Functions for Hierarchical Topologies

Load Balancing using Potential Functions for Hierarchical Topologies Acta Technica Jaurinensis Vol. 4. No. 4. 2012 Load Balancing using Potential Functions for Hierarchical Topologies Molnárka Győző, Varjasi Norbert Széchenyi István University Dept. of Machine Design and

More information

Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications

Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications Reducing Memory in Software-Based Thread-Level Speculation for JavaScript Virtual Machine Execution of Web Applications Abstract Thread-Level Speculation has been used to take advantage of multicore processors

More information

INTEL PARALLEL STUDIO XE EVALUATION GUIDE

INTEL PARALLEL STUDIO XE EVALUATION GUIDE Introduction This guide will illustrate how you use Intel Parallel Studio XE to find the hotspots (areas that are taking a lot of time) in your application and then recompiling those parts to improve overall

More information

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage

Parallel Computing. Benson Muite. benson.muite@ut.ee http://math.ut.ee/ benson. https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage Parallel Computing Benson Muite benson.muite@ut.ee http://math.ut.ee/ benson https://courses.cs.ut.ee/2014/paralleel/fall/main/homepage 3 November 2014 Hadoop, Review Hadoop Hadoop History Hadoop Framework

More information

Debugging with TotalView

Debugging with TotalView Tim Cramer 17.03.2015 IT Center der RWTH Aachen University Why to use a Debugger? If your program goes haywire, you may... ( wand (... buy a magic... read the source code again and again and...... enrich

More information

Parallel Algorithm for Dense Matrix Multiplication

Parallel Algorithm for Dense Matrix Multiplication Parallel Algorithm for Dense Matrix Multiplication CSE633 Parallel Algorithms Fall 2012 Ortega, Patricia Outline Problem definition Assumptions Implementation Test Results Future work Conclusions Problem

More information

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format

Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format , pp.91-100 http://dx.doi.org/10.14257/ijhit.2014.7.4.09 Parallel Compression and Decompression of DNA Sequence Reads in FASTQ Format Jingjing Zheng 1,* and Ting Wang 1, 2 1,* Parallel Software and Computational

More information

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Exascale Challenges and General Purpose Processors Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation Jun-93 Aug-94 Oct-95 Dec-96 Feb-98 Apr-99 Jun-00 Aug-01 Oct-02 Dec-03

More information

Linux tools for debugging and profiling MPI codes

Linux tools for debugging and profiling MPI codes Competence in High Performance Computing Linux tools for debugging and profiling MPI codes Werner Krotz-Vogel, Pallas GmbH MRCCS September 02000 Pallas GmbH Hermülheimer Straße 10 D-50321

More information

WITH the availability of large data sets in application

WITH the availability of large data sets in application IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, VOL. 16, NO. 10, OCTOBER 2004 1 Shared Memory Parallelization of Data Mining Algorithms: Techniques, Programming Interface, and Performance Ruoming

More information

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture

A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture A Robust Dynamic Load-balancing Scheme for Data Parallel Application on Message Passing Architecture Yangsuk Kee Department of Computer Engineering Seoul National University Seoul, 151-742, Korea Soonhoi

More information

32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M

32 K. Stride 8 32 128 512 2 K 8 K 32 K 128 K 512 K 2 M Evaluating a Real Machine Performance Isolation using Microbenchmarks Choosing Workloads Evaluating a Fixed-size Machine Varying Machine Size Metrics All these issues, plus more, relevant to evaluating

More information