UTS: An Unbalanced Tree Search Benchmark

Transcription

1 UTS: An Unbalanced Tree Search Benchmark LCPC

2 Coauthors Stephen Olivier, UNC Jun Huan, UNC/Kansas Jinze Liu, UNC Jan Prins, UNC James Dinan, OSU P. Sadayappan, OSU Chau-Wen Tseng, UMD Also, thanks to Ohio Supercomputer Center, European Center for Parallelism in Barcelona, and William Pugh 2

3 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? 3

4 Motivating the Problem Suppose we have a large search space and a treestructured search. How could we do this in parallel? Now suppose the tree is highly unbalanced. 4

5 Meet UTS Unbalanced computation requiring continuous dynamic load balancing for high performance Exposes trade-offs between communication costs and granularity of load balance Fundamentally disadvantages systems with coarse-grained communication networks Challenges the expressibility and performance portability of parallel programming languages 5

6 Benchmark Characteristics Implicit Generation Tree nodes generated on the fly Mimics systematic search Well-suited for parallel exploration Imbalance Large variation in subtree sizes Parameterization Can create trees of different depths and sizes 6

7 Outline Tree Generation Implementation Performance Analysis Conclusions and Future Work 7

8 Generation Strategy Each node consists of a 20B descriptor Implicit child creation using cryptographic hash (Parent Descriptor, Child Index) Hash Child Node Descriptor Number of children determined by sampling a probability distribution using the descriptor 8

9 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... 9

10 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree How large a subtree at each?... Don t know until we ve explored them... 10

11 Generating Binomial Trees In a Binomial Tree, each node has 0 or m children A node has children with probability q 1 1 qm When qm < 1, expected size is As qm approaches 1, subtree size variation increases (Imbalance!) Expected subtree size rooted at each node is independent of its location in the tree Now we know 11

12 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) 12

13 Generating Geometric Trees Use geometric distribution to determine the number of children Expected value specified as a parameter Depth limitation without which trees could grow indefinitely Long tail of geometric distribution can cause some nodes to have many more children than others (Imbalance!) Lucky sampled value from tail 13

14 Implementation Goal is to explore the entire tree Each thread performs depth first traversal starting from a given node using a stack When a thread runs out of work, it must be able to acquire more nodes Work sharing - busy threads with surplus work responsible for offloading to idle threads Work stealing - idle threads responsible for seeking out surplus work at other threads Busy threads always making progress on goal 14

15 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 15

16 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Steal 2. Stack A thread s steal stack 16

17 Implementation Details How to express in OpenMP & UPC? Shared variables to express program and thread state Concurrency control only when necessary due to high expense Local stack operations performed without locking Most frequent operations Steal operations performed on shared stack with locking No Locking Here Local Access Only Shared Access Stack Top Stack Bottom Fig. Locking 2. A thread s Here steal stack 17

18 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to operations generate must different be serialized trees through in the same a lock. parameter To amortize class heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 18 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial

19 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 19 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial

20 Sample Trees Table 1. Sample trees, with parameters and their resulting depths All have and nearly sizesthe in same millions number of nodes. of nodes but different depths r is seed for the root node, and can be varied to allow operations different must be trees serialized with other through parameters a lock. the To same amortize heads, nodes can only move in chunks of size k between egions or between shared areas in different stacks. 20 Tree Type b 0 d q m r Depth MNodes T1 Geometric T2 Geometric T3 Binomial

21 Sequential Exploration Rates 21 System Processor Compiler T1 T2 T3 Cray X1 Vector (800 MHz) Cray SGI Origin 2000 MIPS (300 MHz) GCC SGI Origin 2000 MIPS (300 MHz) Mipspro Sun SunFire 6800 Sparc9 (750 MHz) Sun C P4 Xeon Cluster (OSC) P4 Xeon (2.4GHz) Intel Mac Powerbook PPC G4 (1.33GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) GCC SGI Altix 3800 Itanium2 (1.6GHz) Intel Dell Blade Cluster (UNC) P4 Xeon (3.6GHz) Intel Table 2. Sequential performance for all trees (Thousands of nodes per second) Performance given in knodes/sec Note the variation with tree type (T1 and T2 are geometric trees; T3 is a binomial tree) As a result, we replaced the array of per-thread data with an array of pointers to dynamically allocated structures. During parallel execution, each thread dynamically allocates memory for its own structure with affinity to the thread performing the allocation. This adjustment eliminated false sharing, but was unecessary for UPC. The facilities for explicit distribution of data in OpenMP are weaker than those of UPC.

22 Performance on Shared Memory (SGI Origin 2000) 22 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results

23 Performance on Shared Memory (SGI Origin 2000) UPC Slower 23 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results

24 Performance on Shared Memory (SGI Origin 2000) T1 vs T2 24 Fig. 3. Parallel performance on the Origin 2000 using OpenM results for geometric trees T1 and T2. On the right, results

25 Performance on Distributed Memory 25 Fig. 4. Performance on the Dell blade cluster (left); Parallel e versus the Origin The 2000 UNIVERSITY (right) of NORTH CAROLINA at CHAPEL HILL

26 Granularity and Scaling Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 26

27 Granularity and Scaling Stealing overhead dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 27

28 Granularity and Scaling Stealing overhead dominates Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 28

29 Granularity and Scaling Stealing overhead dominates Sweet Spot Load imbalance dominates Geometric tree T1 on the SGI Origin 2000 Chunk size governs the amount of locking and communication 29

30 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale) 30

31 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Chunk size (linear scale) Chunk size (log scale!) 31

32 Comparing Trees & Systems While Varying Granularity T1 (Geometric) T3 (Binomial) Origin Performance (Mnodes / sec) Cluster Performance (Mnodes / sec) Sensitive to chunk size Chunk size (linear scale) Chunk size (log scale!) 32

33 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 33 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster

34 Comparing Parallel Efficiency ster (left); Parallel efficiency on the Dell cluster 34 Efficiency at optimal chunk size for each machine Near-ideal scaling on the Origin UPC and OpenMP Poor scaling on the cluster

35 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Work Stealing Granularity

36 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Trouble finding work Work Stealing Granularity

37 Comparing Thread Behavior Fig. 5. PARAVER time charts for T3 on 4 vs. 16 processors of OSC s P4 cluster Origin 2000 (16 threads) Fig. 6. PARAVER time chart for T3 exploration on 16 processors of the Origin 2000 a OpenMP or UPC barrier. This give millisecond precision on the clusters we tested, but more sophisticated methods could likely do better. The records are examined after the run using the PARAVER (PARallel Visualization and Events Representation) tool[12]. PARAVER displays a series of horizontal bars, one for each thread. There is a time axis below, and the color of each bar at each time value corresponds to that thread s state. In addition, yellow vertical lines are drawn between the bars representing the thief and victim threads at the time of each steal. Figure 5 shows PARAVER time charts for two runs on OSC s P4 Xeon cluster at chunk size 64. The bars on the top trace (4 threads) are mostly dark blue, indicating that the threads are at work. The second trace (16 threads) shows a considerable amount of white, showing that much of the time threads are searching for work. There is also noticeable idle time (shown in light blue) in Fig. the termination 5. PARAVERphase. time charts The behavior for T3 onof 4 the vs. 16 Origin processors with of 16OSC s threads, P4 shown cluster in Cluster (4 & 16 threads) Fig. 6, is much better, with all threads working almost all of the time. Dark blue is work time Light blue is time searching for work White is idle time Termination Detection Work Stealing Granularity

38 Expressing Locality OpenMP vs. UPC on Shared Memory OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 38

39 Expressing Locality OpenMP vs. UPC on Shared Memory!!! gin 2000 using OpenMP and UPC. On the left, 39 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000

40 Expressing Locality OpenMP vs. UPC on Shared Memory After Fix!!! gin 2000 using OpenMP and UPC. On the left, 40 OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000

41 Expressing Locality OpenMP vs. UPC on Shared Memory No problems for UPC either way OpenMP lacks portable explicit locality specifier Results in false sharing in on some systems Problem resolved using dynamic allocation Results shown for Origin 2000 gin 2000 using OpenMP and UPC. On the left, 41

42 Larger Problem (Tree) Size Altix processors used Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix

43 Larger Problem (Tree) Size Altix processors used 3X Both trees are same type (binomial) Fig. 11. Speedup on the SGI Altix

44 Comparing Performance on Various Machines T3 using UPC on 16 processors If it scaled better, Blade cluster would outperform the Altix 3800 Fig. 10. Performance for T3 on 16 pro- 44 Fig. 11. Speedup on th

45 Issues Addressed Measure the ability of parallel systems to perform continuous dynamic load balancing Understand the trade-offs between communication costs and granularity of load balance On shared memory and on distributed memory Compare parallel programming languages Expressibility and ease of programming Portability and performance portability 45

46 Conclusions UPC versus OpenMP on shared memory Similar in ease of programming UPC has superior facility for specifying locality Both exhibit near-ideal scaling Comparable absolute performance UPC implementation on distributed memory Frequent access of shared variables for coordinating the work stealing and termination detection lead to poor performance and scaling Several factors impact performance Data locality Granularity of load balance Problem size 46

47 Ongoing and Future Work MPI versions of UTS Work sharing and work stealing methods Further tuning of UPC version, especially to improve distributed memory performance Reduce/replace many shared variable accesses Lose advantage of UPC abstractions for programmability Trade-off with possible decrease in shared memory performance Fails to provide performance portability Performance testing on other systems 47