Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources

Size: px

Start display at page:

Download "Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources"

Delilah Martin
8 years ago
Views:

1 Performance Analysis of Thread Mappings with a Holistic View of the Hardware Resources Wei Wang, Tanima Dey, Jason Mars, Lingjia Tang, Jack Davidson, Mary Lou Soffa Department of Computer Science University of Virginia ISPASS 2012 This research is supported in part by NSF grant number CCF

Davidson, Mary Lou Soffa Department of Computer Science University of

2 Motivation Chip-multiprocessors offer large number of cores and ample resources Number of simultaneously executing applications is increasing Careful resource management is critical Thread mapping is a powerful technique for resource management ISPASS 2012 Wang et al., University of Virginia 2

Careful resource management is critical Thread mapping is a powerful

3 Challenges for Thread Mapping Multiple resources are effected Threads demonstrate various run-time characteristics Multi-threaded workloads are emerging ISPASS 2012 Wang et al., University of Virginia 3

characteristics Multi-threaded workloads are

4 Goal of this Research Analyze why a particular thread mapping is better than another mapping: What are the resources that cause the performance differences? What are the thread characteristics that cause the resource utilization differences? What is the relative importance of various resources? ISPASS 2012 Wang et al., University of Virginia 4

What are the thread characteristics that cause the resource utilization differences?

5 Contributions In-depth performance analyses of various thread mappings using multi-threaded applications on real hardware Identify the key hardware resources Determine the impact on key resource utilization Introduce a new metric L2MP to analyze the performance of the combined memory resources Provide a ranking of the resources ISPASS 2012 Wang et al., University of Virginia 5

resource utilization Introduce a new metric L2MP to analyze the performance of the combined

6 Outline Motivation Challenges Contributions Overview resource, metric, mappings Analysis prefetchers, processor cores Key findings for thread mapping Conclusion ISPASS 2012 Wang et al., University of Virginia 6

processor cores Key findings for thread mapping

7 Overview A comprehensive analyses considering various factors Application s performance Application s characteristics Hardware resources shared by applications Utilization of the resources ISPASS 2012 Wang et al., University of Virginia 7

characteristics Hardware resources shared by applications

8 Resources and Metrics Resources Memory Resources: L1 I/D, I/D TLB, L2, Prefetchers, Memory interconnect Processor Resources: Memory disambiguation units, branch predictors, Processor Core Metrics Cache misses, mis-predictions, memory latency (with hardware performance counters (HPCs)) Processor utilization (from OS) Execution cycles and execution time ISPASS 2012 Wang et al., University of Virginia 8

Metrics Cache misses, mis-predictions, memory latency (with hardware performance counters (HPCs))

9 Thread Characteristics of Multithreaded Applications Single thread characteristics Cache demand Memory bandwidth demand I/O frequency Prefetcher effectiveness Prefetcher excessiveness Multiple thread characteristics Sibling Threads Data and instruction sharing Frequency of synchronization ISPASS 2012 Wang et al., University of Virginia 9

Prefetcher excessiveness Multiple thread characteristics Sibling Threads Data and

10 Four Thread Mappings Mapping Core 0 Core 1 Core 2 Core 3 LLC0 LLC1 OSMap Any thread Any thread Any thread Any thread IsoMap a1, a1 a1,a1 a2, a2 a2, a2 IntMap a1, a1 a2,a2 a1,a1 a2,a2 SprMap a1, a2 a1,a2 a1,a2 a1,a2 App 1 Core 0 Core 1 Core 2 Core 3 App 2 L1 Cache TLB L2 Cache L1 Cache TLB L1 Cache TLB L2 Cache L1 Cache TLB Hardware Prefetchers Hardware Prefetchers Off-Chip Mem Interconnect ISPASS

a1,a2 a1,a2 a1,a2 App 1 Core 0 Core 1 Core 2 Core 3 App 2 L1 Cache TLB L2 Cache L1 Cache TLB L1 Cache

11 Experimental Setup Platform & Workloads Intel Core 2 Q9550 Processor PARSEC benchmark suite benchmarks All possible pairs (36) using the 9 benchmarks 4 worker threads each benchmark Core 0 Core 1 L1 TLB L1 TLB L2 Cache Hardware Prefetchers Core 2 Core 3 L1 TLB Memory Controller & Memory L1 TLB L2 Cache Hardware Prefetchers ISPASS 2012 Wang et al., University of Virginia 11

Core 0 Core 1 L1 TLB L1 TLB L2 Cache Hardware Prefetchers Core 2 Core 3 L1 TLB Memory

12 Key Resources A key resource is identified Utilization of the resource varies considerably Utilization variation results in difference in application s performance Identification technique Direct approach: use HPCs Indirect approach: use application s performance in different mappings ISPASS 2012 Wang et al., University of Virginia 12

performance Identification technique Direct approach: use HPCs Indirect approach:

13 Key Resources More important resources Memory resources Processor resources L1D-cache Branch predictor L2-cache Processor core Hardware prefetchers Memory interconnect Less important resources L1I-cache I/D TLB Memory disambiguation unit ISPASS 2012 Wang et al., University of Virginia 13

prefetchers Memory interconnect Less important resources L1I-cache I/D

14 Analysis Hardware Prefetchers Experimental Results: streamcluster (w. blackscholes) ISPASS

15 Key Findings for Hardware Prefetchers Case 1: Threads that share high amount of data Sharing the same cache improves performance ISPASS 2012 Wang et al., University of Virginia 15

16 Key Findings for Hardware Prefetchers Case 2: Threads that have low or no data sharing but high prefetcher excessiveness Sharing the same prefetchers improves performance ISPASS 2012 Wang et al., University of Virginia 16

excessiveness Sharing the same prefetchers improves

17 Key Findings for Hardware Prefetchers Case 3: Threads that have low data sharing and low prefetcher excessiveness Fewer cache misses and prefetch operations improves performance ISPASS 2012 Wang et al., University of Virginia 17

excessiveness Fewer cache misses and prefetch operations

18 Analysis Processor Cores Processor utilization ISPASS 2012 Wang et al., University of Virginia 18

19 Analysis Processor Cores Performance impact ISPASS 2012 Wang et al., University of Virginia 19

20 Key Findings for Processor Cores Case 1: Sibling threads have frequent synchronization ISPASS 2012 Wang et al., University of Virginia 20

21 Key Findings for Processor Cores Case 2: Sibling threads have frequent I/O operations ISPASS 2012 Wang et al., University of Virginia 21

22 Managing Multiple Resources Example L2 caches, prefetchers, and memory bandwidth are closely related resources A single metric to evaluate their aggregated performance impact L2MP: L2-cache-misses-memory-latencyproduct L2MP = L2_cache_misses X Memory_latency ISPASS 2012 Wang et al., University of Virginia 22

23 L2MP L2MP is good indicator of performance ISPASS 2012 Wang et al., University of Virginia 23

24 Managing Multiple Resources Thread mapping algorithms Consider all the key resources together Improve the utilizations of the resources that provide the maximum benefit Consider co-running application s characteristics ISPASS 2012 Wang et al., University of Virginia 24

25 Findings for Multiple Resources For memory-intensive applications streamcluster, canneal, facesim, fluidanimate Maximize the L2MP metric For I/O- or CPU-intensive applications swaptions, blackscholes, vips, x264, bodytrack Maximize processor utilization ISPASS 2012 Wang et al., University of Virginia 25

26 Conclusion Identified six key resources Analyzed how to map threads with particular characteristics to improve resource utilization Introduced a new metric L2MP for managing key memory resources Determined relative importance of the key resources ISPASS 2012 Wang et al., University of Virginia 26

27 Related Work Shared-cache-aware thread mapping Jiang et al. PACT 2008 Chandra et al. HPCA 2005 Xie et al. CMP-MSI 2008 Knauerhase et al. IEEE-Micro 2008 Cache-Prefetcher-FSB-aware thread mapping Zhuravlev et al. ASPLOS 2010 ISPASS 2012 Wang et al., University of Virginia 27

28 Thank you & Questions? ISPASS 2012 Wang et al., University of Virginia 28

ReSense: Mapping Dynamic Workloads of Colocated Multithreaded Applications Using Resource Sensitivity

ReSense: Mapping Dynamic Workloads of Colocated Multithreaded Applications Using Resource Sensitivity TANIMA DEY, WEI WANG, JACK W. DAVIDSON, and MARY LOU SOFFA, University of Virginia To utilize the full