Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Transcription

1 Multicore Processor and GPU Jia Rao Assistant Professor in CS

2 Moore s Law The number of transistors on integrated circuits doubles approximately every two years! CPU performance doubles every two years!2

3 The End of Moore s Law CPU performance doubles every two years? Transistors performance!3

4 Multicore Processors If wider data path, wider registers, bigger caches, deeper pipelines, and intelligent branch prediction can NOT double performance, what to do with the doubled transistors? Put more cores on a chip > multicore processor!4

5 Multiprocessor Memory Types Shared memory! - there is one (large) common shared memory for all processor! Distributed memory! - each processor has its own (small) local memory, and its content is not replicated anywhere else!!5 Adapted from slides of pfenning@cmu

6 Multicore Processors is a Special Kind of Multiprocessors All processors on the same chip (CMP) MIMD (Multiple Instructions Multiple Data) - Different cores execute different threads, operating on different parts of memory Shared memory multiprocessor - All cores share the same memory!6

7 Multicore Architecture Memory node-0 Memory node-1 Processor-0 Cross-socket interconnect Two-socket Intel Nehalem NUMA machine Processor-1!7

8 The Cache Coherence Problem Replicate contents of memory in local caches Processors can have different values for the same location Reading at shared address should return the last value written! by any processor Processor Processor Processor Processor Cache Cache Cache Cache Interconnect Memory I/O!8 Adapted from slides of

9 Coherency mechanisms Directory-based - In a directory-based system, the data being shared is placed in a common directory that maintains the coherence between caches. The directory acts as a filter through which the processor must ask permission to load an entry from the primary memory to its cache. When an entry is changed the directory either updates or invalidates the other caches with that entry.! Snooping - This is a process where the individual caches monitor address lines for accesses to memory locations that they have cached. It is called a write invalidate protocol when a write operation is observed to a location that a cache has a copy of and the cache controller invalidates its own copy of the snooped memory location.!9

10 The MESI Protocol All coherence related activity is broadcast to all processors Every cache line has one of the four states - Modified cache line is present only in the current cache, is dirty and has been modified from the value in memory - Exclusive cache line is present only in the current cache, and is clean - Shared cache line may be stored in other caches, and is clean - Invalid cache line is invalid!10

11 The MESI Protocol (cont ) Processor events - PrRd read - PrWr write Bus transactions - BusRd read request from the bus without intent to modify - BusRdX read request from the bus with the intent to modify - BusWB write line out to memory Access a cache line in I state will cause a cache miss A write can only be performed if the cache line is in E or M states. If it is in S state, the processor broadcasts a request for ownership (RFO) to invalidate other copies!11

12 Cache hierarchy of Intel Core i7 Case Study: Intel Nehalem 64 byte cache line size L2 Cache Shared L3 Cache (One bank per core) Ring Interconnect L2 Cache L2 Cache L2 Cache L3: per chip 8MB - 12MB, inclusive 16-way set associative cycle latency L3: (per chip) 8 MB, inclusive 16-way set associative 32B / clock per bank cycle latency L2: private per core L2: (private per core) 256KB 8-way set associative, write back 32B / clock, 12 cycle 12 cycle latency 16 outstanding misses Review: key - cache line - write back through po - inclusion 8-way set associative, write back L1 Data Cache Core L1 Data Cache Core L1 Data Cache Core L1 Data Cache Core L1: private per core L1: (private per core) 32 32KB 8-way set associative, write back 2 x 16B loads + 1 x 16B store per clock 4-6 cycle cycle latency latency 10 outstanding misses 8-way set associative, write back!12 Adapted from slides of Fatahalian@cmu

13 UNC_ADDR_OPCODE_MATCH.REMOTE.RSPFWDS Remote L3 CACHE in F or S, load A UNC_ADDR_OPCODE_MATCH.REMOTE.RSPIWB Hitm in remote L3 CACHE, load D These opcode uses can be seen from the dual socket QPI communications diagrams below. These predefined opcode match encodings can be used to monitor HITM accesses in particular and serve as the only event that allows profiling the requesting code on the basis of the HITM transfers its requests generate. Case Study: Intel Nehalem (cont ) RdData request after LLC Miss to Local Home (Clean Rsp) Cores Cores DRd Uncore L L C IMC Cache Lookup Cache Miss GQ [Send Snoop to LLC] QHL RspI SnpData Q P I SnpData RspI Uncore [ Broadcast snoops to all other caching agents) ] Q P I RspI SnpData RspI [ Sending Req to Local Home (socket 2 owns this address) ] RdData GQ QHL DataC_E_CMP Cache Lookup Cache Miss Allocate in E state [I-> E] Speculative mem Rd Data L L C [Fill complete to Socket2] IMC Socket 1 Socket 2 RdData after LLC miss and load Intel from TOP local SECRET memory!13 From Levinthal s perf. analysis guide

14 Performance Analysis Guide Case Study: Intel Nehalem (cont ) RdData request after LLC Miss to Remote Home (Clean Rsp) Cores Cores Uncore L L C IMC Cache Lookup (7) Clean Rsp (8) [RspI indicates clean snoop] Speculative mem Rd (7) Data (9) RspI (9) GQ QHL [Send Snoop to LLC] [Send Request to CHL] SnpData (6) RdData (6) (10) Q P I DataC_E_cmp [Send complete and Data to Socket2 to allocate in E state] RdData (5) DataC_E_cmp (11) Uncore RdData (4) Q P I GQ DataC_E_cmp (12) QHL DRd (1) Allocate in E state [i->e] (13) [ Sending Req to Remote Home (socket 1 owns this address) ] Cache Lookup (2) Cache Miss (3) L L C IMC Socket 1 Socket 2 Intel TOP SECRET RdData after LLC miss and load from remote memory!14 From Levinthal s perf. analysis guide

15 (7) IMC WB Speculative mem Rd Data (9) QHL Data (1 [Send complete and Data to Socket2 to allocate in E state] Socket 1 Case Study: Intel Nehalem (cont ) QHL Intel TOP SECRET RdData request after LLC Miss to Local Home (Hitm Response) IMC Socket 2 Cores Cores DRd Uncore L L C Cache Lookup Hitm Rsp M-> I, Data GQ WbIData SnpData RspIWb [Data written back to Remote Home. RspIWb is a NDR response. Hint to home that wb data follows shortly which is WbIData] [Send Snoop to LLC] Q P I SnpData RspIWb WbIData Uncore [ Broadcast snoops to all other caching agents) ] Q P I WbIData RspIWb SnpData RdData GQ Cache Lookup Cache Miss Allocate in E state [I-> E] DataC_E_Cmp WB L L C [Send complete to Socket2] Speculative mem Rd IMC QHL Socket 1 [ Sending Req to Local Home (socket 2 owns this address) ] QHL Data IMC Socket 2 RdData after LLC miss, invalidate remote modified! Intel TOP SECRET copy, and load from local memory!15 From Levinthal s perf. analysis guide

16 Performance Implications Scalability issues - significant traffic when scaling to high core counts - storage cost for tracking sharers - Increased latency of cache misses False sharing - two threads write to different variables residing on the same cache line, incurring significant amounts of coherence traffic!16

17 False Sharing May lead to false sharing // allocate per thread data long mydata[num_threads]; // allocate per thread data struct perthreaddata { long mydata; char padding[64 - sizeof(int)]; }; PerThreadData mydata[num_threads] Cache line Cache line mydata[0] mydata[1] mydata[2] mydata[0] padding access to mydata cause! coherence traffic!17 Adapted from slides of Fatahalian@cmu

18 Shared Resource Contention Contention could happen in different shared resources - LLC, memory controller, hardware prefetcher, crosssocket interconnect Contention-Aware Scheduling on Multicore Systems Contention is really harmful,! leading to degraded and! unpredictable performance Fig. 1. The performance degradation relative to running solo for two different schedules!18 CPU2006 applications on an Intel Xeon X3565 quad-core processor (two cores share an LL

19 Simultaneous Multithreading (SMT) A technique complementary to multi-core: Simultaneous multithreading Problem: processor pipeline stall Problem addressed: The processor pipeline - Waiting for long floating point/ Integer can get stalled: integer operation - Waiting for data from memory - Other execution unit idle Waiting for the result of a long floating point (or integer) operation Waiting for data to arrive from memory Solution: having two or more hardware threads per core Other execution units wait unused L2 Cache and Control Bus BTB L1 D-Cache D-TLB Schedulers Uop queues Rename/Alloc Trace Cache Decoder BTB and I-TLB Floating Point Source: Intel ucode ROM 17!19 Adapted from slides of pfenning@cmu

20 Intel s Hyperthreading A technique complementary to multi-core: Simultaneous multithreading Replicate Register state, return stack buffer, large page ITLB Problem addressed: The processor pipeline can get stalled: Partitioned load buffer, store Waiting buffer, reorder for the buffer, result of a long floating point small page ITLB (or integer) operation Dynamically Waiting shared for data to reservation arrive station, from memory caches, data Other TLB, 2nd execution level TLB units wait unused Unaware execution units!20 L2 Cache and Control Bus Integer BTB L1 D-Cache D-TLB Schedulers Thread-2:! integer op Uop queues Rename/Alloc Trace Cache Decoder BTB and I-TLB Floating Point Source: Intel ucode ROM 17 Thread-1: floating point Adapted from slides of pfenning@cmu

21 Multiprocessor Scheduling Per-CPU scheduler ready'queue' pick_next_task()'' processor' Work migration to achieve load balancing ' ready'queue' Kick'' pick_next_task()'' ' processor' ready'queue' pick_next_task()'' processor' Push migration ' ' ready'queue' pick_next_task()'' processor' ready'queue' pick_next_task()'' processor' Pull migration ' ready'queue' steal' pick_next_task()'' ' processor'!21

22 Limitations of Multicore Processor Single-core > Multicore is primarily due to - Memory wall - ILP wall - Power wall Multicore is still not performing well - Lack of OS and application support for parallelization - limited scalability due to cache coherence, inter-processor synchronization - Still hard to grow to high core count due to power wall - Not all workloads require deep pipeline, branch predictor > resource waste!22

23 GPU Recap: Multicore uses MIMD architectures GPU uses SIMD (single instruction multiple data) architectures to exploit data parallelism for - matrix-oriented scientific computing - media-oriented image/sound processing SIMD is more energy efficient than MIMD - only needs to fetch one instruction per data operation!23

24 GPU vs. CPU GPU is designed for data parallel processing rather than data caching and flow control!24

25 GPU: heterogeneous Computing Heterogeneous execution model - CPU is the host, GPU is the device Develop a C-like programming language - CUDA and OpenCL Unify all forms of GPU parallelism as thread Programming model is Single Instructin Multiple Thread!25

26 Threads and Blocks A thread is associated with each data element Threads are organized into blocks Blocks are organized into a grid GPU hardware handles thread management, not applications or OS!26

27 An Example A = B * C!27

28 GPU Architecture Multithreaded SIMD processor Thread block! scheduler Floor plan of NVIDIA Fermi GTX 480!28

29 SIMD Multithreaded Processor Process 16 elements one time One lane one element!29

30 Conditional Branching GPU branch hardware uses internal masks to handle different execution paths Blue: mask=1! Red: mask=0 for (i = 0; i < 64; i = i +1) if (x[i]!= 0) x[i] = x[i] - y[i]; else x[i] = x[i]+ y[i]; lane 0 lane 1 lane 2 lane 3 lane 4 lane 5 Blue: x[i]!=0 lane 0 lane 1 lane 2 lane 3 lane 4 lane 5 Red: x[i] ==0 Branch divergence leads to idle execution units!30

31 Coalesced Memory Access original matrix a b storage in memory a b non-coalesced thread 0: 0, 1, 2 thread 1: 3, 4, 5 thread 2: 6, 7, 8 thread 3: 9, a, b coalesced thread 0: 0, 4, 8 thread 1: 1, 5, 9 thread 2: 2, 6, a thread 3: 3, 7, b Maximize memory bandwidth!31

32 an Jiang Ziyu Guo Kai Tian Xipeng Shen Computer Science Department f William and Mary, Williamsburg, VA, USA ang,guoziyu,ktian,xshen}@cs.wm.edu Irregularities ssing Units ral-purpose fficiency is ontrol flows mance gains ns an open approaches kle dynamic ences. It reoth control their relal heuristicsr effectively ing and job unified soft-... = A[P[tid]]; A[ ]: P[ ] = { 0, 5, 1, 7, 4, 3, 6, 2} tid: (a) Irregular memory reference!32 tid: B[ ]: if (B[tid]) {...} (b) Irregular control flow Figure 1. Examples of dynamic irregularities (warp size=4; segment size=4). Graph (a) shows that inferior mappings between threads and data locations cause more memory transactions than necessary; graph (b) shows that inferior mappings between threads and data values cause threads in the same warp diverge on the condition. 1. Introduction Zhang-ASPLOS11