CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012

Size: px

Start display at page:

Download "CPU Architecture Overview. Varun Sampath CIS 565 Spring 2012"

Rudolf Jacobs
7 years ago
Views:

1 CPU Architecture Overview Varun Sampath CIS 565 Spring

2 Objectives Performance tricks of a modern CPU Pipelining Branch Prediction Superscalar Out-of-Order (OoO) Execution Memory Hierarchy Vector Operations SMT Multicore 2

3 What is a CPU anyways? Execute instructions Now so much more Interface to main memory (DRAM) I/O functionality Composed of transistors 3

4 Instructions Examples: arithmetic, memory, control flow add r3,r4 -> r4 load [r4] -> r7 jz end Given a compiled program, minimize cycles instruction seconds cycle CPI (cycles per instruction) & clock period Reducing one term may increase the other 4

5 Desktop Programs Lightly threaded Lots of branches Lots of memory accesses vim ls Conditional branches 13.6% 12.5% Memory accesses 45.7% 45.7% Vector instructions 1.1% 0.2% Profiled with psrun on ENIAC 5

6 Source: intel.com 6

7 What is a Transistor? Approximation: a voltage-controlled switch Typical channel lengths (for 2012): 22-32nm Channel Image: Penn ESE370 7

8 Moore s Law The complexity for minimum component costs has increased at a rate of roughly a factor of two per year Self-fulfilling prophecy What do we do with our transistor budget? Source: intel.com 8

9 Intel Core i7 3960X (Codename Sandy Bridge-E) 2.27B transistors, Total Size 435mm 2 Source: 9

10 A Simple CPU Core + 4 PC I$ Register File s1 s2 d D$ control Image: Penn CIS501 10

11 A Simple CPU Core + 4 PC I$ Register File s1 s2 d D$ control Fetch Decode Execute Memory Writeback Image: Penn CIS501 11

12 Pipelining + 4 PC Insn Mem Register File s1 s2 d Data Mem T insn-mem T regfile T ALU T data-mem T regfile T singlecycle Image: Penn CIS501 12

13 Pipelining Capitalize on instruction-level parallelism (ILP) + Significantly reduced clock period Slight latency & area increase (pipeline latches)? Dependent instructions? Branches Alleged Pipeline Lengths: Core 2: 14 stages Pentium 4 (Prescott): > 20 stages Sandy Bridge: in between 13

14 Branches IR Register File s1 s2 d A B S X Data Mem F/D D/X X/M M/W nop IR O B IR a d O D IR?????? jeq loop Image: Penn CIS501 16

15 Branch Prediction Guess what instruction comes next Based off branch history Example: two-level predictor with global history Maintain history table of all outcomes for M successive branches Compare with past N results (history register) Sandy Bridge employs 32-bit history register 17

16 Branch Prediction + Modern predictors > 90% accuracy o Raise performance and energy efficiency (why?) Area increase Potential fetch stage latency increase 18

17 Another option: Predication Replace branches with conditional instructions ; if (r1==0) r3=r2 cmoveq r1, r2 -> r3 + Avoids branch predictor o Avoids area penalty, misprediction penalty Avoids branch predictor o Introduces unnecessary nop if predictable branch GPUs also use predication 19

18 Improving IPC IPC (instructions/cycle) bottlenecked at 1 instruction / clock Superscalar increase pipeline width Image: Penn CIS371 20

19 Superscalar + Peak IPC now at N (for N-way superscalar) o Branching and scheduling impede this o Need some more tricks to get closer to peak (next) Area increase o Doubling execution resources o Bypass network grows at N 2 o Need more register & memory bandwidth 21

20 Superscalar in Sandy Bridge Image David Kanter, RWT 22

21 Scheduling Consider instructions: xor r1,r2 -> r3 add r3,r4 -> r4 sub r5,r2 -> r3 addi r3,1 -> r1 xor and add are dependent (Read-After- Write, RAW) sub and addi are dependent (RAW) xor and sub are not (Write-After-Write, WAW) 23

22 Register Renaming How about this instead: xor p1,p2 -> p6 add p6,p4 -> p7 sub p5,p2 -> p8 addi p8,1 -> p9 xor and sub can now execute in parallel 24

23 Out-of-Order Execution Reordering instructions to maximize throughput Fetch Decode Rename Dispatch Issue Register-Read Execute Memory Writeback Commit Reorder Buffer (ROB) Keeps track of status for in-flight instructions Physical Register File (PRF) Issue Queue/Scheduler Chooses next instruction(s) to execute 25

24 OoO in Sandy Bridge 26 Image David Kanter, RWT

25 Out-of-Order Execution + Brings IPC much closer to ideal Area increase Energy increase Modern Desktop/Mobile In-order CPUs Intel Atom ARM Cortex-A8 (Apple A4, TI OMAP 3) Qualcomm Scorpion Modern Desktop/Mobile OoO CPUs Intel Pentium Pro and onwards ARM Cortex-A9 (Apple A5, NV Tegra 2/3, TI OMAP 4) Qualcomm Krait 27

26 Memory Hierarchy Memory: the larger it gets, the slower it gets Rough numbers: Latency Bandwidth Size SRAM (L1, L2, L3) 1-2ns 200GBps 1-20MB DRAM (memory) 70ns 20GBps 1-20GB Flash (disk) 70-90µs 200MBps GB HDD (disk) 10ms 1-150MBps GB SRAM & DRAM latency, and DRAM bandwidth for Sandy Bridge from Lostcircuits Flash and HDD latencies from AnandTech Flash and HDD bandwidth from AnandTech Bench SRAM bandwidth guesstimated. 28

27 Caching Keep data you need close Exploit: Temporal locality Chunk just used likely to be used again soon Spatial locality Next chunk to use is likely close to previous 29

28 Cache Hierarchy Hardware-managed I$ D$ L1 Instruction/Data caches L2 unified cache L3 unified cache Software-managed Main memory Disk L a r g e r L2 L3 Main Memory F a s t e r Disk (not to scale) 30

29 Intel Core i7 3960X 15MB L3 (25% of die). 4-channel Memory Controller, 51.2GB/s total Source: 31

30 Some Memory Hierarchy Design Choices Banking Avoid multi-porting Coherency Memory Controller Multiple channels for bandwidth 32

31 Parallelism in the CPU Covered Instruction-Level (ILP) extraction Superscalar Out-of-order Data-Level Parallelism (DLP) Vectors Thread-Level Parallelism (TLP) Simultaneous Multithreading (SMT) Multicore 33

32 Vectors Motivation for (int i = 0; i < N; i++) A[i] = B[i] + C[i]; 34

33 CPU Data-level Parallelism Single Instruction Multiple Data (SIMD) Let s make the execution unit (ALU) really wide Let s make the registers really wide too for (int i = 0; i < N; i+= 4) { // in parallel A[i] = B[i] + C[i]; A[i+1] = B[i+1] + C[i+1]; A[i+2] = B[i+2] + C[i+2]; A[i+3] = B[i+3] + C[i+3]; } 35

34 Vector Operations in x86 SSE2 4-wide packed float and packed integer instructions Intel Pentium 4 onwards AMD Athlon 64 onwards AVX 8-wide packed float and packed integer instructions Intel Sandy Bridge AMD Bulldozer 36

35 Thread-Level Parallelism Thread Composition Instruction streams Private PC, registers, stack Shared globals, heap Created and destroyed by programmer Scheduled by programmer or by OS 37

36 Simultaneous Multithreading Instructions can be issued from multiple threads Requires partitioning of ROB, other buffers + Minimal hardware duplication + More scheduling freedom for OoO Cache and execution resource contention can reduce single-threaded performance 38

37 Multicore Replicate full pipeline Sandy Bridge-E: 6 cores + Full cores, no resource sharing other than lastlevel cache + Easier way to take advantage of Moore s Law Utilization 39

38 Locks, Coherence, and Consistency Problem: multiple threads reading/writing to same data A solution: Locks Implement with test-and-set, load-link/storeconditional instructions Problem: Who has the correct data? A solution: cache coherency protocol Problem: What is the correct data? A solution: memory consistency model 40

39 Conclusions CPU optimized for sequential programming Pipelines, branch prediction, superscalar, OoO Reduce execution time with high clock speeds and high utilization Slow memory is a constant problem Parallelism Sandy Bridge-E great for 6-12 active threads How about 12,000? 41

40 References Milo Martin, Penn CIS501 Fall David Kanter, Intel's Sandy Bridge Microarchitecture. 9/25/10. D=RWT Agner Fog, The microarchitecture of Intel, AMD and VIA CPUs. 6/8/ e.pdf 42

41 Bibliography Classic Jon Stokes articles introducing basic CPU architecture, pipelining (1, 2), and Moore s Law CMOV discussion on Mozilla mailing list Herb Sutter, The Free Lunch Is Over: A Fundamental Turn Toward Concurrency in Software. link 43

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide