EN2910A: Advanced Computer Architecture Topic 01: Introduction to Quantitative Analysis

EN2910A: Advanced Computer Architecture Topic 01: Introduction to Quantitative Analysis Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL 13 1

Topic 01: Introduction to Quantitative Analysis 1. Trends in computing systems 2. Quantifying performance 3. Quantifying power 4. Role of simulators S. Reda EN2910A FALL 13 2

Moore s law Started with 10 um feature size in 1970. We are now at 22 nm feature size. Number of transistors per unit area doubles every 2-3 area. More transistors in die à more capable cores and/or more cores, GPU, accelerating functional units as in SoCs Transistors can switch faster with new technology S. Reda EN2910A FALL 13 3

Trends in single processor performance [patterson] 1. Process technology à more design per unit area à build more powerful cores (more ILP) and/or add more cores (TLP) 2. Process technology à faster transistors à higher clock rate 3. Deeper pipelines à higher clock 4. Better circuit design techniques and better CAD tools S. Reda EN2910A FALL 13 4

Evolution of clock rate [Debois] Dynamic power proportional to fv 2 Increasing the frequency with less than ideal voltage scaling leads to increased power density S. Reda EN2910A FALL 13 5

Power wall [patterson] Economical Heat removal mechanisms (e.g., air and liquid) limit the maximum amount of power consumption S. Reda EN2910A FALL 13 6

Lack of parallelism wall Programming parallel applications is hard Some times applications do not scale; synchronization could be a bottleneck Example: speedup of PARSEC benchmarks on a 8-core Xeon server [weaver 09] Circumventing the parallelism wall: incorporate heterogeneous functional units on die (e.g., GPU, accelerators) S. Reda EN2910A FALL 13 7

Memory wall Unlike processor performance, DRAM performance improved by 7% a year à a memory wall. Density followed Moore law Memory wall = memory_cycle/ processor_cycle In 1990, it was about 4 (25 MHz, 150 ns). Grew to 200 until 2002 but has tapered off since then. [Debois] Although still a big problem, the memory latency wall stopped growing around 2002. With the advent of multicore microarchitectures the memory problem has shifted from latency to bandwidth S. Reda EN2910A FALL 13 8

Summary of current and future challenges to computing Memory wall Increasing number of cores requires increased memory bandwidth; otherwise, starvation and stalling occurs Parallelism wall Some applications lack enough ILP or TLP à not much benefit from aggressive superscalar or many-core designs Power wall Limits on heat removal imposes a limit on power density and frequency of operation Power and parallelism wall à dark silicon (only a small portion of the chip is operational at any moment of time) S. Reda EN2910A FALL 13 9

Performance metrics 1. execution time, latency, response time: time to complete a task 2. throughput: number of tasks (e.g., instructions, queries, frames rendered) completed per unit time. Is throughput = 1/av. response time? Only if NO overlap Otherwise, throughput > 1/av. response time S. Reda EN2910A FALL 13 10

Which benchmarks? 1. Real programs (mpeg encoding) 2. Synthetic benchmarks (e.g., measuring I/O storage bandwidth) 3. Kernels 4. Toy benchmarks (e.g., quicksort) 5. Benchmark suites: SPEC: Standard Performance Evaluation Corporation SPEC CPU Integer point SPEC CPU Floating point SPEC POWERSSJ transactional SPEC Viewperf for GPU performance PARSEC for multi-threaded applications Rodinia for GPGPU performance NASA Parallel benchmarks (NPB) for clusters (e.g., FFT) HPC Challenge benchmarks (HPCC) for clusters (e.g., linear solver) S. Reda EN2910A FALL 13 11

Examples of benchmark results Runtime and average processor power for SPEC CPU2006 benchmarks using AMD Phenom II X4 965 Black edition at 3.4 GHz and 4GB DRAM running Linux 2.6.10.8 S. Reda EN2910A FALL 13 12

Reporting performance for a set of programs Arithmetic mean: P N i=1 T i problem is that programs with longest execution delays can dominate the result or Reporting speedups (Why?): Speed-up measures the advantage of a machine over reference machine R for program i: Arithmetic mean of speedups: Geometric mean of speedups: N N X S i /N P Harmonic mean: N i 1/S i S. Reda EN2910A FALL 13 13 i v uy Nt N i NX w i T i i=1 S i = T R,i T i S i What is the advantage?

Example Which is better machine 1 or machine 2? Program A Program B Arithmetic Mean Ratio of means (ref 1) Machine 1 10 sec 100 sec 55 sec 91.8 10 Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5 Reference 1 100 sec 10000 sec 5050 sec Reference 2 100 sec 1000 sec 550 sec Ratio of mens (ref 2) Program A Program B Arithmetic Harmonic Geometric speedup wrt Reference 1 speedup wrt Reference 2 Machine 1 10 100 55 18.2 31.6 Machine 2 100 50 75 66.7 70.7 Machine 1 10 10 10 10 10 Machine 2 100 5 52.5 9.5 22.4 S. Reda EN2910A FALL 13 14

When to use the harmonic mean? Consider a processor that executed instructions for the first 10 billion instructions at a rate of 1 BIPS (billion instructions per second) and then for the second 10 billion instructions at a rate of 2 BIPS, what is the average instruction rate? Average BIPS = (1+2)/2 = 1.5 WRONG Average BIPS = (10 + 10)/(10/1 + 10/2) = 20/15 =1.33 Harmonic mean of rates = n i= n 1 1 rate( n) Use HM if forced to start and end with rates (e.g. reporting CPI or miss rates or branch misprediction rates) S. Reda EN2910A FALL 13 15

Performance metrics for clusters Supercomputers: Execution time FLOPS (FLOP/s): theoretical peak or using a standard benchmark (e.g., LINPACK is used for Top-500 supercomputer ranking) Warehouse scale: Latency is important metric because it is seen by users Bing study: users will use search less as response time increases Service Level Objectives (SLOs)/Service Level Agreements (SLAs). E.g. 99% of requests be below 100 ms S. Reda EN2910A FALL 13 16

Amdahl s law 1-F F without E Apply enhancement 1-F F/S [Debois] with E Enhancement E accelerates a fraction F of a task by a factor of S speedup = T exe(without E) 1 = T exe (with E) (1 F )+ F S Enhancement is limited by the fraction of execution time that can t be enhanced à law of diminishing returns Amdahl s law à optimize the common case F=0.5 S. Reda EN2910A FALL 13 17

Physical reasons for power consumption [Debois] If transistor input voltage above V t à transistor is ON (short circuit); otherwise, it is off (open circuit) Dynamic power is consumed when transistors switch status. Static or leakage power is consumed when there is no switching (historically negligible but growing in significance with nanoscale CMOS) S. Reda EN2910A FALL 13 18

1. Static (leakage) power P static = VI sub / Ve KV t T When input voltage is less than V t, transistor should be off; however, some electrons are still able to go through because of reductions in threshold voltage (V t ) of recent technology à static or leakage power consumption. Leakage current is exponentially dependent on V t as well as the operating temperature (T). As V t decreases, static power increases exponentially à switch to 3D transistors was mainly motivated to control leakage power. Noise also limits reducing V th S. Reda EN2910A FALL 13 19

2. Dynamic power P dynamic = αfcv 2 α is fraction of clock cycles when gate switches At a particular design & technology nodes, higher frequency demands higher voltage. If chip size grows then total power grows Non-ideal scaling of V t à Non-idea scaling of V à non-ideal scaling of power. Power dissipation leads to heat generation à when heat is not removed appropriately, it causes thermal hot spots à problems to reliability and leakage power. [Reda et al, TComp 11] S. Reda EN2910A FALL 13 20

Reasons for the power wall Scaling rules from one technology node to another 1 Area scales by 2 1 C scales by p 2 Let N be #cores at 45 nm and p be power per core. Total power = Np Assuming same frequency and total chip area at 32 nm, 2N p p 2S power at 45 nm =, where S=(old voltage/new voltage) 2. 45 nm voltage = p1.1 V, 32 nm voltage = 1.0 V à S = 1.21à less than 2 à power density increases. S. Reda EN2910A FALL 13 21

Combined metrics for performance and power Energy: cost paid by the user Joules per instruction (EPI) Energy delay product (EDP) MIPS/W and FLOPS/W For clusters: Power Utilization Effectiveness (PEU) = Total facility power / IT equipment power E = Z finish p(t)dt start [datacenterexperts.com] S. Reda EN2910A FALL 13 22

Role of simulators Computing systems are more complex à need simulators to evaluate new ideas and explore design space. Simulation infrastructure should consider architectural issues (e.g., performance) and complex intertwined physical phenomena (e.g., power, thermal and reliability). Simulation is getting hard due to the need to simulate multi-threaded workloads on multi cores à simulator itself could be single-threaded or multi-threaded simulators Simulator taxonomy: 1. User-level versus full-system 2. Functional versus cycle-accurate 3. Trace-driven versus execution driven S. Reda EN2910A FALL 13 23

1. User-level vs. full-system simulators User-level simulator Full-system simulator [Dubois] 1. User-level simulators: focus on simulating microarchitecture, leaving out system components; system calls are treated as black box 2. Full-system simulators: model an entire computing system including CPU, I/O, disks, and network. S. Reda EN2910A FALL 13 24

2. Functional vs. cycle-accurate simulators functional accurate cycle accurate [Dubois] Orthogonal classification to user-level and full-system. Functional accurate: the function of each instruction is executed without any microarchitectural detail. Fast but not accurate Cycle accurate: capture the details of all microarchitectural blocks and keep track of timing. Accurate but slow. Functional first simulators. S. Reda EN2910A FALL 13 25

3. Trace-driven vs. execution-driven simulators Trace-driven simulation: Benchmark is first executed on an ISA compatible processor Each executed instruction is logged into a trace file Architectural state could be logged before and after OS calls and interrupts The final trace is then fed into an cycle-accurate simulator Execution-driven simulation: No trace file; benchmark fed directly to the simulators All timing and functional aspects of the machine must be reproduced faithfully S. Reda EN2910A FALL 13 26

Summary 1. Trends: power wall, memory wall, parallelism wall. Frequency increases à power wall à multi-core à parallelism wall à fusion 2. Quantifying performance: response time and throughput, computing means (arithmetic, geometric, harmonic) 3. Quantifying power: static and dynamic, origins of the power wall. 4. Role of simulators: user level vs system level, functional vs. cycle accurate. S. Reda EN2910A FALL 13 27