EN2910A: Advanced Computer Architecture Topic 01: Introduction to Quantitative Analysis

Similar documents
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

CSEE W4824 Computer Architecture Fall 2012

Low Power AMD Athlon 64 and AMD Opteron Processors

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Performance evaluation

Evoluzione dell Infrastruttura di Calcolo e Data Analytics per la ricerca

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Managing Data Center Power and Cooling

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Parallel Programming Survey

on an system with an infinite number of processors. Calculate the speedup of

Introduction to Cloud Computing

GPUs for Scientific Computing

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

AMD PhenomII. Architecture for Multimedia System Prof. Cristina Silvano. Group Member: Nazanin Vahabi Kosar Tayebani

GPU Hardware and Programming Models. Jeremy Appleyard, September 2015

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Design Cycle for Microprocessors

Icepak High-Performance Computing at Rockwell Automation: Benefits and Benchmarks

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

CHAPTER 1 INTRODUCTION

1. Memory technology & Hierarchy

Computer Architecture-I

Towards Energy Efficient Query Processing in Database Management System

Massimo Bernaschi Istituto Applicazioni del Calcolo Consiglio Nazionale delle Ricerche.

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

The Green Index: A Metric for Evaluating System-Wide Energy Efficiency in HPC Systems

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Introduction to Microprocessors

Multi-Threading Performance on Commodity Multi-Core Processors


Oracle Database Scalability in VMware ESX VMware ESX 3.5

FPGA-based Multithreading for In-Memory Hash Joins

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Enabling Technologies for Distributed Computing

Multicore Parallel Computing with OpenMP

Binary search tree with SIMD bandwidth optimization using SSE

Data Centric Systems (DCS)

Data Center and Cloud Computing Market Landscape and Challenges

SPARC64 VIIIfx: CPU for the K computer

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

LSN 2 Computer Processors

Generations of the computer. processors.

A Taxonomy and Survey of Energy-Efficient Data Centers and Cloud Computing Systems

Introduction to GPGPU. Tiziano Diamanti

Parallelism and Cloud Computing

An examination of the dual-core capability of the new HP xw4300 Workstation

OBJECTIVE ANALYSIS WHITE PAPER MATCH FLASH. TO THE PROCESSOR Why Multithreading Requires Parallelized Flash ATCHING

Chapter 2 Parallel Computer Architecture

Virtuoso and Database Scalability

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

High Performance Computing in CST STUDIO SUITE

Photonic Networks for Data Centres and High Performance Computing

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Capacity Estimation for Linux Workloads

EEM 486: Computer Architecture. Lecture 4. Performance

Disk Storage Shortfall

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Lattice QCD Performance. on Multi core Linux Servers

Enabling Technologies for Distributed and Cloud Computing

High Performance Computing in the Multi-core Area

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Benchmarking Cassandra on Violin

Chapter 2. Why is some hardware better than others for different programs?

Adapted from David Patterson s slides on graduate computer architecture

Lecture 1: the anatomy of a supercomputer

Turbomachinery CFD on many-core platforms experiences and strategies

Computing for Data-Intensive Applications:

Parallel Algorithm Engineering

Week 1 out-of-class notes, discussions and sample problems

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Parallel Computing. Benson Muite. benson.

COMP/CS 605: Intro to Parallel Computing Lecture 01: Parallel Computing Overview (Part 1)

Resource Efficient Computing for Warehouse-scale Datacenters

Next Generation GPU Architecture Code-named Fermi

Operating System Impact on SMT Architecture

LS-DYNA Best-Practices: Networking, MPI and Parallel File System Effect on LS-DYNA Performance

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Characterizing Task Usage Shapes in Google s Compute Clusters

DDR subsystem: Enhancing System Reliability and Yield

Intel Pentium 4 Processor on 90nm Technology

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Energy Efficient MapReduce

Computer Science 146/246 Homework #3

Overview. CPU Manufacturers. Current Intel and AMD Offerings

Transcription:

EN2910A: Advanced Computer Architecture Topic 01: Introduction to Quantitative Analysis Prof. Sherief Reda School of Engineering Brown University S. Reda EN2910A FALL 13 1

Topic 01: Introduction to Quantitative Analysis 1. Trends in computing systems 2. Quantifying performance 3. Quantifying power 4. Role of simulators S. Reda EN2910A FALL 13 2

Moore s law Started with 10 um feature size in 1970. We are now at 22 nm feature size. Number of transistors per unit area doubles every 2-3 area. More transistors in die à more capable cores and/or more cores, GPU, accelerating functional units as in SoCs Transistors can switch faster with new technology S. Reda EN2910A FALL 13 3

Trends in single processor performance [patterson] 1. Process technology à more design per unit area à build more powerful cores (more ILP) and/or add more cores (TLP) 2. Process technology à faster transistors à higher clock rate 3. Deeper pipelines à higher clock 4. Better circuit design techniques and better CAD tools S. Reda EN2910A FALL 13 4

Evolution of clock rate [Debois] Dynamic power proportional to fv 2 Increasing the frequency with less than ideal voltage scaling leads to increased power density S. Reda EN2910A FALL 13 5

Power wall [patterson] Economical Heat removal mechanisms (e.g., air and liquid) limit the maximum amount of power consumption S. Reda EN2910A FALL 13 6

Lack of parallelism wall Programming parallel applications is hard Some times applications do not scale; synchronization could be a bottleneck Example: speedup of PARSEC benchmarks on a 8-core Xeon server [weaver 09] Circumventing the parallelism wall: incorporate heterogeneous functional units on die (e.g., GPU, accelerators) S. Reda EN2910A FALL 13 7

Memory wall Unlike processor performance, DRAM performance improved by 7% a year à a memory wall. Density followed Moore law Memory wall = memory_cycle/ processor_cycle In 1990, it was about 4 (25 MHz, 150 ns). Grew to 200 until 2002 but has tapered off since then. [Debois] Although still a big problem, the memory latency wall stopped growing around 2002. With the advent of multicore microarchitectures the memory problem has shifted from latency to bandwidth S. Reda EN2910A FALL 13 8

Summary of current and future challenges to computing Memory wall Increasing number of cores requires increased memory bandwidth; otherwise, starvation and stalling occurs Parallelism wall Some applications lack enough ILP or TLP à not much benefit from aggressive superscalar or many-core designs Power wall Limits on heat removal imposes a limit on power density and frequency of operation Power and parallelism wall à dark silicon (only a small portion of the chip is operational at any moment of time) S. Reda EN2910A FALL 13 9

Performance metrics 1. execution time, latency, response time: time to complete a task 2. throughput: number of tasks (e.g., instructions, queries, frames rendered) completed per unit time. Is throughput = 1/av. response time? Only if NO overlap Otherwise, throughput > 1/av. response time S. Reda EN2910A FALL 13 10

Which benchmarks? 1. Real programs (mpeg encoding) 2. Synthetic benchmarks (e.g., measuring I/O storage bandwidth) 3. Kernels 4. Toy benchmarks (e.g., quicksort) 5. Benchmark suites: SPEC: Standard Performance Evaluation Corporation SPEC CPU Integer point SPEC CPU Floating point SPEC POWERSSJ transactional SPEC Viewperf for GPU performance PARSEC for multi-threaded applications Rodinia for GPGPU performance NASA Parallel benchmarks (NPB) for clusters (e.g., FFT) HPC Challenge benchmarks (HPCC) for clusters (e.g., linear solver) S. Reda EN2910A FALL 13 11

Examples of benchmark results Runtime and average processor power for SPEC CPU2006 benchmarks using AMD Phenom II X4 965 Black edition at 3.4 GHz and 4GB DRAM running Linux 2.6.10.8 S. Reda EN2910A FALL 13 12

Reporting performance for a set of programs Arithmetic mean: P N i=1 T i problem is that programs with longest execution delays can dominate the result or Reporting speedups (Why?): Speed-up measures the advantage of a machine over reference machine R for program i: Arithmetic mean of speedups: Geometric mean of speedups: N N X S i /N P Harmonic mean: N i 1/S i S. Reda EN2910A FALL 13 13 i v uy Nt N i NX w i T i i=1 S i = T R,i T i S i What is the advantage?

Example Which is better machine 1 or machine 2? Program A Program B Arithmetic Mean Ratio of means (ref 1) Machine 1 10 sec 100 sec 55 sec 91.8 10 Machine 2 1 sec 200 sec 100.5 sec 50.2 5.5 Reference 1 100 sec 10000 sec 5050 sec Reference 2 100 sec 1000 sec 550 sec Ratio of mens (ref 2) Program A Program B Arithmetic Harmonic Geometric speedup wrt Reference 1 speedup wrt Reference 2 Machine 1 10 100 55 18.2 31.6 Machine 2 100 50 75 66.7 70.7 Machine 1 10 10 10 10 10 Machine 2 100 5 52.5 9.5 22.4 S. Reda EN2910A FALL 13 14

When to use the harmonic mean? Consider a processor that executed instructions for the first 10 billion instructions at a rate of 1 BIPS (billion instructions per second) and then for the second 10 billion instructions at a rate of 2 BIPS, what is the average instruction rate? Average BIPS = (1+2)/2 = 1.5 WRONG Average BIPS = (10 + 10)/(10/1 + 10/2) = 20/15 =1.33 Harmonic mean of rates = n i= n 1 1 rate( n) Use HM if forced to start and end with rates (e.g. reporting CPI or miss rates or branch misprediction rates) S. Reda EN2910A FALL 13 15

Performance metrics for clusters Supercomputers: Execution time FLOPS (FLOP/s): theoretical peak or using a standard benchmark (e.g., LINPACK is used for Top-500 supercomputer ranking) Warehouse scale: Latency is important metric because it is seen by users Bing study: users will use search less as response time increases Service Level Objectives (SLOs)/Service Level Agreements (SLAs). E.g. 99% of requests be below 100 ms S. Reda EN2910A FALL 13 16

Amdahl s law 1-F F without E Apply enhancement 1-F F/S [Debois] with E Enhancement E accelerates a fraction F of a task by a factor of S speedup = T exe(without E) 1 = T exe (with E) (1 F )+ F S Enhancement is limited by the fraction of execution time that can t be enhanced à law of diminishing returns Amdahl s law à optimize the common case F=0.5 S. Reda EN2910A FALL 13 17

Physical reasons for power consumption [Debois] If transistor input voltage above V t à transistor is ON (short circuit); otherwise, it is off (open circuit) Dynamic power is consumed when transistors switch status. Static or leakage power is consumed when there is no switching (historically negligible but growing in significance with nanoscale CMOS) S. Reda EN2910A FALL 13 18

1. Static (leakage) power P static = VI sub / Ve KV t T When input voltage is less than V t, transistor should be off; however, some electrons are still able to go through because of reductions in threshold voltage (V t ) of recent technology à static or leakage power consumption. Leakage current is exponentially dependent on V t as well as the operating temperature (T). As V t decreases, static power increases exponentially à switch to 3D transistors was mainly motivated to control leakage power. Noise also limits reducing V th S. Reda EN2910A FALL 13 19

2. Dynamic power P dynamic = αfcv 2 α is fraction of clock cycles when gate switches At a particular design & technology nodes, higher frequency demands higher voltage. If chip size grows then total power grows Non-ideal scaling of V t à Non-idea scaling of V à non-ideal scaling of power. Power dissipation leads to heat generation à when heat is not removed appropriately, it causes thermal hot spots à problems to reliability and leakage power. [Reda et al, TComp 11] S. Reda EN2910A FALL 13 20

Reasons for the power wall Scaling rules from one technology node to another 1 Area scales by 2 1 C scales by p 2 Let N be #cores at 45 nm and p be power per core. Total power = Np Assuming same frequency and total chip area at 32 nm, 2N p p 2S power at 45 nm =, where S=(old voltage/new voltage) 2. 45 nm voltage = p1.1 V, 32 nm voltage = 1.0 V à S = 1.21à less than 2 à power density increases. S. Reda EN2910A FALL 13 21

Combined metrics for performance and power Energy: cost paid by the user Joules per instruction (EPI) Energy delay product (EDP) MIPS/W and FLOPS/W For clusters: Power Utilization Effectiveness (PEU) = Total facility power / IT equipment power E = Z finish p(t)dt start [datacenterexperts.com] S. Reda EN2910A FALL 13 22

Role of simulators Computing systems are more complex à need simulators to evaluate new ideas and explore design space. Simulation infrastructure should consider architectural issues (e.g., performance) and complex intertwined physical phenomena (e.g., power, thermal and reliability). Simulation is getting hard due to the need to simulate multi-threaded workloads on multi cores à simulator itself could be single-threaded or multi-threaded simulators Simulator taxonomy: 1. User-level versus full-system 2. Functional versus cycle-accurate 3. Trace-driven versus execution driven S. Reda EN2910A FALL 13 23

1. User-level vs. full-system simulators User-level simulator Full-system simulator [Dubois] 1. User-level simulators: focus on simulating microarchitecture, leaving out system components; system calls are treated as black box 2. Full-system simulators: model an entire computing system including CPU, I/O, disks, and network. S. Reda EN2910A FALL 13 24

2. Functional vs. cycle-accurate simulators functional accurate cycle accurate [Dubois] Orthogonal classification to user-level and full-system. Functional accurate: the function of each instruction is executed without any microarchitectural detail. Fast but not accurate Cycle accurate: capture the details of all microarchitectural blocks and keep track of timing. Accurate but slow. Functional first simulators. S. Reda EN2910A FALL 13 25

3. Trace-driven vs. execution-driven simulators Trace-driven simulation: Benchmark is first executed on an ISA compatible processor Each executed instruction is logged into a trace file Architectural state could be logged before and after OS calls and interrupts The final trace is then fed into an cycle-accurate simulator Execution-driven simulation: No trace file; benchmark fed directly to the simulators All timing and functional aspects of the machine must be reproduced faithfully S. Reda EN2910A FALL 13 26

Summary 1. Trends: power wall, memory wall, parallelism wall. Frequency increases à power wall à multi-core à parallelism wall à fusion 2. Quantifying performance: response time and throughput, computing means (arithmetic, geometric, harmonic) 3. Quantifying power: static and dynamic, origins of the power wall. 4. Role of simulators: user level vs system level, functional vs. cycle accurate. S. Reda EN2910A FALL 13 27