# Multicore and Parallel Processing

Save this PDF as:

Size: px
Start display at page:

## Transcription

1 Multicore and Parallel Processing Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter , 7.1 6

2 Administrivia FlameWar Games Night Next Friday, April 27 th 5pm in Upson B17 Please come, eat, drink and have fun No Lab4 or Lab Section next week! 2

3 Administrivia PA3: FlameWar is due next Monday, April 23 rd The goal is to have fun with it Recitations today will talk about it HW6 Due next Tuesday, April 24 th Prelim3 next Thursday, April 26 th 3

4 xkcd/619 4

5 Pitfall: Amdahl s Law Execution time after improvement = affected execution time amount of improvement + execution time unaffected 5

6 Pitfall: Amdahl s Law Improving an aspect of a computer and expecting a proportional improvement in overall performance Example: multiply accounts for 80s out of 100s How much improvement do we need in the multiply performance to get 5 overall improvement? 6

7 Scaling Example Workload: sum of 10 scalars, and matrix sum Speed up from 10 to 100 processors? Single processor: Time = ( ) t add 10 processors 100 processors Assumes load can be balanced across processors 7

8 Scaling Example What if matrix size is ? Single processor: Time = ( ) t add 10 processors 100 processors Assuming load balanced 8

9 Goals for Today How to improve System Performance? Instruction Level Parallelism (ILP) Multicore Increase clock frequency vs multicore Beware of Amdahls Law Next time: Concurrency, programming, and synchronization 9

10 Problem Statement Q: How to improve system performance? Increase CPU clock rate? But I/O speeds are limited Disk, Memory, Networks, etc. Recall: Amdahl s Law Solution: Parallelism 10

11 Instruction Level Parallelism (ILP) Pipelining: execute multiple instructions in parallel Q: How to get more instruction level parallelism? A: Deeper pipeline E.g. 250MHz 1 stage; 500Mhz 2 stage; 1GHz 4 stage; 4GHz 16 stage Pipeline depth limited by max clock speed (less work per stage shorter clock cycle) min unit of work dependencies, hazards / forwarding logic 11

12 Instruction Level Parallelism (ILP) Pipelining: execute multiple instructions in parallel Q: How to get more instruction level parallelism? A: Multiple issue pipeline Start multiple instructions per clock cycle in duplicate stages 12

13 Static Multiple Issue Static Multiple Issue a.k.a. Very Long Instruction Word (VLIW) Compiler groups instructions to be issued together Packages them into issue slots Q: How does HW detect and resolve hazards? A: It doesn t. Simple HW, assumes compiler avoids hazards Example: Static Dual Issue 32 bit MIPS Instructions come in pairs (64 bit aligned) One ALU/branch instruction (or nop) One load/store instruction (or nop) 13

14 MIPS with Static Dual Issue Two issue packets One ALU/branch instruction One load/store instruction 64 bit aligned ALU/branch, then load/store Pad an unused instruction with nop Address Instruction type Pipeline Stages n ALU/branch IF ID EX MEM WB n + 4 Load/store IF ID EX MEM WB n + 8 ALU/branch IF ID EX MEM WB n + 12 Load/store IF ID EX MEM WB n + 16 ALU/branch IF ID EX MEM WB n + 20 Load/store IF ID EX MEM WB 14

15 Scheduling Example Schedule this for dual issue MIPS Loop: lw \$t0, 0(\$s1) # \$t0=array element addu \$t0, \$t0, \$s2 # add scalar in \$s2 sw \$t0, 0(\$s1) # store result addi \$s1, \$s1, 4 # decrement pointer bne \$s1, \$zero, Loop # branch \$s1!=0 ALU/branch Load/store cycle Loop: nop lw \$t0, 0(\$s1) 1 addi \$s1, \$s1, 4 nop 2 addu \$t0, \$t0, \$s2 nop 3 bne \$s1, \$zero, Loop sw \$t0, 4(\$s1) 4 IPC = 5/4 = 1.25 (c.f. peak IPC = 2) 15

16 Scheduling Example Compiler scheduling for dual issue MIPS Loop: lw \$t0, 0(\$s1) # \$t0 = A[i] lw \$t1, 4(\$s1) # \$t1 = A[i+1] addu \$t0, \$t0, \$s2 # add \$s2 addu \$t1, \$t1, \$s2 # add \$s2 sw \$t0, 0(\$s1) # store A[i] sw \$t1, 4(\$s1) # store A[i+1] addi \$s1, \$s1, +8 # increment pointer bne \$s1, \$s3, TOP # continue if \$s1!=end ALU/branch slot Load/store slot cycle Loop: nop lw \$t0, 0(\$s1) 1 nop lw \$t1, 4(\$s1) 2 addu \$t0, \$t0, \$s2 nop 3 addu \$t1, \$t1, \$s2 sw \$t0, 0(\$s1) 4 addi \$s1, \$s1, +8 sw \$t1, 4(\$s1) 5 bne \$s1, \$s3, TOP nop 6 16

17 Scheduling Example Compiler scheduling for dual issue MIPS Loop: lw \$t0, 0(\$s1) # \$t0 = A[i] lw \$t1, 4(\$s1) # \$t1 = A[i+1] addu \$t0, \$t0, \$s2 # add \$s2 addu \$t1, \$t1, \$s2 # add \$s2 sw \$t0, 0(\$s1) # store A[i] sw \$t1, 4(\$s1) # store A[i+1] addi \$s1, \$s1, +8 # increment pointer bne \$s1, \$s3, TOP # continue if \$s1!=end ALU/branch slot Load/store slot cycle Loop: nop lw \$t0, 0(\$s1) 1 addi \$s1, \$s1, +8 lw \$t1, 4(\$s1) 2 addu \$t0, \$t0, \$s2 nop 3 addu \$t1, \$t1, \$s2 sw \$t0, 8(\$s1) 4 bne \$s1, \$s3, Loop sw \$t1, 4(\$s1) 5 17

18 Limits of Static Scheduling Compiler scheduling for dual issue MIPS lw \$t0, 0(\$s1) addi \$t0, \$t0, +1 sw \$t0, 0(\$s1) lw \$t0, 0(\$s2) addi \$t0, \$t0, +1 sw \$t0, 0(\$s2) # load A # increment A # store A # load B # increment B # store B ALU/branch slot Load/store slot cycle nop lw \$t0, 0(\$s1) 1 nop nop 2 addi \$t0, \$t0, +1 nop 3 nop sw \$t0, 0(\$s1) 4 nop lw \$t0, 0(\$s2) 5 nop nop 6 addi \$t0, \$t0, +1 nop 7 nop sw \$t0, 0(\$s2) 8 18

19 Dynamic Multiple Issue Dynamic Multiple Issue a.k.a. SuperScalar Processor (c.f. Intel) CPU examines instruction stream and chooses multiple instructions to issue each cycle Compiler can help by reordering instructions. but CPU is responsible for resolving hazards Even better: Speculation/Out of order Execution Execute instructions as early as possible Aggressive register renaming Guess results of branches, loads, etc. Roll back if guesses were wrong Don t commit results until all previous insts. are retired 19

20 Does Multiple Issue Work? Q: Does multiple issue / ILP work? A: Kind of but not as much as we d like Limiting factors? Programs dependencies Hard to detect dependencies be conservative e.g. Pointer Aliasing: A[0] += 1; B[0] *= 2; Hard to expose parallelism Can only issue a few instructions ahead of PC Structural limits Memory delays and limited bandwidth Hard to keep pipelines full 20

21 Power Efficiency Q: Does multiple issue / ILP cost much? A: Yes. Dynamic issue and speculation requires power CPU Year Clock Rate Pipeline Stages Issue width Out-of-order/ Speculation Cores Power i MHz 5 1 No 1 5W Pentium MHz 5 2 No 1 10W Pentium Pro MHz 10 3 Yes 1 29W P4 Willamette MHz 22 3 Yes 1 75W UltraSparc III MHz 14 4 No 1 90W P4 Prescott MHz 31 3 Yes 1 103W Core MHz 14 4 Yes 2 75W UltraSparc T MHz 6 1 No 8 70W Multiple simpler cores may be better? 21

22 Moore s Law Dual core Itanium 2 K10 Itanium 2 K8 P4 Atom Pentium

23 Moore s law Why Multicore? A law about transistors Smaller means more transistors per die And smaller means faster too But: Power consumption growing too 23

24 Power Limits Surface of Sun Rocket Nozzle Nuclear Reactor Hot Plate Xeon 24

25 Power Wall Power = capacitance * voltage 2 * frequency In practice: Power ~ voltage 3 Reducing voltage helps (a lot)... so does reducing clock speed Better cooling helps The power wall We can t reduce voltage further We can t remove more heat 25

26 Why Multicore? Performance Power 1.2x 1.7x Single Core Overclocked +20% Performance Power 1.0x 1.0x Single Core Performance Power 0.8x 0.51x 1.02x 1.6x Dual Core Single Core Underclocked 20% 26

27 Inside the Processor AMD Barcelona Quad Core: 4 processor cores 27

28 Inside the Processor Intel Nehalem Hex Core 28

29 Programs: Num. Pipelines: Pipeline Width: Hyperthreading Multi Core vs. Multi Issue vs. HT Hyperthreads (Intel) Illusion of multiple cores on a single core Easy to keep HT pipelines full + share functional units 29

30 Example: All of the above 30

31 Parallel Programming Q: So lets just all use multicore from now on! A: Software must be written as parallel program Multicore difficulties Partitioning work Coordination & synchronization Communications overhead Balancing load over cores How do you write parallel programs?... without knowing exact underlying architecture? 31

32 Work Partitioning Partition work so all cores have something to do 32

33 Load Balancing Load Balancing Need to partition so all cores are actually working 33

34 Amdahl s Law If tasks have a serial part and a parallel part Example: step 1: divide input data into n pieces step 2: do work on each piece step 3: combine all results Recall: Amdahl s Law As number of cores increases time to execute parallel part? goes to zero time to execute serial part? Remains the same Serial part eventually dominates 34

35 Amdahl s Law 35

36 Parallel Programming Q: So lets just all use multicore from now on! A: Software must be written as parallel program Multicore difficulties Partitioning work Coordination & synchronization Communications overhead Balancing load over cores How do you write parallel programs?... without knowing exact underlying architecture? 36

### Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

### Advanced d Processor Architecture. Computer Systems Laboratory Sungkyunkwan University

Advanced d Processor Architecture Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Modern Microprocessors More than just GHz CPU Clock Speed SPECint2000

### This Unit: Code Scheduling. CIS 371 Computer Organization and Design. Pipelining Review. Readings. ! Pipelining and superscalar review

This Unit: Code Scheduling CIS 371 Computer Organization and Design App App App System software Mem CPU I/O! Pipelining and superscalar review! Code scheduling! To reduce pipeline stalls! To increase ILP

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

### CISC 360. Computer Architecture. Seth Morecraft Course Web Site:

CISC 360 Computer Architecture Seth Morecraft (morecraf@udel.edu) Course Web Site: http://www.eecis.udel.edu/~morecraf/cisc360 Static & Dynamic Scheduling Scheduling: act of finding independent instructions

### INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

### Computer Architecture

Wider instruction pipelines 2016. május 13. Budapest Gábor Horváth associate professor BUTE Dept. Of Networked Systems and Services ghorvath@hit.bme.hu How to make the CPU faster Option 1: Make the pipeline

### VLIW Processors. VLIW Processors

1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

### Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

### Multicore Architectures

Multicore Architectures Week 1, Lecture 2 Multicore Landscape Intel Dual and quad-core Pentium family. 80-core demonstration last year. AMD Dual, triple (?!), and quad-core Opteron family. IBM Dual and

Advanced Pipelining Techniques 1. Dynamic Scheduling 2. Loop Unrolling 3. Software Pipelining 4. Dynamic Branch Prediction Units 5. Register Renaming 6. Superscalar Processors 7. VLIW (Very Large Instruction

### Module 2: "Parallel Computer Architecture: Today and Tomorrow" Lecture 4: "Shared Memory Multiprocessors" The Lecture Contains: Technology trends

The Lecture Contains: Technology trends Architectural trends Exploiting TLP: NOW Supercomputers Exploiting TLP: Shared memory Shared memory MPs Bus-based MPs Scaling: DSMs On-chip TLP Economics Summary

### Performance evaluation

Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography

### response (or execution) time -- the time between the start and the finish of a task throughput -- total amount of work done in a given time

Chapter 4: Assessing and Understanding Performance 1. Define response (execution) time. response (or execution) time -- the time between the start and the finish of a task 2. Define throughput. throughput

### Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2

### Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

### on an system with an infinite number of processors. Calculate the speedup of

1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

### Introduction to Cloud Computing

Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

### Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

### Spring 2011 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim MIPS Architecture MIPS (Microprocessor without interlocked pipeline stages) MIPS Computer Systems Inc. Developed from Stanford MIPS architecture usages 1990 s R2000, R3000,

### Overview. CPU Manufacturers. Current Intel and AMD Offerings

Central Processor Units (CPUs) Overview... 1 CPU Manufacturers... 1 Current Intel and AMD Offerings... 1 Evolution of Intel Processors... 3 S-Spec Code... 5 Basic Components of a CPU... 6 The CPU Die and

### A Brief Review of Processor Architecture. Why are Modern Processors so Complicated? Basic Structure

A Brief Review of Processor Architecture Why are Modern Processors so Complicated? Basic Structure CPU PC IR Regs ALU Memory Fetch PC -> Mem addr [addr] > IR PC ++ Decode Select regs Execute Perform op

### The Motherboard Chapter #5

The Motherboard Chapter #5 Amy Hissom Key Terms Advanced Transfer Cache (ATC) A type of L2 cache contained within the Pentium processor housing that is embedded on the same core processor die as the CPU

### Basic Computer Organization

SE 292 (3:0) High Performance Computing L2: Basic Computer Organization R. Govindarajan govind@serc Basic Computer Organization Main parts of a computer system: Processor: Executes programs Main memory:

### How to improve (decrease) CPI

How to improve (decrease) CPI Recall: CPI = Ideal CPI + CPI contributed by stalls Ideal CPI =1 for single issue machine even with multiple execution units Ideal CPI will be less than 1 if we have several

### CS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge

CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities

WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

### Intel microprocessor history. Intel x86 Architecture. Early Intel microprocessors. The IBM-AT

Intel x86 Architecture Intel microprocessor history Computer Organization and Assembly Languages g Yung-Yu Chuang with slides by Kip Irvine Early Intel microprocessors Intel 8080 (1972) 64K addressable

### Generations of the computer. processors.

. Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

### Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

### CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

### SOC architecture and design

SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

### ! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):

### Parallel Algorithm Engineering

Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

### Introduction. Released November 20, 2000 NetBurst Architecture 42 million transistors Consumes 55 Watts of power at 1.5 GHz 3.

Intel Pentium 4 Introduction Released November 20, 2000 NetBurst Architecture 42 million transistors Consumes 55 Watts of power at 1.5 GHz 3.2 GB/s system buss NetBurst Retrieved from: http://www.ecs.umass.edu/ece/koren/ece568/papers/pentium4.pd

### A LOOK AT INTEL S NEW NEHALEM ARCHITECTURE: THE BLOOMFIELD AND LYNNFIELD FAMILIES AND THE NEW TURBO BOOST TECHNOLOGY

A LOOK AT INTEL S NEW NEHALEM ARCHITECTURE: THE BLOOMFIELD AND LYNNFIELD FAMILIES AND THE NEW TURBO BOOST TECHNOLOGY Asist.univ. Dragos-Paul Pop froind@gmail.com Romanian-American University Faculty of

### CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

CS 152 Computer Architecture and Engineering Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming Krste Asanovic Electrical Engineering and Computer Sciences University of California at

### Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

### Intel Processor Pricing Effective 4/20/2008 1Ku Tray Units

Desktop Intel Core 2 Extreme Desktop Processor Intel Core 2 Extreme Processor QX9775 (12M Cache, 3.20 GHz, 1600 MHz FSB) Intel Core 2 Extreme Processor QX9770 (12M Cache, 3.20 GHz, 1600 MHz FSB) Intel

### CS4617 Computer Architecture

1/39 CS4617 Computer Architecture Lecture 20: Pipelining Reference: Appendix C, Hennessy & Patterson Reference: Hamacher et al. Dr J Vaughan November 2013 2/39 5-stage pipeline Clock cycle Instr No 1 2

### Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

### Design Cycle for Microprocessors

Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

### Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction

### Legal Notices and Important Information

Legal Notices and Important Information Intel processor numbers are not a measure of performance. Processor numbers differentiate features within each processor family, not across different processor families.

### Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

### Caches (Writing) P & H Chapter 5.2-3, 5.5. Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University

Caches (Writing) P & H Chapter 5.2-3, 5.5 Hakim Weatherspoon CS 3410, Spring 2013 Computer Science Cornell University Goals for Today: caches Writing to the Cache Write-through vs Write-back Cache Parameter

### Low Power AMD Athlon 64 and AMD Opteron Processors

Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD

### Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

### Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

### TDTS 08 Advanced Computer Architecture

TDTS 08 Advanced Computer Architecture [Datorarkitektur] www.ida.liu.se/~tdts08 Zebo Peng Embedded Systems Laboratory (ESLAB) Dept. of Computer and Information Science (IDA) Linköping University Contact

### Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

### CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012

1 CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012 1. [ 12 Points ] Datapath & Pipelining. (a) Consider a simple in-order five-stage pipeline with a two-cycle

### Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

### Assignment 2 Solutions Instruction Set Architecture, Performance, Spim, and Other ISAs

Assignment 2 Solutions Instruction Set Architecture, Performance, Spim, and Other ISAs Alice Liang Apr 18, 2013 Unless otherwise noted, the following problems are from the Patterson & Hennessy textbook

### Scaling in a Hypervisor Environment

Scaling in a Hypervisor Environment Richard McDougall Chief Performance Architect VMware VMware ESX Hypervisor Architecture Guest Monitor Guest TCP/IP Monitor (BT, HW, PV) File System CPU is controlled

### Instruction Set Design

Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

### CPU Performance Equation

CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down

### Dual Core Architecture: The Itanium 2 (9000 series) Intel Processor

Dual Core Architecture: The Itanium 2 (9000 series) Intel Processor COE 305: Microcomputer System Design [071] Mohd Adnan Khan(246812) Noor Bilal Mohiuddin(237873) Faisal Arafsha(232083) DATE: 27 th November

### Building Blocks. CPUs, Memory and Accelerators

Building Blocks CPUs, Memory and Accelerators Outline Computer layout CPU and Memory What does performance depend on? Limits to performance Silicon-level parallelism Single Instruction Multiple Data (SIMD/Vector)

### Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Lecture 3: Evaluating Computer Architectures Announcements - Reminder: Homework 1 due Thursday 2/2 Last Time technology back ground Computer elements Circuits and timing Virtuous cycle of the past and

### This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

### CSE 30321 Computer Architecture I Fall 2009 Final Exam December 18, 2009

CSE 30321 Computer Architecture I Fall 2009 Final Exam December 18, 2009 Test Guidelines: 1. Place your name on EACH page of the test in the space provided. 2. every question in the space provided. If

### The University of Nottingham

The University of Nottingham School of Computer Science A Level 1 Module, Autumn Semester 2007-2008 Computer Systems Architecture (G51CSA) Time Allowed: TWO Hours Candidates must NOT start writing their

### Minimum Hardware and OS Specifications

Hardware and OS Specifications File Stream Document Management Software System Requirements for v4.2 NB: please read through carefully, as it contains 4 separate specifications for a Workstation PC, a

### Multi-Core Processors: New Way to Achieve High System Performance

Multi-Core Processors: New Way to Achieve High System Performance Pawe Gepner EMEA Regional Architecture Specialist Intel Corporation pawel.gepner@intel.com Micha F. Kowalik Market Analyst Intel Corporation

### Intel Atom Processor. Michelle McDaniel and Jonathan Dorn

Intel Atom Processor Michelle McDaniel and Jonathan Dorn Introduction Completely new microarchitecture with very little in common with other Intel PC processors Designed with 3 primary goals: Dramatically

### Virtual Memory 2. Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University. P & H Chapter 5.4

Virtual Memory 2 Hakim Weatherspoon CS 3410, Spring 2012 Computer Science Cornell University P & H Chapter 5.4 Virtual Memory Address Translation Goals for Today Pages, page tables, and memory mgmt unit

### Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

### Computer Architecture TDTS10

why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

### Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Software Pipelining for (i=1, i

### CSE 141 Midterm Exam

CSE 141 Midterm Exam 2011 Winter Professor Steven Swanson 1. Please write your name at the top of each page 2. This is a close book, closed notes exam. No outside material may be used. 3. You may use a

### Components of the System Unit

Components of the System Unit The System Unit A case that contains the electronic components of the computer used to process data. The System Unit The case of the system unit, or chassis, is made of metal

### Architecture and Programming of x86 Processors

Brno University of Technology Architecture and Programming of x86 Processors Microprocessor Techniques and Embedded Systems Lecture 12 Dr. Tomas Fryza December 2012 Contents A little bit of one-core Intel

### A Lab Course on Computer Architecture

A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

### Analysis (III) Low Power Design. Kai Huang

Analysis (III) Low Power Design Kai Huang Chinese new year: 1.3 billion urban exodus 1/28/2014 Kai.Huang@tum 2 The interactive map, which is updated hourly The thicker, brighter lines are the busiest routes.

### Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

### Quiz for Chapter 2 Instructions: Language of the Computer3.10

Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [5 points] Prior to the early 1980s, machines

### Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com

CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU

### Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

### Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

### 2a - Dynamic Instruction Level Parallelism (ILP)

2a - Dynamic Instruction Level Parallelism (ILP) The P6 microarchitecture (Pentium III) The NetBurst microarchitecture (Pentium 4) The Intel Core Microarchitecture (Core e Core 2) The Nehalem/Westmere

### Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

### CISC, RISC, and DSP Microprocessors

CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:

### Machine Architecture. ITNP23: Foundations of Information Technology Una Benlic 4B69

Machine Architecture ITNP23: Foundations of Information Technology Una Benlic 4B69 ube@cs.stir.ac.uk Basic Machine Architecture In these lectures we aim to: Understand the basic architecture of a simple

### COMP303 Computer Architecture Lecture 16. Virtual Memory

COMP303 Computer Architecture Lecture 6 Virtual Memory What is virtual memory? Virtual Address Space Physical Address Space Disk storage Virtual memory => treat main memory as a cache for the disk Terminology:

### Apple MD770 Mac Pro 3.2 GHz Quad-Core Intel Xeon

Apple MD770 Mac Pro 3.2 GHz Quad-Core Intel Xeon 14570468_4766.jpg Mac Pro, Intel Xeon W3565 3.2GHz, 6GB 1066MHz DDR3, 1TB SATA II 7200 rpm, 18x DVD-/+RW/+DL, Radeon HD 5770 1GB, Gigabit Ethernet, Wi-Fi

### Computer Architectures

Computer Architectures 2. Instruction Set Architectures 2015. február 12. Budapest Gábor Horváth associate professor BUTE Dept. of Networked Systems and Services ghorvath@hit.bme.hu 2 Instruction set architectures

### Lecture 3: Single processor architecture and memory

Lecture 3: Single processor architecture and memory David Bindel 30 Jan 2014 Logistics Raised enrollment from 75 to 94 last Friday. Current enrollment is 90; C4 and CMS should be current? HW 0 (getting

### 64-Bit versus 32-Bit CPUs in Scientific Computing

64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples

### Topic 2 Introduction to ISAs for Embedded Systems. The slides used for this lecture were contributed in part By Tulika

Topic 2 Introduction to ISAs for Embedded Systems The slides used for this lecture were contributed in part By Tulika Mitra(tulika@comp.nus.edu.sg). What is an ISA? 2 Definition from Amdahl, Blaaw, and

### Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster!

Heterogeneity-Conscious Parallel Query Execution: Getting a better mileage while driving faster Tobias Mühlbauer, Wolf Rödiger, Robert Seilbeck, Alfons Kemper, Thomas Neumann Technische Universität München

Memory - DDR1, DDR2, and DDR3 Brought to you by http://www.rmroberts.com please visit our site! DDR1 Double Data Rate-SDRAM, or simply DDR1, was designed to replace SDRAM. DDR1 was originally referred

### LSN 2 Computer Processors

LSN 2 Computer Processors Department of Engineering Technology LSN 2 Computer Processors Microprocessors Design Instruction set Processor organization Processor performance Bandwidth Clock speed LSN 2

### Model Answers HW1 Chapter #2

Model Answers HW1 Chapter #2 2.2. a. On the IAS, what would the machine code instruction look like to load the contents of memory address 2? b. How many trips to memory does the CPU need to make to complete

### Why Intel is designing multi-core processors. Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning July, 2006

Why Intel is designing multi-core processors Geoff Lowney Intel Fellow, Microprocessor Architecture and Planning Moore s Law at Intel 970-2005 0 Feature Size (um).e+09 Transistor Count 0.79x.E+07 2.0x

### Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

### Chapter 7. Multicores, Multiprocessors, and Clusters

Chapter 7 Multicores, Multiprocessors, and Clusters Introduction Goal: connecting multiple computers to get higher performance Multiprocessors Scalability, availability, power efficiency Job-level (process-level)