Week 1 out-of-class notes, discussions and sample problems



Similar documents
Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

LSN 2 Computer Processors

on an system with an infinite number of processors. Calculate the speedup of

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Instruction Set Design

Performance evaluation

Processor Architectures

Pipelining Review and Its Limitations

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Instruction Set Architecture (ISA) Design. Classification Categories

PROBLEMS. which was discussed in Section

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Computer Organization and Architecture


Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

Intel 8086 architecture

Central Processing Unit (CPU)

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

CPU Organization and Assembly Language

Instruction Set Architecture

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Generations of the computer. processors.

Instruction Set Architecture (ISA)

The Motherboard Chapter #5

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Computer Architectures

Chapter 5 Instructor's Manual

VLIW Processors. VLIW Processors

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

The Central Processing Unit:

CISC, RISC, and DSP Microprocessors

PRIMERGY server-based High Performance Computing solutions

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Course on Advanced Computer Architectures

ADVANCED COMPUTER ARCHITECTURE

Introduction to Cloud Computing

Design Cycle for Microprocessors

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

An Introduction to the ARM 7 Architecture

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

In the Beginning The first ISA appears on the IBM System 360 In the good old days

CHAPTER 7: The CPU and Memory

ELE 356 Computer Engineering II. Section 1 Foundations Class 6 Architecture

Introducción. Diseño de sistemas digitales.1

PROBLEMS #20,R0,R1 #$3A,R2,R4

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

A Lab Course on Computer Architecture

Lecture 3: Evaluating Computer Architectures. Software & Hardware: The Virtuous Cycle?

Terminal Server Software and Hardware Requirements. Terminal Server. Software and Hardware Requirements. Datacolor Match Pigment Datacolor Tools

EEM 486: Computer Architecture. Lecture 4. Performance

Chapter 1: Introduction. What is an Operating System?

Chapter 2 Logic Gates and Introduction to Computer Architecture

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

System requirements for A+

İSTANBUL AYDIN UNIVERSITY

Thread level parallelism

ARM Microprocessor and ARM-Based Microcontrollers

An Overview of Stack Architecture and the PSC 1000 Microprocessor

CSEE W4824 Computer Architecture Fall 2012

IA-64 Application Developer s Architecture Guide

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

LCMON Network Traffic Analysis

picojava TM : A Hardware Implementation of the Java Virtual Machine

WAR: Write After Read

Microprocessor and Microcontroller Architecture

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

Virtuoso and Database Scalability

StACC: St Andrews Cloud Computing Co laboratory. A Performance Comparison of Clouds. Amazon EC2 and Ubuntu Enterprise Cloud

A Study on the Scalability of Hybrid LS-DYNA on Multicore Architectures

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Computer Organization and Components

64-Bit versus 32-Bit CPUs in Scientific Computing

Enabling Technologies for Distributed and Cloud Computing

Enabling Technologies for Distributed Computing

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

AP ENPS ANYWHERE. Hardware and software requirements

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

Communicating with devices

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

EE361: Digital Computer Organization Course Syllabus

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Imaging Computing Server User Guide

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

CPU Performance Equation

The Future of the ARM Processor in Military Operations

Autodesk Revit 2016 Product Line System Requirements and Recommendations

Quantifying Hardware Selection in an EnCase v7 Environment

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

How To Understand The Design Of A Microprocessor

Introduction to Operating Systems. Perspective of the Computer. System Software. Indiana University Chen Yu

Transcription:

Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types of processors. Handheld/mobile devices: we need powerful processors that are energy efficient (due to battery restrictions), low heat producing (due to the lack of a fan), yet offer real-time performance and graphics processing. The ARM family of processors are the most common, based on the Acorn RISC Machine, first introduced in the early 80s. In the late 80s, Apple worked with Acorn to begin releasing new ARM processor cores. It is these that most current ARM processors are based on. The ARM family generally is denoted by the following features: Load/store instruction set 16 32-bit registers where some of the registers are only used by the OS Fixed-length 32-bit instructions Single clock-cycle execution for most instructions Conditional execution rather than branch prediction Condition codes used only if specified Indexed addressing modes The early ARM processors used a 3-stage pipeline but were later expanded to as many as 13 stages. Branch prediction was added to later versions to improve over conditional execution. We will talk about conditional execution later in the semester. Later versions also implemented the Thumb instruction set. This instruction set consists of 16-bit instructions. This allows two instructions to be fetched and possibly executed. Thumb instructions can be placed inside of ordinary code. The idea behind ARM is to have a scaled back ISA so that the processors can squeeze a good deal of parallelism out of code. Since most handheld devices are running only one or a few apps at a time, there is less need for large memories, the fastest clock speeds or the power found in larger computers. This keeps power consumption and heat production down, and keeps the cost down. Desktop/laptop computers: for these devices, we need to manage the tradeoff between price and performance. Obviously users want better performance but are only winning to spend between $300 and $2500 on a desktop/laptop unit. The largest requirements are to support modest multitasking (e.g., up to 10 processes at a time), graphics and other forms of multimedia, Internet communication and common forms of productivity software as well as the luxury of running a complex operating system that handles user duties with little interaction. Memory requirements are somewhat lofty because users will multitask and because the Windows and Mac operating systems are large. This not only requires 4G-8G of RAM but also as many as 3 levels of cache, organized in such a way that cache performance does not negatively impact the processor. Additionally, the modern processors give off a good amount of heat, so cooling fans must be available. The most common PC processors today are the latest generations of Intel Pentium, Xeon, Celeron and now Core processors. AMD is currently one of the few competitors in the PC market offering the FX, Phenom II and Athlon II processors. Servers: introduced in the 1980s to serve as file servers, servers are now more generically titled and can range in usage from simple file servers (often found in LANs, or used as web or database servers on the Internet) but can also be used to service distributed processing for ATM machines, airline reservations and on-line services (e.g., amazon web site, google search engine). In the latter case, the authors refer to this as a cluster or warehouse-scale computer. Cloud computing also fits in this category. Higher end servers reach supercomputer status. Costs for a server ranges from $5K through $10M and up to $200M for a cluster. The most important aspect of this model of computer is throughput the number of services handled per unit of time. Throughput is impacted as much by memory capacity and telecommunications as it is processor capability. Scalability is another important feature, primarily impacted by how easy it is

to add memory and hard disk space to the computer(s). The server/cluster end has largely replaced mainframe computers of old. There are a wide range of processors that are used by servers, but the more significant performance increase occurs with multiprocessing rather than improvements made to a single processor as we see in the PC market. Embedded computers: at the other extreme from clusters is the embedded computer, a processor embedded in another device (e.g., microwave oven, car engine). These devices are often 8-bit or 16-bit processors with minimum storage and modest power requirements. They often can cost less than $5 and seldom cost more than $100. What we want to cover in this class are the common set of processor improvements that we see standard in most processors, no matter which platform they are intended for. The primary tool to processor improvement is parallelism. There are many forms of parallelism that the authors divide between datalevel and task-level. We implement these using instruction-level parallelism through pipelining and speculative execution, vector-level using an SIMD-style architecture, thread-level and request-level (not covered in this course). The main two efforts to achieve parallelism are through the processor and through the cache. In the processor, we use pipelining and multiple functional units so that one or more instruction can be issued each clock cycle. Due to the complexity of modern processors, instructions may finish execution out of order, and therefore we need additional hardware to re-order the instructions upon completion. In cache, we want to ensure as few cache misses as possible so that the instruction issue stage of the pipeline does not stall, nor does an instruction waiting on memory. So, principle of locality of reference is applied. We will visit many other cache improvements later in the semester. Above all, we focus on the common case. As we saw in class, Amdahl s Law shows us that no matter what level of speedup we might achieve through some improvement, it is the common case that will win out. Consider, for instance, an improvement that can be used 80% of the time that increases performance by 50% versus an improvement that can be used 25% of the time that increases performance by a factor of 10 (1000%). Improvement 1: 1 / (1 -.8 +.8 / 1.5) = 1.36 (36% speedup) Improvement 2: 1 / (1 -.25 +.25 / 10) = 1.29 (29% speedup) Later in the semester, we will look at the initial x86 pipeline and see how the CISC features of x86 complicated the pipeline to the point of poor performance. We cover the MIPS instruction set because it is a model instruction set to aim for. That is, it was designed specifically to promote an efficient pipeline. MIPS was originally developed in the early 80s. Because of this, it lacks some features that we now want present to help support further parallelism. For instance, there is no vector processing instructions in MIPS. We will briefly visit this later in the semester. Neither are there graphics processing instructions (we will not examine this although it is in the textbook). As covered in class, the typical MIPS processor is a 5-stage fetch-execute cycle. Next week, in the out-of-class notes, you will compare it to the MIPS R4000, an 8-stage fetch-execute cycle. We wrap up the notes for the out-of-class portion by looking at several example problems. Also visit the discussion board. 1. It seems that a quad core processor should speed up a computer by a factor of 4 but it doesn t. Use Amdahl s Law to compute the percentage of program execution that should be distributed to achieve an overall speedup of 3. Of 2. Of 1.5. Of 1.25. Answer: We want to solve for x in y = (1 / (1 x + x / 4)) where y is 3, 2, 1.5 and 1.25. This involves a little algebra but we wind up with x = 4 / 3 * (1 1 / y). For y = 3, x =.889. For y = 2, x =.667. For y = 1.5, x =.444. For y = 1.25, x =.267. So to achieve a speedup of 1.25, all

four cores must be in use about 26.7% of the time but to achieve a 3 time speedup, all four cores must be in use 88.9% of the time. 2. Let s compare a CISC machine versus a RISC machine on a benchmark. Assume the following characteristics of the two machines. CISC: CPI of 4 for load/store, 3 for ALU/branch and 10 for call/return, CPU clock rate of 2.75 GHz RISC: CPI of 1.4 (the machine is pipelined, the ideal CPI is 1.0, but overhead and stalls make it 1.4) and a CPU clock rate of 2 GHz Since the CISC machine has more complex instructions, the IC for the CISC machine is 40% smaller than the IC for the RISC machine The benchmark has a breakdown of 38% loads, 10% stores, 35% ALU operations, 3% calls, 3% returns and 11% branches. Which machine will run the benchmark in less time and by how much? Answer: use CPU time = IC * CPI * Clock cycle time RISC: IC RISC * CPI RISC * Clock cycle time RISC = IC RISC * 1.4 * 1 / 2GHz = 0.7 * IC RISC CISC: IC CISC * CPI CISC * Clock cycle time CISC = IC RISC * 0.6 * (4 *.38 + 4 *.10 + 3 *.35 + 10 *.03 + 10 *.03 + 3 *.11) * 1 / 2.75 GHz = IC RISC * 0.6 * 3.9 / 2.75 GHz = 0.851 * IC RISC Since the CISC machine has a higher CPU time, it means that the RISC machine is faster by 0.851 / 0.7 = 1.216 or about 22%. 3. The MIPS instruction set passes parameters through memory, thus slowing down function calls. An alternate architecture, Berkeley RISC, uses register windows. The register windows places local variables of a function into a set of registers. Those being passed as parameters to another function are placed into another set of registers which overlap registers available for the function. Thus, the window is overlapping registers. See the figure below. Let s assume that using

register windows cause memory accesses to be replaced by register operations, so rather than accruing the CPI of a load or store for each parameter, each parameter accrues the CPI of an ALU operation. Assume we have the following CPI breakdown: Loads/stores: 4, ALU and unconditional branches: 2, conditional branches: 3, procedure calls and returns: 15. Architects are trying to decide whether to use additional registers in a CPU for register windows or just more registers in the register file. If we go with ordinary registers, it reduces the number of loads and stores by 40% and 30% respectively (because we can put more into registers). If we go with register windows, it reduces the procedure call/return CPI greatly. Let s assume the CPI or a procedure call reduces to 4.5 and the return reduces to 3. Which should we use for a benchmark of: 40% loads, 13% stores, 31% ALU, 8% conditional branch, 2% unconditional branch, 3% procedure calls, 3% returns? Answer: CPU Time = IC * CPI * Clock Cycle Time. The last value will not change between the two approaches. If we use register windows, CPI reduces and if we add more registers, IC reduces because of fewer loads & stores. CPI original =.40 * 4 +.13 * 4 +.31 * 2 +.08 * 3 +.02 * 2 +.03 * 15 +.03 * 15 = 3.92. CPI regwindows =.40 * 4 +.13 * 4 +.31 * 2 +.08 * 3 +.02 * 2 +.03 * 4.5 +.03 * 3 = 3.245. We have to figure out the new breakdown of instructions if we have fewer loads and stores:.40 *.40 =.16, so 16% fewer loads.13 *.30 =.039 so 3.9% fewer stores So there will be.16 +.039 =.199 fewer instructions, we now recomputed the breakdown of instructions given an IC of 1.00 -.199 =.801 Loads = (.40 -.16) /.801 =.300 Stores = (.13 -.039) /.801 =.114 ALU =.31 /.801 =.387 Conditional branches =.08 /.801 =.100 Unconditional branches =.02 /.801 =.025 Procedure calls =.03 /.801 =.037 Returns =.03 /.801 =.037 New CPI =.300 * 4 +.114 * 4 +.387 * 2 +.100 * 3 +.025 * 2 +.037 * 15 +.037 * 15 = 3.89 IC registers =.801 * ICoriginal CPU Time register windows = IC original * 3.245 * CPU clock cycle time = 3.245 * IC original * clock cycle time CPU Time new registers = IC original *.801 * 3.89 * CPU clock cycle time = 3.116 * IC original * clock cycle time The version using the registers as actual registers is faster, so the speedup of using the registers as actual registers instead of as register windows is 3.245 / 3.116 = 1.041 or a little over 4% faster. 4. In the 1980s and 1990s, architects debated whether the RISC or CISC approach was better. The list below denotes some of the differences in philosophy between the two forms of architecture. For each of the following, explain how it would improve CPU time in terms of which of the following in our CPU time formula would be decreased: IC, CPI, Clock Cycle Time, or some combination. NOTE: some of these may increase but you do not need to discuss what increases, only what decreases. a. In RISC, there are a great number of registers available, less so in a CISC machine

b. In CISC, there can be complex addressing modes such as indirect addressing to obtain the datum pointed to by a pointer c. In RISC, a pipeline is used to perform each part of the fetch-execute cycle as an independent stage d. In CISC, variable sized instruction lengths are common so that multiple memory operands can be accessed at the same time Answers: a. With more registers, there is less need for loads and stores, so IC decreases. However, since CISC machines often have memory-register operations (such as add x, y, z), the actual impact is most felt in CPI because the add instruction in a RISC machine will have a low CPI since operands must be stored in registers, whereas the CISC add instruction will have a much longer CPI if it involves accessing memory one or more times per instruction. b. The complex addressing modes allow memory accesses in single operations whereas in a RISC architecture without complex addressing modes, something like indirect addressing takes multiple operations, therefore this feature lowers IC. c. Since all operations are pipelined, their CPI is reduced to approximately 1, therefore the pipeline lowers CPI. d. The variable sized instruction length allows instructions to carry out multiple tasks, and therefore there needs to be fewer instructions, lowering IC. 5. Let s see what might happen if we add a register-memory ALU mode to MIPS. We could replace the two statements LW R1, 0(R2) DADDU R3, R3, R1 With DADDU R3, 0(R2) So that the new instruction has enough space in the 32-bit instruction length format, we restrict this to be a two operand instruction where the first operand is a source and a destination register. Assume that to accommodate the memory fetch as part of this instruction, we increase clock cycle time by 15%. Using the gcc benchmark (see figure A.27, p. A-41), what percentage of loads would have to be eliminated so that this new mode can execute gcc in the same amount of time? Answer: We want CPU time old = CPU time new where CPU time = IC * CPI * Clock Cycle Time. We will assume that CPI will not change and we know Clock Cycle Time new is 15% longer than Clock Cycle Time old. So, to balance out, IC new must be 15% less than IC old or we have to reduce IC to be 85% of the old. Since loads make up 25.1% of the total, we have to reduce loads to be 15% / 25.1% =.60%, or we have to eliminate 60% of the loads. 6. The autoincrement and autodecrement mode are common in CISC computers. This mode is used when accessing an array by automatically incrementing or decrementing the register storing the offset. The change occurs after the access for the increment, and before the access for the decrement. Let s see what happens in some standard array code with the new mode: for(i=0;i<1000;i++) a[i]=b[i]+c[i]; Assume that R1, R2, and R3 store the starting addresses arrays a, b, c respectively and that they are all int arrays. If we introduce an autoincrement statement like LWI Rx, 0(Ry) in place of the LW instruction of MIPS, how will it impact the performance? Below are the two sets of code, without and with the autoincrements. The CPI for our machine is as follows: 5 for loads/stores, 2

for ALU and 3 for branches. The autoincrement load/store also has a CPI of 5 but requires that we lengthen the clock cycle by 25%. Is the new mode worth pursuing? DADD R4, R0, R0 // R4 is the loop variable i DADDI R5, R0, #1000 // R5 = 1000 top: DSUB R6, R5, R4 BEQZ R6, out // exit for loop after 1000 iterations LW R7, 0(R2) // R7 = b[i] LW R8, 0(R3) // R8 = c[i] DADD R9, R7, R8 // R9 = b[i] + c[i] SW R9, 0(R1) DADDI R1, R1, #4 DADDI R2, R2, #4 DADDI R3, R3, #4 DADDI R4, R4, #1 J top out:... DADD R4, R0, R0 // R4 is the loop variable i DADDI R5, R0, #1000 // R5 = 1000 top: DSUB R6, R5, R4 BEQZ R6, out // exit for loop after 1000 iterations LWI R7, 0(R2) // R7 = b[i] LWI R8, 0(R3) // R8 = c[i] DADD R9, R7, R8 // R9 = b[i] + c[i] SWI R9, 0(R1) DADDI R4, R4, #1 J top out:... Answer: We compare the two CPU Times. CPU Time = IC * CPI * Clock Cycle Time. The original machine has a shorter Clock Cycle Time while the newer machine has a reduced IC * CPI because we can remove three of the DADDI instructions. CPU Time original = IC * CPI * Clock Cycle Time original CPU Time new = IC * CPI * Clock Cycle Time new We compute IC * CPI as follows: The original code has 2 ALU operations outside of the loop plus a loop of 6 ALU, 2 branch and 3 load/store. This gives us a total IC * CPI = 2 * 2 + 1000 * (6 * 2 + 2 * 3 + 3 * 5) = 33,004 clock cycles. The new code has 2 ALU operations outside of the loop plus a loop of 3 ALU, 2 branch and 3 load/store increment. This gives us a total of IC * CPI = 2 * 2 + 1000 * (3 * 2 + 2 * 3 + 3 * 5) = 27,004. Clock Cycle Time new = Clock Cycle Time old * 1.25 CPU Time old = 33,004 * Clock Cycle Time old

CPU Time new = 27,004 * Clock Cycle Time new = 27,004 * Clock Cycle Time old * 1.25 Speedup = CPU Time old / CPU Time new = 33,004 / (27,004 * 1.25) = 0.978, so we see a slowdown, not a speedup. 7. As an alternative to #6, let s assume that the clock speed does not change, but that the CPI for the LWI and SWI is 6. Is the change worth it? Answer: Here, Clock cycle time does not change so we only have to compare IC * CPI for both machines. The old machine s IC * CPI does not change. The new machine has the following IC * CPI = 2 * 2 + 1000 * (3 * 2 + 2 * 3 + 3 * 6) = 30,004. Since this is a reduction, the new mode would be worth it in this case. The speedup is 33,004 / 30,004 = 1.10.