CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, 2012
|
|
- Lilian Cross
- 7 years ago
- Views:
Transcription
1 1 CIS371 Computer Organization and Design Final Exam Solutions Prof. Martin Wednesday, May 2nd, [ 12 Points ] Datapath & Pipelining. (a) Consider a simple in-order five-stage pipeline with a two-cycle branch misprediction penalty and a single-cycle load-use delay penalty. For a specific program, 30% of the instructions are loads, 20% are branches, the remaining 50% of instructions are simple single-cycle ALU operations. Half of the load instructions are followed immediately by a dependent instruction, and 75% of branches are predicted correctly. What is the average CPI of this program on this processor? Answer: [3 Points] The CPI would be 1 plus an extra cycle for loads followed by dependent instruction (15% of instructions), an extra two cycles for mis-predicted branches (25% of 20%). Thus, the CPI is 1 + (1 0.15) + ( ) = (b) The scalar pipeline in the lab assignment had no structural hazards. What two aspects of the design were responsible for avoiding such structural hazards? Answer: [2 points] (1) Enough resources. Replicating the resource so that both instructions can proceed in parallel. The pipeline had separate memory ports for instruction fetch and data accesses. (2) Pipeline organization. The 5-stage pipeline from the lab assignment avoids structural hazards on the register file write port by delaying all register writes until the W stage (ALU instructions could write the register file in M, but this could cause a structural hazard with a preceeding load instruction trying to write the register file in W). (c) The superscalar pipeline from the lab did have a structural hazard. What was the cause of the hazard and how was it handled in the design? Answer: [2 points] There was only one load/store port, so if the pipeline attempted two memory operations in the same cycle, the second of the two was forced to stall to avoid the structural hazard. (d) The maximum speedup achievable by pipelining a single-cycle datapath into five stages is 5x. Give three distinct reasons why this ideal speedup is generally not achieved in practice: Answer: [3 points] Any three of: (1) data hazards (load-to-use data dependencies) will hurt CPI. (2) control hazards (branch mis-predictions) will hurt CPI. (3) The stages won t be divided exactly evenly, preventing a 5x clock improvement. (4) The pipeline register overhead will also prevent a 5x improvement in clock. (e) The maximum speedup achievable by converting a scalar single-cycle datapath into a two-issue superscalar single-cycle processor is 2x. Give two distinct reasons why this ideal speedup is generally not achieved in practice: Answer: [2 points] 1. Adding the additional control logic and register file ports could hurt the clock frequency. 2. Not every pair if instructions would be independent, so sometimes the pipeline would only execute one instruction (rather than the peak throughput of two instructions per cycle).
2 2 2. [ 8 Points ] Branch Prediction Conflicts and Tagged Predictors. You re a microprocessor designer, and your simulations of an important workload indicate that branch instructions at the following two 32-bit addresses (in binary) are executed frequently: Address A: Address B: Answer: Note: This question had by far the lowest score on the exam (average of less than 2 points out of 8). This questions was obviously much more difficult than I anticipated, perhaps because the predictor in the lab assignment was a BTB-style predictor versus the branch-direction predictor discussed in this question. (a) The above two instructions are likely to conflict (hash to the same entry) in a branch predictor. For a simple bimodal predictor of two-bit saturating counters, how many entries must the predictor have to prevent these two branches from interfering (conflicting) with each other? How many total bytes must the predictor be? Answer: These two addresses have the lower 15 bits in common. Thus, the index bits would need to at least 16 bits to map these two addresses to different entries in the branch predictor. That would require a 64k-entry predictor, which is 16KBs. (b) You have a brilliant idea: Why not create a set-associative branch direction predictor? Your simulations indicate that a predictor with just 2048 entries (512 bytes) would be sufficient if it wasn t for these two trouble branches. Consider a two-way set-associative predictor that uses a straightforward tagging strategy (one tag for each two-bit counter, instructions have 32-bit addresses). How large (in KBs) is a two-way set-associative 2048-entry tagged predictor? Answer: The 2048-entry predictor has 1024 sets in each of its two ways, so the lower 10 bits are used to index into the table. Thus, the tags are each = 22 bits. Adding the two-bit counter, each entry is 24 bits or 3 bytes. 2K entries at 3 bytes each is 6KBs. (Actually, unless random replacement was used, a LRU bit is also needed, which would increase the size of each entry to 25 bits, for 6.25KBs.) (c) You have another brilliant idea: Because this is just a predictor, it can be wrong, so it doesn t actually need the full tags, just enough of a tag to avoid this particular conflict. With this new insight, (1) how large should the tags be and (2) what is the total size in KBs of this predictor? Answer: To avoid the conflict, we only need to tag the lower 16 bits of the address. 10 bits are already covered by indexing into the 2k-entry predictor (1024 sets). Thus, only 6 tag bits are need. Six bits plus the 2-bit saturating counter is a total of 8 bits (1 byte) per entry. 2k entries at 1 byte is 2KB. (As above, an LRU bit would add a bit to each entry (9 bits) for a total of 2.25KBs.) (d) How might a predictor capture both the conflict-mitigating benefits of a tagged set-associative predictor and most of the area efficiency of a tag-less predictor? Answer: Some sort of hybrid predictor. Perhaps something similar in concept to how a small victim buffer can mitigate conflicts in direct-mapped caches: form a hybrid of a small setassociative predictor and a large tag-less predictor. Access the predictors in parallel, if the tag matches, use the prediction from the set-associative predictor. Otherwise, use the prediction from the large tag-less predictor. To make most efficient use of the small predictor, insert a new entry (new tag) into it only when a branch is mis-predicted (if the large table is getting a branch correct, there is no need to put that branch in the small predictor).
3 3. [ 10 Points ] Caching. Consider two different cache configurations for an 8-bit processor. Both caches have two 16-byte blocks (for a total capacity of 32 bytes), but one is direct-mapped and the other is two-way set associative and uses the least-recently used (LRU) replacement algorithm. All caches begin empty. (a) For the direct-mapped cache, how many bits are in the tag, index, and offset? Answer: [2 Points] tag: 3, index: 1, offset: 4 (b) For the two-way set associative cache, how many bits are in the tag, index, and offset? Answer: [2 Points] tag: 4, index: 0, offset: 4 (c) Give a short sequence (4 or fewer) of addresses in which the set-associative cache has a better hit rate than the direct-mapped cache. (Give the addresses as an 8-digit binary number.) Also give the miss rate for each. Sequence: Answer: [3 Points] , , Miss rate on direct-mapped: Answer: 100%. Miss rate on set-associative: Answer: 66%. (d) Give a short sequence (4 or fewer) of addresses in which the direct-mapped cache has a better hit rate than the set-associative cache. (Give the addresses as an 8-digit binary number.) Also give the miss rate for each. Sequence: Answer: [3 Points] , , , Miss rate on direct-mapped: Answer: 75% Miss rate on set-associative: Answer: 100% 3
4 4 4. [ 11 Points ] Memory Hierarchy Calculations. (a) Consider a simple core with a single-level of cache that achieves 1 CPI when every memory operation hits in the cache. The miss penalty to the main memory is 150 cycles. For a specific workload, the miss rate is 1 miss per every 50 instructions (or 0.02 misses per instruction). What is the CPI of this chip for this workload? Answer: [2 points] 1 + ( ) = 4 CPI. (b) Consider adding a second-level cache in between the first-level cache and memory. For this workload, a miss in the second-level cache occurs only once every 250 instructions (0.004 misses per instruction) and has the same latency as accessing memory as above (150 cycles). The firstlevel cache is unchanged. Calculate under what conditions adding this cache will at least double the performance (half the CPI). Answer: [3 points] As the previous CPI was 4, we need to target a CPI of 2. Thus, we can only spend 1 CPI in the memory system. The unknown parameter is the access latency of the L2 cache. (0.02 L) + ( ) = 1 Solving for L gives us L = 20. Thus, the access latency of the second-level cache must be 20 cycles or faster. Now let s revisit this design in the context of energy. Assume that when all memory operations hit in the cache, the core consumes 1 nano-joule per instruction (njpi). At 1 CPI at 1Ghz, that is equal to expending 1 Watt of power. That is, njpi * IPC * (clock frequency in Ghz) = Watts. Hint: apply what you have learned about calculating CPIs. (c) Of course, the memory hierarchy also consumes energy. Using the same miss rates (1 miss per every 50 instructions), if a main memory access consumes 100 nj, calculate the nano-joules per instruction of the single-level cache hierarchy. Answer: [1 point] The calculation is largely the same as doing the CPI calculation: 1 + ( ) = 3 njpi. (d) Perform a similar calculation (to part (c)) for the two-level hierarchy, assuming the energy to access the second-level cache is 30 nj and it incurs one miss every 250 instructions. Answer: [1 point] This calculation is also largely the same as doing a CPI calculation: 1 + ( ) + ( ) = 2 njpi. (e) Did adding the second level of cache increase or decrease the energy per instruction? Answer: [1 point] It decreased the energy per instruction. (f) Now consider energy per second, also known as power and measured in Watts. Assuming that adding the second level of cache does halve the CPI, does it increase or decrease the power dissipation of the chip? Why? Answer: [1 point] It increased the energy per second (power). (g) Based on these two metrics, would you expect battery life of a mobile device using this chip to increase or decrease? Explain.
5 Answer: [2 points] If the user is doing the same amount of computation, the battery life of the device is expected in improve (increase), because of the improve energy efficiency. However, because the chip got 2x faster but only 1.5x more energy efficient, the overall power actually went up. Thus, if the user increases the amount of computation they do (which is reasonable, as the computation is faster), then the battery life will actually get worse (decrease). 5
6 5. [ 11 Points ] Scheduling. When the code below is executed, assume the loop iterates thousands of times, and thus you should ignore any startup or initialization effects. 6 LOOP: MUL V <- X * X ADD Y <- Y + 1 STORE 1 -> [Y+V] BRANCH-IF (V == 0), EXIT LOAD X <- [X] BRANCH LOOP LOAD and MUL have a latency of more than one cycle; all other operations have a latency of one cycle. The pipeline is fully bypassed, has no structural hazards, and all execution units are fully pipelined. The hardware s branch direction and branch target prediction are both perfect. The pipeline is non-superscalar, so it can execute at most one instruction per cycle. Initially, consider an in-order pipeline. (a) Assuming the latency of MUL and LOAD are both two cycles (one cycle use penalty), approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why? Answer: [2 points] As there are six instructions per iteration, it should take at least six cycles plus any stalls. As there is an independent instruction after both the MUL and LOAD, all of the stall cycles are hidden. Thus, 6 x 1000 = 6000 cycles. (b) Assuming the latency of MUL and LOAD are both increased to four cycles (three cycle use penalty), approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why? Answer: [3 points] With a three-cycle use penalty, the pipeline must add two stall cycles after the MUL. In addition, there is a cross-loop dependency between the LOAD and the use of X, adding two more stall cycles. Thus, = 10 or 10,000 cycles for 1000 iterations. (c) Static scheduling does not improve the performance of this loop. Why not? Give two specific limitations that combine to prevent static scheduling from being effective in this example. Answer: [4 points] Here we were looking for two specific limitations. First, the LOAD can not be moved before the BRANCH-IF, as the LOAD might access invalid memory, because when V is zero, X would be zero, so the LOAD would be a NULL-reference, which would raise a page fault and cause the operating system to kill the program. This is the scheduling scope problem discussed in class. Second, the LOAD can not be moved before the STORE, as the store might write to the same location. This is unlikely, but the compiler needs to generate correct code for all cases. This is the memory aliasing problem discussed in class. The answer not enough registers is not a correct answer for this example, as moving up the LOAD does not increase the number of registers needed. (d) Now consider a dynamically scheduled (out-of-order) pipeline that supports a large number of in-flight instructions (100+). The pipeline is still non-superscalar, so it can execute a maximum of one instruction per cycle. Assuming the same four-cycle latency (three cycle use penalty) for MUL and LOAD instructions, approximately how many cycles would it take to execute this loop 1000 times (to the nearest 1000 cycles)? Why?
7 Answer: [2 points] 6 cycles per iteration. Why? Dynamic scheduling, can move the LOAD up past the BRANCH and STORE instructions, fully hiding its execution latency. This also puts another instruction between the MUL and its use. If only a single iteration was considered, this would take 7 cycles per iteration. However, dynamically scheduled processors can execute instructions across iterations. The first few iterations will incur this execution stall, but the fetch engine will continue fetching and renaming one instruction per cycle, increasing the number of instructions in the out-of-order scheduling window. Eventually, there will be enough instructions in the window to find independent instructions. Thus, the remaining stall will be filled in, resulting in 6 cycles per iteration (the maximum sustainable for a scalar pipeline). In fact, a superscalar pipeline could sustain one interation every four cycles, because the critical path of the computation is only the LOAD feeding the LOAD in the next iteration. 7
8 8 6. [ 10 Points ] Thread-Level Parallelism and Multicore. (a) What is the primary disadvantage of coarse-grained locking? Answer: Limits the parallelism because of lock contention. (b) What are two disadvantages of employing fine-grained locking? Answer: [2 points] (1) More difficult programming and (2) each lock acquire and release adds runtime overhead. (c) What new mechanism has recently been proposed to help ameliorate the locking granularity problem? Answer: Transactional memory. (d) What is the primary advantage of the MSI protocol over the simpler VI cache coherence protocol? Answer: MSI allows multiple processor to share read access to a block, reducing the number of coherence-induced cache misses. (e) What is the primary advantage of the MESI protocol over the simpler MSI cache coherence protocol? Answer: MESI gives a processor a clean read/write copy of a block if no other processors are sharing the block. This avoids an upgrade miss on a subsequent read. (f) What is false sharing. How can software help avoid false sharing? Answer: [2 points] False sharing is when two processors are accessing different parts of the same cache block (one at least one of the processors is writing the block). Software can help by placing shared data in different cache blocks. (g) What is a memory fence (also known as memory barrier)? Where are they typically used? Answer: [2 points] Memory fences are instructions used to ensure ordering among instructions in systems with relaxed memory consistency models. They are used when writing synchronization operations, such as after a lock acquire and before a lock release.
9 7. [ 16 Points ] Parallelism At All Levels. Throughout the semester we ve explored parallelism at many different levels (from very fine-grained parallelism within an ALU to multiple cores on a chip). Give an example use of parallelism at five different levels of granularity. These should be big picture major approaches for using parallelism to extract significant performance (for example, each could provide a 4x or more performance improvement). For each of the examples, give: (i) the name or term, (ii) the specific reasons for (or benefits of) employing parallelism, and (iii) the primary disadvantage or challenge of exploiting parallelism at that level of granularity. Answer: (a) Arithmetic (carry-select addition) - reduces the latency of arithmetic operations at the small cost of some additional logic. (b) Pipelining (multiple phases of instructions) - greatly increases the clock frequency (and thus throughput) of the datapath. The disadvantages are that (1) it increases the complexity of the design (for example, to handle stalls) and (2) it has diminishing benefits because of branch mispredictions and other pipeline stalls. (c) Superscalar (multiple independent instructions in the same cycle) - up to a point, can multiply the performance of a processor. The advantage is that can be done transparently to the programmer, but it does require sophisticated hardware implementation (and it can hurt clock frequency and energy). (d) Multiprocessors (multiple pipelines all executing concurrently) - adding more processor cores adds many more transistors, but has the potential to linearly increase performance. Its main disadvantages are (1) that it requires programmers to write parallel software and (2) few program have enough inherent parallelism to achieve linear speedup on large number of cores. (e) Vectors/SIMD (one instruction that performs the same arithmetic operation on multiple data elements in parallel) - Vectors can increase performance, but they require programmer-inserted annotations. As an aside, all of these forms of parallelism meet with diminishing returns beyond some point. That is why all of these forms of parallelism are employed, as exploiting any one of them isn t as good use using them all to a moderate degree. Give a specific concrete instance of a system that exploits all of these forms of parallelism: Answer: The XBox 360 9
10 10 Distribution 12" Score'Distribu2on' 10" Number'of'students' 8" 6" 4" 2" 0" Score' Mean: 46 points (59%) Median: 46 points (59%) High: 70 (90%)
Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:
Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):
More informationMore on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction
More informationEE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution
EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution
More informationThis Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings
This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?
More informationThe Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy
The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And
More informationVLIW Processors. VLIW Processors
1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW
More informationCOMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING
COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.
More informationLecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
More informationRethinking SIMD Vectorization for In-Memory Databases
SIGMOD 215, Melbourne, Victoria, Australia Rethinking SIMD Vectorization for In-Memory Databases Orestis Polychroniou Columbia University Arun Raghavan Oracle Labs Kenneth A. Ross Columbia University Latest
More informationEnergy-Efficient, High-Performance Heterogeneous Core Design
Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,
More informationComputer Organization and Components
Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides
More informationEE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)
EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the
More informationChapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup
Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to
More informationCourse on Advanced Computer Architectures
Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2
More informationA Lab Course on Computer Architecture
A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,
More informationOC By Arsene Fansi T. POLIMI 2008 1
IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5
More informationThread level parallelism
Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process
More informationCS352H: Computer Systems Architecture
CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions
More informationComputer Architecture TDTS10
why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers
More informationBindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27
Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.
More information361 Computer Architecture Lecture 14: Cache Memory
1 361 Computer Architecture Lecture 14 Memory cache.1 The Motivation for s Memory System Processor DRAM Motivation Large memories (DRAM) are slow Small memories (SRAM) are fast Make the average access
More informationUsing Power to Improve C Programming Education
Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden jonas.skeppstedt@cs.lth.se jonasskeppstedt.net jonasskeppstedt.net jonas.skeppstedt@cs.lth.se
More information! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends
This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):
More information18-548/15-548 Associativity 9/16/98. 7 Associativity. 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998
7 Associativity 18-548/15-548 Memory System Architecture Philip Koopman September 16, 1998 Required Reading: Cragon pg. 166-174 Assignments By next class read about data management policies: Cragon 2.2.4-2.2.6,
More informationVHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU
VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU Martin Straka Doctoral Degree Programme (1), FIT BUT E-mail: strakam@fit.vutbr.cz Supervised by: Zdeněk Kotásek E-mail: kotasek@fit.vutbr.cz
More informationIntroduction to Cloud Computing
Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic
More information2
1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring
More informationHomework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?
ECE337 / CS341, Fall 2005 Introduction to Computer Architecture and Organization Instructor: Victor Manuel Murray Herrera Date assigned: 09/19/05, 05:00 PM Due back: 09/30/05, 8:00 AM Homework # 2 Solutions
More informationWAR: Write After Read
WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads
More informationMulti-Threading Performance on Commodity Multi-Core Processors
Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction
More informationPipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1
Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of
More informationStatic Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes
basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static
More informationQ. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:
Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove
More informationEE361: Digital Computer Organization Course Syllabus
EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)
More informationThis Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
More informationINSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER
Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano
More informationParallel Computing 37 (2011) 26 41. Contents lists available at ScienceDirect. Parallel Computing. journal homepage: www.elsevier.
Parallel Computing 37 (2011) 26 41 Contents lists available at ScienceDirect Parallel Computing journal homepage: www.elsevier.com/locate/parco Architectural support for thread communications in multi-core
More informationCS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of
CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis
More informationBoard Notes on Virtual Memory
Board Notes on Virtual Memory Part A: Why Virtual Memory? - Letʼs user program size exceed the size of the physical address space - Supports protection o Donʼt know which program might share memory at
More informationMultithreading Lin Gao cs9244 report, 2006
Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............
More informationMemory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar
Memory ICS 233 Computer Architecture and Assembly Language Prof. Muhamed Mudawar College of Computer Sciences and Engineering King Fahd University of Petroleum and Minerals Presentation Outline Random
More informationSPARC64 VIIIfx: CPU for the K computer
SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS
More informationUnit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.
This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'
More informationBEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single
More informationwhat operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?
Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the
More informationBinary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
More informationMulti-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007
Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer
More informationComputer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory
Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in
More informationOperating Systems. Virtual Memory
Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page
More informationPutting Checkpoints to Work in Thread Level Speculative Execution
Putting Checkpoints to Work in Thread Level Speculative Execution Salman Khan E H U N I V E R S I T Y T O H F G R E D I N B U Doctor of Philosophy Institute of Computing Systems Architecture School of
More informationReal-Time Monitoring Framework for Parallel Processes
International Journal of scientific research and management (IJSRM) Volume 3 Issue 6 Pages 3134-3138 2015 \ Website: www.ijsrm.in ISSN (e): 2321-3418 Real-Time Monitoring Framework for Parallel Processes
More informationPipelining Review and Its Limitations
Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic
More informationIBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus
Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,
More informationAdministration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers
CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern
More informationPROBLEMS #20,R0,R1 #$3A,R2,R4
506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,
More informationFPGA-based Multithreading for In-Memory Hash Joins
FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded
More informationChapter 1 Computer System Overview
Operating Systems: Internals and Design Principles Chapter 1 Computer System Overview Eighth Edition By William Stallings Operating System Exploits the hardware resources of one or more processors Provides
More informationQuiz for Chapter 1 Computer Abstractions and Technology 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,
More informationPOWER8 Performance Analysis
POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit
More information18-742 Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two
age 1 18-742 Lecture 4 arallel rogramming II Spring 2005 rof. Babak Falsafi http://www.ece.cmu.edu/~ece742 write X Memory send X Memory read X Memory Slides developed in part by rofs. Adve, Falsafi, Hill,
More informationAchieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging
Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.
More informationon an system with an infinite number of processors. Calculate the speedup of
1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements
More informationAdvanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2
Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of
More informationLet s put together a Manual Processor
Lecture 14 Let s put together a Manual Processor Hardware Lecture 14 Slide 1 The processor Inside every computer there is at least one processor which can take an instruction, some operands and produce
More informationGiving credit where credit is due
CSCE 230J Computer Organization Processor Architecture VI: Wrap-Up Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due ost of slides for
More information2. Background. 2.1. Network Interface Processing
An Efficient Programmable 10 Gigabit Ethernet Network Interface Card Paul Willmann Hyong-youb Kim Scott Rixner Rice University Houston, TX {willmann,hykim,rixner}@rice.edu Vijay S. Pai Purdue University
More informationLecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations
Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2
More informationNIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR
NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR THE NIAGARA PROCESSOR IMPLEMENTS A THREAD-RICH ARCHITECTURE DESIGNED TO PROVIDE A HIGH-PERFORMANCE SOLUTION FOR COMMERCIAL SERVER APPLICATIONS. THE HARDWARE
More informationStrongARM** SA-110 Microprocessor Instruction Timing
StrongARM** SA-110 Microprocessor Instruction Timing Application Note September 1998 Order Number: 278194-001 Information in this document is provided in connection with Intel products. No license, express
More informationWeek 1 out-of-class notes, discussions and sample problems
Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types
More informationComing Challenges in Microarchitecture and Architecture
Coming Challenges in Microarchitecture and Architecture RONNY RONEN, SENIOR MEMBER, IEEE, AVI MENDELSON, MEMBER, IEEE, KONRAD LAI, SHIH-LIEN LU, MEMBER, IEEE, FRED POLLACK, AND JOHN P. SHEN, FELLOW, IEEE
More informationLow Power AMD Athlon 64 and AMD Opteron Processors
Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD
More informationHistorically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.
Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached
More informationLecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com
CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Modern GPU
More informationCISC, RISC, and DSP Microprocessors
CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:
More informationMPI and Hybrid Programming Models. William Gropp www.cs.illinois.edu/~wgropp
MPI and Hybrid Programming Models William Gropp www.cs.illinois.edu/~wgropp 2 What is a Hybrid Model? Combination of several parallel programming models in the same program May be mixed in the same source
More informationThe Classical Architecture. Storage 1 / 36
1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage
More information<Insert Picture Here> T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing
T4: A Highly Threaded Server-on-a-Chip with Native Support for Heterogeneous Computing Robert Golla Senior Hardware Architect Paul Jordan Senior Principal Hardware Engineer Oracle
More informationCOMPUTER HARDWARE. Input- Output and Communication Memory Systems
COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)
More informationCS 159 Two Lecture Introduction. Parallel Processing: A Hardware Solution & A Software Challenge
CS 159 Two Lecture Introduction Parallel Processing: A Hardware Solution & A Software Challenge We re on the Road to Parallel Processing Outline Hardware Solution (Day 1) Software Challenge (Day 2) Opportunities
More informationQuiz for Chapter 6 Storage and Other I/O Topics 3.10
Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [6 points] Give a concise answer to each
More informationChapter 2 Parallel Computer Architecture
Chapter 2 Parallel Computer Architecture The possibility for a parallel execution of computations strongly depends on the architecture of the execution platform. This chapter gives an overview of the general
More informationGPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
More informationConcept of Cache in web proxies
Concept of Cache in web proxies Chan Kit Wai and Somasundaram Meiyappan 1. Introduction Caching is an effective performance enhancing technique that has been used in computer systems for decades. However,
More informationModule: Software Instruction Scheduling Part I
Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section
More informationCUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA
CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application
More informationAMD Opteron Quad-Core
AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced
More informationChapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan
Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software
More informationPerformance Impacts of Non-blocking Caches in Out-of-order Processors
Performance Impacts of Non-blocking Caches in Out-of-order Processors Sheng Li; Ke Chen; Jay B. Brockman; Norman P. Jouppi HP Laboratories HPL-2011-65 Keyword(s): Non-blocking cache; MSHR; Out-of-order
More informationHigh Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team
High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)
More informationWhat is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus
Datorteknik F1 bild 1 What is a bus? Slow vehicle that many people ride together well, true... A bunch of wires... A is: a shared communication link a single set of wires used to connect multiple subsystems
More informationTesting Database Performance with HelperCore on Multi-Core Processors
Project Report on Testing Database Performance with HelperCore on Multi-Core Processors Submitted by Mayuresh P. Kunjir M.E. (CSA) Mahesh R. Bale M.E. (CSA) Under Guidance of Dr. T. Matthew Jacob Problem
More informationHardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui
Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching
More informationComprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies
494 IEEE TRANSACTIONS ON COMPUTERS, VOL. 48, NO. 5, MAY 1999 Comprehensive Hardware and Software Support for Operating Systems to Exploit MP Memory Hierarchies Chun Xia and Josep Torrellas, Member, IEEE
More informationParallelism and Cloud Computing
Parallelism and Cloud Computing Kai Shen Parallel Computing Parallel computing: Process sub tasks simultaneously so that work can be completed faster. For instances: divide the work of matrix multiplication
More informationArchitecture and Implementation of the ARM Cortex -A8 Microprocessor
Architecture and Implementation of the ARM Cortex -A8 Microprocessor October 2005 Introduction The ARM Cortex -A8 microprocessor is the first applications microprocessor in ARM s new Cortex family. With
More informationNext Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
More informationControl 2004, University of Bath, UK, September 2004
Control, University of Bath, UK, September ID- IMPACT OF DEPENDENCY AND LOAD BALANCING IN MULTITHREADING REAL-TIME CONTROL ALGORITHMS M A Hossain and M O Tokhi Department of Computing, The University of
More informationOutline. Cache Parameters. Lecture 5 Cache Operation
Lecture Cache Operation ECE / Fall Edward F. Gehringer Based on notes by Drs. Eric Rotenberg & Tom Conte of NCSU Outline Review of cache parameters Example of operation of a direct-mapped cache. Example
More informationOverview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX
Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy
More information