ECE 562 Advanced Computer Architecture Chapter 1-2 Sampling Questions. Based on our study, explain how to reduce each type of stalls?

Size: px
Start display at page:

Download "ECE 562 Advanced Computer Architecture Chapter 1-2 Sampling Questions. Based on our study, explain how to reduce each type of stalls?"

Transcription

1 ECE 562 Advanced Computer Architecture Chapter 1-2 Sampling Questions 1. Pipeline CPI =Ideal pipeline CPI+ Structural Stalls + RAW Stalls + WAR Stalls + WAW stalls + Control Stalls Based on our study, explain how to reduce each type of stalls? 2. Your boss is trying to decide between a single-processor system and a dual-processor system. The Table below gives the performance on two sets of benchmarks a memory benchmark and a processor benchmark. You know that your application will spend 30% of its time on memorycentric computations, and 70% of its time on processor-centric computations. How much speedup do you anticipate getting if you suggest your boss to move from using a Pentium to an Athlon 64 X on a CPU-intensive application suite? 3. Consider the following code segment. Identify data dependencies by marking with arrows and labeling with names (RAW, WAR, WAW). i: R4 R0 + R2 j: R8 R0 * R4 k: R4 R4 - R2 Simulate the execution of the code using the basic 5-stage pipeline (F, D, E, M, & W) with a single memory port and without forwarding nor cycle splitting. Add/subtract takes 2 cycles, and Multiply takes 3 cycles. The first instruction has done for you. Extend the table as needed.

2 Cycle #: i: R4 R0 + R2 F D E E M W j: R8 R0 * R4 k: R4 R4 - R2 Total number of cycles to complete the code:? 5. Imagine that your company is trying to decide between a single-processor system and a dual-processor system. Figure 1.26 gives the performance on two sets of benchmarks a memory benchmark and a processor benchmark. You know that your application will spend 30% of its time on memory-centric computations, and 70% of its time on processor-centric computations. You are using a dual-core Athlon processor, and you are choosing between two ways to implement the same algorithm. The first is to create a large lookup table to store 4K words of data. When you need the result, you look up the answer. The second method would be to calculate the result in a very tight loop. What are the advantages and disadvantages of each implementation? Thus, at what situation (when there are 88.89% memory operations and 11.1% processor operations), the performance of Pentium equals Pentium D 820 when there are 88.89% memory operations and 11.1% processor operations. 6. Think about what latency numbers really mean they indicate the number of cycles a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a producer followed by a consumer ) will execute correctly. But not all instruction pairs have a Producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles would the loop body in the code sequence in Figure 2.35 require if the pipeline detected true data dependences and only stalled on those, rather than blindly stalling everything just because one functional unit is busy? Show the code with <stall> inserted where necessary to accommodate stated latencies. (Hint: An instruction with latency +2 needs 2 <stall> cycles to be inserted into the code sequence. Think of it this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So latency implies 1 stall cycle; latency 1 + N has N extra stall cycles.) Loop: LD F2, 0(Rx) Memory LD +3 I0: MULTD F2, F0, F2 Memory SD +1 I1: DIVD F8, F2, F0 Integer ADD, SUB +0 I2: LD F4, 0(Ry) Branches +1 I3: ADDD F4, F0, F4 ADDD +2 I4: ADDD F10, F8, F2 MULTD +4 I5: SD F4, 0(Ry) DIVD +10 I6: ADDI Rx, Rx, #8 I7: ADDI Ry, Ry, #8 Latencies beyond Single Cycle I8: SUB R20, R4, Rx I9: BNZ R20, Loop

3 Sample questions in the textbook: Chapter 1: Case Study 1: Chip Fabrication Cost Fabrication Cost Fabrication Yield Defect Tolerance through Redundancy There are many factors involved in the price of a computer chip. New, smaller technology gives a boost in performance and a drop in required chip area. In the smaller technology, one can either keep the small area or place more hardware on the chip in order to get more functionality. In this case study, we explore how different design decisions involving fabrication technology, area, and redundancy affect the cost of chips. 1.1 [10/10/Discussion] <1.5, 1.5> Figure 1.22 gives the relevant chip statistics that influence the cost of several current chips. In the next few exercises, you will be exploring the trade-offs involved between the AMD Opteron, a single-chip processor, and the Sun Niagara, an 8-core chip. a. [10] <1.5> What is the yield for the AMD Opteron? b. [10] <1.5> What is the yield for an 8-core Sun Niagara processor? c. [Discussion] <1.4, 1.6> Why does the Sun Niagara have a worse yield than the AMD Opteron, even though they have the same defect rate? 1.3 [20/20/10/10/20] <1.7> Your colleague at Sun suggests that, since the yield is so poor, it might make sense to sell two sets of chips, one with 8 working processors and one with 6 working processors. We will solve this exercise by viewing the yield as a probability of no defects occurring in a certain area given the defect rate. For the Niagara, calculate probabilities based on each Niagara core separately (this may not be entirely accurate, since the yield equation is based on empirical evidence rather than a mathematical calculation relating the probabilities of finding errors in different portions of the chip). a. [20] <1.7> Using the yield equation for the defect rate above, what is the probability that a defect will occur on a single Niagara core (assuming the chip is divided evenly between the cores) in an 8-core chip? b. [20] <1.7> What is the probability that a defect will occur on one or two cores (but not more than that)? c. [10] <1.7> What is the probability that a defect will occur on none of the cores? d. [10] <1.7> Given your answers to parts (b) and (c), what is the number of 6-core chips you will sell for every 8-core chip? e. [20] <1.7> If you sell your 8-core chips for $150 each, the 6-core chips for $100 each, the cost per die sold is $80, your research and development budget was $200 million, and testing itself costs $1.50 per chip, how many processors would you need to sell in order to recoup costs? Case Study 2: Power Consumption in Computer Systems Amdahl s Law Redundancy MTTF Power Consumption Power consumption in modern systems is dependent on a variety of factors, including the chip clock frequency, efficiency, the disk drive speed, disk drive utilization, and DRAM. The following exercises explore the impact on power that different design decisions and/or use scenarios have.

4 1.4 [20/10/20] <1.6> Figure 1.23 presents the power consumption of several computer system components. In this exercise, we will explore how the hard drive affects power consumption for the system. a. [20] <1.6> Assuming the maximum load for each component, and a power supply efficiency of 70%, what wattage must the server s power supply deliver to a system with a Sun Niagara 8-core chip, 2 GB 184-pin Kingston DRAM, and two 7200 rpm hard drives? b. [10] <1.6> How much power will the 7200 rpm disk drive consume if it is idle rougly 40% of the time? c. [20] <1.6> Assume that rpm is the only factor in how long a disk is not idle (which is an oversimplification of disk performance). In other words, assume that for the same set of requests, a 5400 rpm disk will require twice as much time to read data as a 10,800 rpm disk. What percentage of the time would the 5400 rpm disk drive be idle to perform the same transactions as in part (b)? 1.6 [10/10/Discussion] <1.2, 1.9> Figure 1.24 gives a comparison of power and performance for several benchmarks comparing two servers: Sun Fire T2000 (which uses Niagara) and IBM x346 (using Intel Xeon processors). a. [10] <1.9> Calculate the performance/power ratio for each processor on each benchmark. b. [10] <1.9> If power is your main concern, which would you choose? c. [Discussion] <1.2> For the database benchmarks, the cheaper the system, the lower cost per database operation the system is. This is counterintuitive: larger systems have more throughput, so one might think that buying a larger system would be a larger absolute cost, but lower per operation cost. Since this is true, why do any larger server farms buy expensive servers? (Hint: Look at exercise 1.4 for some reasons.) Case Study 3: The Cost of Reliability (and Failure) in Web Servers TPCC Reliability of Web Servers MTTF This set of exercises deals with the cost of not having reliable Web servers. The data is in two sets: one gives various statistics for Gap.com, which was down for maintenance for two weeks in 2005 [AP 2005]. The other is for Amazon.com, which was not down, but has better statistics on high-load sales days. The exercises combine the two data sets and require estimating the economic cost to the shutdown. 1.9 [10/10] <1.8> The main reliability measure is MTTF. We will now look at different systems and how design decisions affect their reliability. Refer to Figure 1.25 for company statistics. a. [10] <1.8> We have a single processor with an FIT of 100. What is the MTTF for this system? b. [10] <1.8> If it takes 1 day to get the system running again, what is the availability of the system?

5 1.10 [20] <1.8> Imagine that the government, to cut costs, is going to build a supercomputer out of the cheap processor system in Exercise 1.9 rather than a specialpurpose reliable system. What is the MTTF for a system with 1000 processors? Assume that if one fails, they all fail. Case Study 4: Performance Arithmetic Mean Geometric Mean Parallelism Amdahl s Law Weighted Averages In this set of exercises, you are to make sense of Figure 1.26, which presents the performance of selected processors and a fictional one (Processor X), as reported by For each system, two benchmarks were run. One benchmark exercised the memory hierarchy, giving an indication of the speed of the memory for that system. The other benchmark, Dhrystone, is a CPU-intensive benchmark that does not exercise the memory system. Both benchmarks are displayed in order to distill the effects that different design decisions have on memory and CPU performance [10/10/20] <1.9> Imagine that your company is trying to decide between a single-processor system and a dual-processor system. Figure 1.26 gives the performance on two sets of benchmarks a memory benchmark and a processor benchmark. You know that your application will spend 40% of its time on memory-centric computations, and 60% of its time on processor-centric computations. a. [10] <1.9> Calculate the weighted execution time of the benchmarks. b. [10] <1.9> How much speedup do you anticipate getting if you move from using a Pentium to an Athlon 64 X on a CPU-intensive application suite? c. [20] <1.9> At what ratio of memory to processor computation would the performance of the Pentium be equal to the Pentium D 820? 1.14 [10/10/20/20] <1.10> Your company has just bought a new dual Pentium processor, and you have been tasked with optimizing your software for this processor. You will run two applications on this dual Pentium, but the resource requirements are not equal. The first application needs 80% of the resources, and the other only 20% of the resources. a. [10] <1.10> Given that 40% of the first application is parallelizable, how much speedup would you achieve with that application if run in isolation? b. [10] <1.10> Given that 99% of the second application is parallelizable, how much speedup would this application observe if run in isolation? c. [20] <1.10> Given that 40% of the first application is parallelizable, how much overall system speedup would you observe if you parallelized it?

6 d. [20] <1.10> Given that 99% of the second application is parallelizable, how much overall system speedup would you get? Chapter 2: Case Study 1: Exploring the Impact of Microarchitectural Techniques Basic Instruction Scheduling, Reordering, Dispatch Multiple Issue and Hazards Register Renaming Out-of-Order and Speculative Execution Where to Spend Out-of-Order Resources You are tasked with designing a new processor microarchitecture, and you are trying to figure out how best to allocate your hardware resources. Which of the hardware and software techniques you learned in Chapter 2 should you apply? You have a list of latencies for the functional units and for memory, as well as some representative code. Your boss has been somewhat vague about the performance requirements of your new design, but you know from experience that, all else being equal, faster is usually better. Start with the basics. Figure 2.35 provides a sequence of instructions and list of latencies. 2.1 [10] <1.8, 2.1, 2.2> What would be the baseline performance (in cycles, per loop iteration) of the code sequence in Figure 2.35 if no new instruction execution could be initiated until the previous instruction execution had completed? Ignore front-end fetch and decode. Assume for now that execution does not stall for lack of the next instruction, but only one instruction/cycle can be issued. Assume the branch is taken, and that there is a 1 cycle branch delay slot. 2.2 [10] <1.8, 2.1, 2.2> Think about what latency numbers really mean they indicate the number of cycles a given function requires to produce its output, nothing more. If the overall pipeline stalls for the latency cycles of each functional unit, then you are at least guaranteed that any pair of back-to-back instructions (a producer followed by a consumer ) will execute correctly. But not all instruction pairs have a producer/consumer relationship. Sometimes two adjacent instructions have nothing to do with each other. How many cycles would the loop body in the code sequence in Figure 2.35 require if the pipeline detected true data dependences and only stalled on those, rather than blindly stalling everything just because one functional unit is busy? Show the code with <stall> inserted where necessary to accommodate stated latencies. (Hint: An instruction with latency +2 needs 2 <stall> cycles to be inserted into the code sequence. Think of it this way: a 1-cycle instruction has latency 1 + 0, meaning zero extra wait states. So latency implies 1 stall cycle; latency 1 + N has N extra stall cycles.)

7 2.3 [15] <2.6, 2.7> Consider a multiple-issue design. Suppose you have two execution pipelines, each capable of beginning execution of one instruction per cycle, and enough fetch/decode bandwidth in the front end so that it will not stall your execution. Assume results can be immediately forwarded from one execution unit to another, or to itself. Further assume that the only reason an execution pipeline would stall is to observe a true data dependence. Now how many cycles does the loop require? 2.4 [10] <2.6, 2.7> In the multiple-issue design of Exercise 2.3, you may have recognized some subtle issues. Even though the two pipelines have the exact same instruction repertoire, they are not identical nor interchangeable, because there is an implicit ordering between them that must reflect the ordering of the instructions in the original program. If instruction N + 1 begins execution in Execution Pipe 1 at the same time that instruction N begins in Pipe 0, and N + 1 happens to require a shorter execution latency than N, then N + 1 will complete before N (even though program ordering would have implied otherwise). Recite at least two reasons why that could be hazardous and will require special considerations in the microarchitecture. Give an example of two instructions from the code in Figure 2.35 that demonstrate this hazard. 2.5 [20] <2.7> Reorder the instructions to improve performance of the code in Figure Assume the two-pipe machine in Exercise 2.3, and that the out-of-order completion issues of Exercise 2.4 have been dealt with successfully. Just worry about observing true data dependences and functional unit latencies for now. How many cycles does your reordered code take? 2.6 [10/10] <2.1, 2.2> Every cycle that does not initiate a new operation in a pipe is a lost opportunity, in the sense that your hardware is not living up to its potential. a. [10] <2.1, 2.2> In your reordered code from Exercise 2.5, what fraction of all cycles, counting both pipes, were wasted (did not initiate a new op)? b. [10] <2.1, 2.2> Loop unrolling is one standard compiler technique for finding more parallelism in code, in order to minimize the lost opportunities for performance. c. Hand-unroll two iterations of the loop in your reordered code from Exercise 2.5. What speedup did you obtain? (For this exercise, just color the N + 1 iteration s instructions green to distinguish them from the Nth iteration s; if you were actually unrolling the loop you would have to reassign registers to prevent collisions between the iterations.) 2.8 [20] <2.4> Exercise 2.7 explored simple register renaming: when the hardware register renamer sees a source register, it substitutes the destination T register of the last instruction to have targeted that source register. When the rename table sees a destination register, it substitutes the next available T for it. But superscalar designs need to handle multiple instructions per clock cycle at every stage in the machine, including the register renaming. A simple scalar processor would therefore look up both src register mappings for each instruction, and allocate a new destination mapping per clock cycle. Superscalar processors must be able to do that as well, but they must also ensure that any dest-to-src relationships between the two concurrent instructions are handled correctly. Consider the sample code sequence in Figure Assume that we would like to simultaneously rename the first two instructions. Further assume that the next two available T registers to be used are known at the beginning of the clock cycle in which these two instructions are being renamed. Conceptually, what we want is for the first instruction to do its rename table lookups, and then update the table per its destination s T register. Then the second instruction would do exactly the same thing, and any interinstruction dependency would thereby be handled correctly. But there s not enough time to write that T register designation into the renaming table and then look it up again for the second instruction, all in the same clock cycle. That register substitution must instead be done live (in parallel with the register rename table update). Figure 2.39 shows a circuit diagram, using multiplexers and comparators, that will accomplish the necessary on-the-fly register renaming. Your task is to show the cycle-by-cycle state of the rename table for every instruction of the code. Assume the table starts out with every entry equal to its index (T0 = 0; T1 = 1,...) [10/10/10] <2.3> Assume a five-stage single-pipeline microarchitecture (fetch, decode, execute, memory, write back) and the code in Figure All ops are 1 cycle except LW and SW, which are 1 + 2

8 cycles, and branches, which are cycles. There is no forwarding. Show the phases of each instruction per clock cycle for one iteration of the loop. a. [10] <2.3> How many clock cycles per loop iteration are lost to branch overhead? b. [10] <2.3> Assume a static branch predictor, capable of recognizing a backwards branch in the decode stage. Now how many clock cycles are wasted on branch overhead? c. [10] <2.3> Assume a dynamic branch predictor. How many cycles are lost on a correct prediction? 2.12 [20/20/20/10/20] <2.4, 2.7, 2.10> Let s consider what dynamic scheduling might achieve here. Assume a microarchitecture as shown in Figure Assume that the ALUs can do all arithmetic ops (MULTD, DIVD, ADDD, ADDI, SUB) and branches, and that the Reservation Station (RS) can dispatch at most one operation to each functional unit per cycle (one op to each ALU plus one memory op to the LD/ST unit). a. [15] <2.4> Suppose all of the instructions from the sequence in Figure 2.35 are present in the RS, with no renaming having been done. Highlight any instructions in the code where register renaming would improve performance. Hint: Look for RAW and WAW hazards. Assume the same functional unit latencies as in Figure b. [20] <2.4> Suppose the register-renamed version of the code from part (a) is resident in the RS in clock cycle N, with latencies as given in Figure Show how the RS should dispatch these instructions out-oforder, clock by clock, to obtain optimal performance on this code. (Assume the same RS restrictions as in part (a). Also assume that results must be written into the RS before they re available for use; i.e., no bypassing.) How many clock cycles does the code sequence take? Case Study 2: Modeling a Branch Predictor Concept illustrated by this case study Modeling a Branch Predictor

9 Besides studying microarchitecture techniques, to really understand computer architecture you must also program computers. Getting your hands dirty by directly modeling various microarchitectural ideas is better yet. Write a C or Java program to model a 2,1 branch predictor. Your program will read a series of lines from a file named history.txt (available on the companion CD see Figure Figure 2.43). Each line of that file has three data items, separated by tabs. The first datum on each line is the address of the branch instruction in hex. The second datum is the branch target address in hex. The third datum is a 1 or a 0; 1 indicates a taken branch, and 0 indicates not taken. The total number of branches your model will consider is, of course, equal to the number of lines in the file. Assume a directmapped BTB, and don t worry about instruction lengths or alignment (i.e., if your BTB has four entries, then branch instructions at 0x0, 0x1, 0x2, and 0x3 will reside in those four entries, but a branch instruction at 0x4 will overwrite BTB[0]). For each line in the input file, your model will read the pair of data values, adjust the various tables per the branch predictor being modeled, and collect key performance statistics. The final output of your program will look like that shown in Figure Make the number of BTB entries in your model a command-line option [20/10/10/10/10/10/10] <2.3> Write a model of a simple four-state branch target buffer with 64 entries. a. [20] <2.3> What is the overall hit rate in the BTB (the fraction of times a branch was looked up in the BTB and found present)? b. [10] <2.3> What is the overall branch misprediction rate on a cold start (the fraction of times a branch was correctly predicted taken or not taken, regardless of whether that prediction belonged to the branch being predicted)? c. [10] <2.3> Find the most common branch. What was its contribution to the overall number of correct predictions? (Hint: Count the number of times that branch occurs in the history.txt file, then track how each instance of that branch fares within the BTB model.) d. [10] <2.3> How many capacity misses did your branch predictor suffer? e. [10] <2.3> What is the effect of a cold start versus a warm start? To find out, run the same input data set once to initialize the history table, and then again to collect the new set of statistics. f. [10] <2.3> Cold-start the BTB 4 more times, with BTB sizes 16, 32, and 64. Graph the resulting five misprediction rates. Also graph the five hit rates. g. [10] Submit the well-written, commented source code for your branch target buffer model.

Computer Architecture-I

Computer Architecture-I Computer Architecture-I 1. Die Yield is given by the formula, Assignment 1 Solution Die Yield = Wafer Yield x (1 + (Defects per unit area x Die Area)/a) -a Let us assume a wafer yield of 100% and a 4 for

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano cristina.silvano@polimi.it Prof. Silvano, Politecnico di Milano

More information

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

on an system with an infinite number of processors. Calculate the speedup of

on an system with an infinite number of processors. Calculate the speedup of 1. Amdahl s law Three enhancements with the following speedups are proposed for a new architecture: Speedup1 = 30 Speedup2 = 20 Speedup3 = 10 Only one enhancement is usable at a time. a) If enhancements

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Quiz for Chapter 1 Computer Abstractions and Technology 3.10 Date: 3.10 Not all questions are of equal difficulty. Please review the entire quiz first and then budget your time carefully. Name: Course: Solutions in Red 1. [15 points] Consider two different implementations,

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

Week 1 out-of-class notes, discussions and sample problems

Week 1 out-of-class notes, discussions and sample problems Week 1 out-of-class notes, discussions and sample problems Although we will primarily concentrate on RISC processors as found in some desktop/laptop computers, here we take a look at the varying types

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

Low Power AMD Athlon 64 and AMD Opteron Processors

Low Power AMD Athlon 64 and AMD Opteron Processors Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

2

2 1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1 Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

More information

Computer Organization and Components

Computer Organization and Components Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100)

EE282 Computer Architecture and Organization Midterm Exam February 13, 2001. (Total Time = 120 minutes, Total Points = 100) EE282 Computer Architecture and Organization Midterm Exam February 13, 2001 (Total Time = 120 minutes, Total Points = 100) Name: (please print) Wolfe - Solution In recognition of and in the spirit of the

More information

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

More information

Intel Pentium 4 Processor on 90nm Technology

Intel Pentium 4 Processor on 90nm Technology Intel Pentium 4 Processor on 90nm Technology Ronak Singhal August 24, 2004 Hot Chips 16 1 1 Agenda Netburst Microarchitecture Review Microarchitecture Features Hyper-Threading Technology SSE3 Intel Extended

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

More information

Virtuoso and Database Scalability

Virtuoso and Database Scalability Virtuoso and Database Scalability By Orri Erling Table of Contents Abstract Metrics Results Transaction Throughput Initializing 40 warehouses Serial Read Test Conditions Analysis Working Set Effect of

More information

CS 6290 I/O and Storage. Milos Prvulovic

CS 6290 I/O and Storage. Milos Prvulovic CS 6290 I/O and Storage Milos Prvulovic Storage Systems I/O performance (bandwidth, latency) Bandwidth improving, but not as fast as CPU Latency improving very slowly Consequently, by Amdahl s Law: fraction

More information

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs.

Unit 4: Performance & Benchmarking. Performance Metrics. This Unit. CIS 501: Computer Architecture. Performance: Latency vs. This Unit CIS 501: Computer Architecture Unit 4: Performance & Benchmarking Metrics Latency and throughput Speedup Averaging CPU Performance Performance Pitfalls Slides'developed'by'Milo'Mar0n'&'Amir'Roth'at'the'University'of'Pennsylvania'

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Generations of the computer. processors.

Generations of the computer. processors. . Piotr Gwizdała 1 Contents 1 st Generation 2 nd Generation 3 rd Generation 4 th Generation 5 th Generation 6 th Generation 7 th Generation 8 th Generation Dual Core generation Improves and actualizations

More information

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends

! Metrics! Latency and throughput. ! Reporting performance! Benchmarking and averaging. ! CPU performance equation & performance trends This Unit CIS 501 Computer Architecture! Metrics! Latency and throughput! Reporting performance! Benchmarking and averaging Unit 2: Performance! CPU performance equation & performance trends CIS 501 (Martin/Roth):

More information

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy The Quest for Speed - Memory Cache Memory CSE 4, Spring 25 Computer Systems http://www.cs.washington.edu/4 If all memory accesses (IF/lw/sw) accessed main memory, programs would run 20 times slower And

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

COMPUTER HARDWARE. Input- Output and Communication Memory Systems COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)

More information

Real-Time Monitoring Framework for Parallel Processes

Real-Time Monitoring Framework for Parallel Processes International Journal of scientific research and management (IJSRM) Volume 3 Issue 6 Pages 3134-3138 2015 \ Website: www.ijsrm.in ISSN (e): 2321-3418 Real-Time Monitoring Framework for Parallel Processes

More information

Design Cycle for Microprocessors

Design Cycle for Microprocessors Cycle for Microprocessors Raúl Martínez Intel Barcelona Research Center Cursos de Verano 2010 UCLM Intel Corporation, 2010 Agenda Introduction plan Architecture Microarchitecture Logic Silicon ramp Types

More information

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December 2009 23:59

COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December 2009 23:59 COMP 303 MIPS Processor Design Project 4: MIPS Processor Due Date: 11 December 2009 23:59 Overview: In the first projects for COMP 303, you will design and implement a subset of the MIPS32 architecture

More information

EC 362 Problem Set #2

EC 362 Problem Set #2 EC 362 Problem Set #2 1) Using Single Precision IEEE 754, what is FF28 0000? 2) Suppose the fraction enhanced of a processor is 40% and the speedup of the enhancement was tenfold. What is the overall speedup?

More information

The Microarchitecture of Superscalar Processors

The Microarchitecture of Superscalar Processors The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: jes@ece.wisc.edu

More information

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2 Lecture Handout Computer Architecture Lecture No. 2 Reading Material Vincent P. Heuring&Harry F. Jordan Chapter 2,Chapter3 Computer Systems Design and Architecture 2.1, 2.2, 3.2 Summary 1) A taxonomy of

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Chapter 5 Instructor's Manual

Chapter 5 Instructor's Manual The Essentials of Computer Organization and Architecture Linda Null and Julia Lobur Jones and Bartlett Publishers, 2003 Chapter 5 Instructor's Manual Chapter Objectives Chapter 5, A Closer Look at Instruction

More information

Giving credit where credit is due

Giving credit where credit is due CSCE 230J Computer Organization Processor Architecture VI: Wrap-Up Dr. Steve Goddard goddard@cse.unl.edu http://cse.unl.edu/~goddard/courses/csce230j Giving credit where credit is due ost of slides for

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

SPARC64 VIIIfx: CPU for the K computer

SPARC64 VIIIfx: CPU for the K computer SPARC64 VIIIfx: CPU for the K computer Toshio Yoshida Mikio Hondo Ryuji Kan Go Sugizaki SPARC64 VIIIfx, which was developed as a processor for the K computer, uses Fujitsu Semiconductor Ltd. s 45-nm CMOS

More information

Course on Advanced Computer Architectures

Course on Advanced Computer Architectures Course on Advanced Computer Architectures Surname (Cognome) Name (Nome) POLIMI ID Number Signature (Firma) SOLUTION Politecnico di Milano, September 3rd, 2015 Prof. C. Silvano EX1A ( 2 points) EX1B ( 2

More information

Performance evaluation

Performance evaluation Performance evaluation Arquitecturas Avanzadas de Computadores - 2547021 Departamento de Ingeniería Electrónica y de Telecomunicaciones Facultad de Ingeniería 2015-1 Bibliography and evaluation Bibliography

More information

POWER8 Performance Analysis

POWER8 Performance Analysis POWER8 Performance Analysis Satish Kumar Sadasivam Senior Performance Engineer, Master Inventor IBM Systems and Technology Labs satsadas@in.ibm.com #OpenPOWERSummit Join the conversation at #OpenPOWERSummit

More information

Figure 1: Graphical example of a mergesort 1.

Figure 1: Graphical example of a mergesort 1. CSE 30321 Computer Architecture I Fall 2011 Lab 02: Procedure Calls in MIPS Assembly Programming and Performance Total Points: 100 points due to its complexity, this lab will weight more heavily in your

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

(Refer Slide Time: 00:01:16 min)

(Refer Slide Time: 00:01:16 min) Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control

More information

Computer organization

Computer organization Computer organization Computer design an application of digital logic design procedures Computer = processing unit + memory system Processing unit = control + datapath Control = finite state machine inputs

More information

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats Execution Cycle Pipelining CSE 410, Spring 2005 Computer Systems http://www.cs.washington.edu/410 1. Instruction Fetch 2. Instruction Decode 3. Execute 4. Memory 5. Write Back IF and ID Stages 1. Instruction

More information

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World

Chapter 4 System Unit Components. Discovering Computers 2012. Your Interactive Guide to the Digital World Chapter 4 System Unit Components Discovering Computers 2012 Your Interactive Guide to the Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook

More information

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in

More information

Introduction to Microprocessors

Introduction to Microprocessors Introduction to Microprocessors Yuri Baida yuri.baida@gmail.com yuriy.v.baida@intel.com October 2, 2010 Moscow Institute of Physics and Technology Agenda Background and History What is a microprocessor?

More information

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s) Addressing The problem Objectives:- When & Where do we encounter Data? The concept of addressing data' in computations The implications for our machine design(s) Introducing the stack-machine concept Slide

More information

Discovering Computers 2011. Living in a Digital World

Discovering Computers 2011. Living in a Digital World Discovering Computers 2011 Living in a Digital World Objectives Overview Differentiate among various styles of system units on desktop computers, notebook computers, and mobile devices Identify chips,

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

İSTANBUL AYDIN UNIVERSITY

İSTANBUL AYDIN UNIVERSITY İSTANBUL AYDIN UNIVERSITY FACULTY OF ENGİNEERİNG SOFTWARE ENGINEERING THE PROJECT OF THE INSTRUCTION SET COMPUTER ORGANIZATION GÖZDE ARAS B1205.090015 Instructor: Prof. Dr. HASAN HÜSEYİN BALIK DECEMBER

More information

CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

More information

Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

More information

EE361: Digital Computer Organization Course Syllabus

EE361: Digital Computer Organization Course Syllabus EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)

More information

1 Storage Devices Summary

1 Storage Devices Summary Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious

More information

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi

EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Celerra Unified Storage Platforms Using iscsi EMC Business Continuity for Microsoft SQL Server Enabled by SQL DB Mirroring Applied Technology Abstract Microsoft SQL Server includes a powerful capability to protect active databases by using either

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

CISC, RISC, and DSP Microprocessors

CISC, RISC, and DSP Microprocessors CISC, RISC, and DSP Microprocessors Douglas L. Jones ECE 497 Spring 2000 4/6/00 CISC, RISC, and DSP D.L. Jones 1 Outline Microprocessors circa 1984 RISC vs. CISC Microprocessors circa 1999 Perspective:

More information

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM 1 The ARM architecture processors popular in Mobile phone systems 2 ARM Features ARM has 32-bit architecture but supports 16 bit

More information

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Chapter 2 Basic Structure of Computers Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan Outline Functional Units Basic Operational Concepts Bus Structures Software

More information

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1 Introduction - 8.1 I/O Chapter 8 Disk Storage and Dependability 8.2 Buses and other connectors 8.4 I/O performance measures 8.6 Input / Ouput devices keyboard, mouse, printer, game controllers, hard drive,

More information

Configuring Memory on the HP Business Desktop dx5150

Configuring Memory on the HP Business Desktop dx5150 Configuring Memory on the HP Business Desktop dx5150 Abstract... 2 Glossary of Terms... 2 Introduction... 2 Main Memory Configuration... 3 Single-channel vs. Dual-channel... 3 Memory Type and Speed...

More information

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately.

Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Historically, Huge Performance Gains came from Huge Clock Frequency Increases Unfortunately. Hardware Solution Evolution of Computer Architectures Micro-Scopic View Clock Rate Limits Have Been Reached

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA

CUDA Optimization with NVIDIA Tools. Julien Demouth, NVIDIA CUDA Optimization with NVIDIA Tools Julien Demouth, NVIDIA What Will You Learn? An iterative method to optimize your GPU code A way to conduct that method with Nvidia Tools 2 What Does the Application

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Windows Server Performance Monitoring

Windows Server Performance Monitoring Spot server problems before they are noticed The system s really slow today! How often have you heard that? Finding the solution isn t so easy. The obvious questions to ask are why is it running slowly

More information

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Oracle Database Scalability in VMware ESX VMware ESX 3.5 Performance Study Oracle Database Scalability in VMware ESX VMware ESX 3.5 Database applications running on individual physical servers represent a large consolidation opportunity. However enterprises

More information

The Central Processing Unit:

The Central Processing Unit: The Central Processing Unit: What Goes on Inside the Computer Chapter 4 Objectives Identify the components of the central processing unit and how they work together and interact with memory Describe how

More information

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic 1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory

More information

Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

More information

Networking Topology For Your System

Networking Topology For Your System This chapter describes the different networking topologies supported for this product, including the advantages and disadvantages of each. Select the one that best meets your needs and your network deployment.

More information

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui Hardware-Aware Analysis and Optimization of Stable Fluids Presentation Date: Sep 15 th 2009 Chrissie C. Cui Outline Introduction Highlights Flop and Bandwidth Analysis Mehrstellen Schemes Advection Caching

More information

How To Improve Performance On A Single Chip Computer

How To Improve Performance On A Single Chip Computer : Redundant Arrays of Inexpensive Disks this discussion is based on the paper:» A Case for Redundant Arrays of Inexpensive Disks (),» David A Patterson, Garth Gibson, and Randy H Katz,» In Proceedings

More information

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield. Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

More information

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to: 55 Topic 3 Computer Performance Contents 3.1 Introduction...................................... 56 3.2 Measuring performance............................... 56 3.2.1 Clock Speed.................................

More information

- An Essential Building Block for Stable and Reliable Compute Clusters

- An Essential Building Block for Stable and Reliable Compute Clusters Ferdinand Geier ParTec Cluster Competence Center GmbH, V. 1.4, March 2005 Cluster Middleware - An Essential Building Block for Stable and Reliable Compute Clusters Contents: Compute Clusters a Real Alternative

More information

Price/performance Modern Memory Hierarchy

Price/performance Modern Memory Hierarchy Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion

More information

HyperThreading Support in VMware ESX Server 2.1

HyperThreading Support in VMware ESX Server 2.1 HyperThreading Support in VMware ESX Server 2.1 Summary VMware ESX Server 2.1 now fully supports Intel s new Hyper-Threading Technology (HT). This paper explains the changes that an administrator can expect

More information

Intel 8086 architecture

Intel 8086 architecture Intel 8086 architecture Today we ll take a look at Intel s 8086, which is one of the oldest and yet most prevalent processor architectures around. We ll make many comparisons between the MIPS and 8086

More information

GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT

GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT Why Data Domain Series GIVE YOUR ORACLE DBAs THE BACKUPS THEY REALLY WANT Why you should take the time to read this paper Speed up backups (Up to 58.7 TB/hr, Data Domain systems are about 1.5 times faster

More information

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows: 4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the

More information

Configuring a U170 Shared Computing Environment

Configuring a U170 Shared Computing Environment Configuring a U170 Shared Computing Environment NComputing Inc. March 09, 2010 Overview NComputing's desktop virtualization technology enables significantly lower computing costs by letting multiple users

More information