Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture

Size: px
Start display at page:

Download "Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture"

Transcription

1 University of California, Berkeley College of Engineering Computer Science Division EECS Spring 2012 John Kubiatowicz Midterm I SOLUTIONS March 21 st, 2012 CS252 Graduate Computer Architecture Your Name: SID Number: Problem Possible Score Total 100

2 [ This page left for ]

3 Question #1: Short Answer [25pts] Problem 1a[3pts]: What hardware structure can be used to support branch prediction, data prediction, and precise exceptions in an out-of-order processor? Explain what this structure is (including what information it holds) and how it is used with implicit register renaming to recover from a bad prediction or exception. The Reorder Buffer is used to recover from branch miss-prediction, data miss-prediction, and to restore the processor to a precise state. The Reorder Buffer holds pending instructions in program order; in addition to the instruction itself, the reorder buffer holds the result of the instruction until the instruction is committed, as well as any exception results. With implicit register renaming (i.e. original Tomasulo), instructions are committed in program order by writing their results to the register file. As a result, we can recover from a bad prediction or exception by simply throwing out the contents of the Reorder Buffer and refetching the earliest instruction that didn t complete (which was at the head of the Reorder Buffer when we flushed the buffer). Problem 1b[3pts]: What is explicit register renaming? What is involved with implementing explicit register renaming in a 2-way superscalar processor (don t forget the free list!)? Explicit register renaming is the process of translating the programmer-visible (logical) register names to physical register names. Implementation of register renaming for a 2-way superscalar processor involves the ability to translate four source-registers and two destination registers for two instructions simultaneously. The translation table must have four read ports and two write ports. Two details: we must be able to see if the destination register of the first instruction is the same as either of the source registers for the second instruction (and do an appropriate replacement). Further, we must have a free list mechanism which can allocate two free registers at a time; we might use the reorder buffer to know which physical registers are no longer in use and can be placed on the free list (up to two at a time). Problem 1c[2pts]: Suppose that we start with a basic Tomasulo architecture that includes branch prediction. What changes are required to execute 4 instructions per cycle? We need to be able to issue 4 instructions at a time to the reservation stations. We need two have 4 result busses for up to four instruction completions per cycle. We need 8 read ports and 4 write ports on the register file. We need to have enough parallel execution resources to keep up to 4 things going at once. Problem 1d[2pts]: What could prevent the above architecture (in 1c) from sustaining 4 instructions per cycle? What could you do to improve utilization of the pipeline? There are many possible problems. Structural hazards can stall the pipeline (specifically insufficient queue slots in the reservation stations). Long-latency loads and stores could fill up buffers. Insufficient branch-prediction could prevent issuing 4 instructions per cycles. Insufficient instructional-level parallelism can cause problems. We could improve the situation with simultaneous multithreading. 3

4 Problem 1e[3pts]: Name three reasons that industry leaders (e.g. Intel) decided around 2002 to stop trying to improve individual processor performance and start producing multicore processors instead. There were a number of reasons. We took any reasonable ones. Some examples: (1) The amount of ILP that could be automatically extracted by hardware had run out of steam, (2) Power consumption had hit a high point and would have had to go higher to continue performance improvements. (3) Designers found it impossible to increase clock rates. Problem 1f[2pts]: Most branches in a program are highly biased, i.e. they can be predicted by a simple one-level predictor. What can the compiler do to improve the number of branches that are in this category? The compiler can perform a node-splitting operation in which branches that can be reached along multiple paths are replicated multiple times, once for each different path through the code. The resulting branches are typically more highly biased than they were before replication. Problem 1g[2pts]: When is it better to handle events via interrupts rather than polling? How about the reverse? Be specific. Interrupts work well as a notification mechanism when events are either unpredictable or infrequent. Interrupts can also be useful to guarantee processing of events (from a security standpoint), since interrupts often involve handing control to the kernel. Polling works well when events are regular or predictable or can be delayed for a long time without affecting functionality. Problem 1h[3pts]: Why is it advantageous to have a prime number of banks of DRAM in a vector processor? Can you say how to map a binary address to a prime number of banks without the expense of implementing a general modulus operation in hardware? A prime number of banks will provide high performance with a larger number of possible strides than a non-prime number of banks (i.e. all strides that do not have that particular prime number as a factor). If the prime number is of the form 2 m -1 (i.e. a Mersenne prime), then computing the bank ID can be extracted simply from a binary number by realizing that 2 x mod (2 m -1) 2 x mod m. This means that we can divide the address into m-bit chunks which we add together. We take the result and do the same thing again multiple times until there are no more than m bits left (special case if result 2 m -1, treat it like 0). 4

5 Problem 1i[3pts]: Explain how exception conditions could occur out-of-order (in time) with a 5-stage, in-order pipeline. A diagram is probably the easiest way to illustrate. How can such a pipeline produce a precise exception point? Exceptions can occur out-of-order in time because exceptions can occur in different stages. Example: Illegal instruction faults occur in the D stage, overflow errors occur in the E stage, and memory faults occur in the M stage. As a result, instructions that are later in the flow of the program can have exceptions that occur earlier in time, for instance: F 1 D 1 E 1 M 1 W 1 F 2 D 2 E 2 M 2 W 2 In the above example, D 2 occurs in time earlier than M 1, even though the first instruction is the precise exception point. The five stage pipeline reorders exceptions simply by waiting to handle an exception until a well defined point in the pipeline, such as the end of the memory stage. Thus, whenever an exception occurs, the corresponding stage simply sets an exception field in the pipeline, rather than stopping the pipeline. The memory stage will look at this field to decide which instruction should be the precise exception point. Problem 1j[2pts]: Suppose your virtual memory system has 4KB pages. Further, suppose you have a 64KB first-level cache. Explain how you could fully overlap the TLB lookup and cache access. Use a diagram to help with your explanation: The simple answer to this question is to make sure that bits examined by the TLB and by the Cache index are different bits. Since pages are 4K in size, this means that only the lower 12 bits should be examined by the cache. To handle 64K of first-level cache, we must have 64K/4K 16-way associativity. The diagram below shows relevant information. Note that we still have to do the TLB look up and way-choice in serial (thus, perhaps fully overlap is slightly misleading). Address: 20 page # 7 5 disp off TLB assoc lookup index 64K Cache 16-way assoc Tags: Hit/ Miss FN Output Mux 5 Data Hit/ Miss

6 [ This page intentionally left blank!] 6

7 Question #2: In-Order Superscalar Processor [25pts] Consider a dual-issue, in-order pipeline with one fetch stage, one decode stage, multiple execution stages (which include memory access) and a single write-back stage. Assume that the execution stages are organized into two parallel execution pipelines (call them even and odd) that support all possible simultaneous combinations of two instructions. Instructions wait in the decode stage until all of their dependencies have been satisfied. Further, since this is an in-order pipeline, new instructions will be forced to wait behind stalled instructions. On each cycle, the decode stage takes zero, one, or two ready instructions from the fetch stage, gathers operands from the register file or the forwarding network, then dispatch them to execution stages. If less than 2 instructions are dispatched on a particular cycle, then NOPs are sent to the execution stages. When two instructions are dispatched, the even pipeline receives the earlier instruction. When only one instruction is dispatched, it is placed in the even pipeline. Assume that each of the execution pipelines consist of a single linear sequence of stages in which later stages serve as no-ops for shorter operations (or: every instruction takes the same number of stages to execute, but results of shorter operations are available for forwarding sooner). All operations are fully pipelined and results are forwarded as soon as they are complete. Assume that the execution pipelines have the following execution latencies: addf (2 cycles), multf (3 cycles), divf (4 cycles), integer ops (1 cycle). Assume that memory instructions take 4 cycles of execution: one for address calculation done by the integer execution stage, two unbreakable cycles for the actual cache access, and one cycle to check the cache. Finally, assume that branch-conditions are computed by integer execution units. Problem 2a[2pts]: Suppose that this pipeline must be backward-compatible with an earlier, inorder, 5-stage pipeline that has a single branch-delay slot. Assuming that there is only a single fetch stage, will there be any bubbles in the pipeline from branches? Explain. Yes. Even if we can process branch instructions in the decode stage, we will end up with a twoinstruction bubble in the pipeline, at least part of the time: Suppose that a branch and its delay slot get fetched on one cycle. Then, when these two instructions are in the decode stage, there will be two other instructions being fetched; these next two instructions must be discarded if the branch is taken. If, on the other hand, the branch is fetched as an odd instruction, there will be a one instruction in the fetch stage that will have to be discarded if the branch is taken. Note that more complex branch conditions are likely not computed until after the first execution stage, leading to more bubbles. Problem 2b[2pts]: Suppose that the fetch stage takes 2 cycles. Will this change your answer from (2a)? Explain. If you added branch prediction, what would the pipeline have to do when the prediction was wrong? Yes, this will make the situation in 2a even worse, potentially forcing the discarding of 2 additional instructions under some circumstances (or more if branch computed in execute stage). Because the pipeline is simply in-order, all that we have to do on a branch misprediction is to (1) identify the instructions after the branch that are already in the pipeline (2 instructions in F 1, at least 1 in F 2 possibly 2 if compute branch condition in decode, 2 more if compute branch in E) and mark them for flushing (i.e. to inhibit writeback of their results) and (2) start fetching from the correct branch target. 7

8 Problem 2c[10pts]: Below is a start at a simple diagram for the pipelines of this processor. 1) Finish the diagram. Stages are boxes with letters inside: Use F 1 and F 2 for the fetch stages, D for a decode stage, EX 1 through EX 4 for the execution stages of each pipelines, and W for a writeback stage. Memory instructions take 1 cycle (EX 1 ) to compute address, two cycles to fetch (or write) data, and 1 cycle to check tags (TagC). Clearly label which is the even pipeline. Include arrows for forward information flow if this is not obvious. 2) Next, describe what is being performed in each of the 4 stages (including partial results). 3) Show all forwarding paths (as arrows). Your pipeline should never stall unless a value is not ready. Assume, for the moment, that there are never any cache misses. Label each bypass arrow with the types of instructions that will forward their results along that path (i.e. use M for multf, D for divf, A for addf, I for integer operations, and Ld for load results). [Hint: think carefully about inputs to store instructions!] I, A, M, Ld, D I, A, M, Ld I, A I EX 2 EX 3 EX 4 EX 1 MEM 1 MEM 2 TagC A M D EVEN F 1 F 2 D I W EX 2 EX 3 EX 4 EX 1 MEM 1 MEM 2 TagC I A M D ODD I, A I, A, M, Ld I, A, M, Ld, D STAGES: F 1 /F 2 First and second cycle of fetch D Decode stage, stall until ready to dispatch, fetch values from registers EX 1 Integer operations, Compute memory address, First stage of Addf, Multf, and Divf EX 2 First cycle of memory operation, Second stage of Addf, Multf, Divf EX 3 Second cycle of memory operation (load results ready), Third stage of Multf, Divf. EX 4 Tag check cycle of memory operation, Fourth stage of Divf W Writeback (write data to register file) Note that the arcs feeding into the end of the EX 1 stages are for optimizing stores. It is important to note that there are no such arcs for the result of load, since we cannot start store until we have checked tag (thus normal feed of Ld results to end of D stage would be used for stores). If you take the no cache misses constraint literally, then we would feed from MEM 2 to EX 1. 8

9 Problem 2d[2pts]: Could this particular pipeline benefit from explicit register renaming? Why or why not? Be very explicit in your answer. No. Explicit register renaming is not really necessary here, because the pipeline is (1) in order and commits in order at the W stage and (2) uses bypassing. Consequently, there are no WAR and WAW hazards to worry about. Problem 2e[2pts]: Note that we assume that a load is not completed until the end of EX 3 and that a store must have its value by the beginning of EX 2. Consider the following common sequence for a memory copy: loop: ld r1, 0(r2) st r1, 0(r3) add r2, r2, #4 subi r4, r4, #1 add r3, r3, #4 bne r4, r0, loop nop Why can t the load and store be dispatched in the same cycle? What is the minimum number of instructions that must be placed between them to avoid stalling? Explain. They cannot be dispatched in the same cycle, since data from the load is not available until the MEM 2 stage and is not available to start something that cannot be aborted until after the TagC stage. The store instruction needs its data by the beginning of the MEM 1 stage. Thus, we must feed from the end of the TagC stage to the beginning of the MEM 1 stage at the earliest there must be at least 2 cycles of instructions between the load and the store (which means at least 4 instructions between, could need to be 6 instructions if the load is in the even pipeline and the store is in the odd pipeline). Problem 2f[2pts]: Assume that the following multiply-accumulate operation is extremely common for some important applications: multf f2, f1, f2 addf f3, f3, f2 How could you modify the pipeline from (2c) so that the above two operations could always be dispatched together in the same cycle? Explain with a figure. Would there be any negative consequences to this organization? Assuming that multiplies take 3 cycles and adds take 2, you could arrange to process floating-point adds in cycles 4 and 5 of the Odd pipeline (as shown below). Possible consequences: more bypassing logic, more stalling ( results of add in Odd pipeline not available until EX 5 ) EX 2 EX 3 EX 4 EX 1 EX 5 MEM 1 MEM 2 TagC Forward Multf Start Addf EX 2 EX 3 EX 4 EX 1 EX 5 MEM 1 MEM 2 TagC EVEN ODD 9

10 Problem 2g[2pts]: Assume that the above pipeline can experience cache misses. What actions must happen if a cache miss is discovered when a load is traversing through the TagC stage? Are any instructions other than that particular load affected? Explain how to resolve any issues First, it is important to note that the actual load has the wrong value! Further, since we forward for maximum performance, we may have forwarded this wrong value from the MEM 2 stage to one or two instructions that are now in the EX 1 stage. The easiest thing to do when we discover a cache miss is to (1) start filling the cache, (2) flush all instructions in stages earlier than W (including the instructions in the EX 4 /TagC), then (3) restart fetching from those instructions (including the original load) after the cache miss has completed. This simple solution needs only replace the cache line in the cache; we do not have to figure out which instructions have gotten the wrong values, since we restart them all. We could selectively do better than this by doing the following on a cache miss: we could (1) stall all instructions in the pipeline, which means that we do not update any of the latches, then (2) when the data comes back, we not only update the cache, but we forward the particular part of the cache line we originally loaded to the EX 1 latches (if necessary), and the TagC latches (just to have the right data for the writeback stage). We then let the pipeline recompute the failed cycle. Note that, by stalling the pipeline, we effectively recompute anything that might have used a wrong value and continue as if nothing incorrect ever happened. It is important to note that values are written back in order, in the write stage! Those of you that mentioned a ROB or register renaming were not taking the in-order nature of this pipeline into account (and didn t get credit for your answer!). We flush an instruction by turning it into a NOP. Problem 2h[3pts]: Notice that the TagC stage is after the two memory data stages. Assuming that the above pipeline can experience cache misses on store instructions, how can you avoid overwriting data incorrectly during such a cache miss? Explain with a diagram any extra hardware that you may need to make stores work correctly. Be explicit. Do you need to change any of the arcs from (2c)? The trick here is to wait until after we check the tag (in TagC ) before writing to the cache. We can do this by splitting the tag lookup and data storage on a store operation. We will lookup the tag for the current instruction, but do the storeback for a previous store instruction that has already been checked. Notice that there can be up to three stores/pipeline in process. Also, we need to check during loads for matching, pending stores from each pipeline. Here is one-half of the hardware (need to match with other pipeline as well): Store Address Store Data Store Address Store Data Mux Store Address Store Data EX 1 MEM 1 MEM 2 TagC 10

11 EXTRA CREDIT: Problem 2i[5pts]: Briefly describe the logic that would be required in the decode stage of this pipeline. In five (5) sentences or less (and possibly a small figure), describe a mechanism that would permit the decode stage to decide which of two instructions presented to it could be dispatched. 11

12 Problem #3: Software Scheduling [30pts] For this problem, assume that we have fully pipelined, single-issue, in-order processor with the following number of execution cycles for: 1. Floating-point multiply: 4 cycles 2. Floating-point square root: 11 cycles 3. Floating-point adder: 2 cycles 4. Integer operations: 1 cycle Assume that there is one branch delay slot, that there is no delay between integer operations, and dependent branch instructions, and that memory loads and stores require 2 memory cycles (plus the address computation). All functional units are fully pipelined and bypassed. Problem 3a[3pts]: Compute the following latencies between instructions to avoid stalls (i.e. how many unrelated instructions must be inserted between two instructions of the following types to avoid stalls)? The first one is given: Between a ldf and addf: 2 Insts Between a addf and sqrtf: 1 Inst Between a addf and stf: 0 Insts Between an sqrtf and stf: 9 Insts Between a ldf and multf: 2 Insts Between a multf and addf: 3 Insts Between a sqrtf and addf: 10 Inst Between an integer and branch: 0 Insts The following code takes an array of 2D vectors, V [], (split into 1D arrays of coordinates, X[] and Y[]) and computes the sum of the norm, namely V. It also stores the individual norms into array D[]. Let r1 point at array X, r2 at array Y, and r3 at array D. Let r4 hold the length of the arrays. Assume that F9 0 before the start of execution. Stall 2 cycles Stall 2 cycles Stall 3 cycles Stall 1 cycle Stall 10 cycles cnorm: ldf F3,0(r1) ; Load x[i] multf F4,F3,F3 ; x[i]^2 ldf F5,0(r2) ; Load y[i] multf F6,F5,F5 ; y[i]^2 addf F7,F4,F6 ; x[i]^2 + y[i]^2 sqrtf F8,F7 ; sqrt(x[i]^2 + y[i]^2) addf F9,F9,F8 ; accumulate sums stf 0(r3),F8 ; d[i] sqrt(x[i]^2 + y[i]^2) addi r1,r1,#4 addi r2,r2,#4 addi r3,r3,#4 subi r4,r4,#1 bnez r4,cnorm nop Problem 3b[2pts]: How many cycles does this loop take per iteration? Indicate stalls in the above code by labeling each of them with a number of cycles of stall: Total cycles 14 instructions + 18 stalls 32 cycles / iteration 12

13 Problem 3c[4pts]: Reschedule this code to run with as few cycles per iteration as possible. Do not unroll it or software pipeline it. How many cycles do you get per iteration of the loop now? cnorm: ldf F3,0(r1) Stall Stall 2 1 cycles cycle ldf F5,0(r2) multf F4,F3,F3 multf F6,F5,F5 Stall 3 cycles Stall 1 cycle addi r1,r1,#4 addf sqrtf F7,F4,F6 F8,F7 addi r2,r2,#4 addi r3,r3,#4 subi r4,r4,#1 Stall 5 cycles stf -4(r3),F8 bnez r4,cnorm addf F9,F9,F8 Total cycles 13 instructions + 10 Stalls 23 cycles/iteration Problem 3d[6pts]: Unroll the loop once and schedule it to run with as few cycles as possible. Ignore startup code. What is the average number of cycles per iteration of the original loop? Hint: To make this easier, use the 10s digit in naming registers for the second iteration of the loop, i.e. F3 F13. cnorm: ldf F3,0(r1) ldf F5,0(r2) ldf F13,4(r1) multf F4,F3,F3 ldf F15,4(r2) multf F6,F5,F5 multf F14,F13,F13 Stall 1 cycle multf F16,F15,F15 addf F7,F4,F6 Stall 1 cycle addf F17,F14,F16 sqrtf F8,F7 sqrtf F18,F17 addi r1,r1,#8 addi r2,r2,#8 addi r3,r3,#8 Stall 4 cycles subi r4,r4,#2 stf -8(r3),F8 stf -4(r3),F18 addf F9,F9,F8 bnez r4,cnorm addf F9,F9,F18 Total cycles 21 instructions + 6 Stalls 27 cycles 13.5 cycles/iteration 13

14 Problem 3e[5pts]: Software pipeline this loop to avoid stalls. Use as few instructions as possible. Your code should have no more than one copy of the original instructions. What is the average number of cycles per iteration? Ignore startup and exit code. This code can be organized with a pipeline of 5 stages as shown in the following figure: ldf ldf multf multf addf sqrtf stf addf cnorm: stf 0(r3),F8 ; r1+16, r2+16, r3+16, r4+4 addf F9,F9,F8 ; sqrtf F8,F7 ; r1+12, r2+12, r3+12, r4+3 addf F7,F4,F6 ; r1+ 8, r2+ 8, r3+ 8, r4+2 multf F4,F3,F3 ; r1+ 4, r2+ 4, r3+ 4, r4+1 multf F6,F5,F5 ; ldf F3,-16(r1) ; r1+ 0, r2+ 0, r3+ 0, r4+0 ldf F5,-16(r2) ; addi r1,r1,#4 ; addi r2,r2,#4 ; subi r4,r4,#1 ; bnez r4,cnorm ; addi r3,r3,#4 ; Total cycles 13 instructions + 0 Stalls 13 cycle/iteration Problem 3f[3pts]: Assuming that we are allowed to perform loads off the end of the array at r1 and r2, show how we can add a small amount of startup code and a very simple type of exit code to make your software pipelined loop a complete loop for any number of iterations > 0. HINT: Assume that you clean up the first few iterations at the end, but be careful about F9! STARTUP CODE: cnorm_ent: movfi F9, 0f ; Init F9 0 movfi F8, 0f ; Init F8 0 movfi F7, 0f ; Init F7 0 movfi F4, 0f ; Init F4, F6 0 movfi F6, 0f ; movfi F3, 0f ; Init F3, F5 0 movfi F5, 0f ; mov r11, r1 ; Save r1, r2, r3, r4 mov r12, r2 mov r13, r3 mov r14, r4 EXIT CODE: filtersc: mov r1, r11 ; Restore address of x[] mov r2, r12 ; Restore address of y[] mov r3, r13 ; Restore address of z[] mov r4, r14 ; Restore iteration count slti r5,r4,#5 ; Less than 5 iterations? bne r5, finloop ; yes. Use iteration count nop addi r4,r0,#4 ; No, peg fixup count at 4 finloop: <Code from 3a or 3c here> ; Fix up to 4 iterations 14

15 Problem 3g[5pts]: Suppose that we have a vector pipeline in addition to everything else. Assuming that the hardware vector size is at least as big as the number of loop iterations (i.e. value in r4 at entrance), produce a vectorized version of the filter loop. If you are not sure about the name of a particular instruction, use a name that seems reasonable but make sure to put comments in your code to make it obvious what you mean. Assume that there are no native vector reduction operations. Your code should take the same four arguments (r1, r2, r3, and r4) and produce the same result (F9). cnorm: MOVL r4 ; Set vector length LVS V3,r1,4 ; Load x[] into V3 MULTV V4,V3,V3 ; Compute square of x[]>v4 LVS V5,r2,4 ; Load y[] into V5 MULTV V6,V5,V5 ; Compute square of y[]>v6 ADDV V7,V4,V6 ; sum of squares > V7 SQRTV V8,V7 ; square roots > V8 SV V8,r3,4 ; Result > d[] ; The following ignores non-associativity of rounding and assumes ; non-initialized values of V8 0 (may need to work for this) reduce: movi r5, MAXLEN ; Assume power of 2 > 1 MOVL r5 ; Set vector length red_loop: VEXTHALF V9,V8 ; Extract top half of V8 into V9 ADDV V8,V8,V9 ; Bottom half partial sum srl r5,r5,#1 ; Divide length by 2 slti r6,r5, #2 bne r6,red_loop ; Still > 1 MOVL r5 ; Set vector length SVEXT F9,V8,#0 ; Move element 0 of V8 into F9 Note that we did the reduction assuming that floating point addition is associative (which it isn t for routing). Note that, unless you are worried about the reproducibility of two different versions of this code, it is not clear that the different answer that you get from this code is any more correct than the version you would get from the original code. Always understand your problem statement! We also used two vector extract instructions, one which extracts the top-half of a vector register (relative to vector length) into the bottom-half of a different register. We also assumed that we could extract any individual vector entry into a scalar register. Note that we also accepted a serial loop that scanned through d[] to compute the result (since this would technically be correct if we cared about the ordering of the adds in the reduction). Problem 3h[2pts]: Describe generally (in a couple of sentences) what you would have to do to your code in 3h if the size of a hardware vector is < number of iterations of the loop: You would have to strip mine the code, namely divide it into multiple chunks < size of hardware vector. Code would roughly look like this: cnorm_sm: modi r5,r4, MAXLEN ; compute remainder cnorm_loop: MOVL r5 ; Set vector length ; <Compute one vector-length worth of work> ; Update F9 by adding partial-sum of items ; <Update array starting points (r1, r2, r3) sub r4,r4,r5 bne r4, cnorm_loop movi r5, MAXLEN 15

16 Problem 4: Paper Potpourri[20pts] Problem 4a[3pts]: One of the papers that we read showed that the delay of the bypass network scales quadratically with issue width. Can you give an intuition as to why this is true? Assuming that you want to double the number of instructions issued per cycle without slowing your clock cycle down by a factor of 4, what could you do? The intuition behind this result is that the length of the bypass network is going to vary at least linearly with number of units that need to bypass their values. Since the delay of wires varies quadratically with length (assuming that they are not repeated), this gives us a quadratic factor. In fact, you could imagine adding repeaters to give some relief to this scaling factor, but the width of the muxes also increases with issue width, leading to super-linear delay. You could mitigate this delay increase by clustering functional units into smaller groups. Problem 4b[2pts]: Higher dimensional networks (e.g. hypercubes) can route messages with fewer hops than lower-dimensional networks. None-the-less, the exploration paper that we read on k-ary n-cubes (Bill Dally) showed that high-dimensional networks were not necessarily the lowest-latency networks. Explain what assumptions lead to this conclusion and reflect when such assumptions are valid: The important assumption here was that wiring was limited by physical constraints, i.e. the crosssectional area of the wires across the bisection must be the same regardless of the degree of the network. Thus, when optimizing for overall latency, lower-latency networks might perform better because they supported wider (higher bandwidth) connections. Problem 4c[2pts]: The Future of Wires paper ultimately concluded that multicore was an important future architectural innovation (as opposed to more complex uniprocessor cores). Can you give two arguments that lead to that conclusion? One of the primary arguments was that reasonable assumptions about the constitution of wires and clock-rates (measured in units of F04) would lead one to assume that the number of clock cycles to cross a typical chip was increasing. Thus, putting the long-distance wires into a generalized network which connects small processors (i.e. multicore) is the easiest way to handle multiple cycles for communication, rather than trying to build a large multi-issue processor. A second argument had involved the observation that CAD tools generate errors in routing proportional to the number of transistors on the chip. Each of these errors requires painstaking correction by hand. Multicore is the best way to handle Moore s law growth in number of transistors/chip (i.e. exponential): simply produce and debug a processor macro, then replicate it across the chip (with a regular network). Problem 4d[3pts]: Sketch out the following branch predictors: gshare, PAp, Tournament Address GBHR Predictor Type 1 Chooser Predictor Type 2 Address GPHT GShare PABHR PAPHT PAp 16 Mux Tournament

17 Problem 4e[2pts]: What is a simple technique using virtual channels that will permit arbitrary adaptation (around faults and congestion) while still guaranteeing deadlock freedom? Divide virtual channels into two categories: adaptive and deadlock-free. The deadlock-free network can use any technique to be deadlock free, such as routing in dimension order (requiring only 1 virtual channel/physical channel). Then, to route a message, use virtual channels in the adaptive category any way you want (routing around faults or congestion), then, if you get stuck for a bit, transition to the deadlock-free network. Most of the time, you will never have to leave the adaptive network. The most important constraint here is that once a message starts routing on the deadlockfree network, if must never transition back to the adaptive network. Problem 4f[3pts]: What was the basic idea behind Trace Scheduling for VLIW and why is it necessary to get good performance from a VLIW? Explain why a VLIW might need to perform many simultaneous condition checks when implementing Trace Scheduling. Trace Scheduling looks at traces of program execution to identify common paths through the program (across multiple branches, i.e. basic blocks); once these paths have been identified, it compresses all of the instructions in a trace together. The resulting superblock of instructions is scheduled as if the branches always take the path identified in the trace. Checks are placed at the end of the block to catch violations of the predicted branch directions; further, special fixup code is generated to correct the state of the program if branch directions are violated. Trace scheduling is necessary for a VLIW since we need to generate a large block of instructions (without branches) in order find enough parallelism to fill up the slots in VLIW instructions. Without trace scheduling, the instructions in the program couldn t cross branch boundaries a serious constraint when branches might be every 5 instructions. As described above, we might need to check many branch conditions at the end of our scheduled block to see if we need to execute fixup code. Problem 4g[3pts]: What is Simultaneous Multithreading? What hardware enhancements would be required transform a superscalar, out-of-order processor, into one that performs simultaneous multithreading? Be explicit. Simultaneous Multithreading is a technique that allows instructions from multiple threads to exist in the pipeline at the same time. It is term that is applied to superscalar, out-of-order pipelines. Hardware enhancements for Simultaneous Multithreading include (1) multiple PCs and branchprediction hardware, (2) Fetch logic to choose instructions from multiple threads, (3) Additional renaming resources to support more than one thread, including more translation table space and additional physical registers, (4) multiple commit logic to handle commits from multiple threads, (5) possibly additional TLB space to allow each thread to operate in a different address space. Problem 4h[2pts]: What is coarse-grained multithreading (as implemented by the Sparcle processor)? Name at least 2 ways in which the Alewife multiprocessor utilized coarse-grained multithreading. Coarse-grained multithreading is a form of multithreading that switches from one thread to another infrequently say at events such as cache misses or synchronization misses rather than switching every instruction. Another definition of coarse-grained multithreading would be that instructions from different threads never coexist in the pipeline at the same time. With coarse-grained multithreading, the overhead of switching from one thread to another can be multiple cycles (in Alewife it was 14 cycles). Alewife used course-grained multithreading in a number of ways, including switching: (1) on cache misses to global memory, (2) during synchronization misses (such as with the fine-grained synchronization), (3) for handling the threads generated by incoming messages. 17

18 [ This page intentionally left blank!] 18

19 [ Random spare page for scratch ] 19

20 [ Random spare page for scratch ] 20

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches: Multiple-Issue Processors Pipelining can achieve CPI close to 1 Mechanisms for handling hazards Static or dynamic scheduling Static or dynamic branch handling Increase in transistor counts (Moore s Law):

More information

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Course on: Advanced Computer Architectures INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER Prof. Cristina Silvano Politecnico di Milano [email protected] Prof. Silvano, Politecnico di Milano

More information

More on Pipelining and Pipelines in Real Machines CS 333 Fall 2006 Main Ideas Data Hazards RAW WAR WAW More pipeline stall reduction techniques Branch prediction» static» dynamic bimodal branch prediction

More information

VLIW Processors. VLIW Processors

VLIW Processors. VLIW Processors 1 VLIW Processors VLIW ( very long instruction word ) processors instructions are scheduled by the compiler a fixed number of operations are formatted as one big instruction (called a bundle) usually LIW

More information

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes basic pipeline: single, in-order issue first extension: multiple issue (superscalar) second extension: scheduling instructions for more ILP option #1: dynamic scheduling (by the hardware) option #2: static

More information

WAR: Write After Read

WAR: Write After Read WAR: Write After Read write-after-read (WAR) = artificial (name) dependence add R1, R2, R3 sub R2, R4, R1 or R1, R6, R3 problem: add could use wrong value for R2 can t happen in vanilla pipeline (reads

More information

Computer Architecture TDTS10

Computer Architecture TDTS10 why parallelism? Performance gain from increasing clock frequency is no longer an option. Outline Computer Architecture TDTS10 Superscalar Processors Very Long Instruction Word Processors Parallel computers

More information

CS352H: Computer Systems Architecture

CS352H: Computer Systems Architecture CS352H: Computer Systems Architecture Topic 9: MIPS Pipeline - Hazards October 1, 2009 University of Texas at Austin CS352H - Computer Systems Architecture Fall 2009 Don Fussell Data Hazards in ALU Instructions

More information

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern: Pipelining HW Q. Can a MIPS SW instruction executing in a simple 5-stage pipelined implementation have a data dependency hazard of any type resulting in a nop bubble? If so, show an example; if not, prove

More information

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995 UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering EEC180B Lab 7: MISP Processor Design Spring 1995 Objective: In this lab, you will complete the design of the MISP processor,

More information

PROBLEMS #20,R0,R1 #$3A,R2,R4

PROBLEMS #20,R0,R1 #$3A,R2,R4 506 CHAPTER 8 PIPELINING (Corrisponde al cap. 11 - Introduzione al pipelining) PROBLEMS 8.1 Consider the following sequence of instructions Mul And #20,R0,R1 #3,R2,R3 #$3A,R2,R4 R0,R2,R5 In all instructions,

More information

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000. ILP Execution EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May 2000 Lecture #11: Wednesday, 3 May 2000 Lecturer: Ben Serebrin Scribe: Dean Liu ILP Execution

More information

IA-64 Application Developer s Architecture Guide

IA-64 Application Developer s Architecture Guide IA-64 Application Developer s Architecture Guide The IA-64 architecture was designed to overcome the performance limitations of today s architectures and provide maximum headroom for the future. To achieve

More information

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1 Pipeline Hazards Structure hazard Data hazard Pipeline hazard: the major hurdle A hazard is a condition that prevents an instruction in the pipe from executing its next scheduled pipe stage Taxonomy of

More information

Computer Organization and Components

Computer Organization and Components Computer Organization and Components IS5, fall 25 Lecture : Pipelined Processors ssociate Professor, KTH Royal Institute of Technology ssistant Research ngineer, University of California, Berkeley Slides

More information

Midterm Exam #2 Solutions November 10, 1999 CS162 Operating Systems

Midterm Exam #2 Solutions November 10, 1999 CS162 Operating Systems Fall 1999 Your Name: SID: University of California, Berkeley College of Engineering Computer Science Division EECS Midterm Exam #2 November 10, 1999 CS162 Operating Systems Anthony D. Joseph Circle the

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27 Logistics Week 1: Wednesday, Jan 27 Because of overcrowding, we will be changing to a new room on Monday (Snee 1120). Accounts on the class cluster (crocus.csuglab.cornell.edu) will be available next week.

More information

Introduction to Cloud Computing

Introduction to Cloud Computing Introduction to Cloud Computing Parallel Processing I 15 319, spring 2010 7 th Lecture, Feb 2 nd Majd F. Sakr Lecture Motivation Concurrency and why? Different flavors of parallel computing Get the basic

More information

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX Overview CISC Developments Over Twenty Years Classic CISC design: Digital VAX VAXÕs RISC successor: PRISM/Alpha IntelÕs ubiquitous 80x86 architecture Ð 8086 through the Pentium Pro (P6) RJS 2/3/97 Philosophy

More information

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers CS 4 Introduction to Compilers ndrew Myers Cornell University dministration Prelim tomorrow evening No class Wednesday P due in days Optional reading: Muchnick 7 Lecture : Instruction scheduling pr 0 Modern

More information

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards

Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Using Graphics and Animation to Visualize Instruction Pipelining and its Hazards Per Stenström, Håkan Nilsson, and Jonas Skeppstedt Department of Computer Engineering, Lund University P.O. Box 118, S-221

More information

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows:

Solutions. Solution 4.1. 4.1.1 The values of the signals are as follows: 4 Solutions Solution 4.1 4.1.1 The values of the signals are as follows: RegWrite MemRead ALUMux MemWrite ALUOp RegMux Branch a. 1 0 0 (Reg) 0 Add 1 (ALU) 0 b. 1 1 1 (Imm) 0 Add 1 (Mem) 0 ALUMux is the

More information

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations

Lecture: Pipelining Extensions. Topics: control hazards, multi-cycle instructions, pipelining equations Lecture: Pipelining Extensions Topics: control hazards, multi-cycle instructions, pipelining equations 1 Problem 6 Show the instruction occupying each stage in each cycle (with bypassing) if I1 is R1+R2

More information

Multithreading Lin Gao cs9244 report, 2006

Multithreading Lin Gao cs9244 report, 2006 Multithreading Lin Gao cs9244 report, 2006 2 Contents 1 Introduction 5 2 Multithreading Technology 7 2.1 Fine-grained multithreading (FGMT)............. 8 2.2 Coarse-grained multithreading (CGMT)............

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Instruction Set Design

Instruction Set Design Instruction Set Design Instruction Set Architecture: to what purpose? ISA provides the level of abstraction between the software and the hardware One of the most important abstraction in CS It s narrow,

More information

StrongARM** SA-110 Microprocessor Instruction Timing

StrongARM** SA-110 Microprocessor Instruction Timing StrongARM** SA-110 Microprocessor Instruction Timing Application Note September 1998 Order Number: 278194-001 Information in this document is provided in connection with Intel products. No license, express

More information

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield.

Technical Report. Complexity-effective superscalar embedded processors using instruction-level distributed processing. Ian Caulfield. Technical Report UCAM-CL-TR-707 ISSN 1476-2986 Number 707 Computer Laboratory Complexity-effective superscalar embedded processors using instruction-level distributed processing Ian Caulfield December

More information

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings

This Unit: Multithreading (MT) CIS 501 Computer Architecture. Performance And Utilization. Readings This Unit: Multithreading (MT) CIS 501 Computer Architecture Unit 10: Hardware Multithreading Application OS Compiler Firmware CU I/O Memory Digital Circuits Gates & Transistors Why multithreading (MT)?

More information

Midterm II SOLUTIONS December 4 th, 2006 CS162: Operating Systems and Systems Programming

Midterm II SOLUTIONS December 4 th, 2006 CS162: Operating Systems and Systems Programming Fall 2006 University of California, Berkeley College of Engineering Computer Science Division EECS Midterm II SOLUTIONS December 4 th, 2006 CS162: Operating Systems and Systems Programming John Kubiatowicz

More information

OC By Arsene Fansi T. POLIMI 2008 1

OC By Arsene Fansi T. POLIMI 2008 1 IBM POWER 6 MICROPROCESSOR OC By Arsene Fansi T. POLIMI 2008 1 WHAT S IBM POWER 6 MICROPOCESSOR The IBM POWER6 microprocessor powers the new IBM i-series* and p-series* systems. It s based on IBM POWER5

More information

Module: Software Instruction Scheduling Part I

Module: Software Instruction Scheduling Part I Module: Software Instruction Scheduling Part I Sudhakar Yalamanchili, Georgia Institute of Technology Reading for this Module Loop Unrolling and Instruction Scheduling Section 2.2 Dependence Analysis Section

More information

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com CSCI-GA.3033-012 Graphics Processing Units (GPUs): Architecture and Programming Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) [email protected] http://www.mzahran.com Modern GPU

More information

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Application Note 195 ARM11 performance monitor unit Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007 Copyright 2007 ARM Limited. All rights reserved. Application Note

More information

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic

Pipeline Hazards. Arvind Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by Arvind and Krste Asanovic 1 Pipeline Hazards Computer Science and Artificial Intelligence Laboratory M.I.T. Based on the material prepared by and Krste Asanovic 6.823 L6-2 Technology Assumptions A small amount of very fast memory

More information

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of CS:APP Chapter 4 Computer Architecture Wrap-Up William J. Taffe Plymouth State University using the slides of Randal E. Bryant Carnegie Mellon University Overview Wrap-Up of PIPE Design Performance analysis

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

CPU Performance Equation

CPU Performance Equation CPU Performance Equation C T I T ime for task = C T I =Average # Cycles per instruction =Time per cycle =Instructions per task Pipelining e.g. 3-5 pipeline steps (ARM, SA, R3000) Attempt to get C down

More information

Annotation to the assignments and the solution sheet. Note the following points

Annotation to the assignments and the solution sheet. Note the following points Computer rchitecture 2 / dvanced Computer rchitecture Seite: 1 nnotation to the assignments and the solution sheet This is a multiple choice examination, that means: Solution approaches are not assessed

More information

An Implementation Of Multiprocessor Linux

An Implementation Of Multiprocessor Linux An Implementation Of Multiprocessor Linux This document describes the implementation of a simple SMP Linux kernel extension and how to use this to develop SMP Linux kernels for architectures other than

More information

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? Inside the CPU how does the CPU work? what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored? some short, boring programs to illustrate the

More information

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Chapter 12: Multiprocessor Architectures Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup Objective Be familiar with basic multiprocessor architectures and be able to

More information

A Lab Course on Computer Architecture

A Lab Course on Computer Architecture A Lab Course on Computer Architecture Pedro López José Duato Depto. de Informática de Sistemas y Computadores Facultad de Informática Universidad Politécnica de Valencia Camino de Vera s/n, 46071 - Valencia,

More information

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 IBM CELL. Politecnico di Milano Como Campus Project made by: Origgi Alessandro matr. 682197 Teruzzi Roberto matr. 682552 CELL INTRODUCTION 2 1 CELL SYNERGY Cell is not a collection of different processors, but a synergistic whole Operation paradigms,

More information

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1 System Interconnect Architectures CSCI 8150 Advanced Computer Architecture Hwang, Chapter 2 Program and Network Properties 2.4 System Interconnect Architectures Direct networks for static connections Indirect

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

Thread level parallelism

Thread level parallelism Thread level parallelism ILP is used in straight line code or loops Cache miss (off-chip cache and main memory) is unlikely to be hidden using ILP. Thread level parallelism is used instead. Thread: process

More information

The Microarchitecture of Superscalar Processors

The Microarchitecture of Superscalar Processors The Microarchitecture of Superscalar Processors James E. Smith Department of Electrical and Computer Engineering 1415 Johnson Drive Madison, WI 53706 ph: (608)-265-5737 fax:(608)-262-1267 email: [email protected]

More information

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor Travis Lanier Senior Product Manager 1 Cortex-A15: Next Generation Leadership Cortex-A class multi-processor

More information

Memory Systems. Static Random Access Memory (SRAM) Cell

Memory Systems. Static Random Access Memory (SRAM) Cell Memory Systems This chapter begins the discussion of memory systems from the implementation of a single bit. The architecture of memory chips is then constructed using arrays of bit implementations coupled

More information

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro EECS 58 Class Instruction Scheduling Software Pipelining Intro University of Michigan October 8, 04 Announcements & Reading Material Reminder: HW Class project proposals» Signup sheet available next Weds

More information

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team

High Performance Processor Architecture. André Seznec IRISA/INRIA ALF project-team High Performance Processor Architecture André Seznec IRISA/INRIA ALF project-team 1 2 Moore s «Law» Nb of transistors on a micro processor chip doubles every 18 months 1972: 2000 transistors (Intel 4004)

More information

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory Computer Organization and Architecture Chapter 4 Cache Memory Characteristics of Memory Systems Note: Appendix 4A will not be covered in class, but the material is interesting reading and may be used in

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007

Multi-core architectures. Jernej Barbic 15-213, Spring 2007 May 3, 2007 Multi-core architectures Jernej Barbic 15-213, Spring 2007 May 3, 2007 1 Single-core computer 2 Single-core CPU chip the single core 3 Multi-core architectures This lecture is about a new trend in computer

More information

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA AGENDA INTRO TO BEAGLEBONE BLACK HARDWARE & SPECS CORTEX-A8 ARMV7 PROCESSOR PROS & CONS VS RASPBERRY PI WHEN TO USE BEAGLEBONE BLACK Single

More information

Computer organization

Computer organization Computer organization Computer design an application of digital logic design procedures Computer = processing unit + memory system Processing unit = control + datapath Control = finite state machine inputs

More information

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors Lesson 05: Array Processors Objective To learn how the array processes in multiple pipelines 2 Array Processor

More information

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin A Comparison Of Shared Memory Parallel Programming Models Jace A Mogill David Haglin 1 Parallel Programming Gap Not many innovations... Memory semantics unchanged for over 50 years 2010 Multi-Core x86

More information

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

COMPUTER HARDWARE. Input- Output and Communication Memory Systems COMPUTER HARDWARE Input- Output and Communication Memory Systems Computer I/O I/O devices commonly found in Computer systems Keyboards Displays Printers Magnetic Drives Compact disk read only memory (CD-ROM)

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad A. M. Niknejad University of California, Berkeley EE 100 / 42 Lecture 24 p. 1/20 EE 42/100 Lecture 24: Latches and Flip Flops ELECTRONICS Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad University of California,

More information

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP)

TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) Internet Protocol (IP) TCP over Multi-hop Wireless Networks * Overview of Transmission Control Protocol / Internet Protocol (TCP/IP) *Slides adapted from a talk given by Nitin Vaidya. Wireless Computing and Network Systems Page

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

Pipelining Review and Its Limitations

Pipelining Review and Its Limitations Pipelining Review and Its Limitations Yuri Baida [email protected] [email protected] October 16, 2010 Moscow Institute of Physics and Technology Agenda Review Instruction set architecture Basic

More information

Using Power to Improve C Programming Education

Using Power to Improve C Programming Education Using Power to Improve C Programming Education Jonas Skeppstedt Department of Computer Science Lund University Lund, Sweden [email protected] jonasskeppstedt.net jonasskeppstedt.net [email protected]

More information

Operating System Impact on SMT Architecture

Operating System Impact on SMT Architecture Operating System Impact on SMT Architecture The work published in An Analysis of Operating System Behavior on a Simultaneous Multithreaded Architecture, Josh Redstone et al., in Proceedings of the 9th

More information

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification

Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Introduction Overview Motivating Examples Interleaving Model Semantics of Correctness Testing, Debugging, and Verification Advanced Topics in Software Engineering 1 Concurrent Programs Characterized by

More information

OPENSPARC T1 OVERVIEW

OPENSPARC T1 OVERVIEW Chapter Four OPENSPARC T1 OVERVIEW Denis Sheahan Distinguished Engineer Niagara Architecture Group Sun Microsystems Creative Commons 3.0United United States License Creative CommonsAttribution-Share Attribution-Share

More information

Computer Systems Structure Main Memory Organization

Computer Systems Structure Main Memory Organization Computer Systems Structure Main Memory Organization Peripherals Computer Central Processing Unit Main Memory Computer Systems Interconnection Communication lines Input Output Ward 1 Ward 2 Storage/Memory

More information

(Refer Slide Time: 00:01:16 min)

(Refer Slide Time: 00:01:16 min) Digital Computer Organization Prof. P. K. Biswas Department of Electronic & Electrical Communication Engineering Indian Institute of Technology, Kharagpur Lecture No. # 04 CPU Design: Tirning & Control

More information

Energy-Efficient, High-Performance Heterogeneous Core Design

Energy-Efficient, High-Performance Heterogeneous Core Design Energy-Efficient, High-Performance Heterogeneous Core Design Raj Parihar Core Design Session, MICRO - 2012 Advanced Computer Architecture Lab, UofR, Rochester April 18, 2013 Raj Parihar Energy-Efficient,

More information

The Classical Architecture. Storage 1 / 36

The Classical Architecture. Storage 1 / 36 1 / 36 The Problem Application Data? Filesystem Logical Drive Physical Drive 2 / 36 Requirements There are different classes of requirements: Data Independence application is shielded from physical storage

More information

FLIX: Fast Relief for Performance-Hungry Embedded Applications

FLIX: Fast Relief for Performance-Hungry Embedded Applications FLIX: Fast Relief for Performance-Hungry Embedded Applications Tensilica Inc. February 25 25 Tensilica, Inc. 25 Tensilica, Inc. ii Contents FLIX: Fast Relief for Performance-Hungry Embedded Applications...

More information

CSC 2405: Computer Systems II

CSC 2405: Computer Systems II CSC 2405: Computer Systems II Spring 2013 (TR 8:30-9:45 in G86) Mirela Damian http://www.csc.villanova.edu/~mdamian/csc2405/ Introductions Mirela Damian Room 167A in the Mendel Science Building [email protected]

More information

Spacecraft Computer Systems. Colonel John E. Keesee

Spacecraft Computer Systems. Colonel John E. Keesee Spacecraft Computer Systems Colonel John E. Keesee Overview Spacecraft data processing requires microcomputers and interfaces that are functionally similar to desktop systems However, space systems require:

More information

Whitepaper: performance of SqlBulkCopy

Whitepaper: performance of SqlBulkCopy We SOLVE COMPLEX PROBLEMS of DATA MODELING and DEVELOP TOOLS and solutions to let business perform best through data analysis Whitepaper: performance of SqlBulkCopy This whitepaper provides an analysis

More information

EE361: Digital Computer Organization Course Syllabus

EE361: Digital Computer Organization Course Syllabus EE361: Digital Computer Organization Course Syllabus Dr. Mohammad H. Awedh Spring 2014 Course Objectives Simply, a computer is a set of components (Processor, Memory and Storage, Input/Output Devices)

More information

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

We r e going to play Final (exam) Jeopardy! Answers: Questions: - 1 - . (0 pts) We re going to play Final (exam) Jeopardy! Associate the following answers with the appropriate question. (You are given the "answers": Pick the "question" that goes best with each "answer".)

More information

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005

Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005 Compute Cluster Server Lab 3: Debugging the parallel MPI programs in Microsoft Visual Studio 2005... 1

More information

An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform)

An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform) An Introduction To Simple Scheduling (Primarily targeted at Arduino Platform) I'm late I'm late For a very important date. No time to say "Hello, Goodbye". I'm late, I'm late, I'm late. (White Rabbit in

More information

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune Introduction to RISC Processor ni logic Pvt. Ltd., Pune AGENDA What is RISC & its History What is meant by RISC Architecture of MIPS-R4000 Processor Difference Between RISC and CISC Pros and Cons of RISC

More information

Serial Communications

Serial Communications Serial Communications 1 Serial Communication Introduction Serial communication buses Asynchronous and synchronous communication UART block diagram UART clock requirements Programming the UARTs Operation

More information

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek Instruction Set Architecture or How to talk to computers if you aren t in Star Trek The Instruction Set Architecture Application Compiler Instr. Set Proc. Operating System I/O system Instruction Set Architecture

More information

16.1 MAPREDUCE. For personal use only, not for distribution. 333

16.1 MAPREDUCE. For personal use only, not for distribution. 333 For personal use only, not for distribution. 333 16.1 MAPREDUCE Initially designed by the Google labs and used internally by Google, the MAPREDUCE distributed programming model is now promoted by several

More information

Software Pipelining - Modulo Scheduling

Software Pipelining - Modulo Scheduling EECS 583 Class 12 Software Pipelining - Modulo Scheduling University of Michigan October 15, 2014 Announcements + Reading Material HW 2 Due this Thursday Today s class reading» Iterative Modulo Scheduling:

More information

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview Technical Note TN-29-06: NAND Flash Controller on Spartan-3 Overview Micron NAND Flash Controller via Xilinx Spartan -3 FPGA Overview As mobile product capabilities continue to expand, so does the demand

More information

Lecture 17: Virtual Memory II. Goals of virtual memory

Lecture 17: Virtual Memory II. Goals of virtual memory Lecture 17: Virtual Memory II Last Lecture: Introduction to virtual memory Today Review and continue virtual memory discussion Lecture 17 1 Goals of virtual memory Make it appear as if each process has:

More information

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus Datorteknik F1 bild 1 What is a bus? Slow vehicle that many people ride together well, true... A bunch of wires... A is: a shared communication link a single set of wires used to connect multiple subsystems

More information

2

2 1 2 3 4 5 For Description of these Features see http://download.intel.com/products/processor/corei7/prod_brief.pdf The following Features Greatly affect Performance Monitoring The New Performance Monitoring

More information

Operating Systems. Virtual Memory

Operating Systems. Virtual Memory Operating Systems Virtual Memory Virtual Memory Topics. Memory Hierarchy. Why Virtual Memory. Virtual Memory Issues. Virtual Memory Solutions. Locality of Reference. Virtual Memory with Segmentation. Page

More information

Scalability and Classifications

Scalability and Classifications Scalability and Classifications 1 Types of Parallel Computers MIMD and SIMD classifications shared and distributed memory multicomputers distributed shared memory computers 2 Network Topologies static

More information

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV)

Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Interconnection Networks Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Interconnection Networks 2 SIMD systems

More information

Low Power AMD Athlon 64 and AMD Opteron Processors

Low Power AMD Athlon 64 and AMD Opteron Processors Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD

More information

Interconnection Network Design

Interconnection Network Design Interconnection Network Design Vida Vukašinović 1 Introduction Parallel computer networks are interesting topic, but they are also difficult to understand in an overall sense. The topological structure

More information