COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc. - Give your answers in the available space after each question. You can use either Portuguese or English. - Be sure to write your name and number on all pages, non-identified pages will not be graded! - Justify all your answers. - Don t hurry, you should have plenty of time to finish this test. Skip questions that you find less comfortable with and come back to them later on. I. (1.5 + 1 + 0.5 + 1 + 1.5 + 0.5 + 1.5 = 7.5 val.) 1. Consider two different implementations of the same instruction set architecture. There are four classes of instructions: A, B, C, and D. The clock rate and CPI of each implementation are given in the following table. Implementation Clock Rate CPI Class A CPI Class B CPI Class C CPI Class D I1 2.5 GHz 2 1.5 2 1 I2 3 GHz 1 2 1 1 a) Consider a program executing 10 6 instructions divided into classes as follows: 10% class A, 20% class B, 50% class C, and 20% class D. Determine which implementation is faster. IST ID: Name: 1/9
b) What is the global CPI for each implementation? c) How much time is required by each implementation to implement the program. d) If for implementation I1 the number of Class A instructions can be reduced by half at the expense of 10% more Class B instructions, what is the resulting speedup? IST ID: Name: 2/9
2. Consider the MIPS processor pipeline that was presented in this course, with the five pipeline stages F, D, X, M, and W. Consider also that: forwarding mechanisms were implemented to automatically resolve data hazards without stalls, whenever possible; no branch prediction mechanism is implemented; the branch address is computed in the D stage; independent data and program memories exist. The following code segment was executed in this processor: addi $t0, $zero, 0 lw $t3, 0($s1) for_loop: addi $t1, $t0, -16 beq $t1, $0, loop_done lw $t2, 8($s1) add $t3, $t3, $t2 sw $t3, 100($s1) addi $t0, $t0, 4 j for_loop loop_done: a) Represent the execution of the first two iterations of the program loop, by representing, for each instruction, the several executed stages of the pipeline: F, D, X, M, and W. Do not forget to represent every stall that may occur. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 IST ID: Name: 3/9
b) What is the global CPI for this program? c) Perform a full loop unrolling of the program. Estimate the speedup that is achieved by this operation. Número: Nome: 4/9
II. (1.5 + 1.5 + 1 + 1 + 1 + 1 = 7 val.) 1. Consider a memory system for a 32-bit processor with separate caches for code and data. Assume that the processor always makes accesses to 32-bit words, and that the address space is 2 32 words. The data cache has the following characteristics: 64 KB capacity; 2-way set associative; 2-word blocks; write-back allocate; LRU replacement policy. The data bus between the caches and memory is 64-bit wide, thus allowing the cache block to be filled in a single memory access. The following program that computes the number of asymmetric positions in a matrix, a[i,j] a[j,i], is executed on this system. register int i,j,sym; /* 32-bit integers on registers */ int a[1024,1024];... sym = 0; for(i = 0; i < 1024; i=i+1) for(j = i; j < 1024; j=j+1) if(a[i][j]!= a[j][i]) sym = sym + 1; Assume that the variables are allocated sequentially in memory starting at address 0, where the matrix elements are ordered by rows (a[0,0], a[0,1],..., a[1,0],...). a) Determine the hit rate in the data cache for this program (ignore the startup misses). Número: Nome: 5/9
b) Compute the average memory access time for this program. Assume the cache hit time is 1T and that the miss penalty is 10T, where T=10ns is the clock period. (if and only if you did not solve the previous question assume that the hit rate in the data cache is 67%). c) In the same conditions as the previous question, determine the occupation rate of the bus between cache and main memory. Número: Nome: 6/9
2. The memory architecture of a machine X is summarized in the following table: Virtual Address Page Size PTE Size 54 bits 16 K bytes 4 bytes a) Assume that there are 8 bits reserved for the operating system functions (protection, replacement, valid, modified, etc) other than those required by the hardware translation algorithm. Derive the largest physical memory size (in bytes) allowed by this PTE format. Make sure you consider all the fields required by the translation algorithm. b) How large (in bytes) is the page table? c) Assuming that only one application exists in the system and the maximum physical memory is devoted to the process, how much physical space (in bytes) is there for the application s data and code. Número: Nome: 7/9
III. (1.5 + 1.5 = 3 val.) Consider that a server farm is being designed to have 100 TBytes of non-volatile memory, using solid state hard drives (SHD) with 250 GBytes each. a) State how many SHD are needed if redundancy is assured by a i) RAID 1, ii) RAID 3, and iii) RAID 5. Justify your answer. b) Which RAID storage technology would you choose to achieve a lower disk access time RAID 0 or RAID 2? Justify your answer. Número: Nome: 8/9
IV. (2.5 val.) Consider a system with two multiprocessors with the following configurations: Machine A: a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word. Machine B: a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word. Suppose an application has two threads running on the two processors, each of them needs to access an entire array of 4096 words. Is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it rather than the UMA machine? If so, specify the partitioning. If not, by how many more cycles should the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Assume that the memory operations dominate the execution time. Número: Nome: 9/9