A3 Computer Architecture

A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray david.murray@eng.ox.ac.uk www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1

6. Stacks, Subroutines, and Memory Hierarchies 3A3 Michaelmas 2000 2 / 1

In this lecture... We first continue looking at the support supplied to the high level programmer by the macro-level In particular we look at the stack area of memory passing parameters to subroutines We end by looking at ways that memory hierarchies can be built using smaller and faster memories along with larger and slower memories. 3 / 1

Stacks and Subroutines Looking back at the Bog Standard Architecture... there is a register called SP, the stack pointer. Inc(PC) PC SP MAR Memory This register holds the address of the next free location in an area of memory reserved by a program as temporary storage area. IR IR(opcode) IR(address) CU Status Control Lines ALU AC MBR (Someholds the address of the last occupied location.) 4 / 1

The stack The stack pointer uses memory as a last-in, first-out (LIFO) buffer. Usually it is placed at the top of memory, as far away from the program as possible. In the figure, the stack currently contains 5 items and grows downwards into free memory. The stack pointer points to the next free location. 511 Stack grows 510 down 509 508 Stack pointer 507 points to next SP=506 free location The Stack Free memory Data allocated during execution Fixed size data Program 5 / 1

Push PUSH (PHA) and PULL (PLA) manipulate the stack: PHA MAR SP MBR AC MAR MBR; SP SP-1 The accumulator gets pushed onto the stack at the address pointed to by the stack pointer. The stack pointer is then decremented. BEFORE SP 506 AC 327 AFTER PUSH SP 505 AC 327 511 510 509 508 507 511 510 509 508 507 506 234 345 456 567 1234 234 345 456 567 1234 327 6 / 1

Pull PLA SP SP +1 MAR SP MBR MAR AC MBR BEFORE SP 506 AC 327 511 510 509 508 507 234 345 456 567 1234 The stack pointer is incremented and the content pointed to transferred to the accumulator. AFTER PULL SP 507 AC 1234 511 510 509 508 234 345 456 567 1234 7 / 1

Using the stack during a subroutine One of the most useful constructs in a high level language is the subroutine. This allows the programmer to modularize code into small chunks which do a specific task and which may be reused. When we come to compile into assembler what happens to the subroutines? main() {... xcomp = 4; ycomp = 2; mod = modulus(xcomp,ycomp);... } modulus(a,b) { msq = a*a + b*b ; m = sqrt (msq ); return(m); } sqrt (x) {... blah... } 8 / 1

Macro support for subroutines... You now know that instructions in the different routines will end up in different parts of the program. To call a subroutine, we obviously need to jump to the subroutine s instructions transfer the necessary data to the subroutine. arrange for the result to be transferred back to the calling routine. Is this enough? When you consider transferring back to the calling routine, the answer has to be no! How do we know where to jump back to? What if the subroutine has messed up the registers? There is an obvious need to store the status quo before jumping restore it after 9 / 1

Calling by value or by reference There are two ways of using formal parameters to a subroutine. 1 By Value Passing by value means that a copy of the value of the parameter is transferred. This means that even if the subroutine changes that value, when a subroutine returns, the value in the calling routine is unchanged. 2 By reference In this case the address of the parameter is passed, so that the subroutine works with the original data. If it changes the value, that change is seen by the calling routine on return from the subroutine. This discussion continues assuming calling by value. 10 / 1

Example with no parameters... \\Calling Routine JSR MYSUB \\ JUMP TO SUBROUTINE ADD 103... \\Rest of Calling Routine MYSUB LDA 592 \\ SUBROUTINE STARTS...... RTS \\ RETURN FROM SUBROUTINE Subroutine starts at labelled address, so JSR is very like JMP. Difference: in its execute phase, JSR pushes the current value of the PC onto the stack, and then loads the operand into the PC. The RTS command ends the subroutine by pulling the stored program counter from the stack. Because the stack is a LIFO buffer, subroutines can be nested to any level until memory runs out. 11 / 1

JSR and RTS The RTL for the JSR and RTS opcodes are JSR SP PC PC IR (address); SP SP 1 RTS SP SP + 1 PC SP Check Notes!! 12 / 1

Example with parameters Now we must worry how to transfer the parameters to the subroutine, and results from the subroutine We ll consider two ways to pass parameters on a reserved area of memory (your pleasure in 3A3E) to pass parameters on the stack In practice the choice is made by your particular compiler 13 / 1

Using the stack to pass parameters During the calling routine... The calling routine pushes the parameters onto the stack in order uses JSR to push the return PC value onto the stack. During the subroutine itself... Figure shows a very correct, but slow, way of using the parameters from the stack. SP Param3 Param2 Param1 Return PC SP Param3 Param2 Param1 Pull SP Pull Pull Pull SP Return PC Push Return PC Temporary register To working storage 14 / 1

Much more efficient... During subroutine access the parameters by indexed addressing, relative to the stack pointer SP. That is, the subroutine would access parameter 1, for example, by writing LDA SP,2 When the subroutine RTS s it will pull the return PC off the stack, but the parameters will be left on. However, the calling routine knows how many parameters there were. Rather than pulling them off, it can simply increase the stack pointer by the number of parameters (ie, by three in our example). There is no need to erase or reset the memory contents, because subsequent pushes onto the stack will over-write the stale contents. 15 / 1

Example of indexed addressing relative to SP SP Param3 Param2 Param1 Return PC Set up by calling routine SP During subroutine Param3 Param2 Param1 Use M<SP+1+1> as param 1 Use M<SP+1+2> as param 2 etc Can be achieved using index LDA SP+1,1 for param 1 Just after return to calling routine: SP Param3 Param2 Param1 Then SP< SP+3 Params now dropped out of stack 16 / 1

For the rest of the lecture... We look at arranging memory to give the illusion that one has a larger amount of fast SRAM than is actually the case. We look second at arranging disk and memory to give the illusion that we have more main memory than is actually the case. 17 / 1

Cache memory In Lecture 4 we noted that the large main memories were often made using relatively inexpensive but relatively slow Dynamic RAM (DRAM). However, in addition, there is often a a cache of fast Static RAM (SRAM). The cache is small, say < 1% of main memory size. For example, in a machine with a 64 128 MB main memory the cache it might only be 512 KB. Nonetheless, it has a very significant effect on the performance of memory accesses. 18 / 1

Cache memory The cache memory works by copying parts of the main memory to itself. The cache controller intercepts an address requested by the cpu, determines whether it is in the cache or not, declares a HIT allowing the data to be recovered from cache, or declares a MISS requiring it to be fetched from main memory. CPU Cache HIT Cache MISS Memory Controller Main Memory 19 / 1

Cache memory We need to look at 1. The method used by the controller to determine if an address is in the cache. Obviously a small cache memory will only have an appreciable effect on memory performance if the locations that are most frequently accessed are in the cache. So we also need to consider 2. The method used to decide what should be in the cache. We will look at a directly mapped cache. 20 / 1

The directly mapped cache The cache is divided up into a number of memory blocks each contains a number of words. The main memory is many times the size of the cache, and so we partition main memory into cached-sized chunks called sets. A memory address is therefore divided up into three parts the lsb s which indicate which word in a block; the middle bits, which define which block in a set; and the msb s which define which set. Set Block n Block 3 Block 2 Block 1 Block 0 Cache A23 Block Word etc Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Main Memory Address Set Block Word Set 2 Set 1 Set 0 A0 21 / 1

Directly mapped cache... Note that the controller need not worry about the lsb s (ie the words) a block is the least significant unit of memory in the cache. Each block (0,1,2,...) in the cache is required to come from a block with the same number in the main memory. Ie Cache Block 0 block 0 but from from one set or another We see that 1 the cache controller need only know which memory set the block came from once it knows the set, it knows the set and the block. 2 the cache is not completely flexible. Set Block n Block 3 Block 2 Block 1 Block 0 Cache Block Word etc Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Main Set 2 Set 1 Set 0 22 / 1

Hit or Miss? Need a Look Up Table To resolve to which set a block in the cache corresponds, the controller maintains a look up table called a cache tag ram: The address into the tag ram is just the block address The contents are the set number S = S To determine hit or miss... the controller takes the set and block parts of the address from the cpu, S and B. It addresses the tag ram with B and recovers the contents S. If S = S we have a HIT, otherwise a MISS. How do you build the comparator? Set Block Word Hit Address Bus Tag RAM S S * Comparator Miss 23 / 1

The comparator S0 S0* S1 MISS S1* HIT etc 24 / 1

Hit or Miss? If a hit: the required word is recovered from cache If a miss: the required block is recovered from main memory the required word forwarded to the cpu the block placed in the cache at block B the tag ram contents at address B changed to S. 25 / 1

Updating the cache At first sight both the addressing scheme, and the updating scheme seem strange! Is not the fact that we cannot have block B from set S a and block B from S b together in the cache at the same time a major restriction? Answers: YES! one could image accessing data in such a way that the cache failed all the time. NO, because computation usually occurs on instructions/data clustered in memory. Why? Compilers cluster the code. Compilers cluster allocated data (particularly arrays) Run-time memory allocation clusters data (particularly arrays) 26 / 1

Hit-Rates and Speed-Ups Suppose access times to main and cache memories are t m and t c, Suppose Hit-ratio ie probability of Hit is H Then average access time is t ave = Ht c + (1 H)t m = H(t c t m ) + t m So where k = t c /t m. t ave t m = H(k 1) + 1 The Speed-Up with the cache then is s = t m 1 = t ave H(k 1) + 1 27 / 1

Hit-Rates and Speed-Ups Speed up is: s = 1 H(k 1)+1 Top: Dependence of speed-up on hit rate for a cache with k=0.1 Bottom: Dependence of speed-up on hit rate for a cache with (unreasonably) k=0.01. 100 90 80 70 60 50 40 30 20 10 1/(-0.99*x+1) 0 0 0.2 0.4 0.6 0.8 1 28 / 1

Extending the idea downwards Nowadays, processors often have two levels of cache. Pentium III has 512KB cache off the cpu and a 32KB cache on the cpu. Non-Blocking Level 1 Cache The Pentium III processor includes two separate 16 KB level 1 (L1) caches, one for instruction and one for data. The L1 cache provides fast access to the recently used data, increasing the overall performance of the system. Non-Blocking Level 2 Cache Certain versions of the Pentium III processor include a Discrete, off-die level 2 (L2) cache. This L2 cache consists of a 512 KB unified, non-blocking cache that improves performance over cache-on-motherboard solutions by reducing the average memory access time and by providing fast access to recently used instructions and data. Performance is also enhanced over cache-on-motherboard implementations through a dedicated 64-bit cache bus. 29 / 1

Extending the idea upwards But we can also think of extending the idea upwards. Main memory becomes the cache for larger slower memory. Where? 30 / 1

Virtual memory Main memory holds the current working set of a much larger memory on hard disk. Just as the tag ram monkeyed around with addresses between main and cache, now we have to monkey with those between main memory and disk. Distingush two address spaces a logical or virtual address space and a physical address space The cpu deals in logical addresses using the full width of the address bus. Eg, a 32bit address bus allows logical addressing of 4GB. The physical address refers to an actual address in main memory, which may be only 64MB in size. But the user s program/data appears to have access to the logical address space. 31 / 1

Virtual memory... Memory is divided up in pages (rather than blocks in the cache) Each page of physical memory can map onto any page of logical memory More flexible approach than a directly mapped cache (equivalent to non-blocking). A page table is maintained which describes the mapping between logical and physical addresses. When the cpu request a logical address, the relevant page is determined, and the page table read to see whether that page is in physical memory. 32 / 1

Paging in Virtual memory... HIT! returns the physical address. MISS! the page table indicates where it is on the disk. The page is read from disk and installed in physical memory. This displaces a page of memory which has to be written back to disk, Logical address from cpu Memory HIT In memory physical addr Page Table MISS Not in memory location on disk Pager decides which page to swap out Swapped in Disk Data to cpu Swapped out A paging algorithm determines which is the best one to remove the least used, or that used least recently, are obvious choices. The need to access disk is called a page fault. 33 / 1

Hit ratio Just as a cache was a method of speeding up access to main memory, one might regard main memory as a method of speeding up access to disk. Need to worry about the hit rate. Whereas cache is say an order of magnitude slower than Main, Disk is some 6 orders of magnitude slower than Main. k = t m /t d can be all but ignored, giving s = 1 1 H so the speed-up is determined by the hit rate alone. The size of page and number of pages in a memory are important factors in maintaining a high hit rate. Again success requires data to be localized in memory 34 / 1

Disk Thrashing If the program repeatedly generate page faults it thrashes the disks... A problem with (large) data arrays, such as images... Data stored in row order. A pixel in column C, row R may be on a different page from column C, row R+1... 35 / 1

Multi-tasking The use of virtual memory is important in multiuser or multitasking systems, allowing several users to feel they have access to large memories. The topic is discussed in depth in Tanenbaum (Chapter 6). An important feature of the hierarchy built from cache, main memory and disk is its transparency to the user. The hierarchy can be extended further to slower online storage media such as CD Roms. The idea here would be that when accessed, large quantities of material would be brought onto faster disks, but would decay away if unused over a period of hours. 36 / 1