A3 Computer Architecture

Similar documents
CHAPTER 7: The CPU and Memory

Central Processing Unit (CPU)

Chapter 1 Computer System Overview

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

MICROPROCESSOR AND MICROCOMPUTER BASICS

Chapter 4 Lecture 5 The Microarchitecture Level Integer JAVA Virtual Machine

PROBLEMS. which was discussed in Section

Computer Systems Structure Main Memory Organization

a storage location directly on the CPU, used for temporary storage of small amounts of data during processing.

1 Classical Universal Computer 3

MACHINE ARCHITECTURE & LANGUAGE

Memory Basics. SRAM/DRAM Basics

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Computer Architecture

Operating Systems. Virtual Memory

CSC 2405: Computer Systems II

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

(Refer Slide Time: 00:01:16 min)

361 Computer Architecture Lecture 14: Cache Memory

MICROPROCESSOR BCA IV Sem MULTIPLE CHOICE QUESTIONS

The Big Picture. Cache Memory CSE Memory Hierarchy (1/3) Disk

PROBLEMS (Cap. 4 - Istruzioni macchina)

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

CPU Organization and Assembly Language

LSN 2 Computer Processors

CPU Organisation and Operation

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Central Processing Unit

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Exceptions in MIPS. know the exception mechanism in MIPS be able to write a simple exception handler for a MIPS machine

Basic Computer Organization

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

CS412/CS413. Introduction to Compilers Tim Teitelbaum. Lecture 20: Stack Frames 7 March 08

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Central Processing Unit Simulation Version v2.5 (July 2005) Charles André University Nice-Sophia Antipolis

Hardware Assisted Virtualization

Computer organization

Computer Systems Architecture

Lecture 17: Virtual Memory II. Goals of virtual memory

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Instruction Set Architecture. or How to talk to computers if you aren t in Star Trek

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

SDRAM and DRAM Memory Systems Overview

Learning Outcomes. Simple CPU Operation and Buses. Composition of a CPU. A simple CPU design

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Computer Organization and Architecture

Computer Organization. and Instruction Execution. August 22

The stack and the stack pointer

Systems I: Computer Organization and Architecture

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

Introduction to Embedded Systems. Software Update Problem

Hardware/Software Co-Design of a Java Virtual Machine

================================================================

Computer System: User s View. Computer System Components: High Level View. Input. Output. Computer. Computer System: Motherboard Level

Memory unit sees only the addresses, and not how they are generated (instruction counter, indexing, direct)

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

How It All Works. Other M68000 Updates. Basic Control Signals. Basic Control Signals

Z80 Instruction Set. Z80 Assembly Language

Chapter 4 System Unit Components. Discovering Computers Your Interactive Guide to the Digital World

Chapter 3: Operating-System Structures. Common System Components

Buffer Management 5. Buffer Management

Chapter 5 Instructor's Manual

Computer Architecture

Physical Data Organization

Addressing The problem. When & Where do we encounter Data? The concept of addressing data' in computations. The implications for our machine design(s)

Microprocessor & Assembly Language

1 The Java Virtual Machine

picojava TM : A Hardware Implementation of the Java Virtual Machine

Jonathan Worthington Scarborough Linux User Group

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Memory management basics (1) Requirements (1) Objectives. Operating Systems Part of E1.9 - Principles of Computers and Software Engineering

Discovering Computers Living in a Digital World


Z80 Microprocessors Z80 CPU. User Manual UM Copyright 2014 Zilog, Inc. All rights reserved.

Chapter 7D The Java Virtual Machine

Summary of the MARIE Assembly Language

OAMulator. Online One Address Machine emulator and OAMPL compiler.

Instruction Set Architecture (ISA)

Chapter 11 I/O Management and Disk Scheduling

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

PART B QUESTIONS AND ANSWERS UNIT I

Modbus RTU Communications RX/WX and MRX/MWX

Technical Note. Micron NAND Flash Controller via Xilinx Spartan -3 FPGA. Overview. TN-29-06: NAND Flash Controller on Spartan-3 Overview

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Operating Systems, 6 th ed. Test Bank Chapter 7

Chapter 3: Operating-System Structures. System Components Operating System Services System Calls System Programs System Structure Virtual Machines

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

Instruction Set Architecture

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

Introduction. What is an Operating System?

Operating Systems. and Windows

CS 6290 I/O and Storage. Milos Prvulovic

Transcription:

A3 Computer Architecture Engineering Science 3rd year A3 Lectures Prof David Murray david.murray@eng.ox.ac.uk www.robots.ox.ac.uk/ dwm/courses/3co Michaelmas 2000 1 / 1

6. Stacks, Subroutines, and Memory Hierarchies 3A3 Michaelmas 2000 2 / 1

In this lecture... We first continue looking at the support supplied to the high level programmer by the macro-level In particular we look at the stack area of memory passing parameters to subroutines We end by looking at ways that memory hierarchies can be built using smaller and faster memories along with larger and slower memories. 3 / 1

Stacks and Subroutines Looking back at the Bog Standard Architecture... there is a register called SP, the stack pointer. Inc(PC) PC SP MAR Memory This register holds the address of the next free location in an area of memory reserved by a program as temporary storage area. IR IR(opcode) IR(address) CU Status Control Lines ALU AC MBR (Someholds the address of the last occupied location.) 4 / 1

The stack The stack pointer uses memory as a last-in, first-out (LIFO) buffer. Usually it is placed at the top of memory, as far away from the program as possible. In the figure, the stack currently contains 5 items and grows downwards into free memory. The stack pointer points to the next free location. 511 Stack grows 510 down 509 508 Stack pointer 507 points to next SP=506 free location The Stack Free memory Data allocated during execution Fixed size data Program 5 / 1

Push PUSH (PHA) and PULL (PLA) manipulate the stack: PHA MAR SP MBR AC MAR MBR; SP SP-1 The accumulator gets pushed onto the stack at the address pointed to by the stack pointer. The stack pointer is then decremented. BEFORE SP 506 AC 327 AFTER PUSH SP 505 AC 327 511 510 509 508 507 511 510 509 508 507 506 234 345 456 567 1234 234 345 456 567 1234 327 6 / 1

Pull PLA SP SP +1 MAR SP MBR MAR AC MBR BEFORE SP 506 AC 327 511 510 509 508 507 234 345 456 567 1234 The stack pointer is incremented and the content pointed to transferred to the accumulator. AFTER PULL SP 507 AC 1234 511 510 509 508 234 345 456 567 1234 7 / 1

Using the stack during a subroutine One of the most useful constructs in a high level language is the subroutine. This allows the programmer to modularize code into small chunks which do a specific task and which may be reused. When we come to compile into assembler what happens to the subroutines? main() {... xcomp = 4; ycomp = 2; mod = modulus(xcomp,ycomp);... } modulus(a,b) { msq = a*a + b*b ; m = sqrt (msq ); return(m); } sqrt (x) {... blah... } 8 / 1

Macro support for subroutines... You now know that instructions in the different routines will end up in different parts of the program. To call a subroutine, we obviously need to jump to the subroutine s instructions transfer the necessary data to the subroutine. arrange for the result to be transferred back to the calling routine. Is this enough? When you consider transferring back to the calling routine, the answer has to be no! How do we know where to jump back to? What if the subroutine has messed up the registers? There is an obvious need to store the status quo before jumping restore it after 9 / 1

Calling by value or by reference There are two ways of using formal parameters to a subroutine. 1 By Value Passing by value means that a copy of the value of the parameter is transferred. This means that even if the subroutine changes that value, when a subroutine returns, the value in the calling routine is unchanged. 2 By reference In this case the address of the parameter is passed, so that the subroutine works with the original data. If it changes the value, that change is seen by the calling routine on return from the subroutine. This discussion continues assuming calling by value. 10 / 1

Example with no parameters... \\Calling Routine JSR MYSUB \\ JUMP TO SUBROUTINE ADD 103... \\Rest of Calling Routine MYSUB LDA 592 \\ SUBROUTINE STARTS...... RTS \\ RETURN FROM SUBROUTINE Subroutine starts at labelled address, so JSR is very like JMP. Difference: in its execute phase, JSR pushes the current value of the PC onto the stack, and then loads the operand into the PC. The RTS command ends the subroutine by pulling the stored program counter from the stack. Because the stack is a LIFO buffer, subroutines can be nested to any level until memory runs out. 11 / 1

JSR and RTS The RTL for the JSR and RTS opcodes are JSR SP PC PC IR (address); SP SP 1 RTS SP SP + 1 PC SP Check Notes!! 12 / 1

Example with parameters Now we must worry how to transfer the parameters to the subroutine, and results from the subroutine We ll consider two ways to pass parameters on a reserved area of memory (your pleasure in 3A3E) to pass parameters on the stack In practice the choice is made by your particular compiler 13 / 1

Using the stack to pass parameters During the calling routine... The calling routine pushes the parameters onto the stack in order uses JSR to push the return PC value onto the stack. During the subroutine itself... Figure shows a very correct, but slow, way of using the parameters from the stack. SP Param3 Param2 Param1 Return PC SP Param3 Param2 Param1 Pull SP Pull Pull Pull SP Return PC Push Return PC Temporary register To working storage 14 / 1

Much more efficient... During subroutine access the parameters by indexed addressing, relative to the stack pointer SP. That is, the subroutine would access parameter 1, for example, by writing LDA SP,2 When the subroutine RTS s it will pull the return PC off the stack, but the parameters will be left on. However, the calling routine knows how many parameters there were. Rather than pulling them off, it can simply increase the stack pointer by the number of parameters (ie, by three in our example). There is no need to erase or reset the memory contents, because subsequent pushes onto the stack will over-write the stale contents. 15 / 1

Example of indexed addressing relative to SP SP Param3 Param2 Param1 Return PC Set up by calling routine SP During subroutine Param3 Param2 Param1 Use M<SP+1+1> as param 1 Use M<SP+1+2> as param 2 etc Can be achieved using index LDA SP+1,1 for param 1 Just after return to calling routine: SP Param3 Param2 Param1 Then SP< SP+3 Params now dropped out of stack 16 / 1

For the rest of the lecture... We look at arranging memory to give the illusion that one has a larger amount of fast SRAM than is actually the case. We look second at arranging disk and memory to give the illusion that we have more main memory than is actually the case. 17 / 1

Cache memory In Lecture 4 we noted that the large main memories were often made using relatively inexpensive but relatively slow Dynamic RAM (DRAM). However, in addition, there is often a a cache of fast Static RAM (SRAM). The cache is small, say < 1% of main memory size. For example, in a machine with a 64 128 MB main memory the cache it might only be 512 KB. Nonetheless, it has a very significant effect on the performance of memory accesses. 18 / 1

Cache memory The cache memory works by copying parts of the main memory to itself. The cache controller intercepts an address requested by the cpu, determines whether it is in the cache or not, declares a HIT allowing the data to be recovered from cache, or declares a MISS requiring it to be fetched from main memory. CPU Cache HIT Cache MISS Memory Controller Main Memory 19 / 1

Cache memory We need to look at 1. The method used by the controller to determine if an address is in the cache. Obviously a small cache memory will only have an appreciable effect on memory performance if the locations that are most frequently accessed are in the cache. So we also need to consider 2. The method used to decide what should be in the cache. We will look at a directly mapped cache. 20 / 1

The directly mapped cache The cache is divided up into a number of memory blocks each contains a number of words. The main memory is many times the size of the cache, and so we partition main memory into cached-sized chunks called sets. A memory address is therefore divided up into three parts the lsb s which indicate which word in a block; the middle bits, which define which block in a set; and the msb s which define which set. Set Block n Block 3 Block 2 Block 1 Block 0 Cache A23 Block Word etc Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Main Memory Address Set Block Word Set 2 Set 1 Set 0 A0 21 / 1

Directly mapped cache... Note that the controller need not worry about the lsb s (ie the words) a block is the least significant unit of memory in the cache. Each block (0,1,2,...) in the cache is required to come from a block with the same number in the main memory. Ie Cache Block 0 block 0 but from from one set or another We see that 1 the cache controller need only know which memory set the block came from once it knows the set, it knows the set and the block. 2 the cache is not completely flexible. Set Block n Block 3 Block 2 Block 1 Block 0 Cache Block Word etc Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Block n Block 3 Block 2 Block 1 Block 0 Main Set 2 Set 1 Set 0 22 / 1

Hit or Miss? Need a Look Up Table To resolve to which set a block in the cache corresponds, the controller maintains a look up table called a cache tag ram: The address into the tag ram is just the block address The contents are the set number S = S To determine hit or miss... the controller takes the set and block parts of the address from the cpu, S and B. It addresses the tag ram with B and recovers the contents S. If S = S we have a HIT, otherwise a MISS. How do you build the comparator? Set Block Word Hit Address Bus Tag RAM S S * Comparator Miss 23 / 1

The comparator S0 S0* S1 MISS S1* HIT etc 24 / 1

Hit or Miss? If a hit: the required word is recovered from cache If a miss: the required block is recovered from main memory the required word forwarded to the cpu the block placed in the cache at block B the tag ram contents at address B changed to S. 25 / 1

Updating the cache At first sight both the addressing scheme, and the updating scheme seem strange! Is not the fact that we cannot have block B from set S a and block B from S b together in the cache at the same time a major restriction? Answers: YES! one could image accessing data in such a way that the cache failed all the time. NO, because computation usually occurs on instructions/data clustered in memory. Why? Compilers cluster the code. Compilers cluster allocated data (particularly arrays) Run-time memory allocation clusters data (particularly arrays) 26 / 1

Hit-Rates and Speed-Ups Suppose access times to main and cache memories are t m and t c, Suppose Hit-ratio ie probability of Hit is H Then average access time is t ave = Ht c + (1 H)t m = H(t c t m ) + t m So where k = t c /t m. t ave t m = H(k 1) + 1 The Speed-Up with the cache then is s = t m 1 = t ave H(k 1) + 1 27 / 1

Hit-Rates and Speed-Ups Speed up is: s = 1 H(k 1)+1 Top: Dependence of speed-up on hit rate for a cache with k=0.1 Bottom: Dependence of speed-up on hit rate for a cache with (unreasonably) k=0.01. 100 90 80 70 60 50 40 30 20 10 1/(-0.99*x+1) 0 0 0.2 0.4 0.6 0.8 1 28 / 1

Extending the idea downwards Nowadays, processors often have two levels of cache. Pentium III has 512KB cache off the cpu and a 32KB cache on the cpu. Non-Blocking Level 1 Cache The Pentium III processor includes two separate 16 KB level 1 (L1) caches, one for instruction and one for data. The L1 cache provides fast access to the recently used data, increasing the overall performance of the system. Non-Blocking Level 2 Cache Certain versions of the Pentium III processor include a Discrete, off-die level 2 (L2) cache. This L2 cache consists of a 512 KB unified, non-blocking cache that improves performance over cache-on-motherboard solutions by reducing the average memory access time and by providing fast access to recently used instructions and data. Performance is also enhanced over cache-on-motherboard implementations through a dedicated 64-bit cache bus. 29 / 1

Extending the idea upwards But we can also think of extending the idea upwards. Main memory becomes the cache for larger slower memory. Where? 30 / 1

Virtual memory Main memory holds the current working set of a much larger memory on hard disk. Just as the tag ram monkeyed around with addresses between main and cache, now we have to monkey with those between main memory and disk. Distingush two address spaces a logical or virtual address space and a physical address space The cpu deals in logical addresses using the full width of the address bus. Eg, a 32bit address bus allows logical addressing of 4GB. The physical address refers to an actual address in main memory, which may be only 64MB in size. But the user s program/data appears to have access to the logical address space. 31 / 1

Virtual memory... Memory is divided up in pages (rather than blocks in the cache) Each page of physical memory can map onto any page of logical memory More flexible approach than a directly mapped cache (equivalent to non-blocking). A page table is maintained which describes the mapping between logical and physical addresses. When the cpu request a logical address, the relevant page is determined, and the page table read to see whether that page is in physical memory. 32 / 1

Paging in Virtual memory... HIT! returns the physical address. MISS! the page table indicates where it is on the disk. The page is read from disk and installed in physical memory. This displaces a page of memory which has to be written back to disk, Logical address from cpu Memory HIT In memory physical addr Page Table MISS Not in memory location on disk Pager decides which page to swap out Swapped in Disk Data to cpu Swapped out A paging algorithm determines which is the best one to remove the least used, or that used least recently, are obvious choices. The need to access disk is called a page fault. 33 / 1

Hit ratio Just as a cache was a method of speeding up access to main memory, one might regard main memory as a method of speeding up access to disk. Need to worry about the hit rate. Whereas cache is say an order of magnitude slower than Main, Disk is some 6 orders of magnitude slower than Main. k = t m /t d can be all but ignored, giving s = 1 1 H so the speed-up is determined by the hit rate alone. The size of page and number of pages in a memory are important factors in maintaining a high hit rate. Again success requires data to be localized in memory 34 / 1

Disk Thrashing If the program repeatedly generate page faults it thrashes the disks... A problem with (large) data arrays, such as images... Data stored in row order. A pixel in column C, row R may be on a different page from column C, row R+1... 35 / 1

Multi-tasking The use of virtual memory is important in multiuser or multitasking systems, allowing several users to feel they have access to large memories. The topic is discussed in depth in Tanenbaum (Chapter 6). An important feature of the hierarchy built from cache, main memory and disk is its transparency to the user. The hierarchy can be extended further to slower online storage media such as CD Roms. The idea here would be that when accessed, large quantities of material would be brought onto faster disks, but would decay away if unused over a period of hours. 36 / 1