SOLUTIONS Test 1 February 24, Advanced Computer Architecture. Test I. February 24, 1998 SOLUTIONS

Similar documents
The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Homework # 2. Solutions. 4.1 What are the differences among sequential access, direct access, and random access?

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Board Notes on Virtual Memory

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Lecture 17: Virtual Memory II. Goals of virtual memory

18-548/ Associativity 9/16/98. 7 Associativity / Memory System Architecture Philip Koopman September 16, 1998

Outline. Cache Parameters. Lecture 5 Cache Operation

Chapter 1 Computer System Overview

Computer Architecture

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

CSE 160 Lecture 5. The Memory Hierarchy False Sharing Cache Coherence and Consistency. Scott B. Baden

Operating Systems. Virtual Memory

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Chapter 13. PIC Family Microcontroller

Virtual Memory. How is it possible for each process to have contiguous addresses and so many of them? A System Using Virtual Addressing

Parallel Processing and Software Performance. Lukáš Marek

StrongARM** SA-110 Microprocessor Instruction Timing

This Unit: Caches. CIS 501 Introduction to Computer Architecture. Motivation. Types of Memory

(Refer Slide Time: 00:01:16 min)

Concept of Cache in web proxies

361 Computer Architecture Lecture 14: Cache Memory

COS 318: Operating Systems

Memory Management Outline. Background Swapping Contiguous Memory Allocation Paging Segmentation Segmented Paging

Application Note 195. ARM11 performance monitor unit. Document number: ARM DAI 195B Issued: 15th February, 2008 Copyright ARM Limited 2007

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

ACCURATE HARDWARE RAID SIMULATOR. A Thesis. Presented to. the Faculty of California Polytechnic State University. San Luis Obispo

Storage in Database Systems. CMPSCI 445 Fall 2010

CS/COE1541: Introduction to Computer Architecture. Memory hierarchy. Sangyeun Cho. Computer Science Department University of Pittsburgh

Basic Performance Measurements for AMD Athlon 64, AMD Opteron and AMD Phenom Processors

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

Exploiting Address Space Contiguity to Accelerate TLB Miss Handling

1 Description of The Simpletron

AES Power Attack Based on Induced Cache Miss and Countermeasure

Virtual Memory Paging

COS 318: Operating Systems. Virtual Memory and Address Translation

Integrated Layer Processing Can Be Hazardous to Your Performance

An Introduction to the ARM 7 Architecture

We r e going to play Final (exam) Jeopardy! "Answers:" "Questions:" - 1 -

1 Storage Devices Summary

Multiprocessor Cache Coherence

The Classical Architecture. Storage 1 / 36

Traditional IBM Mainframe Operating Principles

Intel P6 Systemprogrammering 2007 Föreläsning 5 P6/Linux Memory System

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

The Linux Virtual Filesystem

x64 Servers: Do you want 64 or 32 bit apps with that server?

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Pitfalls of Object Oriented Programming

Chapter 6, The Operating System Machine Level

Hybrid Cloud Storage System. Oh well, I will write the report on May1 st

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

A client side persistent block cache for the data center. Vault Boston Luis Pabón - Red Hat

Microprocessor & Assembly Language

The introduction of storage presales tool-edesigner. Storage edesigner Project Team

Buffer-Aware Garbage Collection for NAND Flash Memory-Based Storage Systems

Optimizing matrix multiplication Amitabha Banerjee

Computer Organization and Architecture

SEQUENCES ARITHMETIC SEQUENCES. Examples

InfoScale Storage & Media Server Workloads

Lab 11. Simulations. The Concept

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Magento & Zend Benchmarks Version 1.2, 1.3 (with & without Flat Catalogs)

DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

EFFICIENT EXTERNAL SORTING ON FLASH MEMORY EMBEDDED DEVICES

white paper Capacity and Scaling of Microsoft Terminal Server on the Unisys ES7000/600 Unisys Systems & Technology Modeling and Measurement

GPU Computing with CUDA Lecture 4 - Optimizations. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

LOCKUP-FREE INSTRUCTION FETCH/PREFETCH CACHE ORGANIZATION

Pushing the Limits of Windows: Physical Memory Mark Russinovich (From Mark Russinovich Blog)

Storing Data: Disks and Files. Disks and Files. Why Not Store Everything in Main Memory? Chapter 7

SYSTEM ecos Embedded Configurable Operating System

Data Memory Alternatives for Multiscalar Processors

2

Using Synology SSD Technology to Enhance System Performance Synology Inc.

1. Computer System Structure and Components

The 104 Duke_ACC Machine

Technology Update White Paper. High Speed RAID 6. Powered by Custom ASIC Parity Chips

Midterm Exam #2 Solutions November 10, 1999 CS162 Operating Systems

MACHINE INSTRUCTIONS AND PROGRAMS

& Data Processing 2. Exercise 3: Memory Management. Dipl.-Ing. Bogdan Marin. Universität Duisburg-Essen

RAID5 versus RAID10. First let's get on the same page so we're all talking about apples.

Exam 1 Review Questions

Workflow Requirements (Dec. 12, 2006)

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Page Replacement for Write References in NAND Flash Based Virtual Memory Systems

14.1 Rent-or-buy problem

3 Address Spaces & Transaction Routing. The Previous Chapter. This Chapter. The Next Chapter

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

An Overview of Stack Architecture and the PSC 1000 Microprocessor

Name Partners Date. Energy Diagrams I

An Introduction to Computer Science and Computer Organization Comp 150 Fall 2008

Figure 1: Graphical example of a mergesort 1.

How To Write A Page Table

8051 MICROCONTROLLER COURSE

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

Discussion Session 9. CS/ECE 552 Ramkumar Ravi 26 Mar 2012

Operating Systems, 6 th ed. Test Bank Chapter 7

1 Solving LPs: The Simplex Algorithm of George Dantzig

Transcription:

18-742 Advanced Computer Architecture Test I February 24, 1998 SOLUTIONS 1

1. Cache Policies. Consider a cache memory with the following characteristics: Unified, direct mapped Capacity of 256 bytes (0x100 bytes, where the prefix 0x denotes a hexadecimal number) Two blocks per sector Each sector is 2 * 2 * 4 bytes = 0x10 bytes in size Two words per block 4-byte (32-bit) words, addressed in bytes but accessed only in 4-byte word boundaries Bus transfers one 32-bit word at a time, and transfers only those words actually required to be read/written given the design of the cache (i.e., it generates as little bus traffic as it can) No prefetching (i.e., demand fetches only; one block allocated on a miss) Write back policy; Write allocate policy 1a) (5 points) Assume the cache starts out completely invalidated. Circle one of four choices next to every access to indicate whether the access is a hit, or which type of miss it is. Put a couple words under each printed line giving a reason for your choice. read 0x0010 hit, compulsory miss, capacity miss, conflict miss First access is a miss read 0x0014 hit, compulsory miss, capacity miss, conflict miss Same block as 0x0010 read 0x001C hit, compulsory miss, capacity miss, conflict miss Sector is already allocated, but this block in sector is still invalid read 0x0110 hit, compulsory miss, capacity miss, conflict miss First time access, evicts 0x0010, (credit given for conflict miss as an answer as well, but compulsory is a better answer if this is the very first time 0x0110 has been accessed) read 0x001C hit, compulsory miss, capacity miss, conflict miss Was just evicted by 0x0110 1b) (8 points) Assume the cache starts out completely invalidated. Circle hit or miss next to each access, and indicate the number of words of memory traffic generated by the access. write 0x0024 hit, miss, # words bus traffic = 1 Compulsory miss, one word read in to fill block write 0x0020 hit, miss, # words bus traffic = 0 Access to same block as before; no traffic write 0x0124 hit, miss, # words bus traffic = 3 Compulsory miss; both words of block written out; one word read in to fill up the block read 0x0024 hit, miss, # words bus traffic = 4 Conflict miss, both words written out & back in (only 1 dirty bit for the block) 2

2. Multi-level cache organization and performance. Consider a system with a two-level cache having the following characteristics. The system does not use shared bits in its caches. L1 cache - Physically addressed, byte addressed - Split I/D - 8 KB combined, evenly split -- 4KB each - Direct mapped - 2 blocks per sector - 1 word per block (8 bytes/word) - Write-through - No write allocation - L1 hit time is 1 clock - L1 average local miss rate is 0.15 L2 cache - Virtually addressed, byte addressed - Unified - 160 KB - 5-way set associative; LRU replacement - 2 blocks per sector - 1 word per block (8 bytes/word) - Write-back - Write allocate - L2 hit time is 5 clocks (after L1 miss) - L2 average local miss rate is 0.05 - The system has a 40-bit physical address space and a 52-bit virtual address space. - L2 miss (transport time) takes 50 clock cycles - The system uses a sequential forward access model (the usual one from class) 2a) (9 points) Compute the following elements that would be necessary to determine how to configure the cache memory array: Total number of bits within each L1 cache block: 65 bits 8 bytes per block + 1 valid bit = 65 bits Total number of bits within each L1 cache sector: 158 bits tag is 40 bits - 12 bits (4 KB) = 28 bits, and is the only sector overhead since it is direct mapped 28 + 65 + 65 = 158 bits Total number of within each L2 set: 800 bits 8 bytes per block + 1 valid bit + 1 dirty bit = 66 bits cache is 32 KB * 5 in size ; tag is 40 bits - 15 bits (32 KB) = 25 bits tag per sector + 3 bits LRU counter per sector = 28 overhead bits per sector (one sector could also be only 37 bits per the trick discussed in class... but that is optional) Set size = ((66 * 2 ) + 28) * 5) = 800 bits per set 3

2b) (5 points) Compute the averate effective memory access time (t ea, in clocks) for the given 2-level cache under the stated conditions. t ea = T hit1 + P miss1 T hit2 + P miss1 P miss2 T transport t ea = 1 + (.15 * 5 ) + (.15 *.05 * 50) = 2.125 clocks 2c) (10 points) Consider a case in which a task is interrupted, causing both caches described above to be completely flushed by other tasks, and then that original task begins execution again (i.e., a completely cold cache situation) and runs to completion, completely refilling the cache as it runs. What is the approximate time penalty, in clocks, associated with refilling the caches when the original program resumes execution? A restating of this same question is: assuming that the program runs to completion after it is restarted, how much longer (in clocks) will it take to run than if it had not been interrupted? For this sub-problem: The program makes only single-word memory accesses The reason we say approximate is that you should compute L1 and L2 penalties independently and then add them, rather than try to figure out coupling between them. State any assumptions you feel are necessary to solving this problem. For the L1 cache, 1K memory references (8K / 8 bytes) will suffer an L1 miss before the L1 cache is warmed up. But, 15% of these would have been misses anyway. Extra L1 misses = 1K * 0.85 = 870 misses L2 cache will suffer (160K / 8 bytes)* 0.95 extra misses = 19456 misses Penalty due to misses is approximately: 870 * 5 + 19456 * 50 = 977,150 clocks extra execution time due to loss of cache context A simpler answer but less accurate estimate that received 8 of 10 points was to ignore misses and just compute the raw fill times of the cache, giving 1,029,120 clocks (assumes 100 miss rate until the caches are full) 4

2d) (10 points) For the given 2 level cache design, sketch the data cache behavior of the following program in the graphing area provided (ignore all instruction accesses -- consider data accesses only). Label each point of inflection (i.e., points at which the graph substantially changes behavior) and the end points on the curve with a letter and put values for that point in the data table provided along with a brief statement (3-10 words) of what is causing the change in behavior and how you knew what array size it would change at. The local miss rates given earlier are for a typical program, and obviously do not apply to the program under consideration in this sub-question. int touch_array(long *array, int size) { int i; int sum=0; long *p, *limit; limit = array + size/sizeof(int); for (i = 0; i < 1000000; i++) { /* sequentially touch all elements in array */ for (p = array; p!= limit; p++) { sum += *p; return(sum); Point Array size Data t ea Reason A 1 KB 1 All L1 Hits B 4 KB 1 All L1 Hits (L1 D-cache 4 KB) C 8 KB 6 L1 miss; L2 hit (twice L1 size) D 160 KB 6 L1 miss; L2 hit (L2 size) E 192 KB 56 L2 miss; L2 * (1 + 1/5) size F 1000 KB 56 All L2 Misses 5

3. Virtual Memory. Consider a virtual memory system with the following characteristics: Page size of 16 KB = 2 14 bytes Page table and page directory entries of 8 bytes per entry Inverted page table entries of 16 bytes per entry 31 bit physical address, byte addressed Two-level page table structure (containing a page directory and one layer of page tables) 3a) (9 points) What is size (in number of address bits) of the virtual address space supported by the above virtual memory configuration? Page size of 16 KB gives 2 14 address space = 14 bits Each Page Table has 16 KB/8 bytes of entries = 2 14 /2 3 = 11 bits of address space The Page Directory is the same size as a page table, and so contributes another 11 bits of address space 14 + 11 + 11 = 36 bit virtual address space. 3b) (10 points) What is the maximum required size (in KB, MB, or GB) of an inverted page table for the above virtual memory configuration? There must be one inverted page table entry for each page of physical memory. 2 31 /2 14 = 2 17 pages of physical memory, meaning the inverted page table must be 2 17 * 16 bytes = 2 17 * 2 4 bytes = 2 21 = 2 MB in size 6

4. Cache simulation The following program was instrumented with Atom and simulated with a cache simulator. The cache simulator was fed only the data accesses (not instruction accesses) for the following subroutine: long test_routine(long *a, int size, int stride) { long result=0; long *b, *limit; int i; limit = a + size; for (i = 0; i < 100; i++) { b = a; while (b < limit) { result += *b; b += stride; return(result); The result of sending the Atom outputs through the cache simulator for various values of size and stride (each for a single execution of test_routine) are depicted in the graph below. The cache word size was 8 bytes (matching the size of a long on an Alpha), and a direct-mapped cache was simulated. The simulated cache was invalidated just before entering test_routine. There was no explicit prefetching used. The code was compiled with an optimization switch that got rid of essentially all reads and writes for loop handling overhead. Use only Cragon terminology for this problem (not dinero terminology). 1 0.9 Miss Ratio 0.8 0.7 0.6 0.5 0.4 0.3 0.2 Value of "size" 256 512 1024 2048 4096 8192 0.1 0 0 4 8 12 16 20 24 28 32 Value of "stride" 7

4a) (9 points) For the given program and given results graph, what was the size of the cache in KB (support your answer with a brief reason)? Remember that in the C programming language pointer arithmetic is automatically scaled according to the size of the data value being pointed to (so, in this code, adding 1 to a pointer actually adds the value 8 to the address). Cache size = 2048 * 8 / 2 = 8 KB Direct mapped caches have high 100% miss rate when array size >= 2* cache size 4b) (10 points) For this example, what is the cache block size in bytes (support your answer with a brief reason)? Cache block size is 2 words = 16 bytes, because with a stride of 1 the large arrays have 50% miss rate, indicating that odd-word-addressed words are fetched as a pair with even-word-addressed cache misses, yielding spatial locality cache hits when both even and odd words in the same block are touched. 4c) (15 points) What parameter is responsible for the size =2K (the data plotted with circles) point of inflection at a stride of 16? What is the value of that parameter, and why? Cache sector size is 16 * 8 = 128 bytes; because above this value sectors are skipped on the first pass through the cache but filled on the second pass given that array size is twice cache size. Alternately, this could be described as having 8 blocks per sector. For a stride of 17, every 16th access skips an entire sector, leaving it still resident in cache for the next iteration through the loop for an 8K array access. (When the strided accesses wrap around, the second iteration through the cache will touch those locations and remain in cache.) The effect is less pronounced for larger arrays, but there nonetheless. The 100% miss rate at 32 (sector size * 2) confirms that factors of 16 result in maximal conflict+capacity misses. The graph dips further as stride increases because more sectors are skipped the in the first half of the array, leaving more holes for data as the second half of the array maps into the cache. An alternate answer is that the parameter is the number of sets=sectors in the cache. 8