COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING



Similar documents
Course on Advanced Computer Architectures

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

HY345 Operating Systems

EE361: Digital Computer Organization Course Syllabus

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Computer Organization and Components

Week 1 out-of-class notes, discussions and sample problems

Database Management Systems

Using Synology SSD Technology to Enhance System Performance Synology Inc.

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

RAID 5 rebuild performance in ProLiant

Introduction to Cloud Computing

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

SOC architecture and design

Lecture 23: Multiprocessors

Figure 1: Graphical example of a mergesort 1.

Performance Characteristics of VMFS and RDM VMware ESX Server 3.0.1

Microsoft Office Outlook 2013: Part 1

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

IncidentMonitor Server Specification Datasheet

Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY Operating System Engineering: Fall 2005

Thread level parallelism

HP Smart Array Controllers and basic RAID performance factors

Fusion iomemory iodrive PCIe Application Accelerator Performance Testing

CMSC 611: Advanced Computer Architecture

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Recommended hardware system configurations for ANSYS users

Bindel, Spring 2010 Applications of Parallel Computers (CS 5220) Week 1: Wednesday, Jan 27

Optimizing Code for Accelerators: The Long Road to High Performance

Hardware Configuration Guide

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Pipelining Review and Its Limitations

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

SAN Conceptual and Design Basics

White paper. QNAP Turbo NAS with SSD Cache

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

IT 342 Operating Systems Fundamentals Fall 2014 Syllabus

DIABLO TECHNOLOGIES MEMORY CHANNEL STORAGE AND VMWARE VIRTUAL SAN : VDI ACCELERATION

Parallel Algorithm Engineering

WebBIOS Configuration Utility Guide

FPGA-based Multithreading for In-Memory Hash Joins

CUDA Programming. Week 4. Shared memory and register

Enterprise Edition. Hardware Requirements

Using Synology SSD Technology to Enhance System Performance. Based on DSM 5.2

Scaling Analysis Services in the Cloud

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Uptime Infrastructure Monitor. Installation Guide


DELL RAID PRIMER DELL PERC RAID CONTROLLERS. Joe H. Trickey III. Dell Storage RAID Product Marketing. John Seward. Dell Storage RAID Engineering

Very Large Enterprise Network, Deployment, Users

Dependable Systems. 9. Redundant arrays of. Prof. Dr. Miroslaw Malek. Wintersemester 2004/05

Partition Alignment Dramatically Increases System Performance

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Chapter 11 I/O Management and Disk Scheduling

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

A Lab Course on Computer Architecture

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Virtuoso and Database Scalability

Binary search tree with SIMD bandwidth optimization using SSE

Multi-Threading Performance on Commodity Multi-Core Processors

DATABASE. Pervasive PSQL Performance. Key Performance Features of Pervasive PSQL. Pervasive PSQL White Paper

DELL. Virtual Desktop Infrastructure Study END-TO-END COMPUTING. Dell Enterprise Solutions Engineering

Remote PC Guide Series - Volume 2a

CS352H: Computer Systems Architecture

Terminal Server Software and Hardware Requirements. Terminal Server. Software and Hardware Requirements. Datacolor Match Pigment Datacolor Tools

TECHNOLOGY BRIEF. Compaq RAID on a Chip Technology EXECUTIVE SUMMARY CONTENTS

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Enterprise Network Deployment, 10,000 25,000 Users

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

Grant Management. System Requirements

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Remote PC Guide Series - Volume 2b

Performance evaluation

Benchmarking Cassandra on Violin

COS 318: Operating Systems. Virtual Memory and Address Translation

Performance Characteristics of Large SMP Machines

Big Data Processing with Google s MapReduce. Alexandru Costan

Chapter 11 I/O Management and Disk Scheduling

WHITE PAPER FUJITSU PRIMERGY SERVER BASICS OF DISK I/O PERFORMANCE

Parallel Computing. Benson Muite. benson.

Architecture of Hitachi SR-8000

Merge Healthcare Virtualization

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

AlphaTrust PRONTO - Hardware Requirements

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

The Quest for Speed - Memory. Cache Memory. A Solution: Memory Hierarchy. Memory Hierarchy

Chapter 2 Parallel Architecture, Software And Performance

RAID HARDWARE. On board SATA RAID controller. RAID drive caddy (hot swappable) SATA RAID controller card. Anne Watson 1

CSC 2405: Computer Systems II

Solutions. Solution The values of the signals are as follows:

NAND Flash Architecture and Specification Trends

High Performance Tier Implementation Guideline

Very Large Enterprise Network Deployment, 25,000+ Users

OpenMP and Performance

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Transcription:

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc. - Give your answers in the available space after each question. You can use either Portuguese or English. - Be sure to write your name and number on all pages, non-identified pages will not be graded! - Justify all your answers. - Don t hurry, you should have plenty of time to finish this test. Skip questions that you find less comfortable with and come back to them later on. I. (1.5 + 1 + 0.5 + 1 + 1.5 + 0.5 + 1.5 = 7.5 val.) 1. Consider two different implementations of the same instruction set architecture. There are four classes of instructions: A, B, C, and D. The clock rate and CPI of each implementation are given in the following table. Implementation Clock Rate CPI Class A CPI Class B CPI Class C CPI Class D I1 2.5 GHz 2 1.5 2 1 I2 3 GHz 1 2 1 1 a) Consider a program executing 10 6 instructions divided into classes as follows: 10% class A, 20% class B, 50% class C, and 20% class D. Determine which implementation is faster. IST ID: Name: 1/9

b) What is the global CPI for each implementation? c) How much time is required by each implementation to implement the program. d) If for implementation I1 the number of Class A instructions can be reduced by half at the expense of 10% more Class B instructions, what is the resulting speedup? IST ID: Name: 2/9

2. Consider the MIPS processor pipeline that was presented in this course, with the five pipeline stages F, D, X, M, and W. Consider also that: forwarding mechanisms were implemented to automatically resolve data hazards without stalls, whenever possible; no branch prediction mechanism is implemented; the branch address is computed in the D stage; independent data and program memories exist. The following code segment was executed in this processor: addi $t0, $zero, 0 lw $t3, 0($s1) for_loop: addi $t1, $t0, -16 beq $t1, $0, loop_done lw $t2, 8($s1) add $t3, $t3, $t2 sw $t3, 100($s1) addi $t0, $t0, 4 j for_loop loop_done: a) Represent the execution of the first two iterations of the program loop, by representing, for each instruction, the several executed stages of the pipeline: F, D, X, M, and W. Do not forget to represent every stall that may occur. 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 IST ID: Name: 3/9

b) What is the global CPI for this program? c) Perform a full loop unrolling of the program. Estimate the speedup that is achieved by this operation. Número: Nome: 4/9

II. (1.5 + 1.5 + 1 + 1 + 1 + 1 = 7 val.) 1. Consider a memory system for a 32-bit processor with separate caches for code and data. Assume that the processor always makes accesses to 32-bit words, and that the address space is 2 32 words. The data cache has the following characteristics: 64 KB capacity; 2-way set associative; 2-word blocks; write-back allocate; LRU replacement policy. The data bus between the caches and memory is 64-bit wide, thus allowing the cache block to be filled in a single memory access. The following program that computes the number of asymmetric positions in a matrix, a[i,j] a[j,i], is executed on this system. register int i,j,sym; /* 32-bit integers on registers */ int a[1024,1024];... sym = 0; for(i = 0; i < 1024; i=i+1) for(j = i; j < 1024; j=j+1) if(a[i][j]!= a[j][i]) sym = sym + 1; Assume that the variables are allocated sequentially in memory starting at address 0, where the matrix elements are ordered by rows (a[0,0], a[0,1],..., a[1,0],...). a) Determine the hit rate in the data cache for this program (ignore the startup misses). Número: Nome: 5/9

b) Compute the average memory access time for this program. Assume the cache hit time is 1T and that the miss penalty is 10T, where T=10ns is the clock period. (if and only if you did not solve the previous question assume that the hit rate in the data cache is 67%). c) In the same conditions as the previous question, determine the occupation rate of the bus between cache and main memory. Número: Nome: 6/9

2. The memory architecture of a machine X is summarized in the following table: Virtual Address Page Size PTE Size 54 bits 16 K bytes 4 bytes a) Assume that there are 8 bits reserved for the operating system functions (protection, replacement, valid, modified, etc) other than those required by the hardware translation algorithm. Derive the largest physical memory size (in bytes) allowed by this PTE format. Make sure you consider all the fields required by the translation algorithm. b) How large (in bytes) is the page table? c) Assuming that only one application exists in the system and the maximum physical memory is devoted to the process, how much physical space (in bytes) is there for the application s data and code. Número: Nome: 7/9

III. (1.5 + 1.5 = 3 val.) Consider that a server farm is being designed to have 100 TBytes of non-volatile memory, using solid state hard drives (SHD) with 250 GBytes each. a) State how many SHD are needed if redundancy is assured by a i) RAID 1, ii) RAID 3, and iii) RAID 5. Justify your answer. b) Which RAID storage technology would you choose to achieve a lower disk access time RAID 0 or RAID 2? Justify your answer. Número: Nome: 8/9

IV. (2.5 val.) Consider a system with two multiprocessors with the following configurations: Machine A: a NUMA machine with two processors, each with local memory of 512 MB with local memory access latency of 20 cycles per word and remote memory access latency of 60 cycles per word. Machine B: a UMA machine with two processors, with a shared memory of 1GB with access latency of 40 cycles per word. Suppose an application has two threads running on the two processors, each of them needs to access an entire array of 4096 words. Is it possible to partition this array on the local memories of the NUMA machine so that the application runs faster on it rather than the UMA machine? If so, specify the partitioning. If not, by how many more cycles should the UMA memory latency be worsened for a partitioning on the NUMA machine to enable a faster run than the UMA machine? Assume that the memory operations dominate the execution time. Número: Nome: 9/9