Three Buses. The System Bus. Bus Design. (1) Bus Width and Multiplexing

Similar documents
What is a bus? A Bus is: Advantages of Buses. Disadvantage of Buses. Master versus Slave. The General Organization of a Bus

Computer Systems Structure Input/Output

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

Big Picture. IC220 Set #11: Storage and I/O I/O. Outline. Important but neglected

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

CPS104 Computer Organization and Programming Lecture 18: Input-Output. Robert Wagner

INPUT/OUTPUT ORGANIZATION

COS 318: Operating Systems. I/O Device and Drivers. Input and Output. Definitions and General Method. Revisit Hardware

Input/output (I/O) I/O devices. Performance aspects. CS/COE1541: Intro. to Computer Architecture. Input/output subsystem.

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Chapter 2: Computer-System Structures. Computer System Operation Storage Structure Storage Hierarchy Hardware Protection General System Architecture

1. Memory technology & Hierarchy

1. Computer System Structure and Components

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

CS 6290 I/O and Storage. Milos Prvulovic

I/O. Input/Output. Types of devices. Interface. Computer hardware

Introduction Disks RAID Tertiary storage. Mass Storage. CMSC 412, University of Maryland. Guest lecturer: David Hovemeyer.

Windows Server Performance Monitoring

Communicating with devices

Storage. The text highlighted in green in these slides contain external hyperlinks. 1 / 14

Operating Systems 4 th Class

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Read this before starting!

This Unit: Virtual Memory. CIS 501 Computer Architecture. Readings. A Computer System: Hardware

Price/performance Modern Memory Hierarchy

1 Storage Devices Summary

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Computer-System Architecture

21152 PCI-to-PCI Bridge

SPI I2C LIN Ethernet. u Today: Wired embedded networks. u Next lecture: CAN bus u Then: wireless embedded network

File System & Device Drive. Overview of Mass Storage Structure. Moving head Disk Mechanism. HDD Pictures 11/13/2014. CS341: Operating System

A+ Guide to Managing and Maintaining Your PC, 7e. Chapter 1 Introducing Hardware

Input / Ouput devices. I/O Chapter 8. Goals & Constraints. Measures of Performance. Anatomy of a Disk Drive. Introduction - 8.1

Computer Organization & Architecture Lecture #19

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

Q & A From Hitachi Data Systems WebTech Presentation:

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-17: Memory organisation, and types of memory

COMPUTER SCIENCE AND ENGINEERING - Microprocessor Systems - Mitchell Aaron Thornton

PCI Express Overview. And, by the way, they need to do it in less time.

Switch Fabric Implementation Using Shared Memory

Chapter 13 Selected Storage Systems and Interface

Computer Architecture Prof. Mainak Chaudhuri Department of Computer Science and Engineering Indian Institute of Technology, Kanpur

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Question: 3 When using Application Intelligence, Server Time may be defined as.

I/O Device and Drivers

Secure Cloud Storage and Computing Using Reconfigurable Hardware

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

Virtual vs Physical Addresses

Board Notes on Virtual Memory

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

361 Computer Architecture Lecture 14: Cache Memory

Kernel Optimizations for KVM. Rik van Riel Senior Software Engineer, Red Hat June

Chapter 1 Computer System Overview

Communication Protocol

Chapter 13. PIC Family Microcontroller

What is a System on a Chip?

Chapter 2 Logic Gates and Introduction to Computer Architecture

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

- Nishad Nerurkar. - Aniket Mhatre

Distributed File System. MCSN N. Tonellotto Complements of Distributed Enabling Platforms

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Performance Report Modular RAID for PRIMERGY

OpenSPARC T1 Processor

Lecture 17: Virtual Memory II. Goals of virtual memory

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Direct NFS - Design considerations for next-gen NAS appliances optimized for database workloads Akshay Shah Gurmeet Goindi Oracle

Gigabit Ethernet Design

How PCI Express Works (by Tracy V. Wilson)

AMD Opteron Quad-Core

Chapter 02: Computer Organization. Lesson 04: Functional units and components in a computer organization Part 3 Bus Structures

Solid State Drive Architecture

Welcome to the Introduction to Controller Area Network web seminar My name is William Stuart, and I am a Applications Engineer for the Automotive

Project 4: Pseudo USB Simulation Introduction to UNIVERSAL SERIAL BUS (USB) STANDARD

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

Random-Access Memory (RAM) The Memory Hierarchy. SRAM vs DRAM Summary. Conventional DRAM Organization. Page 1

As enterprise data requirements continue

Flash for Databases. September 22, 2015 Peter Zaitsev Percona

Quiz for Chapter 6 Storage and Other I/O Topics 3.10

Connecting UniOP to Telemecanique PLC s

SoC IP Interfaces and Infrastructure A Hybrid Approach

Enabling Technologies for Distributed Computing

Slide Set 8. for ENCM 369 Winter 2015 Lecture Section 01. Steve Norman, PhD, PEng

Multiprocessor Cache Coherence

COS 318: Operating Systems. Virtual Memory and Address Translation

Computer Architecture

find model parameters, to validate models, and to develop inputs for models. c 1994 Raj Jain 7.1

Choosing and Architecting Storage for Your Environment. Lucas Nguyen Technical Alliance Manager Mike DiPetrillo Specialist Systems Engineer

Chapter 6. Inside the System Unit. What You Will Learn... Computers Are Your Future. What You Will Learn... Describing Hardware Performance

Enabling Technologies for Distributed and Cloud Computing

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Computers. Hardware. The Central Processing Unit (CPU) CMPT 125: Lecture 1: Understanding the Computer

Full and Para Virtualization

Maximizing Server Storage Performance with PCI Express and Serial Attached SCSI. Article for InfoStor November 2003 Paul Griffith Adaptec, Inc.

Router Architectures

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Have both hardware and software. Want to hide the details from the programmer (user).

Networking Driver Performance and Measurement - e1000 A Case Study

Transcription:

The System Bus System bus: connects system components together Important: insufficient bandwidth can bottleneck entire system Performance factors Physical length Number and type of connected devices (taps) CPU ($) Main kbd System (memory-) bus Disk display ctrl NIC 17 CPU Three Buses Proc-Mem adapter CPU Mem Mem Backplane Processor-memory bus Connects CPU and memory, no direct interface +Short, few taps fast, high-bandwidth System specific bus Connects devices, no direct P-M interface Longer, more taps slower, lower-bandwidth + Industry standard Connect P-M bus to bus using adapter Backplane bus CPU, memory, connected to same bus + Industry standard, cheap (no adapters needed) Processor-memory performance compromised 18 Bus Design data lines address lines control lines Goals High Performance: low latency and high bandwidth Standardization: flexibility in dealing with many devices Low Cost Processor-memory bus emphasizes performance, then cost & backplane emphasize standardization, then performance Design issues 1. Width/multiplexing: are wires shared or separate? 2. Clocking: is bus clocked or not? 3. Switching: how/when is bus control acquired and released? 4. Arbitration: how do we decide who gets the bus next? (1) Bus Width and Multiplexing Wider + More bandwidth More expensive and more susceptible to skew Multiplexed: address and data share same lines +Cheaper Less bandwidth Burst transfers Multiple sequential data transactions for single address + Increase bandwidth at relatively little cost 19 20

(2) Bus Clocking Synchronous: clocked + Fast Bus must be short to minimize clock skew Asynchronous: un-clocked + Can be longer: no clock skew, deals with devices of different speeds Slower: requires hand-shaking protocol For example, asynchronous read Multiplexed data/address lines, 3 control lines 1. Processor drives address onto bus, asserts Request line 2. asserts Ack line, processor stops driving 3. drives data on bus, asserts DataReady line 4. Processor asserts Ack line, memory stops driving (3) Bus Switching Atomic: bus busy between request and reply +Simple Low utilization Split-transaction: requests/replies can be interleaved + Higher utilization higher throughput Complex, requires sending IDs to match replies to request P-M buses are synchronous and backplane buses asynchronous or slow-clock synchronous 21 22 (4) Bus Arbitration Bus master: component that can initiate a bus request Bus typically has several masters, including processor devices can also be masters (Why? See in a bit) Arbitration: choosing a master among multiple requests Try to implement priority and fairness (no device starves ) Daisy-chain: devices connect to bus in priority order High-priority devices intercept/deny requests by low-priority ones ± Simple, but slow and can t ensure fairness Centralized: special arbiter chip collects requests, decides ± Ensures fairness, but arbiter chip may itself become bottleneck Distributed: everyone sees all requests simultaneously Back off and retry if not the highest priority request ± No bottlenecks and fair, but needs a lot of control lines Standard Bus Examples Type Width Multiplexed? Clocking Data rate Arbitration Maximum masters Maximum length PCI Backplane 32 64 bits Yes 33 (66) MHz 133 (266) MB/s Distributed 1024 0.5 m SCSI USB 1 Yes USB (universal serial bus) Popular for low/moderate bandwidth external peripherals + Packetized interface (like TCP), extremely flexible + Also supplies power to the peripheral 8 32 bits Yes 5 (10) MHz 10 (20) MB/s Distributed 7 31 2.5 m Asynchronous 0.2, 1.5, 80 MB/s Daisy-chain 127 23 24

This Unit: Application OS Compiler Firmware CPU Digital Circuits Gates & Transistors system structure Devices, controllers, and buses Device characteristics Disks Bus characteristics control Polling and interrupts Control and Interfaces Now that we know how devices and buses work How does actually happen? How does CPU give commands to devices? How do devices execute data transfers? How does CPU know when devices are done? 25 26 Sending Commands to Devices Remember: only OS can do this! Two options instructions OS only? Instructions must be privileged (only OS can execute) E.g., IA-32 -mapped Portion of physical address space reserved for OS maps physical addresses to device control registers Stores/loads to these addresses are commands to devices Main memory ignores them, devices recognize and respond Address specifies both device and command These address are not cached why? OS only? physical addresses only mapped in OS address space E.g., almost every architecture other than IA-32 (see pattern??) Querying Device Status Now that we ve sent command to device How do we query device status? So that we know if data we asked for is ready? So that we know if device is ready to receive next command? Polling: Ready now? How about now? How about now??? Processor queries device status register (e.g., with MM load) Loops until it gets status it wants (ready for next command) Or tries again a little later +Simple Waste of processor s time Processor much faster than device 27 28

Polling Overhead: Example #1 Parameters 500 MHz CPU Polling event takes 400 cycles Overhead for polling a mouse 30 times per second? Cycles per second for polling = (30 poll/s)*(400 cycles/poll) 12000 cycles/second for polling (12000 cycles/second)/(500 M cycles/second) = 0.002% overhead + Not bad Polling Overhead: Example #2 Same parameters 500 MHz CPU, polling event takes 400 cycles Overhead for polling a 4 MB/s disk with 16 B interface? Must poll often enough not to miss data from disk Polling rate = (4MB/s)/(16 B/poll) >> mouse polling rate Cycles per second for polling=[(4mb/s)/(16 B/poll)]*(400 cyc/poll) 100 M cycles/second for polling (100 M cycles/second)/(500 M cycles/second) = 20% overhead Bad This is the overhead of polling, not actual data transfer Really bad if disk is not being used (pure overhead!) 29 30 Interrupt-Driven Interrupt Overhead Interrupts: alternative to polling device generates interrupt when status changes, data ready OS handles interrupts just like exceptions (e.g., page faults) Identity of interrupting device recorded in ECR ECR: exception cause register interrupts are asynchronous Not associated with any one instruction Don t need to be handled immediately interrupts are prioritized Synchronous interrupts (e.g., page faults) have highest priority High-bandwidth devices have higher priority than lowbandwidth ones Parameters 500 MHz CPU Polling event takes 400 cycles Interrupt handler takes 400 cycles Data transfer takes 100 cycles Note: when disk is transferring data, the interrupt rate is same as polling rate 4 MB/s, 16 B interface disk, transfers data only 5% of time Percent of time processor spends transferring data 0.05 * (4 MB/s)/(16 B/xfer)*[(100 c/xfer)/(500m c/s)] = 0.25% Overhead for polling? (4 MB/s)/(16 B/poll) * [(400 c/poll)/(500m c/s)] = 20% Overhead for interrupts? + 0.05 * (4 MB/s)/(16 B/int) * [(400 c/int)/(500m c/s)] = 1% 31 32

Direct Access () Interrupts remove overhead of polling But still requires OS to transfer data one word at a time OK for low bandwidth devices: mice, microphones, etc. Bad for high bandwidth devices: disks, monitors, etc. Direct Access () Transfer data between and memory without processor control Transfers entire blocks (e.g., pages, video frames) at a time Can use bus burst transfer mode if available Only interrupts processor when done (or if error occurs) Controllers To do, device attached to controller Multiple devices can be connected to one controller Controller itself seen as a memory mapped device Processor initializes start memory address, transfer size, etc. controller takes care of bus arbitration and transfer details So that s why buses support arbitration and multiple masters! CPU ($) Bus ctrl 33 Main Disk display NIC 34 Processors Overhead A controller is a very simple component May be as simple as a FSM with some local memory Some requires complicated sequences of transfers processor: heavier controller that executes instructions Can be programmed to do complex transfers E.g., programmable network card CPU ($) Main Disk Bus display IOP NIC 35 Parameters 500 MHz CPU Interrupt handler takes 400 cycles Data transfer takes 100 cycles 4 MB/s, 16 B interface, disk transfers data 50% of time setup takes 1600 cycles, transfer 1 16KB page at a time Processor overhead for interrupt-driven? 0.5 * (4M B/s)/(16 B/xfer)*[(500 c/xfer)/(500m c/s)] = 12.5% Processor overhead with? Processor only gets involved once per page, not once per 16 B + 0.5 * (4M B/s)/(16K B/page) * [(2000 c/page)/(500m c/s)] = 0.05% 36

and Hierarchy is good, but is not without challenges Without : processor initiates all data transfers All transfers go through address translation + Transfers can be of any size and cross virtual page boundaries All values seen by cache hierarchy + Caches never contain stale data With : controllers initiate data transfers Do they use virtual or physical addresses? What if they write data to a cached memory location? and Address Translation Which addresses does processor specify to controller? Virtual + Can specify large cross-page transfers controller has to do address translation internally contains small translation buffer (TB) OS initializes buffer contents when it requests an transfer Physical + controller is simple Can only do short page-size transfers OS breaks large transfers into page-size chunks 37 38 and Caching Caches are good Reduce CPU s observed instruction and data access latency + But also, reduce CPU s use of memory + leaving majority of memory/bus bandwidth for But they also introduce a coherence problem for Input problem write into memory version of cached location Cached version now stale Output problem: write-back caches only read from memory version of dirty cached location Output stale value Solutions to Coherence Problem Route all accesses to cache + Solves problem Expensive: CPU must contend for access to caches with Disallow caching of data +Also works Expensive in a different way: CPU access to those regions slow Selective flushing/invalidations of cached data Flush all dirty blocks in region Invalidate blocks in region as writes those addresses + The high performance solution Hardware cache coherence mechanisms for doing this Expensive in yet a third way: must implement this mechanism 39 40

H/W Cache Coherence (see ECE 252, 259) Designing an System for Bandwidth VA TLB I$ PA PA CPU L2 Main VA D$ TLB Bus Disk D$ and L2 snoop bus traffic Observe transactions Check if written addresses are resident Self-invalidate those blocks + Doesn t require access to data part Does require access to tag part May need 2nd copy of tags for this That s OK, tags smaller than data Bus addresses are physical L2 is easy (physical index/tag) D$ is harder (virtual index/physical tag) Reverse translation? No Remember: page size vs. D$ size Approach Find bandwidths of individual components Configure components you can change To match bandwidth of bottleneck component you can t Example (from P&H textbook) Parameters 300 MIPS CPU, 100 MB/s backplane bus 50K OS insns + 100K user insns per operation SCSI-2 controllers (20 MB/s): each accommodates up to 7 disks 5 MB/s disks with t seek + t rotation = 10 ms, 64 KB reads Determine What is the maximum sustainable rate? How many SCSI-2 controllers and disks does it require? 41 42 Designing an System for Bandwidth First: determine rates of components we can t change CPU: (300M insn/s) / (150K Insns/IO) = 2000 IO/s Backplane: (100M B/s) / (64K B/IO) = 1562 IO/s Peak rate determined by bus: 1562 IO/s Second: configure remaining components to match rate Disk: 1 / [10 ms/io + (64K B/IO) / (5M B/s)] = 43.9 IO/s How many disks? (1562 IO/s) / (43.9 IO/s) = 36 disks How many controllers? (43.9 IO/s) * (64K B/IO) = 2.74M B/s per disk (20M B/s) / (2.74M B/s) = 7.2 disks per SCSI controller (36 disks) / (7 disks/scsi-2) = 6 SCSI-2 controllers Designing an System for Latency Previous system designed for bandwidth Some systems have latency requirements as well E.g., database system may require maximum or average latency Latencies are actually harder to deal with than bandwidths Unloaded system: few concurrent IO transactions Latency is easy to calculate Loaded system: many concurrent IO transactions Contention can lead to queuing Latencies can rise dramatically Queuing theory can help if transactions obey fixed distribution Otherwise simulation is needed Caveat: real systems modeled with simulation 43 44

Summary Role of the OS Device characteristics Data bandwidth Disks Structure and latency: seek, rotation, transfer, controller delays Bus characteristics Processor-memory,, and backplane buses Width, multiplexing, clocking, switching, arbitration control instructions vs. memory mapped Polling vs. interrupts Processor controlled data transfer vs. Interaction of with memory system 45