CS250 VLSI Systems Design Lecture 8: Memory

Similar documents

Static-Noise-Margin Analysis of Conventional 6T SRAM Cell at 45nm Technology

CHAPTER 16 MEMORY CIRCUITS

Memory Basics. SRAM/DRAM Basics

Chapter 2 Logic Gates and Introduction to Computer Architecture

Chapter 5 :: Memory and Logic Arrays

Clocking. Figure by MIT OCW Spring /18/05 L06 Clocks 1

EE 42/100 Lecture 24: Latches and Flip Flops. Rev B 4/21/2010 (2:04 PM) Prof. Ali M. Niknejad

Computer Systems Structure Main Memory Organization

1. Memory technology & Hierarchy

Low Power AMD Athlon 64 and AMD Opteron Processors

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

Memory. The memory types currently in common usage are:

Power Reduction Techniques in the SoC Clock Network. Clock Power

Sequential 4-bit Adder Design Report

Module 2. Embedded Processors and Memory. Version 2 EE IIT, Kharagpur 1

Lecture 10: Latch and Flip-Flop Design. Outline

RAM & ROM Based Digital Design. ECE 152A Winter 2012

8 Gbps CMOS interface for parallel fiber-optic interconnects

Contents. Overview Memory Compilers Selection Guide

A New Paradigm for Synchronous State Machine Design in Verilog

NEW adder cells are useful for designing larger circuits despite increase in transistor count by four per cell.

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

Efficient Interconnect Design with Novel Repeater Insertion for Low Power Applications

CHAPTER 11: Flip Flops

Semiconductor Memories

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Switch Fabric Implementation Using Shared Memory

Lecture 5: Gate Logic Logic Optimization

Nexus: An Asynchronous Crossbar Interconnect for Synchronous System-on-Chip Designs

ESP-CV Custom Design Formal Equivalence Checking Based on Symbolic Simulation

Chapter 9 Semiconductor Memories. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Sequential Logic: Clocks, Registers, etc.

Introduction to CMOS VLSI Design

Topics of Chapter 5 Sequential Machines. Memory elements. Memory element terminology. Clock terminology

Alpha CPU and Clock Design Evolution

PowerPC Microprocessor Clock Modes

So far we have investigated combinational logic for which the output of the logic devices/circuits depends only on the present state of the inputs.

Sequential Circuits. Combinational Circuits Outputs depend on the current inputs

Lecture 11: Sequential Circuit Design

NTE2053 Integrated Circuit 8 Bit MPU Compatible A/D Converter

Lecture 7: Clocking of VLSI Systems

S. Venkatesh, Mrs. T. Gowri, Department of ECE, GIT, GITAM University, Vishakhapatnam, India

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Memory Systems. Static Random Access Memory (SRAM) Cell

What is a System on a Chip?

Interfacing Analog to Digital Data Converters

A 10,000 Frames/s 0.18 µm CMOS Digital Pixel Sensor with Pixel-Level Memory

Lecture 10: Sequential Circuits

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: ext: Sequential Circuit

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

Computer Systems Structure Input/Output

International Journal of Electronics and Computer Science Engineering 1482

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

Layout of Multiple Cells

Memory Testing. Memory testing.1

1. True or False? A voltage level in the range 0 to 2 volts is interpreted as a binary 1.

With respect to the way of data access we can classify memories as:

Digital Integrated Circuit (IC) Layout and Design

Interfacing 3V and 5V applications

Sequential Circuit Design

Two-Phase Clocking Scheme for Low-Power and High- Speed VLSI

TRUE SINGLE PHASE CLOCKING BASED FLIP-FLOP DESIGN

Lecture-3 MEMORY: Development of Memory:

The Central Processing Unit:

SPARC64 VIIIfx: CPU for the K computer

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Computer Architecture

Design and analysis of flip flops for low power clocking system

3D NAND Technology Implications to Enterprise Storage Applications

L4: Sequential Building Blocks (Flip-flops, Latches and Registers)

Implementation Details

SOLVING HIGH-SPEED MEMORY INTERFACE CHALLENGES WITH LOW-COST FPGAS

Memory unit. 2 k words. n bits per word

CACTI 3.0: An Integrated Cache Timing, Power, and Area Model

Logical Operations. Control Unit. Contents. Arithmetic Operations. Objectives. The Central Processing Unit: Arithmetic / Logic Unit.

DESIGN CHALLENGES OF TECHNOLOGY SCALING

Lizy Kurian John Electrical and Computer Engineering Department, The University of Texas as Austin

7. Latches and Flip-Flops

OpenSPARC T1 Processor

University of Texas at Dallas. Department of Electrical Engineering. EEDG Application Specific Integrated Circuit Design

Gates, Circuits, and Boolean Algebra

Latch Timing Parameters. Flip-flop Timing Parameters. Typical Clock System. Clocking Overhead

Interconnection Network of OTA-based FPAA

Digital Logic Design. Basics Combinational Circuits Sequential Circuits. Pu-Jen Cheng

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Spacecraft Computer Systems. Colonel John E. Keesee

Lecture 10 Sequential Circuit Design Zhuo Feng. Z. Feng MTU EE4800 CMOS Digital IC Design & Analysis 2010

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Chapter 1 Computer System Overview

2 : BISTABLES. In this Chapter, you will find out about bistables which are the fundamental building blocks of electronic counting circuits.

Pentium vs. Power PC Computer Architecture and PCI Bus Interface

Introduction to Cloud Computing

Design of Energy Efficient Low Power Full Adder using Supply Voltage Gating

IL2225 Physical Design

Computer Organization and Architecture. Characteristics of Memory Systems. Chapter 4 Cache Memory. Location CPU Registers and control unit memory

Weste07r4.fm Page 183 Monday, January 5, :39 AM. 7.1 Introduction

Binary search tree with SIMD bandwidth optimization using SSE

Digital to Analog Converter. Raghu Tumati

Transcription:

CS250 VLSI Systems esign Lecture 8: Memory John Wawrzynek, Krste Asanovic, with John Lazzaro and Yunsup Lee (TA) UC Berkeley Fall 2010

CMOS Bistable 1 0 Flip State 0 1 Cross-coupled inverters used to hold state in CMOS Static storage in powered cell, no refresh needed If a storage node leaks or is pushed slightly away from correct value, non-linear transfer function of high-gain inverter removes noise and recirculates correct value To write new state, have to force nodes to opposite state 2

CMOS Transparent Latch Latch transparent (output follows input) when clock is high, holds last value when clock is low Q Optional Input Buffer Optional Output Buffer Transmission gate switch with both pmos and nmos passes both ones and zeros well Schematic Symbols Q 3 Q Transparent on clock low

Latch Operation 0 1 Q Q Q 1 1 0 0 0 1 Clock High Latch Transparent Clock Low Latch Holding 4

Flip-Flop as Two Latches Sample Hold Q Q Schematic Symbols Q This is how standard cell flip-flops are built (usually with extra in/out buffers) 5

Small Memories from Stdcell Latches Write Address Write ata Read Address Write Address ecoder ata held in transparent-low latches Read Address ecoder Write by clocking latch Add additional ports by replicating read and write port logic (multiple write ports need mux in front of latch) Expensive to add many ports Combinational logic for read port (synthesized) Optional read output latch 6

6-Transistor SRAM (Static RAM) Wordline Bit Bit Large on-chip memories built from arrays of static RAM bitcells, where each bit cell holds a bistable (crosscoupled inverters) and two access transistors. Other clocking and access logic factored out into periphery 7

)*%!+,-%."/$%012# Intel s 22nm SRAM cell 0.092 um 2 SRAM cell for high density applications 0.108 um 2 SRAM cell for low voltage applications!"!#$%&' $ ()%*+,%)'-..,)*%/012%3,..% 8 (4%5678(49%3(73&(*)%7,:67*,;%*6%;-*, [Bohr, Intel, Sept 2009]

General SRAM Structure Bitline Prechargers Address ecode and Wordline river Usually maximum of 128-256 bits per row or column Address Write Enable ifferential Read Sense Amplifiers ifferential Write rivers Write ata 9 Read ata

Unary 1-of-4 encoding Address ecoder Structure Word Line 0 Word Line 1 Word Line 15 2:4 Predecoders A0 A1 A2 Address 10 A3 Clocked Word Line Enable

Read Cycle Prechargers From ecoder Storage Cells Bit/Bit Wordline Bitline differential Wordline Clock Sense Bit Bit ata/ata Sense Sense Amp 1) 2) 3) Full-rail swing ata Output Set-Reset Latch ata 1) Precharge bitlines and senseamp 2) Pulse wordlines, develop bitline differential voltage 3) isconnect bitlines from senseamp, activate sense pulldown, develop full-rail data signals Pulses generated by internal self-timed signals, often using replica circuits representing critical paths 11

Write Cycle Prechargers Bit/Bit From ecoder Storage Cells Wordline 1) 2) Wordline Clock Bit Bit 1) Precharge bitlines 2) Open wordline, pull down one bitline full rail Write Enable Write ata Write-enable can be controlled on a per-bit level. If bit lines not driven during write, cell retains value (looks like a read to the cell). 12

Column-Muxing at Sense Amps From ecoder Wordline Clock Sel0 Sel1 ata Sense Amp ifficult to pitch match sense amp to tight SRAM bit cell spacing so often 2-8 columns share one sense amp. Impacts power dissipation as multiple bitline pairs swing for each bit read. 13 ata

Building Larger Memories Bit cells I/O Bit cells Bit cells I/O Bit cells e c e c e c e c Bit cells I/O Bit cells Bit cells I/O Bit cells Bit cells I/O Bit cells Bit cells I/O Bit cells e c e c e c e c Bit cells I/O Bit cells Bit cells I/O Bit cells Large arrays constructed by tiling multiple leaf arrays, sharing decoders and I/O circuitry e.g., sense amp attached to arrays above and below Leaf array limited in size to 128-256 bits in row/column due to RC delay of wordlines and bitlines Also to reduce power by only activating selected sub-bank In larger memories, delay and energy dominated by I/O wiring 14

Adding More Ports WordlineB WordlineA BitB BitB ifferential Read or Write ports BitA BitA Wordline 15 Read Bitline Optional Single-ended Read port

Memory Compilers In ASIC flow, memory compilers used to generate layout for SRAM blocks in design Often hundreds of memory instances in a modern SoC Memory generators can also produce built-in self-test (BIST) logic, to speed manufacturing testing, and redundant rows/ columns to improve yield Compiler can be parameterized by number of words, number of bits per word, desired aspect ratio, number of sub banks, degree of column muxing, etc. Area, delay, and energy consumption complex function of design parameters and generation algorithm Worth experimenting with design space Usually only single read or write port SRAM and one read and one write SRAM generators in ASIC library 16

Small Memories Compiled SRAM arrays usually have a high overhead due to peripheral circuits, BIST, redundancy. Small memories are usually built from latches and/or flipflops in a stdcell flow Cross-over point is usually around 1K bits of storage Should try design both ways 17

Memory esign Patterns 18

Multiport Memory esign Patterns Often we require multiple access ports to a common memory True Multiport Memory As describe earlier in lecture, completely independent read and write port circuitry Banked Multiport Memory Interleave lesser-ported banks to provide higher bandwidth Stream-Buffered Multiport Memory Use single wider access port to provide multiple narrower streaming ports Cached Multiport Memory Use large single-port main memory, but add cache to service 19

True Multiport Memory Problem: Require simultaneous read and write access by multiple independent agents to a shared common memory. Solution: Provide separate read and write ports to each bit cell for each requester Applicability: Where unpredictable access latency to the shared memory cannot be tolerated. Consequences: High area, energy, and delay cost for large number of ports. Must define behavior when multiple writes on same cycle to same word (e.g., prohibit, provide priority, or combine writes). 20

True Multiport Example: Itanium-2 Regfile IEEE JOURNAL OF SOLI-STATE CIRCUITS, VOL. 37, NO. 11, N Intel Itanium-2 [Fetzer et al, IEEE JSSCC 2002] 21

Itanium-2 Regfile Timing Fig. 1. IEU state and timing diagram. 22

Banked Multiport Memory Problem: Require simultaneous read and write access by multiple independent agents to a large shared common memory. Solution: ivide memory capacity into smaller banks, each of which has fewer ports. Requests are distributed across banks using a fixed hashing scheme. Multiple requesters arbitrate for access to same bank/port. Applicability: Requesters can tolerate variable latency for accesses. Accesses are distributed across address space so as to avoid hotspots. Consequences: Requesters must wait arbitration delay to determine if request will complete. Have to provide interconnect between each requester and each bank/port. Can have greater, equal, or lesser number of banks*ports/bank compared to total number of external access ports. 23

Banked Multiport Memory Port A Port B Arbitration and Crossbar Bank 0 Bank 1 Bank 2 Bank 3 24

Banked Multiport Memory Example Pentium (P5) 8-way interleaved data cache, with two ports ache, all rocessor that the re band- CPU can that the ith data pipeline. ferences Third, in r design, nd data a unified 8-Kbyte, ual-ported TLB ual-ported cache tags Bank conflict detection Figure 7. ual-access data cache. 25 b 7 7 Singe-ported and interleaved cache data I I I [Alpert et al, IEEE Micro, May 1993] frequency.) For the Pentium microprocessor, with its higher

Stream-Buffered Multiport Memory Problem: Require simultaneous read and write access by multiple independent agents to a large shared common memory, where each requester usually makes multiple sequential accesses. Solution: Organize memory to have a single wide port. Provide each requester with an internal stream buffer that holds width of data returned/ consumed by each access. Each requester can access stream buffer without contention, but arbitrates with others to read/write stream buffer. Applicability: Requesters make mostly sequential requests and can tolerate variable latency for accesses. Consequences: Requesters must wait arbitration delay to determine if request will complete. Have to provide stream buffers for each requester. Need sufficient access width to serve aggregate bandwidth demands of all requesters, but wide data access can be wasted if not all used by requester. Have to specify memory consistency model between ports (e.g., provide stream flush operations). 26

Stream-Buffered Multiport Memory Stream Buffer A Stream Buffer B Port A Port B Arbitration Wide Memory 27

Stream-Buffered Multiport Examples IBM Cell microprocessor local store [Chen et al., IBM, 2005] The SPUs SIM support can perform operations on sixteen 8-bit integers, eight 16-bit integers, four 32-bit integers, or four single-precision floating-point numbers per cycle. At 3.2GHz, each SPU is capable of perfo up to Lecture 51.2 billion 8, Memory 8-bit integer operations or 25.6GFLOPs 28 in single precision. CS250, Figure UC 3 Berkeley, shows Fall the 2010 main func units in an SPU: (1) an SPU floating point unit for single-precision, double-precision, and integer multiplies

Cached Multiport Memory Problem: Require simultaneous read and write access by multiple independent agents to a large shared common memory. Solution: Provide each access port with a local cache of recently touched addresses from common memory, and use a cache coherence protocol to keep the cache contents in sync. Applicability: Request streams have significant temporal locality, and limited communication between different ports. Consequences: Requesters will experience variable delay depending on access pattern and operation of cache coherence protocol. Tag overhead in both area, delay, and energy/access. Complexity of cache coherence protocol. 29

Cached Multiport Memory Port A Port B Cache A Cache B Arbitration and Interconnect Common Memory 30

Replicated State Multiport Memory Problem: Require simultaneous read and write access by multiple independent agents to a small shared common memory. Cannot tolerate variable latency of access. Solution: Replicate storage and divide read ports among replicas. Each replica has enough write ports to keep all replicas in sync. Applicability: Many read ports required, and variable latency cannot be tolerated. Consequences: Potential increase in latency between some writers and some readers. 31

Replicated State Multiport Memory Write Port 0 Write Port 1 Copy 0 Copy 1 Read Ports 32 Example: Alpha 21264 Regfile clusters

Memory Hierarchy esign Patterns Use small fast memory together large slow memory to provide illusion of large fast memory. Explicitly managed local stores Automatically managed cache hierarchies 33