DRAM Circuit and Architecture Basics Overview Terminology Access Protocol Architecture Word Line Storage element (capacitor) Bit Line Switching element
DRAM Circuit Basics DRAM Cell Word Line Bit Line Storage element (capacitor) In/Out Buffers DRAM Column Decoder Sense Amps... Bit Lines... Switching element Decoder... Word Lines... Memory Array
, Bitlines and Wordlines DRAM Circuit Basics Defined Bit Lines Word Line of DRAM Size: 8 Kb @ 256 Mb SDRAM node 4 Kb @ 256 Mb RDRAM node
DRAM Circuit Basics Sense Amplifier I 1 2 4 5 Sense and Amplify 3 6 6 s shown
DRAM Circuit Basics Sense Amplifier II : Precharged precharged to V cc /2 1 2 4 5 Sense and Amplify 3 6 V cc (logic 1) Gnd (logic 0) V cc /2
DRAM Circuit Basics Sense Amplifier III : Destructive Read 1 2 4 5 Sense and Amplify 3 6 Wordline Driven V cc (logic 1) Gnd (logic 0) V cc /2
DRAM Access Protocol ROW ACCESS DRAM Column Decoder CPU BUS MEMORY CONTROLLER AKA: OPEN a DRAM Page/ or ACT (Activate a DRAM Page/) or RAS ( Strobe) In/Out Buffers Decoder... Word Lines... Sense Amps... Bit Lines... Memory Array
once the data is valid on ALL of the bit lines, you can select a subset of the bits and send them to the output buffers... CAS picks one of the bits big point: cannot do another RAS or precharge of the lines until finished reading the column data... can t change the values on the bit lines or the output of the sense amps until it has been read by the memory controller DRAM Circuit Basics Column Defined Column: Smallest addressable quantity of DRAM on chip SDRAM*: column size == chip data bus width (4, 8,16, 32) RDRAM: column size!= chip data bus width (128 bit fixed) SDRAM*: get n columns per access. n = (1, 2, 4, 8) RDRAM: get 1 column per access. 4 bit wide columns #0 #1 #2 #3 #4 #5 One of DRAM * SDRAM means SDRAM and variants. i.e. DDR SDRAM
DRAM Access Protocol COLUMN ACCESS I DRAM Column Decoder CPU BUS MEMORY CONTROLLER READ Command or CAS: Column Strobe In/Out Buffers Decoder... Word Lines... Sense Amps... Bit Lines... Memory Array
DRAM Access Protocol Column Access II then the data is valid on the data bus... depending on what you are using for in/out buffers, you might be able to overlap a litttle or a lot of the data transfer with the next CAS to the same page (this is PAGE MODE) CPU BUS Out MEMORY CONTROLLER In/Out Buffers Decoder... Word Lines... Column Decoder Sense Amps... Bit Lines... Memory Array DRAM... with optional additional CAS: Column Strobe note: page mode enables overlap with CAS
DRAM Speed Part I NOTE How fast can I move data from DRAM cell to sense amp? Column Decoder DRAM CPU BUS t RCD MEMORY CONTROLLER In/Out Buffers Decoder... Word Lines... Sense Amps... Bit Lines... Memory Array RCD ( Command Delay)
DRAM Speed Part II How fast can I get data out of sense amps back into memory controller? t CAS aka t CASL aka t CL CPU BUS MEMORY CONTROLLER CAS: Column Strobe CASL: Column Strobe Latency CL: Column Strobe Latency In/Out Buffers Decoder... Word Lines... Column Decoder Sense Amps... Bit Lines... Memory Array DRAM
DRAM Speed Part III How fast can I move data from DRAM cell into memory controller? DRAM Column Decoder CPU BUS MEMORY CONTROLLER t RAC = t RCD + t CAS In/Out Buffers Decoder... Word Lines... Sense Amps... Bit Lines... Memory Array RAC (Random Access Delay)
DRAM Speed Part IV How fast can I precharge DRAM array so I can engage another RAS? DRAM Column Decoder CPU t RP BUS MEMORY CONTROLLER In/Out Buffers Decoder... Word Lines... Sense Amps... Bit Lines... Memory Array RP ( Precharge Delay)
DRAM Speed Part V How fast can I read from different rows? DRAM Column Decoder CPU BUS MEMORY CONTROLLER t RC = t RAS + t RP In/Out Buffers Decoder... Word Lines... Sense Amps... Bit Lines... Memory Array RC ( Cycle Time)
DRAM Speed Summary I What do I care about? t RCD t CAS t RP t RC = t RAS + t RP t RAC = t RCD + t CAS Seen in ads. Easy to explain Easy to sell Embedded systems designers DRAM manufactuers Computer Architect: Latency bound code i.e. linked list traversal RAS: Strobe CAS: Column Strobe RCD: Command Delay RAC :Random Access Delay RP : Precharge Delay RC : Cycle Time
DRAM Speed Summary II DRAM Type PC133 SDRAM Frequency Bus Width (per chip) Peak Bandwidth (per Chip) Random Access Time (t RAC ) Cycle Time (t RC ) 133 16 200 MB/s 45 ns 60 ns DDR 266 133 * 2 16 532 MB/s 45 ns 60 ns PC800 RDRAM 400 * 2 16 1.6 GB/s 60 ns 70 ns FCRAM 200 * 2 16 0.8 GB/s 25 ns 25 ns RLDRAM 300 * 2 32 2.4 GB/s 25 ns 25 ns DRAM is slow But doesn t have to be t RC < 10ns achievable Higher die cost Not adopted in standard Not commodity Expensive
DRAM latency isn t deterministic because of CAS or RAS+CAS, and there may be significant queuing delays within the CPU and the memory controller Each transaction has some overhead. Some types of overhead cannot be pipelined. This means that in general, longer bursts are more efficient. DRAM latency CPU A F B Mem Controller C D E 1 DRAM E 2 /E 3 A: Transaction request may be delayed in Queue B: Transaction request sent to Memory Controller C: Transaction converted to Command Sequences (may be queued) D: Command/s Sent to DRAM E 1 : Requires only a CAS or E 2 : Requires RAS + CAS or E 3: Requires PRE + RAS + CAS F: Transaction sent back to CPU DRAM Latency = A + B + C + D + E + F
DRAM Architecture Basics PHYSICAL ORGANIZATION NOTE x2 DRAM Column Decoder x4 DRAM Column Decoder x8 DRAM Column Decoder Buffers Sense Amps... Bit Lines... Buffers Sense Amps... Bit Lines... Buffers Sense Amps... Bit Lines... Decoder.... Memory Array Decoder.... Memory Array Decoder.... Memory Array x2 DRAM x4 DRAM x8 DRAM This is per bank Typical DRAMs have 2+ banks
let s look at the interface another way.. the say the data sheets portray it. [explain] main point: the RAS\ and CAS\ signals directly control the latches that hold the row and column addresses... DRAM Architecture Basics Read Timing for Conventional DRAM RAS CAS Access Column Access Transfer Column Column DQ out out
DRAM Evolutionary Tree since DRAM s inception, there have been a stream of changes to the design, from FPM to EDO to Burst EDO to SDRAM. the changes are largely structural modifications -- nimor -- that target THROUGHPUT. [discuss FPM up to SDRAM Everything up to and including SDRAM has been relatively inexpensive, especially when considering the pay-off (FPM was essentially free, EDO cost a latch, PBEDO cost a counter, SDRAM cost a slight re-design). however, we re run out of free ideas, and now all changes are considered expensive... thus there is no consensus on new directions and myriad of choices has appeared [ do LATENCY mods starting with ESDRAM... and then the INTERFACE mods ] Conventional DRAM (Mostly) Structural Modifications Targeting Throughput FPM EDO P/BEDO SDRAM ESDRAM Interface Modifications Targeting Throughput.............. MOSYS FCRAM $ VCDRAM Structural Modifications Targeting Latency Rambus, DDR/2 Future Trends
DRAM Evolution NOTE Read Timing for Conventional DRAM Access Column Access Transfer Overlap RAS Transfer CAS Column Column DQ out out
FPM aallows you to keep th esense amps actuve for multiple CAS commands... much better throughput problem: cannot latch a new value in the column address buffer until the read-out of the data is complete DRAM Evolution Read Timing for Fast Page Mode RAS Access Column Access Transfer Overlap Transfer CAS Column Column Column DQ out out out
solution to that problem -- instead of simple tri-state buffers, use a latch as well. by putting a latch after the column mux, the next column address command can begin sooner DRAM Evolution Read Timing for Extended Out RAS Access Column Access Transfer Overlap Transfer CAS Column Column Column DQ out out out
by driving the col-addr latch from an internal counter rather than an external signal, the minimum cycle time for driving the output bus was reduced by roughly 30% DRAM Evolution Read Timing for Burst EDO RAS Access Column Access Transfer Overlap Transfer CAS Column DQ
pipeline refers to the setting up of the read pipeline... first CAS\ toggle latches the column address, all following CAS\ toggles drive data out onto the bus. therefore data stops coming when the memory controller stops toggling CAS\ DRAM Evolution Read Timing for Pipeline Burst EDO RAS Access Column Access Transfer Overlap Transfer CAS Column DQ
main benefit: frees up the CPU or memory controller from having to control the DRAM s internal latches directly... the controller/cpu can go off and do other things during the idle cycles instead of wait... even though the time-to-first-word latency actually gets worse, the scheme increases system throughput DRAM Evolution Read Timing for Synchronous DRAM Clock RAS CAS Command Access Column Access Transfer Overlap Transfer ACT READ Col DQ (RAS + CAS + OE... == Command Bus)
output latch on EDO allowed you to start CAS sooner for next accesss (to same row) DRAM Evolution Inter- Read Timing for ESDRAM Regular CAS-2 SDRAM, R/R to same bank Clock Command latch whole row in ESDRAM -- allows you to start precharge & RAS sooner for thee next page access -- HIDE THE PRECHARGE OVERHEAD. ACT READ Col PRE Bank ACT READ Col DQ ESDRAM, R/R to same bank Clock Command ACT READ PRE ACT READ Col Bank Col DQ
neat feature of this type of buffering: write-around DRAM Evolution Write-Around in ESDRAM Regular CAS-2 SDRAM, R/W/R to same bank, rows 0/1/0 Clock Command ACT READ PRE ACT WRITE PRE ACT READ Col Bank Col Bank Col DQ ESDRAM, R/W/R to same bank, rows 0/1/0 Clock Command ACT READ PRE ACT WRITE READ Col Bank Col Col DQ (can second READ be this aggressive?)
main thing... it is like having a bunch of open row buffers (a la rambus), but the problem is that you must deal with the cache directly (move into and out of it), not the DRAM banks... adds an extra couple of cycles of latency... however, you get good bandwidth if the data you want is cache, and you can prefetch into cache ahead of when you want it... originally targetted at reducing latency, now that SDRAM is CAS-2 and RCD-2, this make sense only in a throughput way DRAM Evolution Internal Structure of Virtual Channel Bank B Bank A 2Kb Segment 2Kb Segment 2Kb Segment 2Kbit 16 Channels (segments) # DQs Input/Output Buffer DQs $ 2Kb Segment Decoder Sense Amps Sel/Dec Activate Prefetch Restore Read Write Segment cache is software-managed, reduces energy
FCRAM opts to break up the data array.. only activate a portion of the word line DRAM Evolution Internal Structure of Fast Cycle RAM SDRAM FCRAM 8K rows requires 13 bits tto select... FCRAM uses 15 (assuming the array is 8k x 1k... the data sheet does not specify) 13 bits Decoder Decoder 8M Array 8M Array 15 bits (8Kr x 1Kb) (?) Sense Amps Sense Amps t RCD = 15ns (two clocks) t RCD = 5ns (one clock) Reduces access time and energy/access
MoSys takes this one step further... DRAM with an SRAM interface & speed but DRAM energy [physical partitioning: 72 banks] auto refresh -- how to do this transparently? the logic moves tthrough the arrays, refreshing them when not active. but what is one bank gets repeated access for a long duration? all other banks will be refreshed, but that one will not. solution: they have a bank-sized CACHE of lines... in theory, should never have a problem (magic) DRAM Evolution Internal Structure of MoSys 1T-SRAM addr Bank Select Auto Refresh.............. $ DQs