A Mixed Time-Criticality SDRAM Controller



Similar documents
Computer Architecture

Coupling TDM NoC and DRAM Controller for Cost and Performance Optimization of Real-Time Systems

Highlights of the High- Bandwidth Memory (HBM) Standard

CompSOC: a Mixed-Criticality Multi-MOC Execution Platform

LogiCORE IP AXI Performance Monitor v2.00.a

Technical Note DDR2 Offers New Features and Functionality

Table 1 SDR to DDR Quick Reference

Table 1: Address Table

Technical Note. Initialization Sequence for DDR SDRAM. Introduction. Initializing DDR SDRAM

Memory Hierarchy. Arquitectura de Computadoras. Centro de Investigación n y de Estudios Avanzados del IPN. adiaz@cinvestav.mx. MemoryHierarchy- 1

AXI Performance Monitor v5.0

Motivation: Smartphone Market

A case study of mobile SoC architecture design based on transaction-level modeling

User s Manual HOW TO USE DDR SDRAM

FPGA-based Multithreading for In-Memory Hash Joins

DDR2 Specific SDRAM Functions

Lecture 36: Chapter 6

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

7a. System-on-chip design and prototyping platforms

Optimizing Configuration and Application Mapping for MPSoC Architectures

Networking Virtualization Using FPGAs

DDR3 DIMM Slot Interposer

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Bandwidth Calculations for SA-1100 Processor LCD Displays

Achieving High Performance DDR3 Data Rates

Disk Storage Shortfall

Technical Note FBDIMM Channel Utilization (Bandwidth and Power)

A Reconfigurable and Programmable Gigabit Ethernet Network Interface Card

How to Perform Real-Time Processing on the Raspberry Pi. Steven Doran SCALE 13X

Designing Predictable Multicore Architectures for Avionics and Automotive Systems extended abstract

Multi-Threading Performance on Commodity Multi-Core Processors

DDR3 SDRAM UDIMM MT8JTF12864A 1GB MT8JTF25664A 2GB

CHAPTER. Monitoring and Diagnosing

FORWARDING of Internet Protocol (IP) packets is the primary. Scalable IP Lookup for Internet Routers

How Solace Message Routers Reduce the Cost of IT Infrastructure

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

Computer Systems Structure Main Memory Organization

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

GR2DR4B-EXXX/YYY/LP 1GB & 2GB DDR2 REGISTERED DIMMs (LOW PROFILE)

Optimising the resource utilisation in high-speed network intrusion detection systems.

V58C2512(804/404/164)SB HIGH PERFORMANCE 512 Mbit DDR SDRAM 4 BANKS X 16Mbit X 8 (804) 4 BANKS X 32Mbit X 4 (404) 4 BANKS X 8Mbit X 16 (164)

1. Memory technology & Hierarchy

Tuning DDR4 for Power and Performance. Mike Micheletti Product Manager Teledyne LeCroy

System Performance Analysis of an All Programmable SoC

Performance Analysis of Web based Applications on Single and Multi Core Servers

The Leader in Memory Technology

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

OpenSPARC T1 Processor

FlexPath Network Processor

ADQYF1A08. DDR2-1066G(CL6) 240-Pin O.C. U-DIMM 1GB (128M x 64-bits)

HP ProLiant Gen8 vs Gen9 Server Blades on Data Warehouse Workloads

SOLVING HIGH-SPEED MEMORY INTERFACE CHALLENGES WITH LOW-COST FPGAS

are un-buffered 200-Pin Double Data Rate (DDR) Synchronous DRAM Small Outline Dual In-Line Memory Module (SO-DIMM). All devices

An Interconnection Network for a Cache Coherent System on FPGAs. Vincent Mirian

AN 223: PCI-to-DDR SDRAM Reference Design

Amadeus SAS Specialists Prove Fusion iomemory a Superior Analysis Accelerator

8-ch RAID0 Design by using SATA Host IP Manual Rev1.0 9-Jun-15

AMD Opteron Quad-Core

Real Time Programming: Concepts

Open Flow Controller and Switch Datasheet

Real-Time Operating Systems for MPSoCs

Summer of LabVIEW The Sunny Side of System Design

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Chapter 5 :: Memory and Logic Arrays

Computer Architecture

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Memory technology evolution: an overview of system memory technologies

DDR4 Memory Technology on HP Z Workstations

Features. DDR3 Unbuffered DIMM Spec Sheet

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Memory unit. 2 k words. n bits per word

Memory Module Specifications KVR667D2D4F5/4G. 4GB 512M x 72-Bit PC CL5 ECC 240-Pin FBDIMM DESCRIPTION SPECIFICATIONS

Tuning DDR4 for Power and Performance. Mike Micheletti Product Manager Teledyne LeCroy

Accelerate Cloud Computing with the Xilinx Zynq SoC

Serial port interface for microcontroller embedded into integrated power meter

Preliminary Draft May 19th Video Subsystem

Features. DDR SODIMM Product Datasheet. Rev. 1.0 Oct. 2011

MAQAO Performance Analysis and Optimization Tool

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

APPLICATION NOTE GaGe CompuScope based Lightning Monitoring System

White Paper FPGA Performance Benchmarking Methodology

Computer Organization & Architecture Lecture #19

Server: Performance Benchmark. Memory channels, frequency and performance

Kirchhoff Institute for Physics Heidelberg

FS1140 & FS1141 DDR Protocol Checking & Performance Tool. FuturePlus Systems. Power Tools For Bus Analysis

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Cost-Effective Certification of High- Assurance Cyber Physical Systems. Kurt Rohloff BBN Technologies

2. Background Network Interface Processing

Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC

Resource Reservation & Resource Servers. Problems to solve

Mobile SDRAM. MT48H16M16LF 4 Meg x 16 x 4 banks MT48H8M32LF 2 Meg x 32 x 4 banks

C-GEP 100 Monitoring application user manual

Virtualisation in NOCs for enhanced MPSOC robustness and performance verification. overview 1

/ Operating Systems I. Process Scheduling. Warren R. Carithers Rob Duncan

FPGAs for Trusted Cloud Computing

Binary search tree with SIMD bandwidth optimization using SSE

1. Introduction to Embedded System Design

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Transcription:

NEST COBRA CA4 A Mixed Time-Criticality SDRAM Controller MeAOW 3-9-23 Sven Goossens, Benny Akesson, Kees Goossens

Mixed Time-Criticality 2/5 Embedded multi-core systems are getting more complex: Integrating more applications Applications get more complex (Functionality/Energy) ratio increases Driven by power, area and cost constraints Results in a mix of applications of different timecriticalities sharing hardware resources Firm real-time + Soft real-time = Mixed real-time The hardware can no longer be tailored for a specific time-criticality class

SDRAM Controllers 3/5 DRAM: Most commonly used off-chip memory resource Shared across FRT and SRT Performance metrics: bandwidth (throughput) and latency (response time) Difficult to bound performance: locality dependent Firm Real-Time Controllers Maximize worst-case performance Simple / analyzable command scheduler No attention for average-case performance Do not exploit locality across requests Soft Real-Time Controllers Maximize average-case performance Complex high performance command scheduler Guaranteeable performance is usually low Exploit locality as much as possible Mixed Real-Time Controllers: requirements For FRT: guarantee enough worst-case performance to satisfy requirements For SRT: maximizing the average-case performance How can locality be exploited by a MRT controller?

Outline 4/5 Introduction Firm Real-Time Performance Conservative Open-Page Policy Reconfigurable Controller Architecture Conclusions

Firm Real-Time Performance Our approach: do not schedule individual commands at run time, instead, use design-time computed command sequences called patterns, and schedule those. Select the right memory map / configuration for the mix of applications. 5/5 An example read pattern for a DDR3-8 in configuration (BI 2, BC 4): The number of Banks Interleaved (BI) in the access ACT 4 NOP RD 3 NOP RD 3 NOP RD 2 NOP ACT 3 3 RD RD RD NOP NOP 3 NOP RD 3 NOP RD Pattern length Each read command results in a burst data transfer. The number of burst per bank is called Burst Count (BC) Burst Length (BL): The number of words per read command. They are transferred in BL/2 clock cycles. Interface width (IW) Access Granularity (AG): Number of bytes read/written in a pattern: AG = BI BC BL IW (Gross) efficiency: fraction of time that the data bus is occupied in the worst-case The designer can choose the bank interleaving and burst count. Each configuration results in a different trade-off between bandwidth, latency and power Goossens, Kouters, Akesson, Goossens, Memory Map Selection for Firm Real-time SDRAM Controllers, Proc. DATE 22

Firm Real-Time Performance 6/5.6 42 9.7 5.4 3.7 2.8 2.2.7.4 Bandwidth ( GB/s) b AG (GB/s) net.4.2.8.6.4.2 2 4 8 6 64 32 8,64 2 2, 2,8 8 2,2 4 4,4 32 6 2, 2,4 8,64 2,4 4,2 2 8,2 2,2 4, 4,2 4 8 6 32 8, 8, 4, 8,64 2, 28MB_DDR3-8 256MB_LPDDR2-8-S4.2.4.6.8.2 Power (W) 2,2 28MB_DDR2-8 28MB_DDR3-8 28MB_DDR2-8 2,4 Labels: BI,BC (BI is omitted) All memories in this graph run at 4MHz Pareto optimal points are connected Isolines denote energy efficiency in GB/J Peak bandwidth (.6GB/s) Power (W) Select the configuration based on the real-time requirements of the requestors, and their request sizes. 8,2 4,2 4, 8,..9.7.6.4.3.

Outline 7/5 Introduction Firm Real-Time Performance Conservative Open-Page Policy Reconfigurable Controller Architecture Conclusions

Open vs. Close-Page Policy 8/5 Time ε A Read P A Read P A Read P A Read P A Read P Close-Page policy A Read Read Read P A Read P A Read Open-Page policy Request arrivals: 2 3 4 Color indicates locality (and request origin) For the blue requestor the open-page policy: Increases the worst-case execution time Reduces the average-case execution time We would like to improve average case performance for SRT applications, without hurting the FRT guarantees

Conservative Open-Page Policy 9/5 Key idea: Do not precharge if next request is known to target the open row Precharge if next address is not known in time, or in case of a miss ε Time A Read P A Read P A Read P A Read P A Read P Close-Page policy Conservative Open- Page policy A Read Read P A Read P A Read P A Read P A Read Read Read P A Read P A Read Open-Page policy Request arrivals: 2 3 4 Goossens, Akesson, Goossens, Conservative Open-Page Policy for Mixed Time-Criticality Memory Controllers, Proc. DATE 23

Starting point is a predictable memory pattern set, with a bypass in case of a page hit Use explicit precharges instead of auto-precharge flags Cmd: Bank: Conservative Open-Page Policy postpones the precharge as long as possible, to increase the hit-window in which we can decide to bypass the precharge and activate. (DDR3-6) ACT-to-ACT constraint = 38 cycles A N N N N N N N A N R N N N R N N N R N N N R N N N N N N N N N N N N N N N A N N N N /5 Hit window (4 cc) Next request Cmd: Bank: A N N N N N N N A N R N N N R N N N R N N N R N N N N N P N N N N N N N P N A N N N N Hit window (28 cc) PRE-to-ACT = Conservative Open-Page policy can be used in a MRT controller: Worst-case guarantees are equal to a close-page policy. Average-case performance is better, leading to lower execution times, lower average-case latencies. SRT applications can even benefit indirectly from the quicker service to FRT requests! The execution time reduction depends on the memory load of the application. Goossens, Akesson, Goossens, Conservative Open-Page Policy for Mixed Time-Criticality Memory Controllers, Proc. DATE 23

Outline /5 Introduction Firm Real-Time Performance Conservative Open-Page Policy Reconfigurable Controller Architecture Conclusions

Reconfigurable Back-end 2/5 SDRAM back-end Logical address Address generator row/col, bank Request type Offset Pattern selector Refresh timer Pattern LUT Command player Pattern base-address, length Pattern memory commands RAS, CAS WE, etc Address masks, shift-amounts Internal configuration bus Configuration data Patterns are reprogrammable at run time. Can support all devices supported by the PHY (all DDR3 devices) Different pattern different worst-case bandwidth, latency and power trade-off. Allows different trade-off per use case. Goossens, Kuijsten, Akesson, Goossens, A Reconfigurable Real-Time SDRAM Controller for Mixed Time-Criticality Systems, Proc. CODES+ISSS 23

SDRAM PHY Resource Bus Reconfigurable Controller Architecture 3/5 Resource front-end Memory client Memory client 2 Atomizer Atomizer Width Converter Width Converter Req./Resp. queue Req./Resp. queue SDRAM back-end TDM Arbiter Configuration Bus Configuration data Run-time reconfiguration infrastructure (memory mapped) Reconfigurable TDM arbiter (predictable and composable during reconfiguration) Reconfigurable back-end Implemented in SystemC, and on a ML65 Virtex-6 development board from Xilinx 2-port instance: 3754 registers, 9543 LUTs and BRAM 4-port instance: 2265 registers, 46 LUTs and BRAM (Most registers are used in the req./resp. queue, that contain 256 bytes / port) Goossens, Kuijsten, Akesson, Goossens, A Reconfigurable Real-Time SDRAM Controller for Mixed Time-Criticality Systems, Proc. CODES+ISSS 23

Outline 4/5 Introduction Firm Real-Time Performance Conservative Open-Page Policy Reconfigurable Controller Architecture Conclusions

Conclusions 5/5 Mixed-time criticality controllers should focus on: For FRT: guarantee enough worst-case performance to satisfy requirements For SRT: maximizing the average-case performance Choosing the right memory map / pattern configuration for the mix of applications: Trade-offs exist between worst-case bandwidth, latency and power Select the configuration that satisfies the firm real-time requirements Using a conservative open-page policy, some of the locality across requests can be exploited: Decrease the gap between worst-case performance and average-case performance Reduce average case latency and thus average case execution time For soft real-time applications Reconfigurable architecture allows changing the memory map / configuration at run-time: Select the right trade-off per use-case Leads to other interesting challenges (see CODES23 paper on predictable reconfiguration)

6/6 For further information / a broader perspective: 5-tile compsoc platform: Sven Goossens <s.l.m.goossens@tue.nl> Benny Akesson <kessoben@fel.cvut.cz> Kees Goossens <k.g.w.goossens@tue.nl> Referred papers: www.svengoossens.nl Electronic Systems Group Electrical Engineering Faculty Eindhoven University of Technology