Memory Architecture and Management in a NoC Platform



Similar documents
Power Reduction Techniques in the SoC Clock Network. Clock Power

Linux flash file systems JFFS2 vs UBIFS

DDR subsystem: Enhancing System Reliability and Yield

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Putting it all together: Intel Nehalem.

Optimizing Configuration and Application Mapping for MPSoC Architectures

Networking Virtualization Using FPGAs

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

Chapter 2 Heterogeneous Multicore Architecture

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

NAND Flash & Storage Media

Introduction to System-on-Chip

Multi-Threading Performance on Commodity Multi-Core Processors

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

Building Blocks for PRU Development

SOC architecture and design

AMD Opteron Quad-Core

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Lecture 23: Multiprocessors

Resource Efficient Computing for Warehouse-scale Datacenters

Industry First X86-based Single Board Computer JaguarBoard Released

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

FPGA-based Multithreading for In-Memory Hash Joins

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

Price/performance Modern Memory Hierarchy

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Web DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

Parallel Programming Survey

A PRAM and NAND Flash Hybrid Architecture for High-Performance Embedded Storage Subsystems

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Chapter Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig I/O devices can be characterized by. I/O bus connections

Beyond Virtualization: A Novel Software Architecture for Multi-Core SoCs. Jim Ready September 18, 2012

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

Parallel Algorithm Engineering

NAND Flash Architecture and Specification Trends

Tyche: An efficient Ethernet-based protocol for converged networked storage

High Performance or Cycle Accuracy?

OpenSoC Fabric: On-Chip Network Generator

Scaling from Datacenter to Client

ECLIPSE Performance Benchmarks and Profiling. January 2009

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Introduction to AMBA 4 ACE and big.little Processing Technology

7a. System-on-chip design and prototyping platforms

What is a System on a Chip?

Caching Mechanisms for Mobile and IOT Devices

ICRI-CI Retreat Architecture track

Intel Xeon Processor E5-2600

A New Chapter for System Designs Using NAND Flash Memory

A Study of Application Performance with Non-Volatile Main Memory

Performance Characteristics of Large SMP Machines

Advantages of e-mmc 4.4 based Embedded Memory Architectures

System Design Issues in Embedded Processing

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

- Nishad Nerurkar. - Aniket Mhatre

760 Veterans Circle, Warminster, PA Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

The Memory Factor Samsung Green Memory Solutions for energy efficient Systems Ed Hogan 2 /?

A Look Inside Smartphone and Tablets

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Flash-Friendly File System (F2FS)

Kalray MPPA Massively Parallel Processing Array

Binary search tree with SIMD bandwidth optimization using SSE

Implementation Details

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations

Redefining Flash Storage Solution

Architecture of Hitachi SR-8000

1 Storage Devices Summary

Power-Aware High-Performance Scientific Computing

LS DYNA Performance Benchmarks and Profiling. January 2009

Cray DVS: Data Virtualization Service

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

How to Run the MQX RTOS on Various RAM Memories for i.mx 6SoloX

An Exploration of Hybrid Hard Disk Designs Using an Extensible Simulator

Low Power AMD Athlon 64 and AMD Opteron Processors

Embedded Parallel Computing

Transcription:

Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011

Overview Motivation State of the Art Data Management Engine Architecture Virtual Address Space Cache Coherence Consistency Synchronization Message Passing based Communication Experiments DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 2 / 27

The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops Required memory bandwidth with two cache levels and 95% hit rate: 7.2Gb/s Available memory bandwidth: 50 Gb/s DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops Required memory bandwidth with two cache levels and 95% hit rate: 7.2Gb/s Available memory bandwidth: 50 Gb/s Required memory access latency: 0.625 cycles Available average access time: 1.475 cycles DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

The Access Bottleneck I/O Bottleneck Processing DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 4 / 27

Bandwidth in 3D ICs Embedded Processor DRAM Die Processing Die (NoC) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27

Bandwidth in 3D ICs Processor DRAM Die Processing Die (NoC) Embedded 8x8 MPSoC Resource size: 2x2 mm2 Switch size: 100x100 um2 TSV pitch: 5 um TSVs/switch: 128@2GHz 256 Gb/s memory bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for Tilera) < 10 ns delay DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27

Bandwidth in 3D ICs Processor DRAM Die Processing Die (NoC) Embedded 8x8 MPSoC Resource size: 2x2 mm2 Switch size: 100x100 um2 TSV pitch: 5 um TSVs/switch: 128@2GHz 256 Gb/s memory bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for Tilera) < 10 ns delay 80 x bandwidth 1/5 latency 1/10 power No area overhead DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27

Access Protocols Protocol Characteristics Frequency buswidth Voltage max bandiwtdh Efficiency DDR3 1066 MHz 32 bit 1.5 V 8.532 GB/s 4.17 GB/s/W LPDDR2 533 MHz 32 bit 1.2 V 4.264 GB/s 6.25 GB/s/W Wide I/O 200 MHz 512 bit 1.2 V 12.8 GB/s 25.0 GB/s/W Cost for 1 TB/s Protocol No of ports No of pads Power DDR3 120 3800 240 W LPDDR2 240 7700 160 W Wide I/O 80 41000 40 W DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 6 / 27

Package Architecture Roadmap 3D with TSVs POP with Stacked MCP Stacked MCP Off-Package From: Denis Dutois and Ahmed Jerraya. 3D Integration Opportunities for Interconnect in Mobile Computing Architectures. In: Future Fab International 34 (July 2010) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 7 / 27

Architecture State of the Art Custom memory architectures in many SoCs and embedded systems with no shared memory space No HW support for shared memory space, cache coherence and consistency (e.g. Inte s 48 core SCC) Uniform shared memory space with snooping based cache coherence for small multi-core systems (ARM s Cortex A9) Non-Uniform Cache Architecture (NUCA) in regular many-core SoCs; introduced in 2002: C. Kim, D. Burger, and S. Keckler. An Adaptive, Non-uniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In: Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. 2002 No general, flexible, and scalable solution for the many-core era DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 8 / 27

Platform Overview DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 9 / 27

Data Management Engine Architecture DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 10 / 27

Virtual Address Space Translation Page Address Offset Virtual Address Address Translation Table Node Address Page Address Physical Address Offset DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 11 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 Translation table: 1 M entries - 2 20 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 Translation table: 1 M entries - 2 20 1 entry: 6+14=20 bit 3 Byte DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 Translation table: 1 M entries - 2 20 1 entry: 6+14=20 bit 3 Byte 3MB/node (18.75%) for translation table DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

DME Virtual Address Translation Page Address Offset Virtual Address NO > BADDR Page Address Offset YES Address Translation Table Node Address Page Address Offset Physical Address DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 13 / 27

Supported Partitions local private physical Supported local private virtual - local shared physical - local shared virtual Supported remote private physical - remote private virtual - remote shared physical - remote shared virtual Supported DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 14 / 27

Address Space Management Node 1 Address Space Node 2 Address Space Node 3 Address Space Node 4 Address Space 0X00000000 0X00000000 BADDR 0X00000000 0X00000000 Private, Local local Private, Local local remote BADDR 0X0F00000 local Shared, Local or local Private, Local remote Shared, Local or remote 0XFB00000 BADDR 0XFFF0000 local remote Shared, Local or remote remote 0XFFFC000 local 0XFFFFFFFF 0XFFFFFFFF 0XFFFFFFFF BADDR 0XFFFFFFFF DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 15 / 27

Multi-core Speedup DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 16 / 27

Experiment: Central vs. Distributed Shared DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 17 / 27

Central vs. Distributed Shared : Matrix Multiplication DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 18 / 27

Central vs. Distributed Shared : FFT DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 19 / 27

Summary After computation (multi-core) and communication (NoC), memory access must be parallalized DME parallizes memory handling DME supports Central/distributed memory Private/shared Physical/virtual address space DME features Synchronization support Cache coherence protocols consistency support Message passing communication Dynamic memory allocator Future Work Improve cache coherency Improved high level support for programming models Adapt DME to integrate heterogeneous SoC subsystems DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 20 / 27

Data Management Engine Implementation Optimized for area Optimized for speed Frequency 444 MHz (2.25 ns) 455 MHz (2.2 ns) Area (Logic) 44k NAND gates 51k NAND gates Area (Control Store) 300k NAND gates power consumption [mw] Mini A 6.9 Mini B 7.0 NICU 2.3 CICU 5.2 Synchronizer 0.2 DME total 21.6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 21 / 27

Directory Based Cache Coherence Processor Node Processor Node Processor Node Processor Node Cache Cache Cache Cache Processor Node Processor Node Processor Node Processor Node Cache Cache Cache Cache Processor Node Processor Node Processor Node Controller Cache Cache Cache Directory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 22 / 27

Directory Structure Node 1 Node 2 Node 3 Directory Node N block DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 23 / 27

Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 No of blocks/cache: 128K - 2 17 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 No of blocks/cache: 128K - 2 17 S D = 520MB 13% of total memory 102% of total cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

Directory Size 1 entry / cache block Node 1 Node 2 Node 3 Directory Node N Cache block Cache 1 Cache 2 Cache 3 S D = N C B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 25 / 27

Directory Size 1 entry / cache block Node 1 Node 2 Node 3 Directory Node N Cache block Cache 1 Cache 2 Cache 3 S D = N C B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 No of blocks/cache: 128K - 2 17 S D = 65MB 1.5% of total memory 12.6% of total cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 25 / 27

Consistency Experiment Initially, int x, y,z = 0; (0,0) CS NODE (0,1) (0,2) // Non-critical section1 x = data1; Reg1 = x; // Lock Acquire Acquire (L); (1,0) (1,1) SYNC NODE (1,2) // Critical Section y = data2; Reg2 = y; // Lock Release Release(L) (2,0) (2,1) (2,2) (a) // Non-critical Section2 z = data3; Reg3 = z; (b) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 26 / 27

Consistency Execution Time 10 x 104 9 8 7 6 Average Code Latency (Release consistency) Max Code Latency (Release consistency) Average Code Latency (Weak consistency) Max Code Latency (Weak consistency) Average Code Latency (Sequential consistency) Max Code Latency (Sequential consistency) Cycles 5 4 3 2 1 0 1x1 1x2 2x2 2x4 4x4 4x8 8x8 Network Size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 27 / 27