Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011
Overview Motivation State of the Art Data Management Engine Architecture Virtual Address Space Cache Coherence Consistency Synchronization Message Passing based Communication Experiments DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 2 / 27
The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27
The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27
The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27
The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops Required memory bandwidth with two cache levels and 95% hit rate: 7.2Gb/s Available memory bandwidth: 50 Gb/s DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27
The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops Required memory bandwidth with two cache levels and 95% hit rate: 7.2Gb/s Available memory bandwidth: 50 Gb/s Required memory access latency: 0.625 cycles Available average access time: 1.475 cycles DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27
The Access Bottleneck I/O Bottleneck Processing DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 4 / 27
Bandwidth in 3D ICs Embedded Processor DRAM Die Processing Die (NoC) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27
Bandwidth in 3D ICs Processor DRAM Die Processing Die (NoC) Embedded 8x8 MPSoC Resource size: 2x2 mm2 Switch size: 100x100 um2 TSV pitch: 5 um TSVs/switch: 128@2GHz 256 Gb/s memory bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for Tilera) < 10 ns delay DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27
Bandwidth in 3D ICs Processor DRAM Die Processing Die (NoC) Embedded 8x8 MPSoC Resource size: 2x2 mm2 Switch size: 100x100 um2 TSV pitch: 5 um TSVs/switch: 128@2GHz 256 Gb/s memory bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for Tilera) < 10 ns delay 80 x bandwidth 1/5 latency 1/10 power No area overhead DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27
Access Protocols Protocol Characteristics Frequency buswidth Voltage max bandiwtdh Efficiency DDR3 1066 MHz 32 bit 1.5 V 8.532 GB/s 4.17 GB/s/W LPDDR2 533 MHz 32 bit 1.2 V 4.264 GB/s 6.25 GB/s/W Wide I/O 200 MHz 512 bit 1.2 V 12.8 GB/s 25.0 GB/s/W Cost for 1 TB/s Protocol No of ports No of pads Power DDR3 120 3800 240 W LPDDR2 240 7700 160 W Wide I/O 80 41000 40 W DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 6 / 27
Package Architecture Roadmap 3D with TSVs POP with Stacked MCP Stacked MCP Off-Package From: Denis Dutois and Ahmed Jerraya. 3D Integration Opportunities for Interconnect in Mobile Computing Architectures. In: Future Fab International 34 (July 2010) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 7 / 27
Architecture State of the Art Custom memory architectures in many SoCs and embedded systems with no shared memory space No HW support for shared memory space, cache coherence and consistency (e.g. Inte s 48 core SCC) Uniform shared memory space with snooping based cache coherence for small multi-core systems (ARM s Cortex A9) Non-Uniform Cache Architecture (NUCA) in regular many-core SoCs; introduced in 2002: C. Kim, D. Burger, and S. Keckler. An Adaptive, Non-uniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In: Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems. 2002 No general, flexible, and scalable solution for the many-core era DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 8 / 27
Platform Overview DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 9 / 27
Data Management Engine Architecture DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 10 / 27
Virtual Address Space Translation Page Address Offset Virtual Address Address Translation Table Node Address Page Address Physical Address Offset DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 11 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 Translation table: 1 M entries - 2 20 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 Translation table: 1 M entries - 2 20 1 entry: 6+14=20 bit 3 Byte DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes - 2 6 16 MB/node - 2 24 1 GB total memory - 2 30 Page size 1KB - 2 10 Translation table: 1 M entries - 2 20 1 entry: 6+14=20 bit 3 Byte 3MB/node (18.75%) for translation table DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27
DME Virtual Address Translation Page Address Offset Virtual Address NO > BADDR Page Address Offset YES Address Translation Table Node Address Page Address Offset Physical Address DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 13 / 27
Supported Partitions local private physical Supported local private virtual - local shared physical - local shared virtual Supported remote private physical - remote private virtual - remote shared physical - remote shared virtual Supported DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 14 / 27
Address Space Management Node 1 Address Space Node 2 Address Space Node 3 Address Space Node 4 Address Space 0X00000000 0X00000000 BADDR 0X00000000 0X00000000 Private, Local local Private, Local local remote BADDR 0X0F00000 local Shared, Local or local Private, Local remote Shared, Local or remote 0XFB00000 BADDR 0XFFF0000 local remote Shared, Local or remote remote 0XFFFC000 local 0XFFFFFFFF 0XFFFFFFFF 0XFFFFFFFF BADDR 0XFFFFFFFF DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 15 / 27
Multi-core Speedup DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 16 / 27
Experiment: Central vs. Distributed Shared DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 17 / 27
Central vs. Distributed Shared : Matrix Multiplication DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 18 / 27
Central vs. Distributed Shared : FFT DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 19 / 27
Summary After computation (multi-core) and communication (NoC), memory access must be parallalized DME parallizes memory handling DME supports Central/distributed memory Private/shared Physical/virtual address space DME features Synchronization support Cache coherence protocols consistency support Message passing communication Dynamic memory allocator Future Work Improve cache coherency Improved high level support for programming models Adapt DME to integrate heterogeneous SoC subsystems DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 20 / 27
Data Management Engine Implementation Optimized for area Optimized for speed Frequency 444 MHz (2.25 ns) 455 MHz (2.2 ns) Area (Logic) 44k NAND gates 51k NAND gates Area (Control Store) 300k NAND gates power consumption [mw] Mini A 6.9 Mini B 7.0 NICU 2.3 CICU 5.2 Synchronizer 0.2 DME total 21.6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 21 / 27
Directory Based Cache Coherence Processor Node Processor Node Processor Node Processor Node Cache Cache Cache Cache Processor Node Processor Node Processor Node Processor Node Cache Cache Cache Cache Processor Node Processor Node Processor Node Controller Cache Cache Cache Directory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 22 / 27
Directory Structure Node 1 Node 2 Node 3 Directory Node N block DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 23 / 27
Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27
Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27
Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27
Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27
Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27
Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 No of blocks/cache: 128K - 2 17 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27
Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 No of blocks/cache: 128K - 2 17 S D = 520MB 13% of total memory 102% of total cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27
Directory Size 1 entry / cache block Node 1 Node 2 Node 3 Directory Node N Cache block Cache 1 Cache 2 Cache 3 S D = N C B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 25 / 27
Directory Size 1 entry / cache block Node 1 Node 2 Node 3 Directory Node N Cache block Cache 1 Cache 2 Cache 3 S D = N C B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes - 2 6 4 GB total memory - 2 32 Block size 64B - 2 6 Cache size 8MB - 2 23 No of blocks/cache: 128K - 2 17 S D = 65MB 1.5% of total memory 12.6% of total cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 25 / 27
Consistency Experiment Initially, int x, y,z = 0; (0,0) CS NODE (0,1) (0,2) // Non-critical section1 x = data1; Reg1 = x; // Lock Acquire Acquire (L); (1,0) (1,1) SYNC NODE (1,2) // Critical Section y = data2; Reg2 = y; // Lock Release Release(L) (2,0) (2,1) (2,2) (a) // Non-critical Section2 z = data3; Reg3 = z; (b) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 26 / 27
Consistency Execution Time 10 x 104 9 8 7 6 Average Code Latency (Release consistency) Max Code Latency (Release consistency) Average Code Latency (Weak consistency) Max Code Latency (Weak consistency) Average Code Latency (Sequential consistency) Max Code Latency (Sequential consistency) Cycles 5 4 3 2 1 0 1x1 1x2 2x2 2x4 4x4 4x8 8x8 Network Size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 27 / 27