Memory Architecture and Management in a NoC Platform

Size: px
Start display at page:

Download "Memory Architecture and Management in a NoC Platform"

Transcription

1 Architecture and Management in a NoC Platform Axel Jantsch Xiaowen Chen Zhonghai Lu Chaochao Feng Abdul Nameed Yuang Zhang Ahmed Hemani DATE 2011

2 Overview Motivation State of the Art Data Management Engine Architecture Virtual Address Space Cache Coherence Consistency Synchronization Message Passing based Communication Experiments DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 2 / 27

3 The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

4 The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

5 The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

6 The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops Required memory bandwidth with two cache levels and 95% hit rate: 7.2Gb/s Available memory bandwidth: 50 Gb/s DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

7 The Wall Wulf and McKee predicted in 1995 the impact into the memory wall: t avg = p t c +(1 p) t m t avg...average access latency for one data word p...probability of a cache hit t c...access time to the cache t m...access time to main memory Today we navigate at the edge of this wall. For example the Tilera 64 core chip: Raw processor performance: 443 Gops Required memory bandwidth with two cache levels and 95% hit rate: 7.2Gb/s Available memory bandwidth: 50 Gb/s Required memory access latency: cycles Available average access time: cycles DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 3 / 27

8 The Access Bottleneck I/O Bottleneck Processing DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 4 / 27

9 Bandwidth in 3D ICs Embedded Processor DRAM Die Processing Die (NoC) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27

10 Bandwidth in 3D ICs Processor DRAM Die Processing Die (NoC) Embedded 8x8 MPSoC Resource size: 2x2 mm2 Switch size: 100x100 um2 TSV pitch: 5 um TSVs/switch: 128@2GHz 256 Gb/s memory bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for Tilera) < 10 ns delay DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27

11 Bandwidth in 3D ICs Processor DRAM Die Processing Die (NoC) Embedded 8x8 MPSoC Resource size: 2x2 mm2 Switch size: 100x100 um2 TSV pitch: 5 um TSVs/switch: 128@2GHz 256 Gb/s memory bandwidth per core 16 Tb/s aggregate memory bandwidth (200 Gb/s for Tilera) < 10 ns delay 80 x bandwidth 1/5 latency 1/10 power No area overhead DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 5 / 27

12 Access Protocols Protocol Characteristics Frequency buswidth Voltage max bandiwtdh Efficiency DDR MHz 32 bit 1.5 V GB/s 4.17 GB/s/W LPDDR2 533 MHz 32 bit 1.2 V GB/s 6.25 GB/s/W Wide I/O 200 MHz 512 bit 1.2 V 12.8 GB/s 25.0 GB/s/W Cost for 1 TB/s Protocol No of ports No of pads Power DDR W LPDDR W Wide I/O W DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 6 / 27

13 Package Architecture Roadmap 3D with TSVs POP with Stacked MCP Stacked MCP Off-Package From: Denis Dutois and Ahmed Jerraya. 3D Integration Opportunities for Interconnect in Mobile Computing Architectures. In: Future Fab International 34 (July 2010) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 7 / 27

14 Architecture State of the Art Custom memory architectures in many SoCs and embedded systems with no shared memory space No HW support for shared memory space, cache coherence and consistency (e.g. Inte s 48 core SCC) Uniform shared memory space with snooping based cache coherence for small multi-core systems (ARM s Cortex A9) Non-Uniform Cache Architecture (NUCA) in regular many-core SoCs; introduced in 2002: C. Kim, D. Burger, and S. Keckler. An Adaptive, Non-uniform Cache Structure for Wire-Delay Dominated On-Chip Caches. In: Proceedings of the 20th International Conference on Architectural Support for Programming Languages and Operating Systems No general, flexible, and scalable solution for the many-core era DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 8 / 27

15 Platform Overview DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 9 / 27

16 Data Management Engine Architecture DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 10 / 27

17 Virtual Address Space Translation Page Address Offset Virtual Address Address Translation Table Node Address Page Address Physical Address Offset DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 11 / 27

18 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

19 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

20 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes MB/node DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

21 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes MB/node GB total memory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

22 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes MB/node GB total memory Page size 1KB DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

23 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes MB/node GB total memory Page size 1KB Translation table: 1 M entries DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

24 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes MB/node GB total memory Page size 1KB Translation table: 1 M entries entry: 6+14=20 bit 3 Byte DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

25 Virtual Address Space Translation S T = (log 2 N +log 2 M N log 2 P S ) (M T /P S ) S T... Size of translation table per node N... Number of nodes M N... per node P S... Page size M T... Total memory size 64 nodes MB/node GB total memory Page size 1KB Translation table: 1 M entries entry: 6+14=20 bit 3 Byte 3MB/node (18.75%) for translation table DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 12 / 27

26 DME Virtual Address Translation Page Address Offset Virtual Address NO > BADDR Page Address Offset YES Address Translation Table Node Address Page Address Offset Physical Address DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 13 / 27

27 Supported Partitions local private physical Supported local private virtual - local shared physical - local shared virtual Supported remote private physical - remote private virtual - remote shared physical - remote shared virtual Supported DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 14 / 27

28 Address Space Management Node 1 Address Space Node 2 Address Space Node 3 Address Space Node 4 Address Space 0X X BADDR 0X X Private, Local local Private, Local local remote BADDR 0X0F00000 local Shared, Local or local Private, Local remote Shared, Local or remote 0XFB00000 BADDR 0XFFF0000 local remote Shared, Local or remote remote 0XFFFC000 local 0XFFFFFFFF 0XFFFFFFFF 0XFFFFFFFF BADDR 0XFFFFFFFF DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 15 / 27

29 Multi-core Speedup DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 16 / 27

30 Experiment: Central vs. Distributed Shared DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 17 / 27

31 Central vs. Distributed Shared : Matrix Multiplication DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 18 / 27

32 Central vs. Distributed Shared : FFT DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 19 / 27

33 Summary After computation (multi-core) and communication (NoC), memory access must be parallalized DME parallizes memory handling DME supports Central/distributed memory Private/shared Physical/virtual address space DME features Synchronization support Cache coherence protocols consistency support Message passing communication Dynamic memory allocator Future Work Improve cache coherency Improved high level support for programming models Adapt DME to integrate heterogeneous SoC subsystems DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 20 / 27

34 Data Management Engine Implementation Optimized for area Optimized for speed Frequency 444 MHz (2.25 ns) 455 MHz (2.2 ns) Area (Logic) 44k NAND gates 51k NAND gates Area (Control Store) 300k NAND gates power consumption [mw] Mini A 6.9 Mini B 7.0 NICU 2.3 CICU 5.2 Synchronizer 0.2 DME total 21.6 DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 21 / 27

35 Directory Based Cache Coherence Processor Node Processor Node Processor Node Processor Node Cache Cache Cache Cache Processor Node Processor Node Processor Node Processor Node Cache Cache Cache Cache Processor Node Processor Node Processor Node Controller Cache Cache Cache Directory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 22 / 27

36 Directory Structure Node 1 Node 2 Node 3 Directory Node N block DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 23 / 27

37 Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

38 Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

39 Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes GB total memory DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

40 Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes GB total memory Block size 64B DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

41 Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes GB total memory Block size 64B Cache size 8MB DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

42 Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes GB total memory Block size 64B Cache size 8MB No of blocks/cache: 128K DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

43 Directory Size 1 entry / memory block Node 1 Node 2 Node 3 Directory Node N block S D = S T /S B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes GB total memory Block size 64B Cache size 8MB No of blocks/cache: 128K S D = 520MB 13% of total memory 102% of total cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 24 / 27

44 Directory Size 1 entry / cache block Node 1 Node 2 Node 3 Directory Node N Cache block Cache 1 Cache 2 Cache 3 S D = N C B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 25 / 27

45 Directory Size 1 entry / cache block Node 1 Node 2 Node 3 Directory Node N Cache block Cache 1 Cache 2 Cache 3 S D = N C B (N +1) S D... Size of directory S T... Total memory N... No of nodes C B... blocks/cache S B... Block size S C... Cache size 64 nodes GB total memory Block size 64B Cache size 8MB No of blocks/cache: 128K S D = 65MB 1.5% of total memory 12.6% of total cache size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 25 / 27

46 Consistency Experiment Initially, int x, y,z = 0; (0,0) CS NODE (0,1) (0,2) // Non-critical section1 x = data1; Reg1 = x; // Lock Acquire Acquire (L); (1,0) (1,1) SYNC NODE (1,2) // Critical Section y = data2; Reg2 = y; // Lock Release Release(L) (2,0) (2,1) (2,2) (a) // Non-critical Section2 z = data3; Reg3 = z; (b) DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 26 / 27

47 Consistency Execution Time 10 x Average Code Latency (Release consistency) Max Code Latency (Release consistency) Average Code Latency (Weak consistency) Max Code Latency (Weak consistency) Average Code Latency (Sequential consistency) Max Code Latency (Sequential consistency) Cycles x1 1x2 2x2 2x4 4x4 4x8 8x8 Network Size DATE 2011 Tutorial MPSoC Hardware/Software Architectural and Design Challenges/Solutions, Axel Jantsch, KTH 27 / 27

Power Reduction Techniques in the SoC Clock Network. Clock Power

Power Reduction Techniques in the SoC Clock Network. Clock Power Power Reduction Techniques in the SoC Network Low Power Design for SoCs ASIC Tutorial SoC.1 Power Why clock power is important/large» Generally the signal with the highest frequency» Typically drives a

More information

Linux flash file systems JFFS2 vs UBIFS

Linux flash file systems JFFS2 vs UBIFS Linux flash file systems JFFS2 vs UBIFS Chris Simmonds 2net Limited Embedded Systems Conference UK. 2009 Copyright 2009, 2net Limited Overview Many embedded systems use raw flash chips JFFS2 has been the

More information

DDR subsystem: Enhancing System Reliability and Yield

DDR subsystem: Enhancing System Reliability and Yield DDR subsystem: Enhancing System Reliability and Yield Agenda Evolution of DDR SDRAM standards What is the variation problem? How DRAM standards tackle system variability What problems have been adequately

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719

Putting it all together: Intel Nehalem. http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Putting it all together: Intel Nehalem http://www.realworldtech.com/page.cfm?articleid=rwt040208182719 Intel Nehalem Review entire term by looking at most recent microprocessor from Intel Nehalem is code

More information

Optimizing Configuration and Application Mapping for MPSoC Architectures

Optimizing Configuration and Application Mapping for MPSoC Architectures Optimizing Configuration and Application Mapping for MPSoC Architectures École Polytechnique de Montréal, Canada Email : Sebastien.Le-Beux@polymtl.ca 1 Multi-Processor Systems on Chip (MPSoC) Design Trends

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi

Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi Optimizing NUCA Organizations and Wiring Alternatives for Large Caches with CACTI 6.0 Naveen Muralimanohar Rajeev Balasubramonian Norman P Jouppi University of Utah & HP Labs 1 Large Caches Cache hierarchies

More information

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29 RAID Redundant Array of Inexpensive (Independent) Disks Use multiple smaller disks (c.f. one large disk) Parallelism improves performance Plus extra disk(s) for redundant data storage Provides fault tolerant

More information

Chapter 2 Heterogeneous Multicore Architecture

Chapter 2 Heterogeneous Multicore Architecture Chapter 2 Heterogeneous Multicore Architecture 2.1 Architecture Model In order to satisfy the high-performance and low-power requirements for advanced embedded systems with greater fl exibility, it is

More information

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook) Vivek Sarkar Department of Computer Science Rice University vsarkar@rice.edu COMP

More information

NAND Flash & Storage Media

NAND Flash & Storage Media ENABLING MULTIMEDIA NAND Flash & Storage Media March 31, 2004 NAND Flash Presentation NAND Flash Presentation Version 1.6 www.st.com/nand NAND Flash Memories Technology Roadmap F70 1b/c F12 1b/c 1 bit/cell

More information

Introduction to System-on-Chip

Introduction to System-on-Chip Introduction to System-on-Chip COE838: Systems-on-Chip Design http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer Engineering Ryerson University

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads A Scalable VISC Processor Platform for Modern Client and Cloud Workloads Mohammad Abdallah Founder, President and CTO Soft Machines Linley Processor Conference October 7, 2015 Agenda Soft Machines Background

More information

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand

Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand Exploiting Remote Memory Operations to Design Efficient Reconfiguration for Shared Data-Centers over InfiniBand P. Balaji, K. Vaidyanathan, S. Narravula, K. Savitha, H. W. Jin D. K. Panda Network Based

More information

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip Outline Modeling, simulation and optimization of Multi-Processor SoCs (MPSoCs) Università of Verona Dipartimento di Informatica MPSoCs: Multi-Processor Systems on Chip A simulation platform for a MPSoC

More information

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting Introduction Big Data Analytics needs: Low latency data access Fast computing Power efficiency Latest

More information

enabling Ultra-High Bandwidth Scalable SSDs with HLnand

enabling Ultra-High Bandwidth Scalable SSDs with HLnand www.hlnand.com enabling Ultra-High Bandwidth Scalable SSDs with HLnand May 2013 2 Enabling Ultra-High Bandwidth Scalable SSDs with HLNAND INTRODUCTION Solid State Drives (SSDs) are available in a wide

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers

The Orca Chip... Heart of IBM s RISC System/6000 Value Servers The Orca Chip... Heart of IBM s RISC System/6000 Value Servers Ravi Arimilli IBM RISC System/6000 Division 1 Agenda. Server Background. Cache Heirarchy Performance Study. RS/6000 Value Server System Structure.

More information

Building Blocks for PRU Development

Building Blocks for PRU Development Building Blocks for PRU Development Module 1 PRU Hardware Overview This session covers a hardware overview of the PRU-ICSS Subsystem. Author: Texas Instruments, Sitara ARM Processors Oct 2014 2 ARM SoC

More information

SOC architecture and design

SOC architecture and design SOC architecture and design system-on-chip (SOC) processors: become components in a system SOC covers many topics processor: pipelined, superscalar, VLIW, array, vector storage: cache, embedded and external

More information

AMD Opteron Quad-Core

AMD Opteron Quad-Core AMD Opteron Quad-Core a brief overview Daniele Magliozzi Politecnico di Milano Opteron Memory Architecture native quad-core design (four cores on a single die for more efficient data sharing) enhanced

More information

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule All Programmable Logic Hans-Joachim Gelke Institute of Embedded Systems Institute of Embedded Systems 31 Assistants 10 Professors 7 Technical Employees 2 Secretaries www.ines.zhaw.ch Research: Education:

More information

Lecture 23: Multiprocessors

Lecture 23: Multiprocessors Lecture 23: Multiprocessors Today s topics: RAID Multiprocessor taxonomy Snooping-based cache coherence protocol 1 RAID 0 and RAID 1 RAID 0 has no additional redundancy (misnomer) it uses an array of disks

More information

Resource Efficient Computing for Warehouse-scale Datacenters

Resource Efficient Computing for Warehouse-scale Datacenters Resource Efficient Computing for Warehouse-scale Datacenters Christos Kozyrakis Stanford University http://csl.stanford.edu/~christos DATE Conference March 21 st 2013 Computing is the Innovation Catalyst

More information

Industry First X86-based Single Board Computer JaguarBoard Released

Industry First X86-based Single Board Computer JaguarBoard Released Industry First X86-based Single Board Computer JaguarBoard Released HongKong, China (May 12th, 2015) Jaguar Electronic HK Co., Ltd officially launched the first X86-based single board computer called JaguarBoard.

More information

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano

Intel Itanium Quad-Core Architecture for the Enterprise. Lambert Schaelicke Eric DeLano Intel Itanium Quad-Core Architecture for the Enterprise Lambert Schaelicke Eric DeLano Agenda Introduction Intel Itanium Roadmap Intel Itanium Processor 9300 Series Overview Key Features Pipeline Overview

More information

FPGA-based Multithreading for In-Memory Hash Joins

FPGA-based Multithreading for In-Memory Hash Joins FPGA-based Multithreading for In-Memory Hash Joins Robert J. Halstead, Ildar Absalyamov, Walid A. Najjar, Vassilis J. Tsotras University of California, Riverside Outline Background What are FPGAs Multithreaded

More information

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC Alan Gara Intel Fellow Exascale Chief Architect Legal Disclaimer Today s presentations contain forward-looking

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Price/performance Modern Memory Hierarchy

Price/performance Modern Memory Hierarchy Lecture 21: Storage Administration Take QUIZ 15 over P&H 6.1-4, 6.8-9 before 11:59pm today Project: Cache Simulator, Due April 29, 2010 NEW OFFICE HOUR TIME: Tuesday 1-2, McKinley Last Time Exam discussion

More information

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID

LSI SAS inside 60% of servers. 21 million LSI SAS & MegaRAID solutions shipped over last 3 years. 9 out of 10 top server vendors use MegaRAID The vast majority of the world s servers count on LSI SAS & MegaRAID Trust us, build the LSI credibility in storage, SAS, RAID Server installed base = 36M LSI SAS inside 60% of servers 21 million LSI SAS

More information

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING 2013/2014 1 st Semester Sample Exam January 2014 Duration: 2h00 - No extra material allowed. This includes notes, scratch paper, calculator, etc.

More information

Web Email DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing)

Web Email DNS Peer-to-peer systems (file sharing, CDNs, cycle sharing) 1 1 Distributed Systems What are distributed systems? How would you characterize them? Components of the system are located at networked computers Cooperate to provide some service No shared memory Communication

More information

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS

HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS HETEROGENEOUS SYSTEM COHERENCE FOR INTEGRATED CPU-GPU SYSTEMS JASON POWER*, ARKAPRAVA BASU*, JUNLI GU, SOORAJ PUTHOOR, BRADFORD M BECKMANN, MARK D HILL*, STEVEN K REINHARDT, DAVID A WOOD* *University of

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

A PRAM and NAND Flash Hybrid Architecture for High-Performance Embedded Storage Subsystems

A PRAM and NAND Flash Hybrid Architecture for High-Performance Embedded Storage Subsystems 1 A PRAM and NAND Flash Hybrid Architecture for High-Performance Embedded Storage Subsystems Chul Lee Software Laboratory Samsung Advanced Institute of Technology Samsung Electronics Outline 2 Background

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Chapter 6. 6.1 Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig. 6.1. I/O devices can be characterized by. I/O bus connections

Chapter 6. 6.1 Introduction. Storage and Other I/O Topics. p. 570( 頁 585) Fig. 6.1. I/O devices can be characterized by. I/O bus connections Chapter 6 Storage and Other I/O Topics 6.1 Introduction I/O devices can be characterized by Behavior: input, output, storage Partner: human or machine Data rate: bytes/sec, transfers/sec I/O bus connections

More information

Beyond Virtualization: A Novel Software Architecture for Multi-Core SoCs. Jim Ready September 18, 2012

Beyond Virtualization: A Novel Software Architecture for Multi-Core SoCs. Jim Ready September 18, 2012 Beyond Virtualization: A Novel Software Architecture for Multi-Core SoCs Jim Ready September 18, 2012 How HW guys view the world SW Software HW How SW guys view the world SW HW Reality The SoC Software

More information

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters

COSC 6374 Parallel Computation. Parallel I/O (I) I/O basics. Concept of a clusters COSC 6374 Parallel Computation Parallel I/O (I) I/O basics Spring 2008 Concept of a clusters Processor 1 local disks Compute node message passing network administrative network Memory Processor 2 Network

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

NAND Flash Architecture and Specification Trends

NAND Flash Architecture and Specification Trends NAND Flash Architecture and Specification Trends Michael Abraham (mabraham@micron.com) NAND Solutions Group Architect Micron Technology, Inc. August 2012 1 Topics NAND Flash Architecture Trends The Cloud

More information

Tyche: An efficient Ethernet-based protocol for converged networked storage

Tyche: An efficient Ethernet-based protocol for converged networked storage Tyche: An efficient Ethernet-based protocol for converged networked storage Pilar González-Férez and Angelos Bilas 30 th International Conference on Massive Storage Systems and Technology MSST 2014 June

More information

High Performance or Cycle Accuracy?

High Performance or Cycle Accuracy? CHIP DESIGN High Performance or Cycle Accuracy? You can have both! Bill Neifert, Carbon Design Systems Rob Kaye, ARM ATC-100 AGENDA Modelling 101 & Programmer s View (PV) Models Cycle Accurate Models Bringing

More information

OpenSoC Fabric: On-Chip Network Generator

OpenSoC Fabric: On-Chip Network Generator OpenSoC Fabric: On-Chip Network Generator Using Chisel to Generate a Parameterizable On-Chip Interconnect Fabric Farzad Fatollahi-Fard, David Donofrio, George Michelogiannakis, John Shalf MODSIM 2014 Presentation

More information

Scaling from Datacenter to Client

Scaling from Datacenter to Client Scaling from Datacenter to Client KeunSoo Jo Sr. Manager Memory Product Planning Samsung Semiconductor Audio-Visual Sponsor Outline SSD Market Overview & Trends - Enterprise What brought us to NVMe Technology

More information

ECLIPSE Performance Benchmarks and Profiling. January 2009

ECLIPSE Performance Benchmarks and Profiling. January 2009 ECLIPSE Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox, Schlumberger HPC Advisory Council Cluster

More information

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

Vorlesung Rechnerarchitektur 2 Seite 178 DASH Vorlesung Rechnerarchitektur 2 Seite 178 Architecture for Shared () The -architecture is a cache coherent, NUMA multiprocessor system, developed at CSL-Stanford by John Hennessy, Daniel Lenoski, Monica

More information

Power Analysis of Link Level and End-to-end Protection in Networks on Chip

Power Analysis of Link Level and End-to-end Protection in Networks on Chip Power Analysis of Link Level and End-to-end Protection in Networks on Chip Axel Jantsch, Robert Lauter, Arseni Vitkowski Royal Institute of Technology, tockholm May 2005 ICA 2005 1 ICA 2005 2 Overview

More information

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

From Bus and Crossbar to Network-On-Chip. Arteris S.A. From Bus and Crossbar to Network-On-Chip Arteris S.A. Copyright 2009 Arteris S.A. All rights reserved. Contact information Corporate Headquarters Arteris, Inc. 1741 Technology Drive, Suite 250 San Jose,

More information

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller

In-Memory Databases Algorithms and Data Structures on Modern Hardware. Martin Faust David Schwalb Jens Krüger Jürgen Müller In-Memory Databases Algorithms and Data Structures on Modern Hardware Martin Faust David Schwalb Jens Krüger Jürgen Müller The Free Lunch Is Over 2 Number of transistors per CPU increases Clock frequency

More information

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09

Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors. NoCArc 09 Performance Evaluation of 2D-Mesh, Ring, and Crossbar Interconnects for Chip Multi- Processors NoCArc 09 Jesús Camacho Villanueva, José Flich, José Duato Universidad Politécnica de Valencia December 12,

More information

Mini System 101 Our Price: $669

Mini System 101 Our Price: $669 Mini System 101 Our Price: $669 Mini System 102 Our Price: $610 Processor Features 667MHz front side bus, 512KB L2 cache and 1.33GHz processor speed. with 1024 x 600 resolutions delivers intense detail

More information

Copyright 2013, Oracle and/or its affiliates. All rights reserved.

Copyright 2013, Oracle and/or its affiliates. All rights reserved. 1 Oracle SPARC Server for Enterprise Computing Dr. Heiner Bauch Senior Account Architect 19. April 2013 2 The following is intended to outline our general product direction. It is intended for information

More information

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER

DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER DEPLOYING AND MONITORING HADOOP MAP-REDUCE ANALYTICS ON SINGLE-CHIP CLOUD COMPUTER ANDREAS-LAZAROS GEORGIADIS, SOTIRIOS XYDIS, DIMITRIOS SOUDRIS MICROPROCESSOR AND MICROSYSTEMS LABORATORY ELECTRICAL AND

More information

Measuring Cache and Memory Latency and CPU to Memory Bandwidth

Measuring Cache and Memory Latency and CPU to Memory Bandwidth White Paper Joshua Ruggiero Computer Systems Engineer Intel Corporation Measuring Cache and Memory Latency and CPU to Memory Bandwidth For use with Intel Architecture December 2008 1 321074 Executive Summary

More information

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.

Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance. Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance

More information

Introduction to AMBA 4 ACE and big.little Processing Technology

Introduction to AMBA 4 ACE and big.little Processing Technology Introduction to AMBA 4 and big.little Processing Technology Ashley Stevens Senior FAE, Fabric and Systems June 6th 2011 Updated July 29th 2013 Page 1 of 15 Why AMBA 4? The continual requirement for more

More information

7a. System-on-chip design and prototyping platforms

7a. System-on-chip design and prototyping platforms 7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit

More information

International Summer School on Embedded Systems

International Summer School on Embedded Systems International Summer School on Embedded Systems Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences Shenzhen, July 30 -- August 3, 2012 Sponsored by Chinese Academy of Sciences and

More information

What is a System on a Chip?

What is a System on a Chip? What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex

More information

September 25, 2007. Maya Gokhale Georgia Institute of Technology

September 25, 2007. Maya Gokhale Georgia Institute of Technology NAND Flash Storage for High Performance Computing Craig Ulmer cdulmer@sandia.gov September 25, 2007 Craig Ulmer Maya Gokhale Greg Diamos Michael Rewak SNL/CA, LLNL Georgia Institute of Technology University

More information

Caching Mechanisms for Mobile and IOT Devices

Caching Mechanisms for Mobile and IOT Devices Caching Mechanisms for Mobile and IOT Devices Masafumi Takahashi Toshiba Corporation JEDEC Mobile & IOT Technology Forum Copyright 2016 Toshiba Corporation Background. Unified concept. Outline The first

More information

ICRI-CI Retreat Architecture track

ICRI-CI Retreat Architecture track ICRI-CI Retreat Architecture track Uri Weiser June 5 th 2015 - Funnel: Memory Traffic Reduction for Big Data & Machine Learning (Uri) - Accelerators for Big Data & Machine Learning (Ran) - Machine Learning

More information

Intel Xeon Processor E5-2600

Intel Xeon Processor E5-2600 Intel Xeon Processor E5-2600 Best combination of performance, power efficiency, and cost. Platform Microarchitecture Processor Socket Chipset Intel Xeon E5 Series Processors and the Intel C600 Chipset

More information

A New Chapter for System Designs Using NAND Flash Memory

A New Chapter for System Designs Using NAND Flash Memory A New Chapter for System Designs Using Memory Jim Cooke Senior Technical Marketing Manager Micron Technology, Inc December 27, 2010 Trends and Complexities trends have been on the rise since was first

More information

A Study of Application Performance with Non-Volatile Main Memory

A Study of Application Performance with Non-Volatile Main Memory A Study of Application Performance with Non-Volatile Main Memory Yiying Zhang, Steven Swanson 2 Memory Storage Fast Slow Volatile In bytes Persistent In blocks Next-Generation Non-Volatile Memory (NVM)

More information

Performance Characteristics of Large SMP Machines

Performance Characteristics of Large SMP Machines Performance Characteristics of Large SMP Machines Dirk Schmidl, Dieter an Mey, Matthias S. Müller schmidl@rz.rwth-aachen.de Rechen- und Kommunikationszentrum (RZ) Agenda Investigated Hardware Kernel Benchmark

More information

Advantages of e-mmc 4.4 based Embedded Memory Architectures

Advantages of e-mmc 4.4 based Embedded Memory Architectures Embedded NAND Solutions from 2GB to 128GB provide configurable MLC/SLC storage in single memory module with an integrated controller By Scott Beekman, senior business development manager Toshiba America

More information

System Design Issues in Embedded Processing

System Design Issues in Embedded Processing System Design Issues in Embedded Processing 9/16/10 Jacob Borgeson 1 Agenda What does TI do? From MCU to MPU to DSP: What are some trends? Design Challenges Tools to Help 2 TI - the complete system The

More information

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks WHITE PAPER July 2014 Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks Contents Executive Summary...2 Background...3 InfiniteGraph...3 High Performance

More information

- Nishad Nerurkar. - Aniket Mhatre

- Nishad Nerurkar. - Aniket Mhatre - Nishad Nerurkar - Aniket Mhatre Single Chip Cloud Computer is a project developed by Intel. It was developed by Intel Lab Bangalore, Intel Lab America and Intel Lab Germany. It is part of a larger project,

More information

760 Veterans Circle, Warminster, PA 18974 215-956-1200. Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA 18974.

760 Veterans Circle, Warminster, PA 18974 215-956-1200. Technical Proposal. Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA 18974. 760 Veterans Circle, Warminster, PA 18974 215-956-1200 Technical Proposal Submitted by: ACT/Technico 760 Veterans Circle Warminster, PA 18974 for Conduction Cooled NAS Revision 4/3/07 CC/RAIDStor: Conduction

More information

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging In some markets and scenarios where competitive advantage is all about speed, speed is measured in micro- and even nano-seconds.

More information

The Memory Factor Samsung Green Memory Solutions for energy efficient Systems Ed Hogan E.Hogan@samsung.com 2 /?

The Memory Factor Samsung Green Memory Solutions for energy efficient Systems Ed Hogan E.Hogan@samsung.com 2 /? The Memory Factor Samsung Green Memory Solutions for energy efficient Systems Ed Hogan E.Hogan@samsung.com 2 / 28 2 /? YYYY.MM.DD / 홍길동 책임 / xxxxxx팀 Special requirements of hosting on Memory Dedicated

More information

A Look Inside Smartphone and Tablets

A Look Inside Smartphone and Tablets A Look Inside Smartphone and Tablets Devices and Trends John Scott-Thomas TechInsights Semicon West July 9, 2013 Teardown 400 phones and tablets a year Four areas: Customer Focus Camera Display Manufacturer

More information

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Cray Gemini Interconnect Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak Outline 1. Introduction 2. Overview 3. Architecture 4. Gemini Blocks 5. FMA & BTA 6. Fault tolerance

More information

Flash-Friendly File System (F2FS)

Flash-Friendly File System (F2FS) Flash-Friendly File System (F2FS) Feb 22, 2013 Joo-Young Hwang (jooyoung.hwang@samsung.com) S/W Dev. Team, Memory Business, Samsung Electronics Co., Ltd. Agenda Introduction FTL Device Characteristics

More information

Kalray MPPA Massively Parallel Processing Array

Kalray MPPA Massively Parallel Processing Array Kalray MPPA Massively Parallel Processing Array Next-Generation Accelerated Computing February 2015 2015 Kalray, Inc. All Rights Reserved February 2015 1 Accelerated Computing 2015 Kalray, Inc. All Rights

More information

Binary search tree with SIMD bandwidth optimization using SSE

Binary search tree with SIMD bandwidth optimization using SSE Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous

More information

Implementation Details

Implementation Details LEON3-FT Processor System Scan-I/F FT FT Add-on Add-on 2 2 kbyte kbyte I- I- Cache Cache Scan Scan Test Test UART UART 0 0 UART UART 1 1 Serial 0 Serial 1 EJTAG LEON_3FT LEON_3FT Core Core 8 Reg. Windows

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

A Generic Network Interface Architecture for a Networked Processor Array (NePA) A Generic Network Interface Architecture for a Networked Processor Array (NePA) Seung Eun Lee, Jun Ho Bahn, Yoon Seok Yang, and Nader Bagherzadeh EECS @ University of California, Irvine Outline Introduction

More information

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi

DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs. Presenter: Bo Zhang Yulin Shi DACOTA: Post-silicon Validation of the Memory Subsystem in Multi-core Designs Presenter: Bo Zhang Yulin Shi Outline Motivation & Goal Solution - DACOTA overview Technical Insights Experimental Evaluation

More information

Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations

Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations Microelectronic System Design Research Group University Kaiserslautern www.eit.uni-kl.de/wehn Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations Norbert

More information

Redefining Flash Storage Solution

Redefining Flash Storage Solution Redefining Flash Storage Solution Through Capacity + Efficiency + Performance + Form PRODUCT GUIDE Holistic Approach to Redefine Flash Storage Novachips is a leading provider of a broad range of Flash

More information

Architecture of Hitachi SR-8000

Architecture of Hitachi SR-8000 Architecture of Hitachi SR-8000 University of Stuttgart High-Performance Computing-Center Stuttgart (HLRS) www.hlrs.de Slide 1 Most of the slides from Hitachi Slide 2 the problem modern computer are data

More information

1 Storage Devices Summary

1 Storage Devices Summary Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious

More information

Power-Aware High-Performance Scientific Computing

Power-Aware High-Performance Scientific Computing Power-Aware High-Performance Scientific Computing Padma Raghavan Scalable Computing Laboratory Department of Computer Science Engineering The Pennsylvania State University http://www.cse.psu.edu/~raghavan

More information

Improving System Scalability of OpenMP Applications Using Large Page Support

Improving System Scalability of OpenMP Applications Using Large Page Support Improving Scalability of OpenMP Applications on Multi-core Systems Using Large Page Support Ranjit Noronha and Dhabaleswar K. Panda Network Based Computing Laboratory (NBCL) The Ohio State University Outline

More information

LS DYNA Performance Benchmarks and Profiling. January 2009

LS DYNA Performance Benchmarks and Profiling. January 2009 LS DYNA Performance Benchmarks and Profiling January 2009 Note The following research was performed under the HPC Advisory Council activities AMD, Dell, Mellanox HPC Advisory Council Cluster Center The

More information

Cray DVS: Data Virtualization Service

Cray DVS: Data Virtualization Service Cray : Data Virtualization Service Stephen Sugiyama and David Wallace, Cray Inc. ABSTRACT: Cray, the Cray Data Virtualization Service, is a new capability being added to the XT software environment with

More information

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis

Performance Metrics and Scalability Analysis. Performance Metrics and Scalability Analysis Performance Metrics and Scalability Analysis 1 Performance Metrics and Scalability Analysis Lecture Outline Following Topics will be discussed Requirements in performance and cost Performance metrics Work

More information

How to Run the MQX RTOS on Various RAM Memories for i.mx 6SoloX

How to Run the MQX RTOS on Various RAM Memories for i.mx 6SoloX Freescale Semiconductor, Inc. Document Number: AN5127 Application Note Rev. 1, 05/2015 How to Run the MQX RTOS on Various RAM Memories for i.mx 6SoloX 1 Introduction This document describes how to customize

More information

An Exploration of Hybrid Hard Disk Designs Using an Extensible Simulator

An Exploration of Hybrid Hard Disk Designs Using an Extensible Simulator An Exploration of Hybrid Hard Disk Designs Using an Extensible Simulator Pavan Konanki Thesis submitted to the Faculty of the Virginia Polytechnic Institute and State University in partial fulfillment

More information

Low Power AMD Athlon 64 and AMD Opteron Processors

Low Power AMD Athlon 64 and AMD Opteron Processors Low Power AMD Athlon 64 and AMD Opteron Processors Hot Chips 2004 Presenter: Marius Evers Block Diagram of AMD Athlon 64 and AMD Opteron Based on AMD s 8 th generation architecture AMD Athlon 64 and AMD

More information

Embedded Parallel Computing

Embedded Parallel Computing Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: Course responsible and examiner: Tomas

More information