Embedded Parallel Computing



Similar documents
Hardware accelerated Virtualization in the ARM Cortex Processors

Cortex-A9 MPCore Software Development

Multi-Threading Performance on Commodity Multi-Core Processors

Supporting Cache Coherence in Heterogeneous Multiprocessor Systems

Vorlesung Rechnerarchitektur 2 Seite 178 DASH

An Implementation Of Multiprocessor Linux

Introduction to AMBA 4 ACE and big.little Processing Technology

SYSTEM ecos Embedded Configurable Operating System

Parallel Computing 37 (2011) Contents lists available at ScienceDirect. Parallel Computing. journal homepage:

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Feb.2012 Benefits of the big.little Architecture

Multiprocessor Cache Coherence

Parallel Algorithm Engineering

Scaling Networking Applications to Multiple Cores

Introduction to Cloud Computing

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Multi-core and Linux* Kernel

Chapter 12: Multiprocessor Architectures. Lesson 09: Cache Coherence Problem and Cache synchronization solutions Part 1

Chapter 1 Computer System Overview

Virtualization in the ARMv7 Architecture Lecture for the Embedded Systems Course CSD, University of Crete (May 20, 2014)

Whitepaper. The Benefits of Multiple CPU Cores in Mobile Devices

Symmetric Multiprocessing

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Principles and characteristics of distributed systems and environments

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Stream Processing on GPUs Using Distributed Multimedia Middleware

NVIDIA Tegra 4 Family CPU Architecture

big.little Technology Moves Towards Fully Heterogeneous Global Task Scheduling Improving Energy Efficiency and Performance in Mobile Devices

Scheduling. Scheduling. Scheduling levels. Decision to switch the running process can take place under the following circumstances:

Cortex -A7 MPCore. Technical Reference Manual. Revision: r0p5. Copyright ARM. All rights reserved. ARM DDI 0464F (ID080315)

A Survey of Parallel Processing in Linux

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Operating Systems 4 th Class

Application Performance Analysis of the Cortex-A9 MPCore

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

CHAPTER 1 INTRODUCTION

Next Generation GPU Architecture Code-named Fermi

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

Embedded Systems: map to FPGA, GPU, CPU?

Real-time processing the basis for PC Control

Accelerate Cloud Computing with the Xilinx Zynq SoC

FPGA-based Multithreading for In-Memory Hash Joins

Lecture 23: Multiprocessors

Parallel Programming Survey

Interconnection Networks

Distribution One Server Requirements

Weighted Total Mark. Weighted Exam Mark

Security Overview of the Integrity Virtual Machines Architecture

SOC architecture and design

POSIX. RTOSes Part I. POSIX Versions. POSIX Versions (2)

ADVANCED COMPUTER ARCHITECTURE

Final Report. Cluster Scheduling. Submitted by: Priti Lohani

High Performance or Cycle Accuracy?

ultra fast SOM using CUDA

BLM 413E - Parallel Programming Lecture 3

Advanced Core Operating System (ACOS): Experience the Performance

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

COS 318: Operating Systems. Virtual Memory and Address Translation

Virtuoso and Database Scalability

Real-Time Operating Systems for MPSoCs

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

evm Virtualization Platform for Windows

Annotation to the assignments and the solution sheet. Note the following points

Design and Implementation of the Heterogeneous Multikernel Operating System

Course Development of Programming for General-Purpose Multicore Processors

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Cortex -A15. Technical Reference Manual. Revision: r2p0. Copyright 2011 ARM. All rights reserved. ARM DDI 0438C (ID102211)

An Interconnection Network for a Cache Coherent System on FPGAs. Vincent Mirian

Distributed Systems. Virtualization. Paul Krzyzanowski

Multicore Processor and GPU. Jia Rao Assistant Professor in CS

Informatica Ultra Messaging SMX Shared-Memory Transport

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Operating Systems for Parallel Processing Assistent Lecturer Alecu Felician Economic Informatics Department Academy of Economic Studies Bucharest

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

Multicore Programming with LabVIEW Technical Resource Guide

NIAGARA: A 32-WAY MULTITHREADED SPARC PROCESSOR

Chapter 1: Introduction. What is an Operating System?

Intel 64 and IA-32 Architectures Software Developer s Manual

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

ARM Caches: Giving you enough rope... to shoot yourself in the foot. Marc Zyngier KVM Forum 15

Operating System Impact on SMT Architecture

Computer Organization

Lecture 2 Parallel Programming Platforms

Packet Sniffer using Multicore programming. By B.A.Khivsara Assistant Professor Computer Department SNJB s KBJ COE,Chandwad

Facing the Challenges for Real-Time Software Development on Multi-Cores

Operating Systems. 05. Threads. Paul Krzyzanowski. Rutgers University. Spring 2015

Parallelism and Cloud Computing

Details of a New Cortex Processor Revealed. Cortex-A9. ARM Developers Conference October 2007

Computer Graphics Hardware An Overview

Texture Cache Approximation on GPUs

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Memory Architecture and Management in a NoC Platform

Transcription:

Embedded Parallel Computing Lecture 5 - The anatomy of a modern multiprocessor, the multicore processors Tomas Nordström Course webpage:: <http://www.hh.se/do8003> Course responsible and examiner: Tomas Nordström, Tomas.Nordstrom@hh.se; Room E313; Tel: +46 35 16 7334 1

Outline Modern Multicore Symmetrical Multiprocessing (SMP) Multicore ARM Multi-core architecture 2

MIMD: Symmetrical Multiprocessing A multi-core architecture with Symmetrical Multiprocessing (SMP) is defined by the following characteristics: Architecture consists of two or more identical CPU cores. All cores share a common system memory and are controlled by a single Operating system. Each CPU is capable of operating independently on different workloads and whenever possible, is also capable of sharing workloads with the other CPU. 3

Multicore [Wikipedia]: A multi-core processor is a single computing component with two or more independent actual processors (called "cores"), which are the units that read and execute program instructions A many-core processor is a multi-core processor in which the number of cores is large enough that traditional multiprocessor techniques are no longer efficient largely because of issues with congestion in supplying instructions and data to the many processors An Intel Core 2 Duo E6750 dual-core processor Several tens of cores! 4

Multicore Symmetric multiprocessing (SMP) designs using discrete CPUs exists since a long time Thus the issues regarding implementing multicore processor architecture and supporting it with software are well known Utilizing a proven processing-core design without architectural changes reduces design risk significantly 5

Single Core Design is Hitting the f... Wall Greatly diminished gains in processor performance from increasing the operating frequency. This is due to three primary factors: The memory wall The ILP wall The power wall 6

Multicore SMP The proximity of multiple CPU cores on the same die allows the cache coherency circuitry to operate at a much higher clock-rate than is possible if the signals have to travel off-chip Combining equivalent CPUs on a single die significantly improves the performance of cache snoop (alternative: Bus snooping) operations 7

Example: ARM MPcore 8

Thread-level Parallelism For thread-level parallelism, ARM needed to improve exception handling to prepare for the increased complexity in handling multithreading on multiple processors These requirements added inherent complexity in the interrupt handler, scheduler, and context switch 9

MPcore Semaphores Earlier ARM architectures implemented semaphores with the swap instruction, which held the external bus until completion. One processor can hold the entire bus until completion, disallowing all other processors. Unacceptable! ARMv6 introduced two new instructions load-exclusive LDREX and store-exclusive STREX which take advantage of an exclusive monitor in memory: LDREX loads a value from memory and sets the exclusive monitor to watch that location, and STREX checks the exclusive monitor and, if no other write has taken place to that location, performs the store to memory and returns a value to indicate if the data was written. 10

Physically Tagged Caches Usage of Virtual or Physical addresses in the cache? A virtually tagged cache must be flushed every time a context switch takes place because the cache contains old virtual-to-physical translations In ARM11 with MPcore the memory management unit logic resides between the level 1 cache and the processor core 11

Atomic Instructions Traditionally swap-based and compare- andexchange-based semaphores have been used to control access to critical data SMP often aim for lock-free synchronization 12

cmpxchg8b Many are using the Intel cmpxchg8b instruction in these lock-free routines because it can exchange and compare 8 bytes of data atomically. Typically, this involved 4 bytes for payload and 4 bytes to distinguish between payload versions that could otherwise have the same value the so-called A-B-A problem. 13

The ARM exclusives provide atomicity using the data address rather than the data value, so that the routines can atomically exchange data without experiencing the A-B-A problem Exploiting this would, however, require rewriting much of the existing two-word exclusive code. Consequently, ARM added instructions for performing load-and-store exclusives using various payload sizes -- including 8 bytes -- thus ensuring the direct portability of existing multithreaded code. 14

Misc MP improvements Improved access to localized data Power-conscious spin-locks Weakly ordered memory consistency 15

Two Main Enhancements The ARM11 multiprocessor includes two main SMP enhancements: Generic Interrupt Controller (GIC) providing interprocessor communication Snoop Control Unit (SCU), an intelligent memory-communication system providing cache coherence 16

Cache Coherency The ARM11 MPCore implements a Snoop Control Unit (SCU) between the processors. Operating at CPU frequency. This configuration also provides a very rapid path for data to move directly between each CPU s cache. 17

18

MESI Modified The cache line is present only in the current cache, and is dirty; it has been modified from the value in main memory. The cache is required to write the data back to main memory at some time in the future, before permitting any other read of the (no longer valid) main memory state. The write-back changes the line to the Exclusive state. Exclusive The cache line is present only in the current cache, but is clean; it matches main memory. It may be changed to the Shared state at any time, in response to a read request. Alternatively, it may be changed to the Modified state when writing to it. Shared Indicates that this cache line may be stored in other caches of the machine and is "clean" ; it matches the main memory. The line may be discarded (changed to the Invalid state) at any time. Invalid Indicates that this cache line is invalid. 19

MOESI The processor maintains cache coherence with an optimized version of the MESI (modified, exclusive, shared, invalid) protocol. In addition to the four common MESI protocol states, there is a fifth "Owned" state representing data that is both modified and shared. This avoids the need to write modified data back to main memory before sharing it. 20

21

22

23

Interrupt System Generic Interrupt Controller (GIC) External interrupts Internal Interrupts Example: One processor allocates virtual memory -> all others needs to update their memory translations -> ARM uses GIC to quickly signal that between processors 24

Distributed Interrupt Controller masking of interrupts Distributed Interrupt Controller I N T E R F A C E MP 11 CPUs prioritization of the interrupts distribution of the interrupts to the target MP11 CPUs tracking the status of interrupts generation of interrupts by software 25

26

NVIDIA Tegra 2 27

Applications Using MPCore Frostbite is an example of a game engine that employs job-based parallelism. This engine is used by the popular Battlefield: Bad Company series of games. It is an engine that is capable of using as many threads as the underlying hardware platform provides. The engine performs the primary Game and Render tasks on the GPU and divides up the other system related work into jobs. Each job typically consists of 15K to 200K lines of C+ + code with the average job size being around 25K lines of code. Most of these jobs are independent while some have interdependencies. Each frame of the game would typically contain two hundred to three hundred jobs and the engine assigns the jobs to all available hardware cores. 28

Task Level Parallelism on Frostbite Game Engine

Questions Study-support questions 30

Links Multi-core <http://en.wikipedia.org/wiki/multi_core>; SMP - Symmetric Multiprocessor System <http://en.wikipedia.org/wiki/symmetric_multiprocessor> ABA problem <http://en.wikipedia.org/wiki/aba_problem> MESI <http://en.wikipedia.org/wiki/mesi> MOESI <http://en.wikipedia.org/wiki/moesi_protocol> Embedded moves to multicore <http://embedded-computing.com/embedded-moves-multicore> Goodacre, J.; Sloss, A.N.;, "Parallelism and the ARM instruction set architecture," IEEE Computer, vol.38, no.7, pp. 42-50, July 2005 doi: 10.1109/MC.2005.239 <http://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=1463106&isnumber=31455> Goodacre, J., "Details of a New Cortex Processor Revealed, Cortex-A9", Presentation at the ARM developers' Conference, October 2007. <http://www.arm.com/files/downloads/cortex-a9_devcon-talk_introduction_final-02.pdf> Stevens A., Introduction to AMBA 4 ACE, ARM White paper June 6, 2011. <http://www.arm.com/files/pdf/cachecoherencywhitepaper_6june2011.pdf> 31