Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Similar documents

Parallel Programming Survey

Xeon+FPGA Platform for the Data Center

Introduction to Cloud Computing

System Software Integration: An Expansive View. Overview

Client/Server and Distributed Computing

Data Center and Cloud Computing Market Landscape and Challenges

EECS 750: Advanced Operating Systems. 01/28 /2015 Heechul Yun

2.1 What are distributed systems? What are systems? Different kind of systems How to distribute systems? 2.2 Communication concepts

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Principles and characteristics of distributed systems and environments

Symmetric Multiprocessing

Chapter 12: Multiprocessor Architectures. Lesson 01: Performance characteristics of Multiprocessor Architectures and Speedup

Distributed Systems LEEC (2005/06 2º Sem.)

High Performance or Cycle Accuracy?

Control 2004, University of Bath, UK, September 2004

Real-Time Operating Systems for MPSoCs

Distributed Systems. REK s adaptation of Prof. Claypool s adaptation of Tanenbaum s Distributed Systems Chapter 1

Client/Server Computing Distributed Processing, Client/Server, and Clusters

OC By Arsene Fansi T. POLIMI

Thread level parallelism

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Introducing EEMBC Cloud and Big Data Server Benchmarks

AMD Opteron Quad-Core

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

How Solace Message Routers Reduce the Cost of IT Infrastructure

Distribution transparency. Degree of transparency. Openness of distributed systems

Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)

Getting More Performance and Efficiency in the Application Delivery Network

Cellular Computing on a Linux Cluster

Achieving Low-Latency Security

Intel DPDK Boosts Server Appliance Performance White Paper

Software Defined Networking What is it, how does it work, and what is it good for?

System Models for Distributed and Cloud Computing

White Paper The Numascale Solution: Extreme BIG DATA Computing

Accelerating High-Speed Networking with Intel I/O Acceleration Technology

numascale White Paper The Numascale Solution: Extreme BIG DATA Computing Hardware Accellerated Data Intensive Computing By: Einar Rustad ABSTRACT

Next Generation Operating Systems

Optimizing Configuration and Application Mapping for MPSoC Architectures

Virtualization Technologies and Blackboard: The Future of Blackboard Software on Multi-Core Technologies

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

PikeOS: Multi-Core RTOS for IMA. Dr. Sergey Tverdyshev SYSGO AG , Moscow

Embedded Systems: map to FPGA, GPU, CPU?

Petascale Software Challenges. Piyush Chaudhary High Performance Computing

Advanced Core Operating System (ACOS): Experience the Performance

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Interconnection Networks

Optimizing Shared Resource Contention in HPC Clusters

Solving I/O Bottlenecks to Enable Superior Cloud Efficiency

Core Syllabus. Version 2.6 C OPERATE KNOWLEDGE AREA: OPERATION AND SUPPORT OF INFORMATION SYSTEMS. June 2006

CMSC 611: Advanced Computer Architecture

Multi-core Curriculum Development at Georgia Tech: Experience and Future Steps

Multithreading Lin Gao cs9244 report, 2006

Lecture 1 Introduction to Parallel Programming

Multiprocessor System-on-Chip

Stream Processing on GPUs Using Distributed Multimedia Middleware

How To Virtualize A Storage Area Network (San) With Virtualization

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Parallelism and Cloud Computing

CHAPTER 1 INTRODUCTION

Exascale Challenges and General Purpose Processors. Avinash Sodani, Ph.D. Chief Architect, Knights Landing Processor Intel Corporation

Resource Utilization of Middleware Components in Embedded Systems

Cluster, Grid, Cloud Concepts

Experience with the integration of distribution middleware into partitioned systems

TCP Servers: Offloading TCP Processing in Internet Servers. Design, Implementation, and Performance

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Software Concepts. Uniprocessor Operating Systems. System software structures. CIS 505: Software Systems Architectures of Distributed Systems

Chapter 2 Parallel Architecture, Software And Performance

Design and Implementation of the Heterogeneous Multikernel Operating System

Introduction to System-on-Chip

White Paper. Freescale s Embedded Hypervisor for QorIQ P4 Series Communications Platform

The Shortcut Guide to Balancing Storage Costs and Performance with Hybrid Storage

Overview and History of Operating Systems

Titolo del paragrafo. Titolo del documento - Sottotitolo documento The Benefits of Pushing Real-Time Market Data via a Web Infrastructure

Chapter 2 Parallel Computer Architecture

Virtual machine interface. Operating system. Physical machine interface

Low-Overhead Hard Real-time Aware Interconnect Network Router

Powerful Duo: MapR Big Data Analytics with Cisco ACI Network Switches

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

Java Environment for Parallel Realtime Development Platform Independent Software Development for Multicore Systems

High Performance Computing in the Multi-core Area

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

A distributed system is defined as

Design Patterns for Packet Processing Applications on Multi-core Intel Architecture Processors

Design Cycle for Microprocessors

Virtual Machines.

How To Understand The Power Of The Internet

KeyStone Training. Multicore Navigator Overview. Overview Agenda

A hypervisor approach with real-time support to the MIPS M5150 processor

Intel Xeon +FPGA Platform for the Data Center

Transcription:

Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association

Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association? What is EEMBC? The multicore communication API Multicore performance and scalability Expected outcome in 2007

Multicore for Multiple Reasons Increase compute density Centralization of distributed processing All cores can run entirely different things; for example, four machines running in one package Functional partitioning While possible with multiprocessors, benefits from proximity and possible data sharing Core 1 runs security, Core 2 runs routing algorithm, Core 3 enforces policy, etc. Asynchronous multiprocessing (AMP) Minimizes overhead and synchronization issues Core 1 runs legacy OS, Core 2 runs an RTOS, other cores do a variety of processing tasks (i.e. things where applications can be optimized) Parallel pipelining Taking advantage of proximity The performance opportunity.

Ideal Trend of Transistor Utilization Performance (GOPS) 1000 100 Transistors 10 1 0.1 SMT, FGMT, CGMT OOO Superscalar Pipelining 0.01 1992 1998 2002 2006 2010 time

Moore s Gap Justifies More Cores Performance (GOPS) 1000 100 Transistors The Moore s Gap 10 1 0.1 SMT, FGMT, CGMT OOO Superscalar Pipelining 0.01 1992 1998 2002 2006 2010 time

Decomposing Moore s Gap Diminishing returns from single CPU mechanisms Pipelining Caching Functional units Wire delays Power envelopes

Closing Moore s Gap: Less is More Increase a resource only if for every 1% increase in area there is at least a 1% increase in performance Performance increase = Area increase Percentage performance increase must be at least the same as the percentage increase in the area (power) Remember the power of n: We can double performance by going from n 2n cores 2n cores have 2X the area (power) KILL Rule for Multicore Kill If Less than Linear

The Performance Opportunity Push single core Cache 3x Cache Processor-A Processor-A 65nm CPI: 1 + 0.006 x 100 = 1.6 90nm Go multicore CPI: 1 + 0.01 x 100 = 2 Miss rate = 1% Latency to main memory = 100 cycles 90nm -> 65nm = 2x transistors Smaller CPI (cycles per instruction) is better Cache 1x Processor A Cache 1x Processor B 65nm CPI: (1 + 0.01 x 100)/2 = 1

Multicore Issues to Solve Hardware/software communication interconnect Modeling, simulation, and hardware/software co-validation Inter-core resource management Debugging: connectivity, synchronization Distributed power management Load balancing Algorithm partitioning Performance analysis

Enabling the Multicore Ecosystem Initial engagement began in May 2005 Multicore ecosystem enablement Industry-wide participation Including system developers, processor vendors, infrastructure developers, device manufacturers, EDA vendors, and software and application developers. Current efforts Communications APIs Debug API

Analyzing Multicore Processing Performance Embedded Microprocessor Benchmark Consortium (EEMBC) Industry standard benchmarks since 1997 Evaluating current and future development of MP platforms Uncovering MP bottlenecks Comparing multiprocessor vs. uni-processor

The Abundance of Parallelism Networking/Telecomm Security General purpose: databases, web servers, multiple tasks Databases for differentiating services, etc Graphics and gaming Communications, cellphones Wireless Video imaging

Possible Approaches Shared memory Semantics of a thread-based API such as POSIX Message Based Accelerator based asymmetric heterogeneous MP solutions today lack common programming paradigm Both approaches valid to support embedded applications which are supporting coherency and distributed memory architectures

Multicore Debug Challenges Need for higher level of abstraction with common debug semantics No one would want 16 different debugger instances running to debug a 16 core system! Need aggregate control and state inspection Dynamic debugging needed Ability to look at things dynamically, without stopping the cores Multicore Association plans to develop a standard debug protocol/api so vendors can support this type of development

Communication API Target Domain Referred to as MCAPI Embedded multi-processing requiring task to task communication Multiple cores on a chip & multiple chips on a board Closely distributed and/or tightly-coupled systems Heterogeneous & homogeneous systems Cores, interconnects, memory architectures, OSes Scalable: 2 1000 s of cores Assumes the cost of messaging/routing < computation of a node Allows implementations with significantly lower latency, overhead & memory footprint than MPI and TIPC

Unified API for Closely Distributed Embedded System Forms layer on top of which other abstractions or applications may be built Provides OS vendor with a uniform foundation for present/future products and abstracts the platform capabilities Provides processor vendor with better determinism compared with MPI and reduced support burden Provides software developers with better ease of use and more powerful development methodologies Application Application Further abstraction (OS/middleware) Application Further abstraction (OS/middleware) CAPI architecture Interconnect topology CAPI Interconnect architecture CAPI CPU/DSP Accelerators HW CPU/DSP Accelerators CPU/DSP Accelerators Courtesy of PolyCore Software

CAPI Features and Performance Considerations Very small runtime footprint OS ok, but not required Significantly smaller code size Significantly lower latency Low-level messaging API Low-level streaming API High efficiency Enables efficient use of hardware accelerators Zero copy features No unnecessary data movement First draft of specification targeted at Q2 2007

Automotive Case Study in Engine Control Sensors are continuously monitored to determine the appropriate engine parameter settings General-purpose processor handles the data tasks polling the sensors Another GP processor handles the scheduled control task that reads the data collected by the data tasks and processes the data by computing values to apply to various actuators in the engine This workload and the associated transfer of data implies that the synchronization between control and data tasks must be minimal to avoid negative impacts on the latency of the control task

Fast Synchronization Between Data and Control Tasks Sensor Sensor Sensor Sensor Sensor Sensor Sensor Sensor Sensor Sensor 1. Data Tasks Performing Polling Operations 3. Non-blocking communication moves on if data not available yet per sensor General-Purpose Processor Core General-Purpose Processor Core 2. Reading data and computing values to apply to actuators ( messages send updates)

Packet Processing Case Study Streams contain packets 64 bytes to 1kbyte Example could be a TCP engine Multicore chip contains 3 to 100 cores Depends on required bandwidth Each core has small amount of local private memory or cache Some cores may be specialized hardware accelerators Inter-core communication using on-chip interconnect Cores can also communicate through some common shared memory

Packet Transfer Between Cores Mode 1: Packets streamed between the modules without going through external shared memory Mode 2: Packets placed in shared memory between modules Modules accesses the packets from shared memory Hybrid mode: Packets placed in shared memory, packet descriptors and metadata are streamed directly between cores

Quick Synchronization Communication between the cores follows a stream pattern, occurring once or twice per packet Processing is both parallel and pipelined Compute intensive modules Little computation, but requires quick inspection of metadata

Benchmarking Multicore Scalability Single versus multiprocessor Memory bandwidth OS support for scheduling What s important?

Benchmarking Multicore Software assumes homogeneity across processing elements Scalability of general purpose processing, as opposed to accelerator offloading Threads Threading is the standard technique to express concurrency Each thread has symmetric memory visibility Abstraction Test harness abstracts the hardware into basic software threading model

Scaling on multiple cores Based on preliminary testing A 0 B 0 B 0 B 0 Workload

Processing of Multiple Data Streams Channel 1 Channel 2 O O O Channel n VoIP Processing Common code over multiple threads Shows how well a solution can scale over scalable data inputs

Decompose Single Task Into Multiple Subtasks MPEG Decode Parallelize single algorithm to share workload between multiple processors Multiple threads work on common data set Demonstrates fine grain parallelism support

Execution of Multiple Different Algorithms MPEG Decode MPEG Encode Concurrency over both the data and instructions Scalability of a solution for general purpose processing System (incl. OS) must handle workloads from multiple concurrent algorithms Ex: mobile phone supporting a video call (Encode and decode algorithms, UI, possibly networking)

Multicore Opportunities and Challenges Multicore technology is inevitable It s time to begin implementing The Multicore Association will help ease the transition EEMBC will help analyze the performance benefits Patent pending on new multicore technique