Institut d Electronique et des Télécommunications de Rennes. Equipe Image



Similar documents
Throughput constraint for Synchronous Data Flow Graphs

A Generic Network Interface Architecture for a Networked Processor Array (NePA)

Going Linux on Massive Multicore

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Which ARM Cortex Core Is Right for Your Application: A, R or M?

High Performance or Cycle Accuracy?

Designing and Embodiment of Software that Creates Middle Ware for Resource Management in Embedded System

White Paper. Real-time Capabilities for Linux SGI REACT Real-Time for Linux

Real-Time Operating Systems for MPSoCs

Embedded Development Tools

System Considerations

Resource Utilization of Middleware Components in Embedded Systems

MPSoC Virtual Platforms

Thèse. Memory Study and Dataflow Representations for Rapid Prototyping of Signal Processing Applications on MPSoCs

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Stream Processing on GPUs Using Distributed Multimedia Middleware

MAQAO Performance Analysis and Optimization Tool

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Overview. Surveillance Systems. The Smart Camera - Hardware

Virtual Network Provisioning and Fault-Management across Multiple Domains

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

How To Design An Image Processing System On A Chip

Cisco Integrated Services Routers Performance Overview

A case study of mobile SoC architecture design based on transaction-level modeling

7a. System-on-chip design and prototyping platforms

Architectures and Platforms

Inspecting GNU Radio Applications with ControlPort and Performance Counters

Operating System Support for Multiprocessor Systems-on-Chip

Partial and Dynamic reconfiguration of FPGAs: a top down design methodology for an automatic implementation

Linux Performance Optimizations for Big Data Environments

Accelerate Cloud Computing with the Xilinx Zynq SoC

Low-Overhead Hard Real-time Aware Interconnect Network Router

Kalray MPPA Massively Parallel Processing Array

Intel CoFluent Methodology for SysML *

Runtime Verification for Real-Time Automotive Embedded Software

Multi-Threading Performance on Commodity Multi-Core Processors

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

Product Development Flow Including Model- Based Design and System-Level Functional Verification

Developing reliable Multi-Core Embedded-Systems with NI Linux Real-Time

Embedded Systems: map to FPGA, GPU, CPU?

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Application of Android OS as Real-time Control Platform**

Deeply Embedded Real-Time Hypervisors for the Automotive Domain Dr. Gary Morgan, ETAS/ESC

DIPLODOCUS: An Environment for. the Hardware/Software Partitioning of. Institut Mines-Telecom. Complex Embedded Systems

CprE 588 Embedded Computer Systems Homework #1 Assigned: February 5 Due: February 15

Experience with the integration of distribution middleware into partitioned systems

Design a medical application for Android platform using model-driven development approach

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Optimizing Configuration and Application Mapping for MPSoC Architectures

Real-time Process Network Sonar Beamformer

STLinux Software development environment

System Design Issues in Embedded Processing

Cloud Based Application Architectures using Smart Computing

The Role of Precise Timing in High-Speed, Low-Latency Trading

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Application Note: AN00141 xcore-xa - Application Development

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

OPART: Towards an Open Platform for Abstraction of Real-Time Communication in Cross-Domain Applications

Maintaining Non-Stop Services with Multi Layer Monitoring

12. Introduction to Virtual Machines

Automatic CUDA Code Synthesis Framework for Multicore CPU and GPU architectures

Chapter 11 I/O Management and Disk Scheduling

Software Synthesis from Dataflow Models for G and LabVIEW

System Software Integration: An Expansive View. Overview

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Better Trace for Better Software

Using Linux Clusters as VoD Servers

LTE Mobility Enhancements

Echtzeittesten mit MathWorks leicht gemacht Simulink Real-Time Tobias Kuschmider Applikationsingenieur

Sierraware Overview. Simply Secure

High-Level Synthesis for FPGA Designs

Enabling High performance Big Data platform with RDMA

Software Engineering for LabVIEW Applications. Elijah Kerry LabVIEW Product Manager

Automated Software Testing of Memory Performance in Embedded GPUs. Sudipta Chattopadhyay, Petru Eles and Zebo Peng! Linköping University

Practical Performance Understanding the Performance of Your Application

Extending the Power of FPGAs. Salil Raje, Xilinx

PERFORMANCE TUNING ORACLE RAC ON LINUX

Performance of Host Identity Protocol on Nokia Internet Tablet

Embedded System Hardware - Processing (Part II)

Parallel Firewalls on General-Purpose Graphics Processing Units

CHAPTER 4: SOFTWARE PART OF RTOS, THE SCHEDULER

Feb.2012 Benefits of the big.little Architecture

EMC Documentum Interactive Delivery Services Accelerated Overview

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

FPGA-based Multithreading for In-Memory Hash Joins

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Transcription:

1 D ÉLCTRONI QU T D NICATIONS D RNNS Institut d lectronique et des Télécommunications de Rennes March 13 2015 quipe Image

2 The team xpertise: ITR Image Team D ÉLCTRONI 10 teachers-researcher QU ~ T 15 D PhD & post-docs NICATIONS D RNNS Image : analysis, compression Architecture : multi-core, embedded systems Research themes: Image analysis for semantic indexation and embedded vision, 2D/3D image and video coding, Cryptography, Architecture,

3 D ÉLCTRONI QU T D NICATIONS D RNNS ITR Image Architecture theme

4 Objectives D ÉLCTRONI signal processing applications QU T D distributed and embedded platforms NICATIONS D RNNS Throughput Latency nergy Memory Programming Time Dataflow-based Methods and Tools for: Optimizing

5 Target Applications D ÉLCTRONI MPG4 Part2, AVC, SVC, HVC, SHVC MPG Participation QU T D NICATIONS D RNNS Stereo Vision, SLAM MPG Decoders Computer Vision and 3D Processing Cryptography Chaotic-based Cryptography Telecommunications 3GPP LT enodeb

6 D ÉLCTRONI QU T D NICATIONS D RNNS Target Platforms Texas Instruments Keystone I and II Zboard with Xilinx Zynq Odroid with Samsung xynos 5 Kalray MPPA

7 D ÉLCTRONI Throughput QU T D NICATIONS Latency D RNNS Optimizing nergy Memory Programming Time Methods Dataflow programming SIMD & Parallelism Data representation nergy-aware processing

8 Softwares D ÉLCTRONI http://sourceforge.net/projects/opensvcdecoder QU T D NICATIONS D RNNS Open SVC Decoder (C code, x86 ASM) Open HVC Decoder (C code, x86 & ARM ASM) FFmpeg https://github.com/openhvc/openhvc Orcc Compiler (Java, XTend) http://orcc.sourceforge.net PRSM Rapid Prototyping Tool (Java, XTend) http://preesm.sourceforge.net/website

9 D ÉLCTRONI QU T D NICATIONS D RNNS Academic Partners

10 D ÉLCTRONI QU T D NICATIONS D RNNS Industrial Partners

11 D ÉLCTRONI QU T D NICATIONS D RNNS

D ÉLCTRONI QU T D NICATIONS D RNNS Motivations log Introduction Lines of code/chip x2 every 10 months Transistors/chip x2 every 18 months Software Productivity Gap Lines of code/day x2 every 5 years 1990 1995 2000 2005 2010 2015 Source: ITRS & Hardware-dependent Software, cker et al., Springer

Hardware Complexity D ÉLCTRONI QU T D NICATIONS 5000 D RNNS Nb of P per SoC 4000 3000 2000 1000 Source: ITRS System Drivers 2011 Introduction 0 2010 2015 2020 2025

14 D ÉLCTRONI QU T D NICATIONS D RNNS What is PRSM? Algorithm PRSM Architecture PRSM +C compiler Simulator + Debugger + Profiler P Multicore Runtime DSP DSP P P P P DSP DSP Peripherals Main Memory

15 PRSM Tool and applications available on GitHub D ÉLCTRONI QU T D NICATIONS D RNNS What is PRSM? (Parallel Real-time mbedded xecutives Scheduling Method) A rapid prototyping framework An open-source project A set of eclipse plugins

16 PRSM D ÉLCTRONI QU T D Design of parallel algorithms NICATIONS Throughput/Latency D evaluation RNNS Using PRSM to design an embedded system: To provide metrics Predictable memory footprints To build a working prototype Code generation for multicore architectures Guaranteed deadlock-freeness Inter-core communications For design-space exploration Seamless porting to a new architecture Legacy code reusability

17 D ÉLCTRONI QU T D NICATIONS D RNNS Inputs Algorithm Architecture PRSM +C compiler Simulator + Debugger + Profiler P Multicore Runtime DSP DSP P P P P DSP DSP Peripherals Main Memory

D ÉLCTRONI QU T D Actors and Data ports NICATIONS FIFO queues D RNNS PRSM Inputs Algorithm descriptions using Dataflow Graphs Synchronous Dataflow (SDF) A B 2 1 1 2 1 2 1 1 C D. Lee and D. Messerschmitt, Synchronous data flow, Proceedings of the I, 1987. 18

PRSM Inputs D ÉLCTRONI QU T D An actor is fired when its input FIFOs contain enough data-tokens. NICATIONS D RNNS A 2 1 1 2 1 1 Algorithm descriptions using Dataflow Graphs Data-driven execution B 1 2 C D Core 1 A B C C D. Lee and D. Messerschmitt, Synchronous data flow, Proceedings of the I, 1987. 19

D ÉLCTRONI QU T D A 2 NICATIONS D 1 RNNS 1 2 PRSM Inputs 1 2 1 1 Algorithm descriptions using Dataflow Graphs xpression of parallelisms: Task / Data / Pipeline / B Core 1 Core 2 C D x2 Pipeline Internal Task Data parallelism A B C C D Internal Core 3. Lee and D. Messerschmitt, Synchronous data flow, Proceedings of the I, 1987. 20

in out PRSM Inputs PiSDF (Parameterized and Interfaced Synchronous Dataflow) D ÉLCTRONI QU T D Read Header Size =4 NICATIONS D RNNS Read Size Size Filter Send Size Size Image Size SetNb Slices =2 N Size Size Size /N Kernel Size /N Size

PRSM Inputs PiSDF (Parameterized and Interfaced Synchronous Dataflow) D ÉLCTRONI QU PiSDF T is: D Hierarchical & Compositional NICATIONS Statically D parameterizable RNNS Dynamically reconfigurable Lightweight runtime overhead PiSDF fosters: Predictability Parallelism Developer-friendliness K. Desnos, M. Pelcat, J.-F. Nezan, S. S. Bhattacharyya, S. Aridhi PiMM: Parameterized and Interfaced Dataflow Meta-Model for MPSoCs Runtime Reconfiguration, SAMOS XIII

D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Inputs Directed Data Link S-LAM (System-Level Architecture Model) Communication Nodes Parallel Node Communication nablers Contention Node Processing lement Operator Set-up Link Undirected Data Link RAM DMA M. Pelcat, J.-F. Nezan, J. Piat, J. Croizer and S. Aridhi, A System-Level Architecture Model for Rapid Prototyping of Heterogeneous Multicore mbedded Systems, DASIP2009

D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Inputs S-LAM (System-Level Architecture Model) core1 DMA RAM CN 1 Gbit/s core2 core3 M. Pelcat, J.-F. Nezan, J. Piat, J. Croizer and S. Aridhi, A System-Level Architecture Model for Rapid Prototyping of Heterogeneous Multicore mbedded Systems, DASIP2009

D ÉLCTRONI QU T D core1 NICATIONS D RNNS PRSM Inputs S-LAM (System-Level Architecture Model) core2 core3 TCP2 DMA SCR VCP2 DMA RIO SCR 1 Gb/s 2 GB/s 2 GB/s VCP2 TCP2 core1 core2 core3 DSP 1 DSP 2

D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Inputs Algorithm/Architecture independence PiSDF graphs are architecture-independent S-LAM graphs are application-independent Scenario Define information/constraints for the deployment of a specific algorithm on a specific architecture Mapping constraints Heterogeneous timing constraints

27 D ÉLCTRONI QU T D NICATIONS D RNNS Algorithm Architecture Deployment PRSM +C compiler Simulator + Debugger + Profiler P Multicore Runtime DSP DSP P P P P DSP DSP Peripherals Main Memory

PRSM Deployment Customizable accuracy (w.r.t. communications) D ÉLCTRONI QU T D NICATIONS D RNNS Mapping/Scheduling for static graphs State-of-the-art algorithms (FAST, List, ) Latency and load balancing optimization core1 core2 core3 core4

PRSM Deployment D ÉLCTRONI QU SPIDR: T D Synchronous Parameterized and Interfaced Dataflow mbedded Runtime NICATIONS D RNNS Mapping/Scheduling for reconfigurable PiSDF Timings Jobs Params Jobs Slave Master Master tasks: - Run jobs - Map & Schedule - Manage graphs - Monitor & Trace Data Data Pool of data FIFOs Jobs Slave Slave task: - Run jobs

PRSM Deployment D ÉLCTRONI QU T D valuate the memory requirements NICATIONS Adjust the D size of architecture memory RNNS Memory optimizations for static graphs Bounding the memory needs of an application graph to Assess the optimality of a memory allocation Insufficient memory Possible allocated memory Wasted memory 0 Lower Bound Upper Bound Available Memory

D ÉLCTRONI QU T D 200 NICATIONS D RNNS PRSM Deployment Memory optimizations for static graphs Graph level memory reuse optimization x1 x2 x2 x2 x1 A B C D 100 150 150 50 50 25 50 75 75 x75 A 100 100 B 2 C 2 B 1 C 1 D 1 150 50 75 x75 150 50 75 D 2 25 25 AB 1 100 AB 2 100 B 1 C 1 150 B 2 C 2 150 C 1 C 2 75 C 1 D 1 50 C 2 D 2 50 D 2 25 D 1 25 Core 1 Core 2 A B 1 C 2 D 1 B 2 C 1 D 2 xecution order AB 1 100 AB 2 100 B 1 C 1 150 B 2 C 2 150 C 1 C 2 75 C 1 D 1 50 C 2 D 2 50 D 2 25 D 1 25

D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Deployment Memory optimizations for static graphs Buffer merging technique for SDF graphs A 30 AB 30 B BC 20 BD 10 20 10 C D No buffer merging AB 30 memory BC 20 BC 20 Buffer merging AB 30 memory BD 10 BD 10

PRSM Deployment Multiple input/output buffers merge. D ÉLCTRONI QU T D NICATIONS D RNNS Memory optimizations for static graphs 48% less memory than state-of-the-art techniques Techniques are independent from host language. No modification of the SDF MoC/applications graphs.

D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Deployment nergy optimization: platform xynos 5 Odroid xynos 5 Big.LITTL A7 A7 A7 A7 A15 A15 A15 A15

D ÉLCTRONI QU core1 T D core2 NICATIONS core3 D core4 RNNS nergy optimization setup PRSM Deployment Image Processing QoS P=0 P=1 P=0.5 P=1 P=0.5 P=0 P-Value Linux-based Runtime (Abo Akademi) DVFS DPM Odroid xynos 5 Big.LITTL A7 A7 A7 A7 A15 A15 A15 A15

D ÉLCTRONI QU T D NICATIONS D RNNS nergy optimization results PRSM Deployment 20% energy savings on a parallel Sobel + sequential postprocessing wrt. Linux completely fair scheduler and on-demand governor S. Holmbacka,. Nogues, M. Pelcat, S. Lafond, and J. Lilius. nergy fficiency and Performance Management of Parallel Dataflow Applications. DASIP 2014, Madrid

37 D ÉLCTRONI QU T D NICATIONS D RNNS Algorithm Architecture Outputs PRSM +C compiler Simulator + Debugger + Profiler Multicore Runtime P DSP DSP P P P P DSP DSP Peripherals Main Memory

D ÉLCTRONI QU T D B A NICATIONS D C RNNS PRSM Outputs Generation of self-timed multicore code D o1 Actor A Actor B Actor D o1 o2 A B C D o2 Actor C time Actor

PRSM Outputs D ÉLCTRONI QU T D TMS320c6678 from Texas Instruments NICATIONS Supports D the activation of the DSP caches. RNNS Code generation for multiple targets Multi-C6X DSPs: Multi-x86 and multi-arm CPUs: Linux and Windows, pthread OMAP4 heterogeneous platform: dual-core ARM Cortex-A9, 2 Cortex-M3, and a C64xT DSP.

40 D ÉLCTRONI QU T D Algorithm NICATIONS D RNNS Demo Time Architecture PRSM +C compiler Simulator + Debugger + Profiler Multicore Runtime P DSP DSP P P P P DSP DSP Peripherals Main Memory

D ÉLCTRONI QU T D Available on GitHub NICATIONS D RNNS PRSM features Open Source Tool Research-Oriented Tool Summary New models, optimizations, scheduling clipse-based Integrated Tool Several plug-ins, metamodels xtended Web Tutorials http://preesm.sourceforge.net/website

D ÉLCTRONI QU T D NICATIONS D RNNS Questions? http://preesm.sf.net @PreesmProject