1 D ÉLCTRONI QU T D NICATIONS D RNNS Institut d lectronique et des Télécommunications de Rennes March 13 2015 quipe Image
2 The team xpertise: ITR Image Team D ÉLCTRONI 10 teachers-researcher QU ~ T 15 D PhD & post-docs NICATIONS D RNNS Image : analysis, compression Architecture : multi-core, embedded systems Research themes: Image analysis for semantic indexation and embedded vision, 2D/3D image and video coding, Cryptography, Architecture,
3 D ÉLCTRONI QU T D NICATIONS D RNNS ITR Image Architecture theme
4 Objectives D ÉLCTRONI signal processing applications QU T D distributed and embedded platforms NICATIONS D RNNS Throughput Latency nergy Memory Programming Time Dataflow-based Methods and Tools for: Optimizing
5 Target Applications D ÉLCTRONI MPG4 Part2, AVC, SVC, HVC, SHVC MPG Participation QU T D NICATIONS D RNNS Stereo Vision, SLAM MPG Decoders Computer Vision and 3D Processing Cryptography Chaotic-based Cryptography Telecommunications 3GPP LT enodeb
6 D ÉLCTRONI QU T D NICATIONS D RNNS Target Platforms Texas Instruments Keystone I and II Zboard with Xilinx Zynq Odroid with Samsung xynos 5 Kalray MPPA
7 D ÉLCTRONI Throughput QU T D NICATIONS Latency D RNNS Optimizing nergy Memory Programming Time Methods Dataflow programming SIMD & Parallelism Data representation nergy-aware processing
8 Softwares D ÉLCTRONI http://sourceforge.net/projects/opensvcdecoder QU T D NICATIONS D RNNS Open SVC Decoder (C code, x86 ASM) Open HVC Decoder (C code, x86 & ARM ASM) FFmpeg https://github.com/openhvc/openhvc Orcc Compiler (Java, XTend) http://orcc.sourceforge.net PRSM Rapid Prototyping Tool (Java, XTend) http://preesm.sourceforge.net/website
9 D ÉLCTRONI QU T D NICATIONS D RNNS Academic Partners
10 D ÉLCTRONI QU T D NICATIONS D RNNS Industrial Partners
11 D ÉLCTRONI QU T D NICATIONS D RNNS
D ÉLCTRONI QU T D NICATIONS D RNNS Motivations log Introduction Lines of code/chip x2 every 10 months Transistors/chip x2 every 18 months Software Productivity Gap Lines of code/day x2 every 5 years 1990 1995 2000 2005 2010 2015 Source: ITRS & Hardware-dependent Software, cker et al., Springer
Hardware Complexity D ÉLCTRONI QU T D NICATIONS 5000 D RNNS Nb of P per SoC 4000 3000 2000 1000 Source: ITRS System Drivers 2011 Introduction 0 2010 2015 2020 2025
14 D ÉLCTRONI QU T D NICATIONS D RNNS What is PRSM? Algorithm PRSM Architecture PRSM +C compiler Simulator + Debugger + Profiler P Multicore Runtime DSP DSP P P P P DSP DSP Peripherals Main Memory
15 PRSM Tool and applications available on GitHub D ÉLCTRONI QU T D NICATIONS D RNNS What is PRSM? (Parallel Real-time mbedded xecutives Scheduling Method) A rapid prototyping framework An open-source project A set of eclipse plugins
16 PRSM D ÉLCTRONI QU T D Design of parallel algorithms NICATIONS Throughput/Latency D evaluation RNNS Using PRSM to design an embedded system: To provide metrics Predictable memory footprints To build a working prototype Code generation for multicore architectures Guaranteed deadlock-freeness Inter-core communications For design-space exploration Seamless porting to a new architecture Legacy code reusability
17 D ÉLCTRONI QU T D NICATIONS D RNNS Inputs Algorithm Architecture PRSM +C compiler Simulator + Debugger + Profiler P Multicore Runtime DSP DSP P P P P DSP DSP Peripherals Main Memory
D ÉLCTRONI QU T D Actors and Data ports NICATIONS FIFO queues D RNNS PRSM Inputs Algorithm descriptions using Dataflow Graphs Synchronous Dataflow (SDF) A B 2 1 1 2 1 2 1 1 C D. Lee and D. Messerschmitt, Synchronous data flow, Proceedings of the I, 1987. 18
PRSM Inputs D ÉLCTRONI QU T D An actor is fired when its input FIFOs contain enough data-tokens. NICATIONS D RNNS A 2 1 1 2 1 1 Algorithm descriptions using Dataflow Graphs Data-driven execution B 1 2 C D Core 1 A B C C D. Lee and D. Messerschmitt, Synchronous data flow, Proceedings of the I, 1987. 19
D ÉLCTRONI QU T D A 2 NICATIONS D 1 RNNS 1 2 PRSM Inputs 1 2 1 1 Algorithm descriptions using Dataflow Graphs xpression of parallelisms: Task / Data / Pipeline / B Core 1 Core 2 C D x2 Pipeline Internal Task Data parallelism A B C C D Internal Core 3. Lee and D. Messerschmitt, Synchronous data flow, Proceedings of the I, 1987. 20
in out PRSM Inputs PiSDF (Parameterized and Interfaced Synchronous Dataflow) D ÉLCTRONI QU T D Read Header Size =4 NICATIONS D RNNS Read Size Size Filter Send Size Size Image Size SetNb Slices =2 N Size Size Size /N Kernel Size /N Size
PRSM Inputs PiSDF (Parameterized and Interfaced Synchronous Dataflow) D ÉLCTRONI QU PiSDF T is: D Hierarchical & Compositional NICATIONS Statically D parameterizable RNNS Dynamically reconfigurable Lightweight runtime overhead PiSDF fosters: Predictability Parallelism Developer-friendliness K. Desnos, M. Pelcat, J.-F. Nezan, S. S. Bhattacharyya, S. Aridhi PiMM: Parameterized and Interfaced Dataflow Meta-Model for MPSoCs Runtime Reconfiguration, SAMOS XIII
D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Inputs Directed Data Link S-LAM (System-Level Architecture Model) Communication Nodes Parallel Node Communication nablers Contention Node Processing lement Operator Set-up Link Undirected Data Link RAM DMA M. Pelcat, J.-F. Nezan, J. Piat, J. Croizer and S. Aridhi, A System-Level Architecture Model for Rapid Prototyping of Heterogeneous Multicore mbedded Systems, DASIP2009
D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Inputs S-LAM (System-Level Architecture Model) core1 DMA RAM CN 1 Gbit/s core2 core3 M. Pelcat, J.-F. Nezan, J. Piat, J. Croizer and S. Aridhi, A System-Level Architecture Model for Rapid Prototyping of Heterogeneous Multicore mbedded Systems, DASIP2009
D ÉLCTRONI QU T D core1 NICATIONS D RNNS PRSM Inputs S-LAM (System-Level Architecture Model) core2 core3 TCP2 DMA SCR VCP2 DMA RIO SCR 1 Gb/s 2 GB/s 2 GB/s VCP2 TCP2 core1 core2 core3 DSP 1 DSP 2
D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Inputs Algorithm/Architecture independence PiSDF graphs are architecture-independent S-LAM graphs are application-independent Scenario Define information/constraints for the deployment of a specific algorithm on a specific architecture Mapping constraints Heterogeneous timing constraints
27 D ÉLCTRONI QU T D NICATIONS D RNNS Algorithm Architecture Deployment PRSM +C compiler Simulator + Debugger + Profiler P Multicore Runtime DSP DSP P P P P DSP DSP Peripherals Main Memory
PRSM Deployment Customizable accuracy (w.r.t. communications) D ÉLCTRONI QU T D NICATIONS D RNNS Mapping/Scheduling for static graphs State-of-the-art algorithms (FAST, List, ) Latency and load balancing optimization core1 core2 core3 core4
PRSM Deployment D ÉLCTRONI QU SPIDR: T D Synchronous Parameterized and Interfaced Dataflow mbedded Runtime NICATIONS D RNNS Mapping/Scheduling for reconfigurable PiSDF Timings Jobs Params Jobs Slave Master Master tasks: - Run jobs - Map & Schedule - Manage graphs - Monitor & Trace Data Data Pool of data FIFOs Jobs Slave Slave task: - Run jobs
PRSM Deployment D ÉLCTRONI QU T D valuate the memory requirements NICATIONS Adjust the D size of architecture memory RNNS Memory optimizations for static graphs Bounding the memory needs of an application graph to Assess the optimality of a memory allocation Insufficient memory Possible allocated memory Wasted memory 0 Lower Bound Upper Bound Available Memory
D ÉLCTRONI QU T D 200 NICATIONS D RNNS PRSM Deployment Memory optimizations for static graphs Graph level memory reuse optimization x1 x2 x2 x2 x1 A B C D 100 150 150 50 50 25 50 75 75 x75 A 100 100 B 2 C 2 B 1 C 1 D 1 150 50 75 x75 150 50 75 D 2 25 25 AB 1 100 AB 2 100 B 1 C 1 150 B 2 C 2 150 C 1 C 2 75 C 1 D 1 50 C 2 D 2 50 D 2 25 D 1 25 Core 1 Core 2 A B 1 C 2 D 1 B 2 C 1 D 2 xecution order AB 1 100 AB 2 100 B 1 C 1 150 B 2 C 2 150 C 1 C 2 75 C 1 D 1 50 C 2 D 2 50 D 2 25 D 1 25
D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Deployment Memory optimizations for static graphs Buffer merging technique for SDF graphs A 30 AB 30 B BC 20 BD 10 20 10 C D No buffer merging AB 30 memory BC 20 BC 20 Buffer merging AB 30 memory BD 10 BD 10
PRSM Deployment Multiple input/output buffers merge. D ÉLCTRONI QU T D NICATIONS D RNNS Memory optimizations for static graphs 48% less memory than state-of-the-art techniques Techniques are independent from host language. No modification of the SDF MoC/applications graphs.
D ÉLCTRONI QU T D NICATIONS D RNNS PRSM Deployment nergy optimization: platform xynos 5 Odroid xynos 5 Big.LITTL A7 A7 A7 A7 A15 A15 A15 A15
D ÉLCTRONI QU core1 T D core2 NICATIONS core3 D core4 RNNS nergy optimization setup PRSM Deployment Image Processing QoS P=0 P=1 P=0.5 P=1 P=0.5 P=0 P-Value Linux-based Runtime (Abo Akademi) DVFS DPM Odroid xynos 5 Big.LITTL A7 A7 A7 A7 A15 A15 A15 A15
D ÉLCTRONI QU T D NICATIONS D RNNS nergy optimization results PRSM Deployment 20% energy savings on a parallel Sobel + sequential postprocessing wrt. Linux completely fair scheduler and on-demand governor S. Holmbacka,. Nogues, M. Pelcat, S. Lafond, and J. Lilius. nergy fficiency and Performance Management of Parallel Dataflow Applications. DASIP 2014, Madrid
37 D ÉLCTRONI QU T D NICATIONS D RNNS Algorithm Architecture Outputs PRSM +C compiler Simulator + Debugger + Profiler Multicore Runtime P DSP DSP P P P P DSP DSP Peripherals Main Memory
D ÉLCTRONI QU T D B A NICATIONS D C RNNS PRSM Outputs Generation of self-timed multicore code D o1 Actor A Actor B Actor D o1 o2 A B C D o2 Actor C time Actor
PRSM Outputs D ÉLCTRONI QU T D TMS320c6678 from Texas Instruments NICATIONS Supports D the activation of the DSP caches. RNNS Code generation for multiple targets Multi-C6X DSPs: Multi-x86 and multi-arm CPUs: Linux and Windows, pthread OMAP4 heterogeneous platform: dual-core ARM Cortex-A9, 2 Cortex-M3, and a C64xT DSP.
40 D ÉLCTRONI QU T D Algorithm NICATIONS D RNNS Demo Time Architecture PRSM +C compiler Simulator + Debugger + Profiler Multicore Runtime P DSP DSP P P P P DSP DSP Peripherals Main Memory
D ÉLCTRONI QU T D Available on GitHub NICATIONS D RNNS PRSM features Open Source Tool Research-Oriented Tool Summary New models, optimizations, scheduling clipse-based Integrated Tool Several plug-ins, metamodels xtended Web Tutorials http://preesm.sourceforge.net/website
D ÉLCTRONI QU T D NICATIONS D RNNS Questions? http://preesm.sf.net @PreesmProject