Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

Similar documents
Secure data processing: Blind Hypervision

On-Line Diagnosis using Orthogonal Multi-Tone Time Domain Reflectometry in a Lossy Cable

Runtime Code Generation for Code Polymorphism

Code generation under Control

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Jonathan Worthington Scarborough Linux User Group

CEA LIST activity on Cable Monitoring and Diagnosis

Driving force. What future software needs. Potential research topics

Coherent sub-thz transmission systems in Silicon technologies: design challenges for frequency synthesis

Stream Processing on GPUs Using Distributed Multimedia Middleware

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Cloud Computing. Up until now

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

picojava TM : A Hardware Implementation of the Java Virtual Machine

Bringing Big Data Modelling into the Hands of Domain Experts

Binary search tree with SIMD bandwidth optimization using SSE

Introduction to Virtual Machines

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

Seven Challenges of Embedded Software Development

12. Introduction to Virtual Machines

Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries

Virtual Machines.

Virtual Machine Learning: Thinking Like a Computer Architect

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

COM 444 Cloud Computing

CHAPTER 1 INTRODUCTION

Applied Micro development platform. ZT Systems (ST based) HP Redstone platform. Mitac Dell Copper platform. ARM in Servers

SWARM: A Parallel Programming Framework for Multicore Processors. David A. Bader, Varun N. Kanade and Kamesh Madduri

Recent Advances in Periscope for Performance Analysis and Tuning

Java Virtual Machine: the key for accurated memory prefetching

Optimizing Code for Accelerators: The Long Road to High Performance

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Performance monitoring at CERN openlab. July 20 th 2012 Andrzej Nowak, CERN openlab

FPGA-based Multithreading for In-Memory Hash Joins

High Performance or Cycle Accuracy?

Linux Scheduler Analysis and Tuning for Parallel Processing on the Raspberry PI Platform. Ed Spetka Mike Kohler

Accelerate Cloud Computing with the Xilinx Zynq SoC

A Comparison Of Shared Memory Parallel Programming Models. Jace A Mogill David Haglin

Chapter 3 Operating-System Structures

ADVANCED COMPUTER ARCHITECTURE

A Unified View of Virtual Machines

Hybrid and Custom Data Structures: Evolution of the Data Structures Course

SQL/XML-IMDBg GPU boosted In-Memory Database for ultra fast data management Harald Frick CEO QuiLogic In-Memory DB Technology

CFD Implementation with In-Socket FPGA Accelerators

1/20/2016 INTRODUCTION

2015 The MathWorks, Inc. 1

Parallel Algorithm Engineering

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Java Embedded Applications

Next Generation Operating Systems

What s New in MATLAB and Simulink

Radeon GPU Architecture and the Radeon 4800 series. Michael Doggett Graphics Architecture Group June 27, 2008

Scalability and Classifications

Scientific Computing Programming with Parallel Objects

Parallel Computing. Benson Muite. benson.

FPGA area allocation for parallel C applications

SOC architecture and design

Garbage Collection in the Java HotSpot Virtual Machine

Week 1 out-of-class notes, discussions and sample problems

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

ProTrack: A Simple Provenance-tracking Filesystem

Multi-GPU Load Balancing for Simulation and Rendering

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Efficient Parallel Graph Exploration on Multi-Core CPU and GPU

Introduction to Cloud Computing

A Case Study - Scaling Legacy Code on Next Generation Platforms

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

C++ Programming Language

General Introduction

GPUs for Scientific Computing

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

Recent and Future Activities in HPC and Scientific Data Management Siegfried Benkner

News and trends in Data Warehouse Automation, Big Data and BI. Johan Hendrickx & Dirk Vermeiren

Next Generation GPU Architecture Code-named Fermi

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Secured Embedded Many-Core Accelerator for Big Data Processing

PyCompArch: Python-Based Modules for Exploring Computer Architecture Concepts

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

International Workshop on Field Programmable Logic and Applications, FPL '99

my forecasted needs. The constraint of asymmetrical processing was offset two ways. The first was by configuring the SAN and all hosts to utilize

Transcription:

Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems A. Carbon, Y. Lhuillier, H.-P. Charles CEA LIST DACLE division Embedded Computing Embedded Software Laboratories France Contact: alexandre.carbon@cea.fr www.cea.fr 24th IEEE International Conference on Applicationspecific Systems, Architectures and Processors June 5-7 2013, Washington D.C., USA.

Cliquez pour modifier le style du Outline titre Context : Virtualization JIT emergence JIT optimization opportunities (based on the LLVM framework) Hardware accelerator proposal Experiments results Conclusion 2

Cliquez pour modifier le style Context du titre Parallelism emergence CPU CPU CPU CPU CPU 3

Cliquez pour modifier le style Context du titre Parallelism emergence Heterogeneity development in computing systems ILT/TLP-based multi-cores and wide SIMD GPUs CPU CPU CPU CPU GPU 4

Cliquez pour modifier le style Context du titre Parallelism emergence Heterogeneity development in embedded systems Heterogeneous asymmetric many-core processors (AMP) emergence CPU DSP CPU HW acc. GPU PPE: a major concern of embedded systems designers 5

Cliquez Context: pour Virtualization modifier le style emergence du titre Code deployment on such architectures High development cost to efficiently target one AMP Code portability has become a major issue Software? Hardware 6

Cliquez Context: pour Virtualization modifier le style emergence du titre Emergence of virtualization abstraction layers First mention in 60 s (IBM 360/67) Java Virtual Machines, CLI, LLVM => Virtual Machines (VMs) development Software Virtualization layer Hardware 7

Cliquez Context: pour Virtualization modifier le style emergence du titre Emergence of virtualization abstraction layers Based initially on interpretation Suffer from considerable performance overheads Software Virtualization Interpretation layer Hardware 8

Cliquez Context: pour modifier Just-In-Time le style emergence du titre Emergence of virtualization abstraction layers Coupling today interpretation and Just-In-Time compilation Software Interpretation Virtualization layer Just-In-Time compilation Hardware 9

Cliquez Context: pour modifier Just-In-Time le style emergence du titre JIT compilation: widely used in GPP Performance consumption overheads in embedded 2 kinds of existing optimizations to reduce JIT impact Software optimizations System design specialization Specialized dedicated resources Additional standard dedicated resources 1 JIT compilation complexity limits performance gains 1 Pointer-based algorithms Proposing tuned hardware associated to these dedicated resources to manage JIT compilation algorithms 1 Ting Cao et al. «The yin and yang of power and performance for asymmetric hardware and managed software», ISCA 12. 10

Cliquez pour modifier le style du Outline titre Context : Virtualization JIT emergence JIT optimization opportunities (based on the LLVM framework) Hardware accelerator proposal Experiments results Conclusion 11

Cliquez pour modifier LLVM le style framework du titre LLVM bytecode compiler (LLC): used in many projects Profiling: identifying LLC s most critical parts Experiments on a ARM Cortex-A5 model Associative array management dynamic memory allocation: on average 24% of LLC execution time 12

LLVM: Cliquez existing pour modifier software le optimizations style du titre New abstract-data-types (optimized versions of STL C++ ADT) Provide [multi]map multi[set] abstract data types Hash table implementations rather than sorted-tree STL C++: still used when performance is not a key issue Specialized allocators (eg. RecycleAllocator) Keeps track of recently dealloccated objects to reuse them Avoiding frequent allocations and deallocations Despite these software optimizations, associative arrays memory allocation still prevail 13

Cliquez pour modifier LLVM le normalization style du titre Our goal: proposing an alternative acceleration of associative arrays dynamic memory allocation Mean: standardization of LLC Proposing a solution based on standard libraries for reuse Using only STL C++ library for [multi]map multi[set] ADT C s memory allocation standard library (dlmalloc-based) Transferring optimizations to a hardware accelerator Solution portability: accelerator reuse Benefit to all pointer-based algorithms using massively associative arrays and dynamic memory allocation 14

Cliquez pour modifier le style du Outline titre Context : Virtualization JIT emergence JIT optimization opportunities (based on the LLVM framework) Hardware accelerator proposal Experiments results Conclusion 15

Cliquez RB-Tree pour modifier Hardware le style accelerator du titre Current implementation of associative arrays memory allocation Standard C++ Map Set libraries Using Red-Black Tree representation Binary tree with coloring property C s memory allocator: dlmalloc Using associative arrays to associate data sizes with free memory chunks Using hash table double linked-lists Systematic usage of RB-Trees Proposing an implementation of dlmalloc using RB-Trees Modifying the allocator without modifying user interface 16

Cliquez Hardware pour modifier acceleration style description du titre New RB-Tree node structure: held in 128-bits Digest key in 31-bits, color in the last bit, preserving the sorting order rb_tree_node_t* X rb_tree_node_t* X 31 0 COLOR PARENT LEFT RIGHT KEY D_KEY C 128 bits PARENT LEFT 128 bits RIGHT Key_size 31 1 0 (a) initial structure (b) proposed structure Proposing hardware accelerated instructions Specialized instructions for RB-Tree management functions Accelerating traversals, key look-ups, balanced insertion and removal 17

Cliquez pour modifier Proposed le ISA style extension du titre 15 new instructions over 400 in ARM ISA Instruction Function Used by RBTINC Rd, Rm increment map::iterator, set::iterator RBTDEC Rd, Rm decrement map::iterator, set::iterator RBTLOW Rd, Rn, Rm lower bound map::lower_bound, set::lower_bound, malloc, realloc RBTUP Rd, Rn, Rm upper_bound map::upper_bound, set::upper_bound RBTDEL Rn, Rm rebalance for erase map::erase, set::erase, malloc, realloc, free RBTINS <L R> Rn, Rm, Rs insert rebalance map::insert, set::insert, malloc, realloc, free Read at most 3 registers, write at most 1 register Multi-cycles instructions: hiding iterative computations 18

Cliquez HW pour accelerator modifier implementation style du titre New Cortex-A5 pipeline HW accelerator proposal Full FSM size estimation: 161 states, 223 transitions 19

Cliquez pour modifier le style du Outline titre Context : Virtualization JIT emergence JIT optimization opportunities (based on the LLVM framework) Hardware accelerator proposal Experiments results Conclusion 20

Cliquez pour modifier Experiments le style du Results titre Instrumented ARM Cortex-A5 ISS with a cache simulator Speedups obtained for SW and HW accelerated versions of LLC SW: 29% of gain, HW: 50% / LLC standardized version 15% of gain for HW acc / SW acc 21

Cliquez pour modifier Experiments le style du Results titre Evolution of time spent in memory allocation associative array management (relative to total execution time) From 41% to 24% (SW) 12% (HW) Raw speedup on memory allocation associative arrays: 5x 22

Cliquez pour modifier le style du Outline titre Context : Virtualization JIT emergence JIT optimization opportunities (based on the LLVM framework) Hardware accelerator proposal Experiments results Conclusion 23

Cliquez pour modifier le style Conclusion du titre Interest of dedicated resources for virtualization services Limited gains for JIT compilation with SW optimizations Impact of dynamic memory allocation associative array management on execution time Proposing tuned HW for JIT compilation, coupled to dedicated resources HW accelerator hidden behind standard libraries Valuable for all pointer-based algorithms ISA extension for RB-Tree management functions 15% of gain comparing to SW opt in LLVM code gen 5x raw speedup for memory allocation associative array management Next acceleration opportunities: instruction graph handling 24

Thank you Questions? Centre de Grenoble 17 rue des Martyrs 38054 Grenoble Cedex Centre de Saclay Nano-Innov PC 172 91191 Gif sur Yvette Cedex