THE VELOX STACK Patrick Marlier (UniNE)



Similar documents
64-Bit NASM Notes. Invoking 64-Bit NASM

Multi-core Programming System Overview

Cloned Transactions: A New Execution Concept for Transactional Memory

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

Thesis Proposal: Improving the Performance of Synchronization in Concurrent Haskell

Towards OpenMP Support in LLVM

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Lecture 7: Machine-Level Programming I: Basics Mohamed Zahran (aka Z)

Introduction to Virtual Machines

EE8205: Embedded Computer System Electrical and Computer Engineering, Ryerson University. Multitasking ARM-Applications with uvision and RTX

Virtualization. Pradipta De

Weighted Total Mark. Weighted Exam Mark

Scalability evaluation of barrier algorithms for OpenMP

GPU Parallel Computing Architecture and CUDA Programming Model

PART IV Performance oriented design, Performance testing, Performance tuning & Performance solutions. Outline. Performance oriented design

Last Class: OS and Computer Architecture. Last Class: OS and Computer Architecture

System/Networking performance analytics with perf. Hannes Frederic Sowa

Course Development of Programming for General-Purpose Multicore Processors

Debugging with TotalView

INTEL PARALLEL STUDIO EVALUATION GUIDE. Intel Cilk Plus: A Simple Path to Parallelism

Software and the Concurrency Revolution

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

Scheduling Task Parallelism" on Multi-Socket Multicore Systems"

Frysk The Systems Monitoring and Debugging Tool. Andrew Cagney

Embedded Programming in C/C++: Lesson-1: Programming Elements and Programming in C

Building Applications Using Micro Focus COBOL

OS Virtualization Frank Hofmann

CS 377: Operating Systems. Outline. A review of what you ve learned, and how it applies to a real operating system. Lecture 25 - Linux Case Study

CS423 Spring 2015 MP4: Dynamic Load Balancer Due April 27 th at 9:00 am 2015

RISC-V Software Ecosystem. Andrew Waterman UC Berkeley

SSC - Concurrency and Multi-threading Java multithreading programming - Synchronisation (I)

Parallel Computing. Shared memory parallel programming with OpenMP

Design and Implementation of the Heterogeneous Multikernel Operating System

How To Write A Multi Threaded Software On A Single Core (Or Multi Threaded) System

Chapter 6, The Operating System Machine Level

Virtual Computing and VMWare. Module 4

KITES TECHNOLOGY COURSE MODULE (C, C++, DS)

picojava TM : A Hardware Implementation of the Java Virtual Machine

Understanding Hardware Transactional Memory

Monitors, Java, Threads and Processes

An Easier Way for Cross-Platform Data Acquisition Application Development

Lecture 6: Semaphores and Monitors

Automatic Logging of Operating System Effects to Guide Application-Level Architecture Simulation

NVIDIA CUDA GETTING STARTED GUIDE FOR MAC OS X

A Unified View of Virtual Machines

CHAPTER 1 INTRODUCTION

SYCL for OpenCL. Andrew Richards, CEO Codeplay & Chair SYCL Working group GDC, March Copyright Khronos Group Page 1

SEER PROBABILISTIC SCHEDULING FOR COMMODITY HARDWARE TRANSACTIONAL MEMORY. 27 th Symposium on Parallel Architectures and Algorithms

PARALLEL JAVASCRIPT. Norm Rubin (NVIDIA) Jin Wang (Georgia School of Technology)

Operating Systems and Networks

General Introduction

Introduction to GPU hardware and to CUDA

Hardware RAID vs. Software RAID: Which Implementation is Best for my Application?

Andreas Herrmann. AMD Operating System Research Center

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

1/5/2013. Technology in Action

Mutual Exclusion using Monitors

A Case Study - Scaling Legacy Code on Next Generation Platforms

AMD CodeXL 1.7 GA Release Notes

Example of Standard API

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Facing the Challenges for Real-Time Software Development on Multi-Cores

A JIT Compiler for Android s Dalvik VM. Ben Cheng, Bill Buzbee May 2010

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Java Interview Questions and Answers

Instruction Set Architecture (ISA)

Advanced Multiprocessor Programming

VMware and CPU Virtualization Technology. Jack Lo Sr. Director, R&D

Running applications on the Cray XC30 4/12/2015

The Click2NetFPGA Toolchain. Teemu Rinta-aho, Mika Karlstedt, Madhav P. Desai USENIX ATC 12, Boston, MA, 13 th of June, 2012

Validating Java for Safety-Critical Applications

CSE 373: Data Structure & Algorithms Lecture 25: Programming Languages. Nicki Dell Spring 2014

ControlMaestro Upgrade guide

Control 2004, University of Bath, UK, September 2004

Performance monitoring with Intel Architecture

Basics in Energy Information (& Communication) Systems Virtualization / Virtual Machines

C++ Programming Language

Neptune. A Domain Specific Language for Deploying HPC Software on Cloud Platforms. Chris Bunch Navraj Chohan Chandra Krintz Khawaja Shams

Introduction to Cloud Computing

Outline. Outline. Why virtualization? Why not virtualize? Today s data center. Cloud computing. Virtual resource pool

X86-64 Architecture Guide

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Replication on Virtual Machines

5. Calling conventions for different C++ compilers and operating systems

14 Failover and Instant Failover

GTask Developing asynchronous applications for multi-core efficiency

Migrating Application Code from ARM Cortex-M4 to Cortex-M7 Processors

The BSN Hardware and Software Platform: Enabling Easy Development of Body Sensor Network Applications

SystemC Tutorial. John Moondanos. Strategic CAD Labs, INTEL Corp. & GSRC Visiting Fellow, UC Berkeley

12. Introduction to Virtual Machines

Security Vulnerability Notice

Intel TSX (Transactional Synchronization Extensions) Mike Dai Wang and Mihai Burcea

Multi-Threading Performance on Commodity Multi-Core Processors

Transcription:

THE VELOX STACK Patrick Marlier (UniNE) 06.09.20101

THE VELOX STACK / OUTLINE 2

APPLICATIONS Real applications QuakeTM Game server C code with OpenMP Globulation 2 Real-Time Strategy Game C++ code using STL and Boost library RMS-TM Recognition, Mining and Synthesis domain applications C and C++ code A compiler with TM is required Benchmarks TM-Unit Unit testing of TM library: observe semantic, correctness, performance STMBench7 C++ code Workloads with different operations on a set of graphs, indexes 3

TRANSACTIONAL MEMORY API Based on Draft Specification of Transactional Language Constructs for C++ Transaction transaction keyword Transaction attributes Transaction (default): transaction [[atomic]] Transaction with possible irrevocable statement: transaction [[relaxed]] Function attributes undoable function: transaction_safe attribute irrevocable function: transaction_unsafe attribute undoable function with possible irrevocability: transaction_callable attribute uninstrumented function: transaction_pure attribute Wrapped function: transaction_wrap attribute Extra statements transaction_cancel transaction_abort 4

C/C++ COMPILERS WITH TRANSACTIONAL MEMORY DTMC Using LLVM and LLVM-GCC LLVM front-end (LLVM-GCC) with TM-API support Transactification using IR instrumentation in LLVM Can generate multiple code paths using different instrumentation or none Can detect aligned memory accesses Link Time Optimization even for transactional accesses Research oriented for fast prototyping GCC Fully integrated in GCC : -fgnu-tm Complex instrumentation with «after read», «after write», «for write» barriers Should be integrated in GCC 4.7 It is still active so don t hesitate to contribute, it can make TM mainstream. All compilers are API and ABI compatible. 5

TRANSACTIONAL MEMORY ABI Based on Transactional Memory Application Binary Interface (TM ABI) Specification 1.0 extern long cntr; void increment() { transaction { cntr = cntr + 5; } } I want to Start a transaction Read an unsigned int at address X Code transformation (internal) extern long cntr; void increment() { _ITM_beginTransaction(...); long l_cntr = (long) _ITM_RU8(&cntr); l_cntr = l_cntr + 5; _ITM_WU8(&cntr, l_cntr); _ITM_commitTransaction(); } Transformed to ABI call _ITM_beginTransaction() _ITM_RU4(X) Write a long value V at address Y _ITM_WU8(Y, V) Switch to irrevocable mode Commit a transaction _ITM_changeTransactionMode() _ITM_commitTransaction() Standardized interface in order to plug any TM library that support TM-ABI 6

TRANSACTIONAL MEMORY LIBRARY TinySTM Timestamp and lock based STM Several flavors with various trade-offs Write-back or write-through Encounter- or commit-time locking Visible reads Serial and concurrent irrevocability ABI compatible Transaction-safe memory allocator Advanced Contention Management Kill transaction and lock stealing Extensible via external modules 7

HARDWARE TRANSACTIONAL MEMORY AMD Advanced Synchronization Facilities (ASF) Extension of the x86_64 ISA Designed to be implementable in high-volume microprocessors Near cycle-accurate simulator Early release instruction, monitor instructions, DCAS: MOV R8, RAX MOV R9, RBX retry: SPECULATE JNZ retry MOV RCX, 1 LOCK MOV R10, [mem1] LOCK MOV RBX, [mem2] CMP R8, R10 JNZ out CMP R9, RBX JNZ out LOCK MOV [mem1], RDI LOCK MOV [mem2], RSI XOR RCX, RCX out: COMMIT MOV RAX, R10 Example of Double Compare And Swap using ASF Start speculative region RCX indicates if DCAS fails or succeeds Load mem1 Load mem2 Compare mem1 with old value (R8) If not equal leave Compare mem2 with old value (R9) If not equal leave Store new value in mem1 Store new value in mem2 Indicates DCAS success Commit speculative region (restore RAX) 8

HARDWARE AND HYBRID TRANSACTIONAL MEMORY Hardware Transactional Memory using pure ASF implementation (with LTO) extern long cntr; void increment() { _ITM_beginTransaction(...); long l_cntr = (long) _ITM_RU8(&cntr); l_cntr = l_cntr + 5; _ITM_WU8(&cntr, l_cntr); _ITM_commitTransaction(); } SPECULATE JNZ retry_or_switch_to_stm LOCK MOV RCX, cntr ADD RCX, 5 LOCK MOV cntr, RCX COMMIT Hybrid Transactional Memory using AMD ASF Extension of TinySTM Takes advantage that ASF allows the use of non speculative accesses in speculative region STM and HTM transactions can execute concurrently Overcome HTM limitations 9

HARDWARE TRANSACTIONAL MEMORY AMD Advanced Synchronization Facilities (pure ASF) 10

AND A LOT MORE TM-Dietlibc: TM aware libraries Libc functions compatible with transactions TM-Sched: TM aware kernel Time slice extension, transaction scheduling, soft real-time support Gibraltar: Dynamic instrumentation of binary code No need to re-compile libraries and transform legacy code to transactional code PTLSIM-MM: HTM without need of explicit instrumentation BeeFarm: FPGA-based multiprocessor system-on-chip with HTM But also in the Java side: STMBench7: Benchmark for TM TM-Java: Language extension for TM support Deuce: JAVA STM run-time and instrumentation framework Several STM library: LSA, TL2 11

THANK YOU Questions? http://www.velox-project.eu/ 12