Compilers and Tools for Software Stack Optimisation



Similar documents
Wiggins/Redstone: An On-line Program Specializer

Monitoring, Tracing, Debugging (Under Construction)

Stacey D. Son Consultant/SRI International. BSDCan Developer s Summit 15-May-2013

Example of Standard API

Full and Para Virtualization

Multi-/Many-core Modeling at Freescale

Valgrind BoF Ideas, new features and directions

STLinux Software development environment

Android Development: a System Perspective. Javier Orensanz

Reducing Dynamic Compilation Latency

Mobile Application Development Android

RISC-V Software Ecosystem. Andrew Waterman UC Berkeley

RTOS Debugger for ecos

QEMU, a Fast and Portable Dynamic Translator

Lecture 1 Introduction to Android

Android Architecture. Alexandra Harrison & Jake Saxton

ANDROID DEVELOPER TOOLS TRAINING GTC Sébastien Dominé, NVIDIA

PERFORMANCE ANALYSIS OF KERNEL-BASED VIRTUAL MACHINE

GETTING STARTED WITH ANDROID DEVELOPMENT FOR EMBEDDED SYSTEMS

Reminders. Lab opens from today. Many students want to use the extra I/O pins on

Virtualization and Other Tricks.

Overview. The Android operating system is like a cake consisting of various layers.

Chapter 3 Operating-System Structures

How To Develop Android On Your Computer Or Tablet Or Phone

Evading Android Emulator

Running a Program on an AVD

Finding Performance and Power Issues on Android Systems. By Eric W Moore

Light-Weight and Resource Efficient OS-Level Virtualization Herbert Pötzl

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

IO Visor Project Overview

Visualizing gem5 via ARM DS-5 Streamline. Dam Sunwoo ARM R&D December 2012

Multi-GPU Load Balancing for Simulation and Rendering

Real-time Debugging using GDB Tracepoints and other Eclipse features

Thingsquare Technology

Chapter 14 Virtual Machines

Helping you avoid stack overflow crashes!

Android Virtualization from Sierraware. Simply Secure

Multi-core Programming System Overview

Intel Application Software Development Tool Suite 2.2 for Intel Atom processor. In-Depth

Lecture 5. User-Mode Linux. Jeff Dike. November 7, Operating Systems Practical. OSP Lecture 5, UML 1/33

I Control Your Code Attack Vectors Through the Eyes of Software-based Fault Isolation. Mathias Payer, ETH Zurich

Performance Counter. Non-Uniform Memory Access Seminar Karsten Tausche

An Introduction to Android. Huang Xuguang Database Lab. Inha University

An Easier Way for Cross-Platform Data Acquisition Application Development

The sphinx simulator project

COS 318: Operating Systems. Virtual Machine Monitors

Frysk The Systems Monitoring and Debugging Tool. Andrew Cagney

High Performance or Cycle Accuracy?

Sequential Performance Analysis with Callgrind and KCachegrind

Building Embedded Systems

Application-Level Debugging and Profiling: Gaps in the Tool Ecosystem. Dr Rosemary Francis, Ellexus

An Introduction to Android

Operating System Overview. Otto J. Anshus

RED HAT DEVELOPER TOOLSET Build, Run, & Analyze Applications On Multiple Versions of Red Hat Enterprise Linux

A JIT Compiler for Android s Dalvik VM. Ben Cheng, Bill Buzbee May 2010

Software Tracing of Embedded Linux Systems using LTTng and Tracealyzer. Dr. Johan Kraft, Percepio AB

Performance tuning Xen

Sequential Performance Analysis with Callgrind and KCachegrind

Overview of CS 282 & Android

Red Hat Linux Internals

Dynamically Translating x86 to LLVM using QEMU

Carlos Villavieja, Nacho Navarro Arati Baliga, Liviu Iftode

Embedded Systems. Review of ANSI C Topics. A Review of ANSI C and Considerations for Embedded C Programming. Basic features of C

How To Test Your Code

Sistemi ad agenti Principi di programmazione di sistema

Embedded Linux Platform Developer

A general-purpose virtualization service for HPC on cloud computing: an application to GPUs

Compromise-as-a-Service

Understand and Build Android Programming Environment. Presented by: Che-Wei Chang

Instruction Set Architecture (ISA)

Virtualization. Clothing the Wolf in Wool. Wednesday, April 17, 13

Operating System Structures

Experimental Evaluation of Distributed Middleware with a Virtualized Java Environment

The Advanced JTAG Bridge. Nathan Yawn 05/12/09

Betriebssysteme KU Security

Hacking your Droid ADITYA GUPTA

Some Future Challenges of Binary Translation. Kemal Ebcioglu IBM T.J. Watson Research Center

Virtualization Technologies

Friendly ARM MINI2440 & Dalvik Virtual Machine with Android

12. Introduction to Virtual Machines

x86 ISA Modifications to support Virtual Machines

evm Virtualization Platform for Windows

10 STEPS TO YOUR FIRST QNX PROGRAM. QUICKSTART GUIDE Second Edition

General Introduction

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

The programming language C. sws1 1

Mobile Devices - An Introduction to the Android Operating Environment. Design, Architecture, and Performance Implications

Performance Analysis of Android Platform

Virtualization of Linux based computers: the Linux-VServer project

An Implementation Of Multiprocessor Linux

Dongwoo Kim : Hyeon-jeong Lee s Husband

Microkernels, virtualization, exokernels. Tutorial 1 CSC469

ELEC 377. Operating Systems. Week 1 Class 3

Basics of VTune Performance Analyzer. Intel Software College. Objectives. VTune Performance Analyzer. Agenda

Topics. Introduction. Java History CS 146. Introduction to Programming and Algorithms Module 1. Module Objectives

Chapter 2 System Structures

Optimizing Linux Performance

ITG Software Engineering

Hypervisors. Introduction. Introduction. Introduction. Introduction. Introduction. Credits:

Introduction to Virtual Machines

Transcription:

Compilers and Tools for Software Stack Optimisation EJCP 2014 2014/06/20 christophe.guillon@st.com

Outline Compilers for a Set-Top-Box Compilers Potential Auto Tuning Tools Dynamic Program instrumentation System calls interposition

COMPILERS FOR A SET-TOP-BOX

Set-Top-Box at Home System on Chip (SoC) Video Encode & Decode Audio Encode & Decode Connectivity and Storage Drivers and Application Middlewares Rich Content Applications

Set-Top-Box System on Chip Applications ~20 programmable cores 2x ARM 4x ST231 2x STxP70 9 x SLIM 1 ARM Mali

Cross Compilers Applications Platform Development Compilers: GCC LLVM Open64

Embedded Compilers Applications Runtime Compilers: OpenCL/LLVM - AOT OpenGL ARM - AOT Kernel opts - AOT V8 JSC - JIT SFX JSC - JIT Adobe ASC - JIT Dalvik JC - JIT HotSpot JC - JIT Android NCI - AOT

Other Compilers Applications x86 Host Emulation Tools Compilers: QEMU (DBT) Valgrind (DBT) ARM FastSIM (SIM) HCE Compilers

COMPILERS POTENTIAL

Compilers Potential and Cost When? Static Ahead of Time Just In Time Increasing Knowledge Increasing Runtime Cost How Often? One shot Iterative Continuous Increasing Potential Increasing Potential Cost

Optimizations Space Exploration Perf v.s. Size Best Tradeoffs

Defaults are Uneffectives tuned choice for performance -O3 -O2 -O3 level desastrous -Os

Defaults Choices are Limited tuned choice for size -O3 -O2 -Os -Os choice for size whatever the perf

Optimizing Applications Choose the right domain specific language Fallback to C/C++ Ensure Functionality and Unitary Testing at First Develop Representative Benchmarks Profile against the Benchmarks Use your Domain Expertise Use your Target Architecture Expertise Exploit the Compiler potential Iterate, but preserve maintainability Should you care?

AUTO TUNING TOOLS

Motivations Automate the process of optimization Build, run, observe, iterate, Complement programmer s optimisation Reproduce as changes are made Explore more routes Exploit knowledge of compilers and architectures Explore transparently Inject optimisations without sources modification

Auto Tuning through Iterative Compilation Iterative Compilation Initial Compilation Run, observe & refine 2nd Variant Run, observe & refine Final Variant Multi variant Iterative Compilation Run, observe, select some and refine Run, observe, select some and refine Initial Variants set 2nd populations set Final populations set Run, observe & select best Auto Tuned Variant

Cost of Auto Tuning Populations sets (size P) 1st: base (Ox X fdo X lto) 2nd: inlining variants 3nd: loop transfo variants 4th: back-end variants Select 3 Select 3 Select 3 For each hot function (H functions) Complexity: 0(10.P.H) compilation+run Typical total time cost (P=100, H=10): => 10 000 compilation+run

Parallelizing Auto Tuning Populations sets (size P) 1st: base (Ox X fdo X lto) 2nd: inlining variants 3nd: loop transfo variants 4th: back-end variants For each hot function (H functions) Parallelized on computation farm (N nodes) Complexity: 0((P/N + 3.(3P/N)) * H) compilation+run Parallel compilation time (P=100, N=400, H=10) => 40 compilation+run (not 25)

Tools for Auto Tuning Achieving Seamless Auto Tuning on Computation Farm Compilation: audit build & produce variants * Run: on chip or emulation ** Profiling: instrumentation *** Observation: system perf/size **** For instance * with Proot + CARE ** with QEMU for instance *** with QEMU + dynamic instrumentation **** with bin/size and bin/perf tools for instance

Demo & Refs Ref to auto tuning video: http://youtu.be/l5qpwgiajus Auto tuning frameworks TACT ATOS (not yet disclosed) Genetic exploration Tools

DYNAMIC PROGRAM INSTRUMENTATION

Motivations Program performance analysis cycles count, profiles, call graphs, Profile driven compiler optimizations I-cache placement, edge profiling, data dep. estimation, Program debugging call traces, syscall traces, memory checks, Processor architecture analysis instructions usage, cache behavior,

Dynamic Binary Translation Program Loader Disassembler Compiler Target Buffer Manager Virtual Memory Manager System calls Emulator Debugger Bridge Instrumentation Interface Processes, Threads and Filesystem Monitor

Example with QEMU QEMU, a versatile open source tool used for: Platform emulation (Google SDK, ) Devices emulation (VirtualBox, ) Program emulation (ScratchBox, ) QEMU is a Dynamic Binary Translator (DBT) Hence it contains a compiler: TCG (Tiny Code Generator)

Program Instrumentation Initial Program A B Initial program flow A B B A Instrumented program flow A T B T B T A Instrumented Program Ins A Execution environment Ins B 1 2 T Trace Output Hello A Hello B Hello A Hello B

Binary Translation (i.e. QEMU ARM -> x86) 1 Host Execution Environment (x86) QEMU Guest program flow A B B A Guest (ARM) Program A B 6 2 A Generic TCG IR (LIR) 4 5 A 10 8 B 9 3 7 B

QEMU guests & hosts Alpha Arm Cris X86 Lm32 Microblaze Mips Ppc S390x Sh4 Sparc Unicore32 ST200 TCG IR neutral Arm Hppa X86 Ia64 Mips Ppc Ppc64 S390 Sparc

Execution Time Instrumentation 1 Host Execution Environment QEMU Guest Program A 2 3 Ins A 4 6 Exec event TCG Plugins 5 Ins A 8 T 7 Initial program flow A A Instrumented emulated program flow Ins plu gin T A Ins plu gin T A

Instruction Count Plugin /* A simple plugin for counting executed guest instructions. * Usage : cc shared o icount.so icount.c * qemu-i386 tcg-plugin./icount.so the_program */ #include <stdint.h> #include <stdio.h> #include «qemu_tpi.h» uint64_t total_icount; /* Instruction count updated at each block execution. */ void qemu_tpi_block_execution_event(qemu_tpi_t *tpi) { total_icount += TPI_tb_icount(tpi); } void qemu_tpi_fini(qemu_tpi_t *tpi) { printf("instructions: %"PRIu64"\n", total_icount); }

Translation Time Instrumentation Host Execution Environment 1 QEMU Guest Program A Ins 2 3 A 4 TCG Plugins Translation event 5 Ins A 7 6 T Initial program flow A A Instrumented emulated program flow Ins T A Ins T A

Inlined Instruction Count Plugin /* The translation event based code for the icount.c plugin. */ void qemu_tpi_block_translation_event(qemu_tpi_t *tpi) { /* Instruction count update inlined at each block translation. * Code is generated into the TCG IR buffer directly. */ TPIv_ptr ptr = TPI_const_ptr(tpi, &total_icount); TPIv_i64 total = TPI_temp_new_i64(tpi); TPI_gen_ld_i64(tpi, total, ptr, 0); TPIv_i64 icount = TPI_const_i64(tpi, TPI_tb_icount(tpi)); TPI_gen_add_i64(tpi, total, total, icount); TPI_temp_free_i64(tpi, icount); TPI_gen_st_i64(tpi, total, ptr, 0); TPI_temp_free_i64(tpi, total); TPI_temp_free_ptr(tpi, ptr); }

Profile Plugin Example $ qemu-i386 tcg-plugin profile.so sha1-386.exe SHA1=15dd99a1991e0b3826fede3deffc1feba42278e6 Instrs bytes blocks symbol 106497664 364229691 1792028 SHA1Transform 571297 1815640 133175 SHA1Update 961 3399 168 SHA1Final 23590 85199 6141 main 18 40 3 libc_csu_init 16 43 5.init 2093 12552 2087.plt 55 146 10.text 12 28 3.fini 175102 504418 50840 unknown/libc.so.6

Instrumentation Overhead 30 25 Emulation of i386 code on a Core 2 Duo @2.0Mhz SPECINT2000 20 15 10 5 native qemu inline offline 0 Emulation cost is x15 (qemu) Instrumentation overhead is 30% (offline) for a I-count plugin Overhead reduced to 3% with translation time interface (inline)

Demo & Refs Ref to live demo Dynamic Binary Translators and Intrumentation Tools Valgrind Pine DynamoRIO QEMU

SYSTEM CALLS INTERPOSITION

Motivation Intercept communications between user programs and the kernel Modify the behavior (side effects) of the user program without recompiling Take full control (debugger)

Linux ptrace inteface Attach fork(), ptrace(getpid(), ) and exec(), or ptrace(program_pid, ) Monitor wait() and catch TRAP signal Handle get syscall() info and program state, and modify syscall()/state and continue

Interposition and Context Switch All syscall() from the tracee are notified just before start just before end tracer attach enter exit enter exit tracee syscall 1 syscall 2 4 context switchs per syscall Idle Running

Interposition Example Emulate a change of the / (root) location equivalent to chroot() but in user-land tracee behaves as in / but in your /mydir Trace open(/file_path) (and all calls with paths) Tracee: open( /etc/passwd ) Tracer: /etc/passwd -> /mydir/etc/passwd Well, it s not that easy to fake passwords, why? Also what about the loader, /mydir/lib/ld.so?

Applications Debugging gdb, strace Instrumentation Valgrind Cross filesystem execution QEMU, Proot Archiving CARE, CDE

CARE Comprehensive Archiving For each path seen in a syscall if ( not_seen(path) && not_existing(path)) copy path to /archive/path For Reproducible Execution For each path seen in a syscall transform path into /archive/path

Demo and Refs See CARE demo video http://youtu.be/xipzm66_3w0 CARE: http://reproducible.io PRoot: http://proot.me