Runtime Code Generation for Code Polymorphism



Similar documents
Hardware Acceleration for Just-In-Time Compilation on Heterogeneous Embedded Systems

Secure data processing: Blind Hypervision

Code generation under Control

Side Channel Analysis and Embedded Systems Impact and Countermeasures

On-Line Diagnosis using Orthogonal Multi-Tone Time Domain Reflectometry in a Lossy Cable

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Embedded System Hardware - Processing (Part II)

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

The new 32-bit MSP432 MCU platform from Texas

Wiggins/Redstone: An On-line Program Specializer

On Demand Loading of Code in MMUless Embedded System

Instruction Set Design

Computer Architecture TDTS10

SOC architecture and design

Instruction Set Architecture

Administration. Instruction scheduling. Modern processors. Examples. Simplified architecture model. CS 412 Introduction to Compilers

Triathlon of Lightweight Block Ciphers for the Internet of Things

Interpreters and virtual machines. Interpreters. Interpreters. Why interpreters? Tree-based interpreters. Text-based interpreters

VLIW Processors. VLIW Processors

Speeding up GPU-based password cracking

Rethinking SIMD Vectorization for In-Memory Databases


MAQAO Performance Analysis and Optimization Tool

Java Card. Smartcards. Demos. . p.1/30

Software implementation of Post-Quantum Cryptography

ARM Microprocessor and ARM-Based Microcontrollers

CS521 CSE IITG 11/23/2012

Coherent sub-thz transmission systems in Silicon technologies: design challenges for frequency synthesis

A Predictive Model for Cache-Based Side Channels in Multicore and Multithreaded Microprocessors

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

UTBB-FDSOI 28nm : RF Ultra Low Power technology for IoT

Software based Finite State Machine (FSM) with general purpose processors

The ARM Architecture. With a focus on v7a and Cortex-A8

FPGA area allocation for parallel C applications

Pipelining Review and Its Limitations

İSTANBUL AYDIN UNIVERSITY

Freescale Semiconductor, I

Sleuth: Automated Verification of Software Power Analysis Countermeasures

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

NVIDIA Tools For Profiling And Monitoring. David Goodwin

Keil C51 Cross Compiler

STM32 F-2 series High-performance Cortex-M3 MCUs

Binary search tree with SIMD bandwidth optimization using SSE

Comparative Performance Review of SHA-3 Candidates

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Microsemi Security Center of Excellence

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

A Performance Study of Load Balancing Strategies for Approximate String Matching on an MPI Heterogeneous System Environment

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

COMPUTER ORGANIZATION ARCHITECTURES FOR EMBEDDED COMPUTING

Networking Virtualization Using FPGAs

Embedded Software development Process and Tools: Lesson-4 Linking and Locating Software

«A 32-bit DSP Ultra Low Power accelerator»

Optimizing Code for Accelerators: The Long Road to High Performance

CEA LIST activity on Cable Monitoring and Diagnosis

AES Power Attack Based on Induced Cache Miss and Countermeasure

Building Blocks for PRU Development

Fast Software AES Encryption

Smart Cards a(s) Safety Critical Systems

Dongwoo Kim : Hyeon-jeong Lee s Husband

Using the CoreSight ITM for debug and testing in RTX applications

NFQL: A Tool for Querying Network Flow Records [6]

Lightweight Cryptography From an Engineers Perspective

Architectures and Platforms

Introduction to Virtual Machines

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Performance Investigations. Hannes Tschofenig, Manuel Pégourié-Gonnard 25 th March 2015

High Performance or Cycle Accuracy?

Performance Analysis and Optimization Tool

Energy-Efficient, High-Performance Heterogeneous Core Design

Fastboot Techniques for x86 Architectures. Marcus Bortel Field Application Engineer QNX Software Systems

Fast Arithmetic Coding (FastAC) Implementations

GPU Hardware Performance. Fall 2015

GameTime: A Toolkit for Timing Analysis of Software

Research Statement. Hung-Wei Tseng

ARM Processors and the Internet of Things. Joseph Yiu Senior Embedded Technology Specialist, ARM

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Fast Matching of Binary Features

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

High-speed image processing algorithms using MMX hardware

Driving force. What future software needs. Potential research topics

Operating Systems. Virtual Memory

UM0586 User manual. STM32 Cryptographic Library. Introduction

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

IMPROVING PERFORMANCE OF RANDOMIZED SIGNATURE SORT USING HASHING AND BITWISE OPERATORS

Institut d Electronique et des Télécommunications de Rennes. Equipe Image

Managing Adaptability in Heterogeneous Architectures through Performance Monitoring and Prediction

CHAPTER 7: The CPU and Memory

LSN 2 Computer Processors

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

A Multi-layered Domain-specific Language for Stencil Computations

Physical Security: Status and Outlook

Transcription:

Runtime Code Generation for Code Polymorphism Workshop on Runtime Code Generation for Secured Embedded Devices Damien Couroussé 2015-12-03 www.cea.fr

Runtime Cliquez pour Code modifier Generation: le style Motivation du titre Pitch: some code optimisations are not accessible to static compilers Unknown data Sometimes, the hardware is also unknown, at least partially Delay code optimisations at runtime Constant propagation, elimination of dead code, Strength reduction, Loop unrolling, Instruction scheduling, etc. Drive code performance by runtime-changing constraints Bounds : power / energy / execution time Heterogeneous cores : accelerators, specialised instructions (runtime) code specialisation Runtime code generation for unusual purposes (e.g. security)? (runtime) code polymorphism CEA. All rights reserved DACLE Division January 2014 2

Cliquez pour modifier le style du Outline titre Tools for runtime code generation in embedded systems for performance for security CEA. All rights reserved DACLE Division January 2014 3

Cliquez Approaches pour modifier for code le specialisation style du titre CEA. All rights reserved DACLE Division January 2014 4

Cliquez pour modifier Code generation le style du flow titre CEA. All rights reserved DACLE Division January 2014 5

Cliquez pour modifier Example le of style degoal du code titre Simple program example: vector addition void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr ]# } Type int_t int 32 #(vec_len) Alloc int_t v lw v, vec_addr add v, v, #(val) sw vec_addr, v degoal DSL: Source to source converted to standard C code Standard C code CEA. All rights reserved DACLE Division January 2014 6

Cliquez pour modifier Example le of style degoal du code titre Simple program example: vector addition void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr ]# } Type int_t int 32 #(vec_len) Alloc int_t v lw v, vec_addr add v, v, #(val) sw vec_addr, v When executed Program memory: ldr r1, [r0] add r1, #1 str r1, [r0] add r0, #4 ldr r2, [r0] add r2, #1 str r2, [r0] add r0, #4 CEA. All rights reserved DACLE Division January 2014 7

Cliquez pour modifier Example le of style degoal du code titre Simple program example: vector addition void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr Type int_t int 32 #(vec_len) Alloc int_t v Interface: pointer to code buffer and I/O registers Type definitions and variable allocations ]# } lw v, vec_addr add v, v, #(val) sw vec_addr, v Instructions CEA. All rights reserved DACLE Division January 2014 8

Cliquez pour modifier Example le of style degoal du code titre Simple program example: vector addition void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr ]# } Type int_t int 32 #(vec_len) Alloc int_t v lw v, vec_addr add v, v, #(val) sw vec_addr, v Determined by the application and fixed in the final machine code CEA. All rights reserved DACLE Division January 2014 9

Cliquez pour modifier Example le of style degoal du code titre Simple program example: void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr Type int_t int 32 #(vec_len) Alloc int_t v ]# } lw v, vec_addr add v, v, #(val) sw vec_addr, v Inline run-time constants CEA. All rights reserved DACLE Division January 2014 10

Cliquez pour modifier Example le of style degoal du code titre Simple program example: vector addition add v, v, #(val) void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr 4 90 1024 Type int_t int 32 #(vec_len) Alloc int_t v ]# } lw v, vec_addr add v, v, #(val) sw vec_addr, v Inline run-time constants Run #1 Run #2 Run #N ADD R0, #4 ADD_T2 R0, R0 #1024 ADD R0, #90 CEA. All rights reserved DACLE Division January 2014 11

Cliquez pour modifier Example le of style degoal du code titre Simple program example: vector addition Type int_t int 32 #(vec_len) void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr 1 2 4 Type int_t int 32 #(vec_len) Alloc int_t v ]# } lw v, vec_addr add v, v, #(val) sw vec_addr, v Inline run-time constants Run #1 Run #2 Run #N ADD R0, #4 Scalar/SISD ADD R0, #4 ADD R1, #4 VADD Q0, #4 Vector/SIMD (if enabled) CEA. All rights reserved DACLE Division January 2014 12

Cliquez pour modifier Example le of style degoal du code titre Simple program example: vector addition void gen_vector_add(void *buffer, int vec_len, int val) { #[ Begin buffer Prelude vec_addr Type int_t int 32 1 Alloc int_t r ]# int i; for (i = 0; i < vec_len; ++i) { #[ lw r, vec_addr add r, r, #(val) sw vec_addr, r add vec_addr, #(sizeof(int)) ]# } } Loop unrolling: Copy-paste a block of instructions CEA. All rights reserved DACLE Division January 2014 13

degoal Portable DSL Cliquez pour modifier le style Features du titre Complex variables (registers) Typed Vector support, dynamically sized Mix runtime data binary code Easily extended with domain-specific or hardwarespecific instructions (e.g. multimedia) Results Auto-adaptative dynamic libraries Runtime portable optimization Multiple performance metrics: Faster generated code Smaller generated code Code generation 3 order of magnitude faster than JIT/LLVM Code generators 4 orders of magnitude smaller than JITs/LLVM CEA. All rights reserved DACLE Division January 2014 14

Cliquez degoal pour modifier supported le architectures style du titre ARCHITECTURE STATUS FEATURES ARM32 ARM Cortex-A, Cortex-M [Thumb-2, VFP, NEON] SIMD, [IO/OoO] STxP70 [including FPx] (STHORM / P2012) SIMD, VLIW (2-way) K1 (Kalray MPPA) SIMD, VLIW (5-way) PTX (Nvidia GPUs) MIPS 32-bits MSP430 (TI microcontroler) Up to < 1kB RAM CROSS CODE GENERATION supported (e.g. generate code for STxP70 from an ARM Cortex-A) [IO/OoO]: Instruction scheduling for in-order and out-of-order cores CEA. All rights reserved DACLE Division January 2014 15

Cliquez pour modifier le style du Outline titre Tools for runtime code generation in embedded systems for performance Application use cases for security Video compression (Itanium) Linear algebra on GPUs (Nvidia) Arithmetic computing on Micro-controllers for the IoT (ARM Cortex-M, TI MSP430) Auto-tuning for embedded applications (ARM Cortex-M + VFP + NEON) Memory allocation in MPSoCs CEA. All rights reserved DACLE Division January 2014 16

Cliquez pour modifier le style du Outline titre Tools for runtime code generation in embedded systems for performance for security Runtime code generation as a Software Protection for Embedded Systems against Physical Attacks CEA. All rights reserved DACLE Division January 2014 17

Cliquez pour modifier Code le polymorphism style du titre Definition Regularly changing the behavior of a (secured) component, at runtime, while maintaining unchanged its functional properties, with runtime code generation What for? Protection against reverse engineering of SW the secured code is not available before runtime the secured code regularly changes its form (code generation interval ω) Protection against physical attacks polymorphism changes the spatial and temporal properties of the secured code: side channel fault attacks combine with usual SW protections against focused attacks Compatible with State-of-the-Art HW SW Countermeasures How? degoal: runtime code generation for embedded systems fast code generation tiny memory footprint: proof of concept on TI's MSP430 (512 B RAM) CEA. All rights reserved DACLE Division January 2014 18

State of the Art Cliquez Countermeasures pour modifier le polymorphism style du titre Random register renaming [May 2011a, Agosta 2012] Out-of-Order execution : At the instruction level [May 2011b, Bayrak 2012] At the control flow level [Agosta 2014, Crane 2015] Execution of dummy instructions [Ambrose 2007, Coron 2009, Coron 2010] A few proof-of-concept implementations, not suitable for embedded devices [Amarilli 2011, Amarilli 2011, Agosta 2012] Our approach Pure software portability, genericity Combination of all the polymorphic levers found in the state of the art Modest overhead (execution time memory footprint) With runtime code generation CEA. All rights reserved DACLE Division January 2014 19

Polymorphic Cliquez pour runtime modifier code le style generation du titre inputs Secured Component outputs alea CEA. All rights reserved DACLE Division January 2014 20

Polymorphic Cliquez pour runtime modifier code le style generation du titre inputs Secured Component outputs alea inputs Runtime Code Generator Polymorphic Instance outputs alea Opportunities for polymorphic code generation Random register allocation Random instruction selection Instruction shuffling Insertion of noise instructions CEA. All rights reserved DACLE Division January 2014 21

Demo 8-bit AES STM32 (Cortex-M3)

Cliquez Application pour modifier to AES: AddRoundKey() le style du titre void addroundkey_compilette( cdg_insn_t* code, uint8_t* key_addr, uint8_t *state_addr) { #[ Begin code Prelude Type reg32 int 32 Alloc reg32 state, key, i } ]#; mv i, #(16) loop: rtn End sub i, i, #(1) lb state, @(#(state_addr) + i) // state = state_addr[i] lb key, @(#(key_addr) + i) xor state, key sb @(#(state_addr) + i), state bneq loop, i, #(0) // key= key_addr[i] CEA. All rights reserved DACLE Division January 2014 23

Cliquez pour modifier le style du Demo titre Random register allocation Greedy algorithm: each register is allocated among one of the free registers remaining Has an impact on: The management of the context (ABI) Instruction selection CEA. All rights reserved DACLE Division January 2014 24

Cliquez pour modifier le style du Demo titre Instruction selection Replace an instruction by a semantically equivalent sequence of one or several instructions Select the sequence in a list of equivalences Examples: CEA. All rights reserved DACLE Division January 2014 25

Cliquez pour modifier le style du Demo titre Instruction shuffling Reorder instructions, but do not break the semantics of the code! Defs Uses Do not take into account result latency and issue latency Special treatments for special instructions. E.g. branch instructions CEA. All rights reserved DACLE Division January 2014 26

Cliquez pour modifier le style du Demo titre Insertion of noise Insertion of fake instructions, that have no effect on the result of the program. Controlled by a probability of insertion p Can insert any instruction: nop Arithmetic (add, xor ) Memory accesses (lw, lb, ) Power hungry ones (mul, mac ) CEA. All rights reserved DACLE Division January 2014 27

Cliquez pour AES modifier 8-bit. Memory le style footprint du titre Polymorphic code generation library AES.cdg degoal + arm-noneeabi-gcc -Wl,--gc-sections Binary image Runtime code generation Polymorphic instance of AES Polymorphic code generation library (64 variants) Average: Min: Max: 22395,5 bytes 19632 bytes 25208 bytes Binary Image: size increase vs the reference implementation Reference: 73279 bytes Polymorphic version, max size: 76543 bytes. Max difference: 3264 bytes Size of the polymorphic instance Size increase ~ 1.30 Variable overhead. Highly depends on the code generation settings CEA. All rights reserved DACLE Division January 2014 28

Cliquez AES pour 8-bit. modifier Performance le style overhead du titre Execution times (cycles) Min Max Avg. Reference AES 6385 6385 6385 Code generator 5671 12910 9345 Polymorphic instance 7185 9745 8303 Execution time overhead Interval of code generation COGITO [Agosta, 2012] ( : extrapolation) 1 2,76 398 5 1,59 80,4 20 1,37 8,94 100 1,31 5,00 2000 1,30 1,27 11600 1,30 1,10 Evaluation platform: ARM Cortex-M3, arm-none-eabi-gcc v4.8.1 CEA. All rights reserved DACLE Division January 2014 29

Cliquez pour modifier le style Conclusion du titre degoal: build runtime code generators (a.k.a compilettes) Code specialisation on runtime data values Code specialisation on characteristics of the hardware Applicable to embedded systems constrained wrt computing power and memory resources Polymorphic runtime code generation Generic software countermeasure The generated code can exploit hardware characteristics available Compatible with state-of-the-art software countermeasures Variation of the observation in time and space CEA. All rights reserved DACLE Division January 2014 30

Centre de Grenoble 17 rue des Martyrs 38054 Grenoble Cedex Centre de Saclay Nano-Innov PC 172 91191 Gif sur Yvette Cedex

Cliquez pour Bibliographic modifier le style references du titre [Agosta 2012] Agosta, G., Barenghi, A., Pelosi, G.: A code morphing methodology to automate power analysis countermeasures. In: DAC. pp. 77 82. ACM (2012) [Agosta 2014] Agosta, G., Barenghi, A., Pelosi, G., and Scandale, M.. 2014. A Multiple Equivalent Execution Trace Approach to Secure Cryptographic Embedded Software. DAC 2014. [Ambrose 2007] Ambrose, Jude Angelo, Roshan G. Ragel, and Sri Parameswaran. "A smart random code injection to mask power analysis based side channel attacks." Hardware/Software Codesign and System Synthesis (CODES+ ISSS), 2007 5th IEEE/ACM/IFIP International Conference on. IEEE, 2007. [Amarilli 2011] Amarilli, A., Müller, S., Naccache, D., Page, D., Rauzy, P., Tunstall, M.: Can Code Polymorphism Limit Information Leakage? In: WISTP (2011) [Bayrak 2012] Bayrak, A.G., Velickovic, N., Ienne, P., Burleson, W.: An architecture-independent instruction shuffler to protect against side-channel attacks. TACO (2012) [Coron 2009] Coron, J. S., Kizhvatov, I. (2009). An efficient method for random delay generation in embedded software. In Cryptographic Hardware and Embedded Systems-CHES 2009 (pp. 156-170). Springer Berlin Heidelberg. [Coron 2010] Coron, J. S., Kizhvatov, I. (2010). Analysis and improvement of the random delay countermeasure of CHES 2009. In Cryptographic Hardware and Embedded Systems, CHES 2010 (pp. 95-109). Springer Berlin Heidelberg. [Crane 2015] Crane, S., Homescu, A., Brunthaler, S., Larsen, P., Franz, M. (2015). Thwarting cache side-channel attacks through dynamic software diversity. In Network And Distributed System Security Symposium, NDSS (Vol. 15). [May 2001a] May, D., Muller, H., Smart, N.: Random register renaming to foil DPA. In: CHES, vol. LNCS 2162, pp. 28 38. Springer (2001) [May 2001b] May, D., Muller, H.L., Smart, N.P.: Non-deterministic processors. In: ACISP 01. pp. 115 129. Springer (2001) [Morpho 2013] Protection of Applets against hidden-channel analyses. US Patent 2013/0312110 A1 CEA. All rights reserved DACLE Division January 2014 32