Fast Software AES Encryption

Similar documents

Fast Implementations of AES on Various Platforms

High-Performance Modular Multiplication on the Cell Processor

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

Implementation of Full -Parallelism AES Encryption and Decryption

Efficient Software Implementation of AES on 32-bit Platforms

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

IJESRT. [Padama, 2(5): May, 2013] ISSN:

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

Speeding up GPU-based password cracking

Design and Verification of Area-Optimized AES Based on FPGA Using Verilog HDL

Next Generation GPU Architecture Code-named Fermi

Automata Designs for Data Encryption with AES using the Micron Automata Processor

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

An Instruction Set Extension for Fast and Memory-Efficient AES Implementation

The Advanced Encryption Standard: Four Years On

AES Power Attack Based on Induced Cache Miss and Countermeasure

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

GPGPU Computing. Yong Cao

SeChat: An AES Encrypted Chat

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

The Advanced Encryption Standard (AES)

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Enhancing Advanced Encryption Standard S-Box Generation Based on Round Key

Efficient Software Implementation of AES on 32-Bit Platforms

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

A PPENDIX H RITERIA FOR AES E VALUATION C RITERIA FOR

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

CS 758: Cryptography / Network Security

CSCE 465 Computer & Network Security

Introduction to GPU Programming Languages

Computer Graphics Hardware An Overview

Block Cipher Speed and Energy Efficiency Records on the MSP430: System Design Trade-Offs for 16-bit Embedded Applications

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

An Energy Efficient ATM System Using AES Processor

Introduction to Cloud Computing

New AES software speed records

FPGA IMPLEMENTATION OF AN AES PROCESSOR

NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist

Rijndael Encryption implementation on different platforms, with emphasis on performance

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

Introduction to GPU Architecture

NVIDIA GeForce GTX 580 GPU Datasheet

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

synthesizer called C Compatible Architecture Prototyper(CCAP).

GPU Programming Strategies and Trends in GPU Computing

Speeding Up RSA Encryption Using GPU Parallelization

Implementation and Design of AES S-Box on FPGA

Separable & Secure Data Hiding & Image Encryption Using Hybrid Cryptography

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Improving Performance of Secure Data Transmission in Communication Networks Using Physical Implementation of AES

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Design and Implementation of Asymmetric Cryptography Using AES Algorithm

GPU Parallel Computing Architecture and CUDA Programming Model

ELECTENG702 Advanced Embedded Systems. Improving AES128 software for Altera Nios II processor using custom instructions

GPUs for Scientific Computing

GeoImaging Accelerator Pansharp Test Results

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Software implementation of Post-Quantum Cryptography

Optimizing Code for Accelerators: The Long Road to High Performance

Survey on Enhancing Cloud Data Security using EAP with Rijndael Encryption Algorithm

Secret File Sharing Techniques using AES algorithm. C. Navya Latha Garima Agarwal Anila Kumar GVN

GPU(Graphics Processing Unit) with a Focus on Nvidia GeForce 6 Series. By: Binesh Tuladhar Clay Smith

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Combining Mifare Card and agsxmpp to Construct a Secure Instant Messaging Software

7a. System-on-chip design and prototyping platforms

ultra fast SOM using CUDA

New AES software speed records

NVIDIA VIDEO ENCODER 5.0

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Generations of the computer. processors.

ARM Microprocessor and ARM-Based Microcontrollers

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Hash Function JH and the NIST SHA3 Hash Competition

Design and Implementation of Low-area and Low-power AES Encryption Hardware Core

Introduction to GPU hardware and to CUDA

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

Performance Investigations. Hannes Tschofenig, Manuel Pégourié-Gonnard 25 th March 2015

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Triathlon of Lightweight Block Ciphers for the Internet of Things

Optimization. NVIDIA OpenCL Best Practices Guide. Version 1.0

LSN 2 Computer Processors

Stream Processing on GPUs Using Distributed Multimedia Middleware

FPGA-based Multithreading for In-Memory Hash Joins

Binary search tree with SIMD bandwidth optimization using SSE

SierraVMI Sizing Guide

~ Greetings from WSU CAPPLab ~

White Paper. Shay Gueron Intel Architecture Group, Israel Development Center Intel Corporation

Fast Hash-Based Signatures on Constrained Devices

Transcription:

Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 2010 Fast Software AES Encryption Osvik, Dag Arne Proceedings FSE'10 Proceedings of the 17th international conference on Fast software encryption, pp. 75-93 http://hdl.handle.net/10945/40195

Fast Software AES Encryption Dag Arne Osvik 1 Joppe W. Bos 1 Deian Stefan 2 David Canright 3 1 Laboratory for Cryptologic Algorithms, EPFL, CH-1015 Lausanne, Switzerland 2 Dept. of Electrical Engineering, The Cooper Union, NY 10003, New York, USA 3 Applied Math., Naval Postgraduate School, Monterey CA 93943, USA 1 / 20

Fast Software AES Encryption Dag Arne Osvik 1 Joppe W. Bos 1 Deian Stefan 2 David Canright 3 Presented by: Onur Özen1 1 Laboratory for Cryptologic Algorithms, EPFL, CH-1015 Lausanne, Switzerland 2 Dept. of Electrical Engineering, The Cooper Union, NY 10003, New York, USA 3 Applied Math., Naval Postgraduate School, Monterey CA 93943, USA 1 / 20

Outline Introduction Motivation Related Work Contributions The Advanced Encryption Standard Target Platforms The 8-bit AVR Microcontroller The 32-bit Advanced RISC Machine The Cell Broadband Engine Architecture The NVIDIA Graphics Processing Unit Conclusions 2 / 20

Motivation Advanced Encryption Standard Rijndael announced in 2001 as the AES. One of the most widely used cryptographic primitives. IP Security, Secure Shell, Truecrypt RFID and low-power authentication methods Key tokens, RF-based Remote Access Control Many intensive efforts to speed up AES in both hard- and software. Related work E. Käsper and P. Schwabe. Faster and Timing-Attack Resistant AES-GCM. CHES 2009. P. Bulens, et al. Implementation of the AES-128 on Virtex-5 FPGAs AFRICACRYPT 2008. O. Harrison and J. Waldron. Practical Symmetric Key Cryptography on Modern Graphics Hardware. USENIX Sec. Symp. 2008. S. Rinne, et al. Performance Analysis of Contemporary Light-Weight Block Ciphers on 8-bit Microcontrollers. EED 2007. K. Shimizu, et al. Cell Broadband Engine Support for Privacy, Security, and Digital Rights Management Applications. 2005. 3 / 20

Contributions New software speed records for various architectures Target both ends of the performance spectrum low-end microcontrollers high-end architectures processing many streams simultaneously Specific optimizations for each platform 4 / 20

Contributions New software speed records for various architectures 8-bit AVR microcontrollers compact, efficient single stream AES version 32-bit Advanced RISC Machine one of the most widely used processors for mobile applications efficient single stream AES version (T -table based) Synergistic processing elements of the Cell broadband engine widely available in the PS3 video game console single instruction multiple data (SIMD) architecture process 16 streams in parallel (bytesliced) NVIDIA graphics processing unit first AES decryption implementation single instruction multiple threads (SIMT) architecture process thousand of streams in parallel (T -table based) 4 / 20

The Advanced Encryption Standard Fixed block length version of the Rijndael block cipher Key-iterated block cipher with 128-bit state and block length Support for 128-, 192-, and 256-bit keys Strong security properties only related key attacks on AES-192 and AES-256 no attacks on full AES-128 Very efficient in hardware and software (Created by Jeff Moser) 5 / 20

Round Function Steps 1 SubBytes: 3 MixColumns: a 00 a 01 a 02 a 03 S-box b 00 b 01 b 02 b 03 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 00 a 01 a 02 a 03 b 00 b 01 b 02 b 03 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 2 ShiftRows: 4 AddRoundKey: a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 k 00 a 01 k 02 k 03 b 00 b 01 b 02 b 03 a 10 a 11 a 12 a 13 a 11 a 12 a 13 a 10 a 10 a 11 a 12 a 13 k 10 a 11 k 12 k 13 b 10 b 11 b 12 b 13 a 20 a 21 a 22 a 23 a 22 a 23 a 20 a 21 a 20 a 21 a 22 a 23 k 20 a 21 k 22 k 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 a 33 a 30 a 31 a 32 a 30 a 31 a 32 a 33 k 30 k 31 k 32 k 33 b 30 b 31 ab 32 b 33 6 / 20

Efficient T -table implementation Combine SubBytes, ShiftRows, MixColumns using the standard T -table approach. Update each column (0 j 3): [s j0, s j1, s j2, s j3 ] T = T 0 [a c0 0] T 1 [a c1 1] T 2 [a c2 2] T 3 [a c3 3] k j, where each T i is 1KB and k j is the jth column of the round key. T i s are rotations of one table. Example (j = 0): a 00 a 01 a 02 a 03 k 00 s 00 s 01 s 02 s 03 a 10 a 11 a 12 a 13 k 10 s 10 s 11 s 12 s 13 T 0 T 1 T 2 T 3 a 20 a 21 a 22 a 23 k 20 s 20 s 21 s 22 s 23 a 30 a 31 a 32 a 33 k 30 s 30 s 31 s 32 s 33 7 / 20

Target Platforms Atmel AVRs DigiKey Cell B.E ARM NVIDIA GPUs IBM Systems 8 / 20

Microcontrollers AVR Modified Harvard architecture 32 8-bit registers 3 16-bit pointer registers Registers are addressable Mostly single-cycle execution 1 2 to 384KB flash memory 0 to 32KB SRAM 0 to 4KB EEPROM ARM Harvard/von Neumann architecture 16 32-bit registers available Conditional execution Optional modification of flags Mostly single-cycle execution Load/store multiple registers Inline barrel shifter 9 / 20

Microcontrollers Results Reference Key Encryption Decryption Code size scheduling (cycles) (cycles) (bytes) 8-bit AVR microcontroller Rinne et al. precompute 3,766 4,558 3,410 Poettering (Fast) precompute 2,474 3,411 3,098 Poettering (Furious) precompute 2,739 3,579 1,570 Poettering (Fantastic) precompute 4,059 4,675 1,482 Otte precompute 2,555 6,764 2,070 Otte precompute 2,555 3,193 2,580 New - Low RAM precompute 2,153 2,901 1,912 New - Fast precompute 1,993 2,901 1,912 32-bit ARM microprocessor Atasu et. al precompute 639 638 5,966 New precompute 544-3,292 10 / 20

Cell Broadband Engine Architecture Use the Synergistic Processing Elements runs at 3.2 GHz 128-bit wide SIMD-architecture two instructions per clock cycle (dual pipeline) in-order processor rich instruction set: i.e. all distinct binary operations f : {0, 1} 2 {0, 1} are present. Expensive QS22 Blade Servers (2 8 Es) PCI express cards: cryptographic accelerator ({ 4, 8 } Es) Cheap PS3 video game console (6 Es) 11 / 20

Dual pipeline Example Balancing pipeline loads may lead to more instructions but fewer cycles. 16-way SIMD xtime: a x (a F 2 8 = F 2 [x]/(x 8 + x 4 + x 3 + x + 1)) Similar techniques for 16-way SIMD S-box lookups. x x and cmp-gt rotate-left 2 2 4 4 shift-left and and gather 4 4 2 xor form-mask 2 4 xtime (x) rotate-right 2 4 and xor 2 xtime (x) 2 4 even, 1 odd latency: 8 cycles 3 even, 4 odd latency: 20 cycles 12 / 20

U Results Comparison 20 17.1 Cycles/byte 15 10 5 12.4 11.3 13.9 K. Shimizu et al. 2005 This article 0 Encryption Decryption Throughput per PS3: 13.8 (encryption) and 10.8 Gbps (decryption) 13 / 20

NVIDIA Graphic Processing Units Contain 12-30 simultaneous multiprocessors (SMs): 8 streaming processors (s) 16KB 16-way banked fast shared memory 8192/16384 32-bit registers 8KB constant memory cache 6KB-8KB texture cache 2 special function units instruction fetch and scheduling unit GeForce 8800GTX: 16 SMs @ 1.35GHz GTX 295: 2 30 SMs @ 1.24GHz 16KB Shared Mem 16KB Shared Mem 16KB Shared Mem Scheduler Scheduler Scheduler Register File Register File SFU SFU SFUConst $ SFU Text $ SFU SFU Const $ Text $ Const $ Text $ Register File 14 / 20

AES GPU Implementation Each thread processes multiple blocks hide memory latency using double buffering! Group multiple (e.g., 16) threads into stream groups share a common key, expanded in texture/shared memory Execute 16 (scheduling-unit) stream groups per thread block allows any caching to shared memory with no collisions! Launch multiple thread blocks per grid increase device utilization! All global memory access is coalesced maximize memory access throughput! 15 / 20

AES GPU Implementation (cont.) On-the-fly version all threads cache 1 st round key to shared memory Texture-memory version threads of same stream group share a common key Round-key read Shared/texture memory thread stream group Threads of same stream group share a common key. 16 / 20

AES GPU Implementation (cont.) PTX pre-expanded key (benchmark-only) version threads in a thread block share 1 key. Round-key (broadcast) read Constant memory Threads in same thread block share a common key. 17 / 20

AES GPU Implementation (cont.) T -tables placed in shared memory: T-table read Shared memory Avg: 35% collisions Still the fastest! Will further improve on Fermi Even with collisions greatest speedup! 18 / 20

GPU Results Comparison Cycles/byte 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 1.3 1.71 0.7 0.52 0.65 0.64 0.32 0.32 S. A. Manavski 2007 (8800GTX) J. Yang et al. 2007 (ATI HD 2900 XT) O. Harrison et al. 2008 (8800GTX) This article (8800GTX) PTX This article (8800GTX) This article (single-gpu GTX295) 0.2 0 Encryption Decryption Encryption: 30.9 and 16.8 Gbps on single GTX 295 GPU and 8800GTX. Decryption: 30.8 and 16.6 Gbps on single GTX 295 GPU and 8800GTX. 19 / 20

Conclusions Our new AES-128 software speed records for encryption and decryption x times faster compared to the previous records 8-bit AVR 1.24 encryption 1.10 decryption smaller code size 32-bit ARM 1.17 encryption Cell Broadband Engine (E) 1.10 encryption 1.23 decryption NVIDIA GPU 1.2 encryption 1.34 encryption (kernel-only) First decryption implementation All numbers subject to further improvements 20 / 20

Conclusions Our new AES-128 software speed records for encryption and decryption x times faster compared to the previous records 8-bit AVR 1.24 encryption 1.10 decryption smaller code size 32-bit ARM 1.17 encryption Cell Broadband Engine (E) 1.10 encryption 1.23 decryption NVIDIA GPU 1.2 encryption 1.34 encryption (kernel-only) First decryption implementation All numbers subject to further improvements To be continued... 20 / 20