Fast Implementations of AES on Various Platforms

Similar documents
Fast Software AES Encryption

High-Performance Modular Multiplication on the Cell Processor

Implementation of Full -Parallelism AES Encryption and Decryption

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

SeChat: An AES Encrypted Chat

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Efficient Software Implementation of AES on 32-bit Platforms

The Advanced Encryption Standard: Four Years On

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

IJESRT. [Padama, 2(5): May, 2013] ISSN:

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Speeding up GPU-based password cracking

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

CSCE 465 Computer & Network Security

ANALYSIS OF RSA ALGORITHM USING GPU PROGRAMMING

Design and Verification of Area-Optimized AES Based on FPGA Using Verilog HDL

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Automata Designs for Data Encryption with AES using the Micron Automata Processor

The Advanced Encryption Standard (AES)

An Instruction Set Extension for Fast and Memory-Efficient AES Implementation

GPGPU Computing. Yong Cao

Combining Mifare Card and agsxmpp to Construct a Secure Instant Messaging Software

Enhancing Advanced Encryption Standard S-Box Generation Based on Round Key

Next Generation GPU Architecture Code-named Fermi

CS 758: Cryptography / Network Security

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Optimizing Code for Accelerators: The Long Road to High Performance

Survey on Enhancing Cloud Data Security using EAP with Rijndael Encryption Algorithm

AES Power Attack Based on Induced Cache Miss and Countermeasure

Rijndael Encryption implementation on different platforms, with emphasis on performance

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

Secret File Sharing Techniques using AES algorithm. C. Navya Latha Garima Agarwal Anila Kumar GVN

Improving Performance of Secure Data Transmission in Communication Networks Using Physical Implementation of AES

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

An Energy Efficient ATM System Using AES Processor

Network Security. Omer Rana

FPGA IMPLEMENTATION OF AN AES PROCESSOR

NVIDIA GeForce GTX 580 GPU Datasheet

Efficient Software Implementation of AES on 32-Bit Platforms

Cryptography and Network Security

Design and Implementation of Asymmetric Cryptography Using AES Algorithm

AES1. Ultra-Compact Advanced Encryption Standard Core. General Description. Base Core Features. Symbol. Applications

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

A PPENDIX H RITERIA FOR AES E VALUATION C RITERIA FOR

Modern Block Cipher Standards (AES) Debdeep Mukhopadhyay

ultra fast SOM using CUDA

GPUs for Scientific Computing

Cryptography and Network Security: Summary

Implementation and Design of AES S-Box on FPGA

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

Introduction to Cloud Computing

Introduction to GPU Architecture

Speeding Up RSA Encryption Using GPU Parallelization

How To Encrypt With A 64 Bit Block Cipher

Area Optimized and Pipelined FPGA Implementation of AES Encryption and Decryption

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

The Advanced Encryption Standard (AES)

Purpose Computer Hardware Configurations... 6 Single Computer Configuration... 6 Multiple Server Configurations Data Encryption...

GPU Parallel Computing Architecture and CUDA Programming Model

This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?

Keywords Web Service, security, DES, cryptography.

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

Separable & Secure Data Hiding & Image Encryption Using Hybrid Cryptography

synthesizer called C Compatible Architecture Prototyper(CCAP).

A PERFORMANCE EVALUATION OF COMMON ENCRYPTION TECHNIQUES WITH SECURE WATERMARK SYSTEM (SWS)

Cryptography and Network Security Chapter 3

Network Security. Chapter 3 Symmetric Cryptography. Symmetric Encryption. Modes of Encryption. Symmetric Block Ciphers - Modes of Encryption ECB (1)

Introduction to GPU Programming Languages

THÈSE DE DOCTORAT DE L UNIVERSITÉ PIERRE ET MARIE CURIE. Spécialité: Informatique, Télécommunication et Électronique. Karim Moussa Ali Abdellatif

Computer Architecture TDTS10

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Computer Graphics Hardware An Overview

CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing

Dynamic Profiling and Load-balancing of N-body Computational Kernels on Heterogeneous Architectures

Hardware-Aware Analysis and. Presentation Date: Sep 15 th 2009 Chrissie C. Cui

Thanks, But No Thanks

White Paper. Shay Gueron Intel Architecture Group, Israel Development Center Intel Corporation

GPU Programming Strategies and Trends in GPU Computing

Karsten Nohl, Breaking GSM phone privacy

Network Security. Security. Security Services. Crytographic algorithms. privacy authenticity Message integrity. Public key (RSA) Message digest (MD5)

Performance Evaluation of AES using Hardware and Software Codesign

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff

Design and Implementation of Low-area and Low-power AES Encryption Hardware Core

Real-time Visual Tracker by Stream Processing

FPGA-based Multithreading for In-Memory Hash Joins

DataTrust Backup Software. Whitepaper Data Security. Version 6.8

AStudyofEncryptionAlgorithmsAESDESandRSAforSecurity

Computer Architecture. Secure communication and encryption.

SPINS: Security Protocols for Sensor Networks

Transcription:

Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept. of Electrical Engineering, The Cooper Union, NY 10003, New York, USA stefan@cooper.edu 1 / 19

Outline Introduction Motivation Previous Work Contributions The Advanced Encryption Standard Target Platforms The 8-bit AVR Microcontroller The Cell Broadband Engine Architecture The NVIDIA Graphics Processing Unit Conclusions 2 / 19

Motivation Advanced Encryption Standard Rijndael announced in 2001 as the AES. One of the most widely used cryptographic primitives. IP Security, Secure Shell, Truecrypt RFID and low-power authentication methods Key tokens, RF-based Remote Access Control Many intensive efforts to speed up AES in both hard- and software. Related work E. Käsper and P. Schwabe. Faster and Timing-Attack Resistant AES-GCM. CHES 2009. P. Bulens, et al. Implementation of the AES-128 on Virtex-5 FPGAs AFRICACRYPT 2008. O. Harrison and J. Waldron. Practical Symmetric Key Cryptography on Modern Graphics Hardware. USENIX Sec. Symp. 2008. S. Rinne, et al. Performance Analysis of Contemporary Light-Weight Block Ciphers on 8-bit Microcontrollers. SPEED 2007. K. Shimizu, et al. Cell Broadband Engine Support for Privacy, Security, and Digital Rights Management Applications. 2005. 3 / 19

Contributions New software speed records for various architectures 8-bit AVR microcontrollers compact, efficient single stream AES version Synergistic processing elements of the Cell broadband engine widely available in the PS3 video game console single instruction multiple data (SIMD) architecture process 16 streams in parallel (bytesliced) NVIDIA graphics processing unit first AES decryption implementation single instruction multiple threads (SIMT) architecture process thousand of streams in parallel (T -table based) 4 / 19

The Advanced Encryption Standard Fixed block length version of the Rijndael block cipher Key-iterated block cipher with 128-bit state and block length Support for 128-, 192-, and 256-bit keys Strong security properties no attacks on full AES-128 Very efficient in hardware and software 5 / 19

AES-128 Block Cipher Algorithm consists of 5 steps: 1 Key expansion: 128-bit N r + 1 = 11 128-bit round keys 2 State initialization: initial state plaintext block 128-bit key 3 Round transformation: apply round function on state N r 1 times 4 Final round transformation: apply the modified round function Core of AES, the round function, consists of the following steps: SubBytes, ShiftRows, MixColumns, and AddRoundKey. 6 / 19

AES-128 Block Cipher Algorithm consists of 5 steps: 1 Key expansion: 128-bit N r + 1 = 11 128-bit round keys 2 State initialization: initial state plaintext block 128-bit key 3 Round transformation: apply round function on state N r 1 times 4 Final round transformation: apply the modified round function Core of AES, the round function, consists of the following steps: SubBytes, ShiftRows, MixColumns, and AddRoundKey. Decryption follows the same procedure - round function steps are the inverse and run in reverse order 6 / 19

Round Function Steps 1 SubBytes: S-box a 00 a 01 a 02 a 03 b 00 b 01 b 02 b 03 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 7 / 19

Round Function Steps 1 SubBytes: S-box a 00 a 01 a 02 a 03 b 00 b 01 b 02 b 03 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 2 ShiftRows: a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 a 10 a 11 a 12 a 13 a 11 a 12 a 13 a 10 a 20 a 21 a 22 a 23 a 22 a 23 a 20 a 21 a 30 a 31 a 32 a 33 a 33 a 30 a 31 a 32 7 / 19

Round Function Steps 1 SubBytes: 1 MixColumns: a 00 a 01 a 02 a 03 S-box b 00 b 01 b 02 b 03 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 00 a 01 a 02 a 03 b 00 b 01 b 02 b 03 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 2 ShiftRows: a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 a 10 a 11 a 12 a 13 a 11 a 12 a 13 a 10 a 20 a 21 a 22 a 23 a 22 a 23 a 20 a 21 a 30 a 31 a 32 a 33 a 33 a 30 a 31 a 32 7 / 19

Round Function Steps 1 SubBytes: 1 MixColumns: a 00 a 01 a 02 a 03 S-box b 00 b 01 b 02 b 03 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 00 a 01 a 02 a 03 b 00 b 01 b 02 b 03 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 2 ShiftRows: 2 AddRoundKey: a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 k 00 a 01 k 02 k 03 b 00 b 01 b 02 b 03 a 10 a 11 a 12 a 13 a 11 a 12 a 13 a 10 a 10 a 11 a 12 a 13 k 10 a 11 k 12 k 13 b 10 b 11 b 12 b 13 a 20 a 21 a 22 a 23 a 22 a 23 a 20 a 21 a 20 a 21 a 22 a 23 k 20 a 21 k 22 k 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 a 33 a 30 a 31 a 32 a 30 a 31 a 32 a 33 k 30 k 31 k 32 k 33 b 30 b 31 ab 32 b 33 7 / 19

Target Platforms NVIDIA GPUs Cell B.E Atmel AVRs IBM Systems 8 / 19

Advanced Virtual RISC Architecture Modified Harvard architecture 32 8-bit registers 16-bit pointer registers Registers are addressable Mostly single-cycle execution 1 2KB to 384KB flash memory 0 to 32KB SRAM 0 to 4KB EEPROM 9 / 19

Encryption Comparison AVR 3600 3400 3200 Code Size in Bytes 3000 2800 2600 2400 2200 2000 1800 1600 1400 1500 2000 2500 3000 3500 4000 4500 Cycles for Encryption Rinne et al. Poettering Otte New 10 / 19

Decryption Comparison AVR Code Size in Bytes 3600 3400 3200 3000 2800 2600 2400 2200 2000 1800 1600 1400 2500 3000 3500 4000 4500 5000 5500 6000 6500 7000 Cycles for Decryption Rinne et al. Poettering Otte New 11 / 19

Cell Broadband Engine Architecture Use the Synergistic Processing Elements runs at 3.2 GHz 128-bit wide SIMD-architecture two instructions per clock cycle (dual pipeline) in-order processor rich instruction set: i.e. all distinct binary operations f : {0, 1} 2 {0, 1} are present. Expensive QS22 Blade Servers (2 8 SPEs) Cheap PS3 video game console (6 SPEs) 12 / 19

SPU Results Comparison 20 17.1 Cycles/byte 15 10 5 12.4 11.7 14.4 K. Shimizu et al. 2005 This article 0 Encryption Decryption Throughput per PS3: 13.2 (encryption) and 10.8 Gbps (decryption) Work-in-progress, fill both pipelines Current version: 1752 odd and 2764 even instructions for encryption. 13 / 19

NVIDIA Graphic Processing Units Contain 12-30 simultaneous multiprocessors (SMs): 8 streaming processors (SPs) 16KB 16-way banked fast shared memory 8192/16384 32-bit registers 8KB constant memory cache 6KB-8KB texture cache 2 special function units instruction fetch and scheduling unit GeForce 8800GTX: 16 SMs @ 1.35GHz GTX 295: 2 30 SMs @ 1.24GHz 16KB Shared Mem 16KB Shared Mem 16KB Shared Mem Scheduler SP Scheduler Scheduler SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP SP Register File Register File SP SFU SP SFU SP SP SFUConst $ SFU Text $ SFU SFU Const $ Text $ Const $ Text $ Register File 14 / 19

AES GPU Implementation Combine SubBytes, ShiftRows, MixColumns using the standard T -table approach. Update each column (0 j 3): [s j0, s j1, s j2, s j3 ] T = T 0 [a c0 0] T 1 [a c1 1] T 2 [a c2 2] T 3 [a c3 3] k j, where each T i is 1KB and k j is the jth column of the round key. 15 / 19

AES GPU Implementation Combine SubBytes, ShiftRows, MixColumns using the standard T -table approach. Update each column (0 j 3): [s j0, s j1, s j2, s j3 ] T = T 0 [a c0 0] T 1 [a c1 1] T 2 [a c2 2] T 3 [a c3 3] k j, where each T i is 1KB and k j is the jth column of the round key. Example (j = 0): a 00 a 01 a 02 a 03 T 0 b 00 k 00 s 00 s 01 s 02 s 03 a 10 a 11 a 12 a 13 T 1 b 10 k 10 s 10 s 11 s 12 s 13 a 20 a 21 a 22 a 23 T 2 b 20 k 20 s 20 s 21 s 22 s 23 a 30 a 31 a 32 a 33 T 3 b 30 k 30 s 30 s 31 s 32 s 33 15 / 19

AES GPU Implementation Combine SubBytes, ShiftRows, MixColumns using the standard T -table approach. Update each column (0 j 3): [s j0, s j1, s j2, s j3 ] T = T 0 [a c0 0] T 1 [a c1 1] T 2 [a c2 2] T 3 [a c3 3] k j, where each T i is 1KB and k j is the jth column of the round key. Example (j = 0): a 00 a 01 a 02 a 03 T 0 b 00 k 00 s 00 s 01 s 02 s 03 a 10 a 11 a 12 a 13 T 1 b 10 k 10 s 10 s 11 s 12 s 13 a 20 a 21 a 22 a 23 T 2 b 20 k 20 s 20 s 21 s 22 s 23 a 30 a 31 a 32 a 33 T 3 b 30 k 30 s 30 s 31 s 32 s 33 Optimization approach: launch thread blocks containing multiple independent groups of 16 (1/2-warp) streams. 15 / 19

AES GPU Implementation (cont.) Key expansion: 1 On-the-fly: allows thousands of independent streams speed dependent on T -access speed multi-block speed improvement: cache few round keys / stream in shared memory; 16-streams/group no bank conflicts! 16 / 19

AES GPU Implementation (cont.) Key expansion: 1 On-the-fly: allows thousands of independent streams speed dependent on T -access speed multi-block speed improvement: cache few round keys / stream in shared memory; 16-streams/group no bank conflicts! 2 Texture memory: keys alive between kernel launches: multi-block encryption is faster than on-the-fly! thread count limited by texture cache size 16 / 19

AES GPU Implementation (cont.) Key expansion: 1 On-the-fly: allows thousands of independent streams speed dependent on T -access speed multi-block speed improvement: cache few round keys / stream in shared memory; 16-streams/group no bank conflicts! 2 Texture memory: keys alive between kernel launches: multi-block encryption is faster than on-the-fly! thread count limited by texture cache size 3 Shared memory: 16 round key column reads with no bank conflicts single kernel multi-block encryption is the fastest! thread count limited by shared memory size 16 / 19

AES GPU Implementation (cont.) Placement of T -tables: 1 Constant memory: simple and very quick approach unless encrypting same block with same key: almost all T -accesses are serialized combine with any key scheduling algorithm 17 / 19

AES GPU Implementation (cont.) Placement of T -tables: 1 Constant memory: simple and very quick approach unless encrypting same block with same key: almost all T -accesses are serialized combine with any key scheduling algorithm 2 Shared memory: Collision-free approach T i, i > 0 are rotations of T 0: place 1KB T 0 in each bank. 17 / 19

AES GPU Implementation (cont.) Placement of T -tables: 1 Constant memory: simple and very quick approach unless encrypting same block with same key: almost all T -accesses are serialized combine with any key scheduling algorithm 2 Shared memory: Collision-free approach T i, i > 0 are rotations of T 0: place 1KB T 0 in each bank. All shared memory used by 1 thread block very low device utilization. 17 / 19

AES GPU Implementation (cont.) Placement of T -tables: 1 Constant memory: simple and very quick approach unless encrypting same block with same key: almost all T -accesses are serialized combine with any key scheduling algorithm 2 Shared memory: Collision-free approach T i, i > 0 are rotations of T 0: place 1KB T 0 in each bank. All shared memory used by 1 thread block very low device utilization. Lazy approach Place T -tables in order. 17 / 19

AES GPU Implementation (cont.) Placement of T -tables: 1 Constant memory: simple and very quick approach unless encrypting same block with same key: almost all T -accesses are serialized combine with any key scheduling algorithm 2 Shared memory: Collision-free approach T i, i > 0 are rotations of T 0: place 1KB T 0 in each bank. All shared memory used by 1 thread block very low device utilization. Lazy approach Place T -tables in order. On average: 6/16 collisions, so remaining reads are parallel. Allows for multiple blocks/sm higher device occupancy. 17 / 19

AES GPU Implementation (cont.) Placement of T -tables: 1 Constant memory: simple and very quick approach unless encrypting same block with same key: almost all T -accesses are serialized combine with any key scheduling algorithm 2 Shared memory: Collision-free approach T i, i > 0 are rotations of T 0: place 1KB T 0 in each bank. All shared memory used by 1 thread block very low device utilization. Lazy approach Place T -tables in order. On average: 6/16 collisions, so remaining reads are parallel. Allows for multiple blocks/sm higher device occupancy. combine with on-the-fly key scheduling or key expansion in texture memory. 17 / 19

AES GPU Implementation (cont.) Placement of T -tables: 1 Constant memory: simple and very quick approach unless encrypting same block with same key: almost all T -accesses are serialized combine with any key scheduling algorithm 2 Shared memory: Collision-free approach T i, i > 0 are rotations of T 0: place 1KB T 0 in each bank. All shared memory used by 1 thread block very low device utilization. Lazy approach Place T -tables in order. On average: 6/16 collisions, so remaining reads are parallel. Allows for multiple blocks/sm higher device occupancy. combine with on-the-fly key scheduling or key expansion in texture memory. 3 Texture memory: ongoing work, but estimates are lower than lazy shared memory approach. 17 / 19

GPU Results Comparison Cycles/byte 2 1.5 1 0.5 1.3 1.71 1.56 0.74 0.76 0.17 0.19 S. A. Manavski 2007 (8800GTX) J. Yang et al. 2007 (ATI HD 2900 XT) O. Harrison et al. 2008 (8800GTX) This article (8800GTX) This article (GTX295) 0 Encryption Decryption Encryption: 59.6 and 14.6 Gbps on the GTX 295 and 8800GTX, respectively. Decryption: 52.4 and 14.3 Gbps on the GTX 295 and 8800GTX, respectively. 18 / 19

Conclusions AES-128 software speed records for encryption and decryption 8-bit AVR 1.24 encryption 1.10 decryption smaller code size Cell Broadband Engine (SPE) 1.06 encryption 1.18 decryption NVIDIA GPU 1.75 encryption First decryption implementation All numbers subject to further improvements 19 / 19

Conclusions AES-128 software speed records for encryption and decryption 8-bit AVR 1.24 encryption 1.10 decryption smaller code size Cell Broadband Engine (SPE) 1.06 encryption 1.18 decryption NVIDIA GPU 1.75 encryption First decryption implementation All numbers subject to further improvements To be continued... 19 / 19