Calhoun: The NPS Institutional Archive Faculty and Researcher Publications Faculty and Researcher Publications 2010 Fast Software AES Encryption Osvik, Dag Arne Proceedings FSE'10 Proceedings of the 17th international conference on Fast software encryption, pp. 75-93 http://hdl.handle.net/10945/40195
Fast Software AES Encryption Dag Arne Osvik 1 Joppe W. Bos 1 Deian Stefan 2 David Canright 3 1 Laboratory for Cryptologic Algorithms, EPFL, CH-1015 Lausanne, Switzerland 2 Dept. of Electrical Engineering, The Cooper Union, NY 10003, New York, USA 3 Applied Math., Naval Postgraduate School, Monterey CA 93943, USA 1 / 20
Fast Software AES Encryption Dag Arne Osvik 1 Joppe W. Bos 1 Deian Stefan 2 David Canright 3 Presented by: Onur Özen1 1 Laboratory for Cryptologic Algorithms, EPFL, CH-1015 Lausanne, Switzerland 2 Dept. of Electrical Engineering, The Cooper Union, NY 10003, New York, USA 3 Applied Math., Naval Postgraduate School, Monterey CA 93943, USA 1 / 20
Outline Introduction Motivation Related Work Contributions The Advanced Encryption Standard Target Platforms The 8-bit AVR Microcontroller The 32-bit Advanced RISC Machine The Cell Broadband Engine Architecture The NVIDIA Graphics Processing Unit Conclusions 2 / 20
Motivation Advanced Encryption Standard Rijndael announced in 2001 as the AES. One of the most widely used cryptographic primitives. IP Security, Secure Shell, Truecrypt RFID and low-power authentication methods Key tokens, RF-based Remote Access Control Many intensive efforts to speed up AES in both hard- and software. Related work E. Käsper and P. Schwabe. Faster and Timing-Attack Resistant AES-GCM. CHES 2009. P. Bulens, et al. Implementation of the AES-128 on Virtex-5 FPGAs AFRICACRYPT 2008. O. Harrison and J. Waldron. Practical Symmetric Key Cryptography on Modern Graphics Hardware. USENIX Sec. Symp. 2008. S. Rinne, et al. Performance Analysis of Contemporary Light-Weight Block Ciphers on 8-bit Microcontrollers. EED 2007. K. Shimizu, et al. Cell Broadband Engine Support for Privacy, Security, and Digital Rights Management Applications. 2005. 3 / 20
Contributions New software speed records for various architectures Target both ends of the performance spectrum low-end microcontrollers high-end architectures processing many streams simultaneously Specific optimizations for each platform 4 / 20
Contributions New software speed records for various architectures 8-bit AVR microcontrollers compact, efficient single stream AES version 32-bit Advanced RISC Machine one of the most widely used processors for mobile applications efficient single stream AES version (T -table based) Synergistic processing elements of the Cell broadband engine widely available in the PS3 video game console single instruction multiple data (SIMD) architecture process 16 streams in parallel (bytesliced) NVIDIA graphics processing unit first AES decryption implementation single instruction multiple threads (SIMT) architecture process thousand of streams in parallel (T -table based) 4 / 20
The Advanced Encryption Standard Fixed block length version of the Rijndael block cipher Key-iterated block cipher with 128-bit state and block length Support for 128-, 192-, and 256-bit keys Strong security properties only related key attacks on AES-192 and AES-256 no attacks on full AES-128 Very efficient in hardware and software (Created by Jeff Moser) 5 / 20
Round Function Steps 1 SubBytes: 3 MixColumns: a 00 a 01 a 02 a 03 S-box b 00 b 01 b 02 b 03 2 3 1 1 1 2 3 1 1 1 2 3 3 1 1 2 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 00 a 01 a 02 a 03 b 00 b 01 b 02 b 03 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 10 a 11 a 12 a 13 b 10 b 11 b 12 b 13 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 a 20 a 21 a 22 a 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 b 30 b 31 b 32 b 33 2 ShiftRows: 4 AddRoundKey: a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 a 00 a 01 a 02 a 03 k 00 a 01 k 02 k 03 b 00 b 01 b 02 b 03 a 10 a 11 a 12 a 13 a 11 a 12 a 13 a 10 a 10 a 11 a 12 a 13 k 10 a 11 k 12 k 13 b 10 b 11 b 12 b 13 a 20 a 21 a 22 a 23 a 22 a 23 a 20 a 21 a 20 a 21 a 22 a 23 k 20 a 21 k 22 k 23 b 20 b 21 b 22 b 23 a 30 a 31 a 32 a 33 a 33 a 30 a 31 a 32 a 30 a 31 a 32 a 33 k 30 k 31 k 32 k 33 b 30 b 31 ab 32 b 33 6 / 20
Efficient T -table implementation Combine SubBytes, ShiftRows, MixColumns using the standard T -table approach. Update each column (0 j 3): [s j0, s j1, s j2, s j3 ] T = T 0 [a c0 0] T 1 [a c1 1] T 2 [a c2 2] T 3 [a c3 3] k j, where each T i is 1KB and k j is the jth column of the round key. T i s are rotations of one table. Example (j = 0): a 00 a 01 a 02 a 03 k 00 s 00 s 01 s 02 s 03 a 10 a 11 a 12 a 13 k 10 s 10 s 11 s 12 s 13 T 0 T 1 T 2 T 3 a 20 a 21 a 22 a 23 k 20 s 20 s 21 s 22 s 23 a 30 a 31 a 32 a 33 k 30 s 30 s 31 s 32 s 33 7 / 20
Target Platforms Atmel AVRs DigiKey Cell B.E ARM NVIDIA GPUs IBM Systems 8 / 20
Microcontrollers AVR Modified Harvard architecture 32 8-bit registers 3 16-bit pointer registers Registers are addressable Mostly single-cycle execution 1 2 to 384KB flash memory 0 to 32KB SRAM 0 to 4KB EEPROM ARM Harvard/von Neumann architecture 16 32-bit registers available Conditional execution Optional modification of flags Mostly single-cycle execution Load/store multiple registers Inline barrel shifter 9 / 20
Microcontrollers Results Reference Key Encryption Decryption Code size scheduling (cycles) (cycles) (bytes) 8-bit AVR microcontroller Rinne et al. precompute 3,766 4,558 3,410 Poettering (Fast) precompute 2,474 3,411 3,098 Poettering (Furious) precompute 2,739 3,579 1,570 Poettering (Fantastic) precompute 4,059 4,675 1,482 Otte precompute 2,555 6,764 2,070 Otte precompute 2,555 3,193 2,580 New - Low RAM precompute 2,153 2,901 1,912 New - Fast precompute 1,993 2,901 1,912 32-bit ARM microprocessor Atasu et. al precompute 639 638 5,966 New precompute 544-3,292 10 / 20
Cell Broadband Engine Architecture Use the Synergistic Processing Elements runs at 3.2 GHz 128-bit wide SIMD-architecture two instructions per clock cycle (dual pipeline) in-order processor rich instruction set: i.e. all distinct binary operations f : {0, 1} 2 {0, 1} are present. Expensive QS22 Blade Servers (2 8 Es) PCI express cards: cryptographic accelerator ({ 4, 8 } Es) Cheap PS3 video game console (6 Es) 11 / 20
Dual pipeline Example Balancing pipeline loads may lead to more instructions but fewer cycles. 16-way SIMD xtime: a x (a F 2 8 = F 2 [x]/(x 8 + x 4 + x 3 + x + 1)) Similar techniques for 16-way SIMD S-box lookups. x x and cmp-gt rotate-left 2 2 4 4 shift-left and and gather 4 4 2 xor form-mask 2 4 xtime (x) rotate-right 2 4 and xor 2 xtime (x) 2 4 even, 1 odd latency: 8 cycles 3 even, 4 odd latency: 20 cycles 12 / 20
U Results Comparison 20 17.1 Cycles/byte 15 10 5 12.4 11.3 13.9 K. Shimizu et al. 2005 This article 0 Encryption Decryption Throughput per PS3: 13.8 (encryption) and 10.8 Gbps (decryption) 13 / 20
NVIDIA Graphic Processing Units Contain 12-30 simultaneous multiprocessors (SMs): 8 streaming processors (s) 16KB 16-way banked fast shared memory 8192/16384 32-bit registers 8KB constant memory cache 6KB-8KB texture cache 2 special function units instruction fetch and scheduling unit GeForce 8800GTX: 16 SMs @ 1.35GHz GTX 295: 2 30 SMs @ 1.24GHz 16KB Shared Mem 16KB Shared Mem 16KB Shared Mem Scheduler Scheduler Scheduler Register File Register File SFU SFU SFUConst $ SFU Text $ SFU SFU Const $ Text $ Const $ Text $ Register File 14 / 20
AES GPU Implementation Each thread processes multiple blocks hide memory latency using double buffering! Group multiple (e.g., 16) threads into stream groups share a common key, expanded in texture/shared memory Execute 16 (scheduling-unit) stream groups per thread block allows any caching to shared memory with no collisions! Launch multiple thread blocks per grid increase device utilization! All global memory access is coalesced maximize memory access throughput! 15 / 20
AES GPU Implementation (cont.) On-the-fly version all threads cache 1 st round key to shared memory Texture-memory version threads of same stream group share a common key Round-key read Shared/texture memory thread stream group Threads of same stream group share a common key. 16 / 20
AES GPU Implementation (cont.) PTX pre-expanded key (benchmark-only) version threads in a thread block share 1 key. Round-key (broadcast) read Constant memory Threads in same thread block share a common key. 17 / 20
AES GPU Implementation (cont.) T -tables placed in shared memory: T-table read Shared memory Avg: 35% collisions Still the fastest! Will further improve on Fermi Even with collisions greatest speedup! 18 / 20
GPU Results Comparison Cycles/byte 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 1.3 1.71 0.7 0.52 0.65 0.64 0.32 0.32 S. A. Manavski 2007 (8800GTX) J. Yang et al. 2007 (ATI HD 2900 XT) O. Harrison et al. 2008 (8800GTX) This article (8800GTX) PTX This article (8800GTX) This article (single-gpu GTX295) 0.2 0 Encryption Decryption Encryption: 30.9 and 16.8 Gbps on single GTX 295 GPU and 8800GTX. Decryption: 30.8 and 16.6 Gbps on single GTX 295 GPU and 8800GTX. 19 / 20
Conclusions Our new AES-128 software speed records for encryption and decryption x times faster compared to the previous records 8-bit AVR 1.24 encryption 1.10 decryption smaller code size 32-bit ARM 1.17 encryption Cell Broadband Engine (E) 1.10 encryption 1.23 decryption NVIDIA GPU 1.2 encryption 1.34 encryption (kernel-only) First decryption implementation All numbers subject to further improvements 20 / 20
Conclusions Our new AES-128 software speed records for encryption and decryption x times faster compared to the previous records 8-bit AVR 1.24 encryption 1.10 decryption smaller code size 32-bit ARM 1.17 encryption Cell Broadband Engine (E) 1.10 encryption 1.23 decryption NVIDIA GPU 1.2 encryption 1.34 encryption (kernel-only) First decryption implementation All numbers subject to further improvements To be continued... 20 / 20