New AES software speed records

Similar documents

New AES software speed records

ChaCha, a variant of Salsa20

Salsa20/8 and Salsa20/12

Software implementation of Post-Quantum Cryptography

Password-based encryption in ZIP files

Fast Software AES Encryption

ELECTENG702 Advanced Embedded Systems. Improving AES128 software for Altera Nios II processor using custom instructions

CSCE 465 Computer & Network Security

Fast Implementations of AES on Various Platforms

SeChat: An AES Encrypted Chat

High-speed high-security cryptography on ARMs

White Paper. Shay Gueron Intel Architecture Group, Israel Development Center Intel Corporation

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Caml Virtual Machine File & data formats Document version: 1.4

Efficient Software Implementation of AES on 32-bit Platforms

M.S. Project Proposal. SAT Based Attacks on SipHash

GPU File System Encryption Kartik Kulkarni and Eugene Linkov

Hash Function JH and the NIST SHA3 Hash Competition

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Cryptography and Network Security

Breakthrough AES Performance with. Intel AES New Instructions

Designing Hash functions. Reviewing... Message Authentication Codes. and message authentication codes. We have seen how to authenticate messages:

MAC. SKE in Practice. Lecture 5

How To Attack A Block Cipher With A Key Key (Dk) And A Key (K) On A 2Dns) On An Ipa (Ipa) On The Ipa 2Ds (Ipb) On Pcode)

Intel Advanced Encryption Standard (AES) New Instructions Set

AES-GCM software performance on the current high end CPUs as a performance baseline for CAESAR competition

Intel s New AES Instructions for Enhanced Performance and Security

The Misuse of RC4 in Microsoft Word and Excel

Fast Arithmetic Coding (FastAC) Implementations

Sosemanuk, a fast software-oriented stream cipher

Table of Contents. Bibliografische Informationen digitalisiert durch

TCG Algorithm Registry. Family 2.0" Level 00 Revision April 17, Published. Contact:

The mathematics of RAID-6

Implementation of Full -Parallelism AES Encryption and Decryption

Memory ICS 233. Computer Architecture and Assembly Language Prof. Muhamed Mudawar

A Comparative Study Of Two Symmetric Encryption Algorithms Across Different Platforms.

MarshallSoft AES. (Advanced Encryption Standard) Reference Manual

RPDO 1 TPDO 1 TPDO 5 TPDO 6 TPDO 7 TPDO 8

Haswell Cryptographic Performance

QUIC. Quick UDP Internet Connections. Multiplexed Stream Transport over UDP. IETF-88 TSV Area Presentation

CLOUD COMPUTING SECURITY ARCHITECTURE - IMPLEMENTING DES ALGORITHM IN CLOUD FOR DATA SECURITY

Communicating with devices

Error oracle attacks and CBC encryption. Chris Mitchell ISG, RHUL

Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors

High-Performance Modular Multiplication on the Cell Processor

CPU Organization and Assembly Language

ELEG3924 Microprocessor Ch.7 Programming In C

Using the RDTSC Instruction for Performance Monitoring

Fast Software Encryption: Designing Encryption Algorithms for Optimal Software Speed on the Intel Pentium Processor

Cryptography and Network Security. Prof. D. Mukhopadhyay. Department of Computer Science and Engineering. Indian Institute of Technology, Kharagpur

Network Security - ISA 656 Introduction to Cryptography

How To Write A Hexadecimal Program

CSci 530 Midterm Exam. Fall 2012

EE361: Digital Computer Organization Course Syllabus

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

AES-GCM for Efficient Authenticated Encryption Ending the Reign of HMAC-SHA-1?

GPON Section Reporter: 王依盈

Soft-Starter SSW-06 V1.6X

The Stream Cipher HC-128

Efficient Low-Level Software Development for the i.mx Platform

Definition of RAID Levels

CPU performance monitoring using the Time-Stamp Counter register

Network Security. Security. Security Services. Crytographic algorithms. privacy authenticity Message integrity. Public key (RSA) Message digest (MD5)

Triathlon of Lightweight Block Ciphers for the Internet of Things

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉志尉 National Chiao Tung University

Instruction Set Architecture (ISA)

ETHERNET ENCRYPTION MODES TECHNICAL-PAPER

Technical Specification Digital Video Broadcasting (DVB); Content Scrambling Algorithms for DVB-IPTV Services using MPEG2 Transport Streams

Wireless ATA: A New Data Transport Protocol for Wireless Storage

Secret File Sharing Techniques using AES algorithm. C. Navya Latha Garima Agarwal Anila Kumar GVN

ACR122 NFC Contactless Smart Card Reader

LSN 2 Computer Processors

Computer Architecture. Secure communication and encryption.

Message authentication

Pipelining Review and Its Limitations

Nemo 96HD/HD+ MODBUS

1 Step 1: Select... Files to Encrypt 2 Step 2: Confirm... Name of Archive 3 Step 3: Define... Pass Phrase

Cryptography and Network Security Chapter 3

Performance Characterization of SPEC CPU2006 Integer Benchmarks on x Architecture

Chapter 8. Network Security

Secondary Storage. Any modern computer system will incorporate (at least) two levels of storage: magnetic disk/optical devices/tape systems

Automata Designs for Data Encryption with AES using the Micron Automata Processor

How To Understand And Understand The History Of Cryptography

Application Note. Introduction AN2471/D 3/2003. PC Master Software Communication Protocol Specification

How To Encrypt With A 64 Bit Block Cipher

Advanced Encryption Standard by Example. 1.0 Preface. 2.0 Terminology. Written By: Adam Berent V.1.7

Modes of Operation of Block Ciphers

NXP & Security Innovation Encryption for ARM MCUs

Comparative Performance Review of SHA-3 Candidates

Visualisation of potential weakness of existing cipher engine implementations in commercial on-the-fly disk encryption software

Advanced Encryption Standard by Example. 1.0 Preface. 2.0 Terminology. Written By: Adam Berent V.1.5

Transcription:

New AES software speed records Daniel J. Bernstein and Peter Schwabe 16.12.2008 Indocrypt 2008

The estream project ECRYPT Stream Cipher Project Running from 2004 to 2008 to identify promising new stream ciphers Established an open benchmarking suite for stream ciphers For comparison contains 128-bit AES in counter mode Reference implementation by Gladman Other authors added faster implementations This talk describes our new implementation New AES software speed records 2

One round of AES in C Input and Output Input: 16-byte state given in 4 32-bit integers y0, y1, y2, y3 Output: 16-byte state given in 4 32-bit integers z0, z1, z2, z3 Tables and keys uint32 rk[44], T0[256], T1[256], T2[256], T3[256]; The rst round of AES z0 = T0[ y0 >> 24 ] ^ T1[(y1 >> 16) & 0xff] \ ^ T2[(y2 >> 8) & 0xff] ^ T3[ y3 & 0xff] ^ rk[4]; z1 = T0[ y1 >> 24 ] ^ T1[(y2 >> 16) & 0xff] \ ^ T2[(y3 >> 8) & 0xff] ^ T3[ y0 & 0xff] ^ rk[5]; z2 = T0[ y2 >> 24 ] ^ T1[(y3 >> 16) & 0xff] \ ^ T2[(y0 >> 8) & 0xff] ^ T3[ y1 & 0xff] ^ rk[6]; z3 = T0[ y3 >> 24 ] ^ T1[(y0 >> 16) & 0xff] \ ^ T2[(y1 >> 8) & 0xff] ^ T3[ y2 & 0xff] ^ rk[7]; New AES software speed records 3

What a machine is really doing Tables and keys unsigned char rk[176], T0[1024], T1[1024], T2[1024], T3[1024]; The rst round of AES z0 = *(uint32 *)(rk + 16); z1 = *(uint32 *)(rk + 20); z2 = *(uint32 *)(rk + 24); z3 = *(uint32 *)(rk + 28); z0 ^= *(uint32 *) (T0 + ((y0 >> 22) & 0x3fc)) ^ *(uint32 *) (T1 + ((y1 >> 14) & 0x3fc)) \ ^ *(uint32 *) (T2 + ((y2 >> 6) & 0x3fc)) ^ *(uint32 *) (T3 + ((y3 << 2) & 0x3fc)); z1 ^= *(uint32 *) (T0 + ((y1 >> 22) & 0x3fc)) ^ *(uint32 *) (T1 + ((y2 >> 14) & 0x3fc)) \ ^ *(uint32 *) (T2 + ((y3 >> 6) & 0x3fc)) ^ *(uint32 *) (T3 + ((y0 << 2) & 0x3fc)); z2 ^= *(uint32 *) (T0 + ((y2 >> 22) & 0x3fc)) ^ *(uint32 *) (T1 + ((y3 >> 14) & 0x3fc)) \ ^ *(uint32 *) (T2 + ((y0 >> 6) & 0x3fc)) ^ *(uint32 *) (T3 + ((y1 << 2) & 0x3fc)); z3 ^= *(uint32 *) (T0 + ((y3 >> 22) & 0x3fc)) ^ *(uint32 *) (T1 + ((y0 >> 14) & 0x3fc)) \ ^ *(uint32 *) (T2 + ((y1 >> 6) & 0x3fc)) ^ *(uint32 *) (T3 + ((y2 << 2) & 0x3fc)); New AES software speed records 4

Interleaving tables Tables and keys unsigned char rk[176], tables[4096]; #define T0 (&tables[0]) #define T1 (&tables[4]) #define T2 (&tables[2048]) #define T3 (&tables[2052]) The rst round of AES z0 = *(uint32 *)(rk + 16); z1 = *(uint32 *)(rk + 20); z2 = *(uint32 *)(rk + 24); z3 = *(uint32 *)(rk + 28); z0 ^= *(uint32 *) (T0 + ((y0 >> 21) & 0x7f8)) ^ *(uint32 *) (T1 + ((y1 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y2 >> 5) & 0x7f8)) ^ *(uint32 *) (T3 + ((y3 << 3) & 0x7f8)); z1 ^= *(uint32 *) (T0 + ((y1 >> 21) & 0x7f8)) ^ *(uint32 *) (T1 + ((y2 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y3 >> 5) & 0x7f8)) ^ *(uint32 *) (T3 + ((y0 << 3) & 0x7f8)); z2 ^= *(uint32 *) (T0 + ((y2 >> 21) & 0x7f8)) ^ *(uint32 *) (T1 + ((y3 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y0 >> 5) & 0x7f8)) ^ *(uint32 *) (T3 + ((y1 << 3) & 0x7f8)); z3 ^= *(uint32 *) (T0 + ((y3 >> 21) & 0x7f8)) ^ *(uint32 *) (T1 + ((y0 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y1 >> 5) & 0x7f8)) ^ *(uint32 *) (T3 + ((y2 << 3) & 0x7f8)); New AES software speed records 5

z0 = *(uint32 *)(rk + 16); z1 = *(uint32 *)(rk + 20); z2 = *(uint32 *)(rk + 24); z3 = *(uint32 *)(rk + 28); z0 ^= *(uint32 *) (T0 + ((y0 >> 21) & 0x7f8)) \ ^ *(uint32 *) (T1 + ((y1 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y2 >> 5) & 0x7f8)) \ ^ *(uint32 *) (T3 + ((y3 << 3) & 0x7f8)); z1 ^= *(uint32 *) (T0 + ((y1 >> 21) & 0x7f8)) \ ^ *(uint32 *) (T1 + ((y2 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y3 >> 5) & 0x7f8)) \ ^ *(uint32 *) (T3 + ((y0 << 3) & 0x7f8)); z2 ^= *(uint32 *) (T0 + ((y2 >> 21) & 0x7f8)) \ ^ *(uint32 *) (T1 + ((y3 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y0 >> 5) & 0x7f8)) \ ^ *(uint32 *) (T3 + ((y1 << 3) & 0x7f8)); z3 ^= *(uint32 *) (T0 + ((y3 >> 21) & 0x7f8)) \ ^ *(uint32 *) (T1 + ((y0 >> 13) & 0x7f8)) \ ^ *(uint32 *) (T2 + ((y1 >> 5) & 0x7f8)) \ ^ *(uint32 *) (T3 + ((y2 << 3) & 0x7f8)); New AES software speed records 6

Baseline for all architectures Each round has 20 loads, 16 shifts, 16 masks and 16 xors New AES software speed records 7

Baseline for all architectures Each round has 20 loads, 16 shifts, 16 masks and 16 xors Last round is slightly dierent: Needs 16 more mask instructions Four load instructions to load input, four xors with key stream, four stores for output... some more overhead Results in 720 instructions needed to encrypt a block of 16 bytes Specically: 208 loads, 4 stores, 508 integer instructions New AES software speed records 7

Baseline for all architectures Each round has 20 loads, 16 shifts, 16 masks and 16 xors Last round is slightly dierent: Needs 16 more mask instructions Four load instructions to load input, four xors with key stream, four stores for output... some more overhead Results in 720 instructions needed to encrypt a block of 16 bytes Specically: 208 loads, 4 stores, 508 integer instructions Two strategies to make these instructions execute fast: Reduce the number of instructions Make the machine execute the instructions as fast as possible New AES software speed records 7

UltraSPARC (Not the most common architecture but an easy-to-understand example) New AES software speed records 8

UltraSPARC (Not the most common architecture but an easy-to-understand example) 64-bit architecture In-order execution Up to 4 instructions per cycle At most 2 integer instructions per cycle At most 1 load/store instruction per cycle Instructions have latencies 24 integer registers available New AES software speed records 8

Fast AES on the UltraSPARC Reminder: 208 loads, 4 stores, 508 integer instructions Only one load or store per cycle ( at least 212 cycles) Only 2 integer instructions per cycle ( at least 254 cycles) Make sure to have 2 integer instructions each cycle Make sure to not have extra cycles for load/store instructions Result: 254 cycles/block 15.98 cycles/byte in the estream benchmarking framework for encryption of 4096 bytes New AES software speed records 9

Padded registers (saving 80 instructions) Idea: The UltraSPARC is a 64-bit architecture, pad 32-bit values with zeros, i.e. 0xc66363a5 becomes 0x0c60063006300a50 Do that consistently for values in registers, the tables and the round keys, then: Without padded registers p00 = (uint32) y0 >> 21 p01 = (uint32) y0 >> 13 p02 = (uint32) y0 >> 5 p03 = (uint32) y0 << 3 p00 &= 0x7f8 p01 &= 0x7f8 p02 &= 0x7f8 p03 &= 0x7f8 With padded registers p00 = (uint64) y0 >> 48 p01 = (uint64) y0 >> 32 p02 = (uint64) y0 >> 16 p01 &= 0xff0 p02 &= 0xff0 p03 = y0 & 0xff0 New AES software speed records 10

Padded registers (saving 120 instructions) Idea: The UltraSPARC is a 64-bit architecture, pad 32-bit values with zeros, i.e. 0xc66363a5 becomes 0x0c60063006300a50 Do that consistently for values in registers, the tables and the round keys, then: Without padded registers p00 = (uint32) y0 >> 21 p01 = (uint32) y0 >> 13 p02 = (uint32) y0 >> 5 p03 = (uint32) y0 << 3 p00 &= 0x7f8 p01 &= 0x7f8 p02 &= 0x7f8 p03 &= 0x7f8 Padded registers and 32-bit shift p00 = (uint64) y0 >> 48 p01 = (uint64) y0 >> 32 p02 = (uint32) y0 >> 16 p01 &= 0xff0 p03 = y0 & 0xff0 New AES software speed records 10

Counter-mode caching (saving about 100 instructions) Idea: Make use of the fact that AES is used in CTR mode A 16-byte counter is increased by 1 for every block 15 bytes of the counter remain constant for 256 blocks In the rst round most computations only depend on these 15 bytes 12 bytes of the output of the rst round remain constant for 256 blocks In the second round most computations only depend on these 12 bytes New AES software speed records 11

Counter-mode caching (saving about 100 instructions) Idea: Make use of the fact that AES is used in CTR mode A 16-byte counter is increased by 1 for every block 15 bytes of the counter remain constant for 256 blocks In the rst round most computations only depend on these 15 bytes 12 bytes of the output of the rst round remain constant for 256 blocks In the second round most computations only depend on these 12 bytes Do all these computations only once for 256 subsequent blocks and cache the result. New AES software speed records 11

Results Encryption speed given in cycles/byte for encrypting a block of 4096 bytes Veriable estream results Unveriable claims Architecture This paper Best previous Unpublished code UltraSPARC III 12.06 20.75 (Bernstein) 16.875 (Lipmaa) Power PC G4 14.57 16.26 (Wu) 25.06 (Ahrens) Pentium 4 f12 14.15 16.97 (Bernstein) 15 (Osvik) Core 2 6fb 10.57 12.27 (Wu) 14.5 (Matsui & Nakajima) Athlon 64 10.43 13.32 (Wu) 10.62 (Matsui) New AES software speed records 12

Results Encryption speed given in cycles/byte for encrypting a block of 4096 bytes Veriable estream results Unveriable claims Architecture This paper Best previous Unpublished code UltraSPARC III 12.06 20.75 (Bernstein) 16.875 (Lipmaa) Power PC G4 14.57 16.26 (Wu) 25.06 (Ahrens) Pentium 4 f12 14.15 16.97 (Bernstein) 15 (Osvik) Core 2 6fb 10.57 12.27 (Wu) 14.5 (Matsui & Nakajima) Athlon 64 10.43 13.32 (Wu) 10.62 (Matsui) Some post-paper results Measurements within the ebacs project on a Core 2 10676: 9.31 cycles/byte Young reports 9.12 cycles/byte on a Core 2 6fb after integrating optimizations described in this paper New AES software speed records 12

Further information Paper and Software http://cr.yp.to/aes-speed.html estream and ebacs http://www.ecrypt.eu.org/stream/ http://bench.cr.yp.to/ qhasm http://cr.yp.to/qhasm.html http://cryptojedi.org/programming/qhasm-doc/ New AES software speed records 13