Haswell Cryptographic Performance

White Paper Sean Gulley Vinodh Gopal IA Architects Intel Corporation Haswell Cryptographic Performance July 2013 329282-001

Executive Summary The new Haswell microarchitecture featured in the 4 th generation Intel Core Processor family implements several new instructions designed to improve cryptographic processing performance. Additionally, Haswell improves the performance characteristics of critical instructions, such as Standard-Instructions Set (AES-NI), over the previous Intel microarchitectures code named Sandy Bridge and Ivy Bridge. This paper details all the new and improved features Haswell has to offer related to cryptographic algorithm implementations of the Secure Hash Algorithm (SHA)[1], various modes of the Advanced Encryption Standard (AES)[2], and Rivest-Shamir-Adleman (RSA). The paper introduces many of the enhancements made to the Haswell microarchitecture intended to improve the performance of the most heavily used cryptographic algorithms. Increases of 15% 100% per cycle are demonstrated on a single thread of an Intel Core i7 processor 4770 (formerly Haswell) core over an Intel Core i7 processor 2600 (formerly Sandy Bridge) core. The Intel Embedded Design Center provides qualified developers with web-based access to technical resources. Access Intel Confidential design materials, step-by step guidance, application reference solutions, training, Intel s tool loaner program, and connect with an e-help desk and the embedded community. Design Fast. Design Smart. Get started today. http://www.intel.com/p/en_us/embedded. 2

Contents Overview... 4 Haswell Improvements... 4 AES... 5 SHA... 5 Public Key... 5 Performance Gains... 6 Methodology... 6 Conclusion... 9 Acknowledgements... 9 References... 9 Tables Table 1. SHA Single Buffer Performance (cycles/byte)... 7 Table 2. SHA Multi-Buffer Performance (cycles/byte)1... 7 Table 3. AES Performance (cycles/byte)1... 8 Table 4. Modular Exponentiation Performance (cycles)1... 8 3

Overview Cryptographic algorithms are the backbone of any secure platform or communication channel. Instances such as storage confidentiality, executable integrity, or secure web connections all require strong cryptographic algorithms to provide good security. One of the main obstacles to security is the adverse impact to a user s experience or system performance due to the high computational cost of cryptographic algorithms. Since the introduction of the AES-NI in Intel Processors in 2010, Intel has continuously focused on improving encryption/decryption, secure hashing, and public/private key operations in an effort to proliferate better security solutions for consumers and enterprises. Several features in the Haswell microarchitecture, first introduced in June 2013 on the 4 th generation Intel Core Processor family, have been specifically designed to lower the cost of secure implementations. The performance of two modes of AES operation, Cipher Block Chaining (CBC) and Galois Counter Mode (GCM), are examined to demonstrate how Haswell s improved AES-NI performance is designed to accelerate secure networking performance (i.e., Secure Socket Layer (SSL)/Transport Layer Security(TLS), IPSec)) and general encryption needs. Micro-architectural improvements in the core and new instructions are detailed to provide insight into the dramatic single and multi-buffer performance of SHA-1, SHA-256, and SHA-512 operations. Significant improvements in modular exponentiation functionality, useful in public key algorithms such as RSA, are also examined. Haswell Improvements The enhancements to Haswell for cryptographic performance come in three flavors: new instructions, micro-architectural improvements, and novel software implementations. There are several new VEX encoded Global Procurement Reporting (GPR) instructions in the Bit Manipulation Instructions (BMI) feature group that aid in SHA (RORX) and RSA (MULX) performance increases. Additionally, the new Advanced Vector Extensions (AVX) 2 instructions that promote vector integer operations from 128 bits to 256 bits increase SHA single buffer and multi-buffer performance. The increase from 3 Arithmetic Logic Unit (ALUs) to 4 on Haswell also benefits RSA and SHA single buffer. Micro-architectural improvements to the latency and throughput of the AESENC/AESDEC and PCLMULQDQ instructions drive higher AES performance. 4

AES Haswell has reduced the instruction latency of AES instructions from 8 cycles on the Intel microarchitecture 2 nd generation Intel Core Processor family down to 7. Additionally, the throughput has been optimized by a reduced number of micro-operations. The reduction in latency helps serial modes of AES operation, such as CBC Encrypt. The increase in throughput aids parallel modes of operation, such as CBC Decrypt or multi-buffer. The PCLMULQDQ instruction microarchitecture has been significantly modified to halve the latency (down to 7 cycles) and increase the throughput 4X (2 cycles). These major changes in concert with the AES instruction improvements have dramatically increased performance for implementations of algorithms such as AES-GCM [3]. SHA New features in Haswell have led to improved SHA performance for single data buffer hashing and for hashing multiple independent data buffers simultaneously. For single buffer hashing the new RORX instruction results in faster rotates, a key function in SHA processing, by allowing more operations to occur in parallel due to the non-destructive property of the instruction [4]. The wider 256-bit YMM registers used in AVX2 instructions bring two benefits to SHA processing, faster message schedule calculations for single buffer and double the number of data buffers that can be processed using multi-buffer. Public Key Large integer arithmetic is the foundation for many cryptographic algorithms, such as RSA, Elliptic Curve Cryptography (ECC), and Diffie-Hellman (DH) key exchange. These algorithms generally require the multiplication of very large integers (i.e., 1024 bits). Given the multipliers on Intel processors are only 64-bits, the full operation requires many multiplications and additions. The new MULX instruction was specifically designed with this in mind to provide two key advantages over the traditional MUL instruction [5]. First, MULX provides greater register flexibility by allowing the two destination registers to be distinct from the sources. Second, the arithmetic flags are untouched, thus allowing MULX to be mixed with add-carry instructions without corrupting the carry chain. 5

Performance Gains Cryptographic performance on Haswell has shown gains of 15% to over 100% 1 over the Intel microarchitecture code name Sandy Bridge implementations of the same algorithms. We will highlight the performance for a variety of the most common cryptographic algorithms in use today on a Intel Core i7 processor 4770 (formerly Haswell) core and a Intel Core i7 processor 2600 (formerly Sandy Bridge) core. All test results are given in cycles/byte or just CPU cycles (for RSA) in order to provide an accurate representation of the microarchitecture s capabilities and to eliminate any frequency discrepancies. For more accurate single thread performance comparisons we disabled Intel Turbo Boost Technology. Methodology Timing is measured using the rdtsc() function, which returns the processor time stamp counter (TSC). The TSC is the number of clock cycles since the last reset. The TSC_initial is the TSC recorded before the function to measure is called. After the function is complete, the rdtsc() was called again to record the new cycle count TSC_final. The effective cycle count for the called routine is computed using # of cycles = (TSC_final-TSC_initial). We measured the performance of the functions on data buffers of size 1024 bytes (or in the case of modular exponentiation just the function itself). We called the functions to hash, encrypt, or decrypt the same buffer a large number of times, collecting many timing measurements. We discarded the first and last 1/8th samples, sorted the timings, and then discarded the largest/smallest quarter, leaving the remaining quarter to be averaged. Finally, the average value was divided by the 1024 byte buffer size to express the performance in cycles per byte. For AES performance comparisons, the same code is run on the Intel microarchitecture code name Sandy Bridge and Haswell. The SHA and RSA performance comparisons use different code on Haswell to include the new instructions. Note: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 6

Table 1. SHA Single Buffer Performance (cycles/byte)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain SHA-1 5.44 3.80 1.43x SHA-256 12.82 8.59 1.49x SHA-512 8.60 6.27 1.37x The combination of new AVX2 instructions for the message schedule instead of AVX, the new BMI instructions used in the rounds processing, and the addition of the 4 th ALU, the SHA performance gains provide a significant increase of 1.37 to 1.49X 1. Table 2. SHA Multi-Buffer Performance (cycles/byte)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain SHA-1 2.11 1.13 1.87x SHA-256 5.12 2.67 1.92x SHA-512 6.72 3.32 2.02x Multi-Buffer SHA performance increases ~2X 1 due to the doubling of the SIMD integer operation width to 256 bits in AVX2. For SHA-1 and SHA-256, eight independent buffers can be processed at once because there are eight 32-bit lanes in the 256-bit YMM register. For SHA-512, which operates on 64-bit words, four independent data buffers can be processed in parallel. 1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Refer to the Performance of Multi-Hash section on page 9. For more information go to http://www.intel.com/performance. 7

Table 3. AES Performance (cycles/byte)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain AES-128-CBC Encrypt 5.21 4.52 1.15x AES-128-CBC Decrypt 0.76 0.64 1.19x AES-256-CBC Encrypt 7.21 6.27 1.15x AES-256-CBC Decrypt 1.02 0.89 1.15x AES GCM Encrypt 2.76 1.26 2.19x The decrease in latency from 8 cycles to 7 cycles is obvious in the CBC Encrypt performance. The 8/7 ratio is efficiently achieved. For the throughput improvement, we now achieve close to an impressive perfect one round per cycle for CBC Decrypt (and CBC Encrypt Multi-Buffer not shown). The GCM performance is a dramatic 2.19X 1 since the microarchitecture of the PCLMULQDTable 1. SHA Single Buffer Performance (cycles/byte)q instruction on Haswell has been significantly improved. Table 4. Modular Exponentiation Performance (cycles)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain 512-bit 231,750 173,348 1.34x 1024-bit 1,752,092 1,318,895 1.33x Modular exponentiation is the most compute intensive portion of the RSA algorithm, hence we focus on this calculation. For RSA 2048 decryption (private key operation), two 1024 modular exponentiation calculations are required, assuming the use of the Chinese Remainder Theorem. The 1.33X 1 performance increase is solid return on investment for modifying code to use MULX. 8

Conclusion The cryptographic application performance gains provided by Haswell over the Intel microarchitecture code name Sandy Bridge range from 1.15 to 2.19X 1. The AES improvements can be seen out of the box with a processor upgrade. The significant SHA and RSA performance increases require code updates to take advantage of the new Haswell feature set. Intel has been aggressively enabling common software libraries, such as OpenSSL*, with these code updates to ensure peak performance of most applications on Haswell processors. For more information on enabling custom software libraries, see the reference papers. Acknowledgements References We thank Jim Guilford, Erdinc Ozturk, David Cote, and Wajdi Feghali for their substantial contributions to this work. [1] Federal Information Processing Standards Publication 180-2 Secure Hash Standard http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf [2] Federal Information Processing Standards Publication 197 Advanced Encryption Standard http://csrc.nist.gov/publications/fips/fips197/fips- 197.pdf [3] E. Ozturk, V. Gopal Enabling High-Performance Galois-Counter-Mode on Intel Architecture Processors, October 2012, http://www.intel.com/content/www/us/en/intelligent-systems/networksecurity/enabling-high-performance-gcm.html [4] J. Guilford, K. Yap, V. Gopal, Fast SHA-256 Implementations on Intel Architecture Processors, May 2012, https://www-ssl.intel.com/content/www/us/en/intelligent-systems/inteltechnology/sha-256-implementations-paper.html [5] E. Ozturk, J. Guilford, V. Gopal, W. Feghali New Instructions Supporting Large Integer Arithmetic on Intel Architecture Processors, August 2012, https://www-ssl.intel.com/content/www/us/en/intelligent-systems/inteltechnology/ia-large-integer-arithmetic-paper.html 9

The Intel Embedded Design Center provides qualified developers with webbased access to technical resources. Access Intel Confidential design materials, step-by step guidance, application reference solutions, training, Intel s tool loaner program, and connect with an e-help desk and the embedded community. Design Fast. Design Smart. Get started today. http://www.intel.com/p/en_us/embedded. Authors Sean Gulley and Vinodh Gopal are IA Architects with the Datacenter and Connected Systems Group at Intel Corporation. Acronyms AES Advanced Encryption Standard ALU Arithmetic Logic Unit AVX Advanced Vector Extensions BMI Bit Manipulation Instructions CBC Cipher Block Chaining DH Diffie-Hellmen ECC Elliptic Curve Cryptography GCM Galois Counter Mode GPR General Purpose Register IA NI Intel Architecture Instructions Set RSA Rivest-Shamir-Adleman SHA Secure Hash Algorithm SIMD Single Instruction Multiple Data SSL Secure Socket Layer TLS Transport Layer Security 10

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. For more information including details on which processors support HT Technology, see here. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information. Intel Turbo Boost Technology requires a PC with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your PC manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost. Intel, Intel Turbo Boost Technology, Intel Hyper Threading Technology, Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013 Intel Corporation. All rights reserved. 11