Haswell Cryptographic Performance



Similar documents
Improving OpenSSL* Performance

Breakthrough AES Performance with. Intel AES New Instructions

Processing Multiple Buffers in Parallel to Increase Performance on Intel Architecture Processors

Intel Media SDK Library Distribution and Dispatching Process

Software Solutions for Multi-Display Setups

2013 Intel Corporation

Intel Core TM i3 Processor Series Embedded Application Power Guideline Addendum

Enhancing McAfee Endpoint Encryption * Software With Intel AES-NI Hardware- Based Acceleration

Upsurge in Encrypted Traffic Drives Demand for Cost-Efficient SSL Application Delivery

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Implementation and Performance of AES-NI in CyaSSL. Embedded SSL

Intel Data Direct I/O Technology (Intel DDIO): A Primer >

Intel Service Assurance Administrator. Product Overview

Power Benefits Using Intel Quick Sync Video H.264 Codec With Sorenson Squeeze

Intel vpro Technology. How To Purchase and Install Go Daddy* Certificates for Intel AMT Remote Setup and Configuration

Customizing Boot Media for Linux* Direct Boot

with PKI Use Case Guide

Specification Update. January 2014

Intel vpro Technology. How To Purchase and Install Symantec* Certificates for Intel AMT Remote Setup and Configuration

Fast, Low-Overhead Encryption for Apache Hadoop*

Three Paths to Faster Simulations Using ANSYS Mechanical 16.0 and Intel Architecture

Intel and Qihoo 360 Internet Portal Datacenter - Big Data Storage Optimization Case Study

AES-GCM software performance on the current high end CPUs as a performance baseline for CAESAR competition

The Impact of Cryptography on Platform Security

Accelerating Business Intelligence with Large-Scale System Memory

Intel Technical Advisory

Vendor Update Intel 49 th IDC HPC User Forum. Mike Lafferty HPC Marketing Intel Americas Corp.

The Case for Rack Scale Architecture

Intel SSD 520 Series Specification Update

LTE Control Plane on Intel Architecture

Accelerating Business Intelligence with Large-Scale System Memory

Intel Core i5 processor 520E CPU Embedded Application Power Guideline Addendum January 2011

iscsi Quick-Connect Guide for Red Hat Linux

Creating Overlay Networks Using Intel Ethernet Converged Network Adapters

Intel HTML5 Development Environment. Tutorial Test & Submit a Microsoft Windows Phone 8* App (BETA)

VNF & Performance: A practical approach

Douglas Fisher Vice President General Manager, Software and Services Group Intel Corporation

The ROI from Optimizing Software Performance with Intel Parallel Studio XE

Creating Full Screen Applications Across Multiple Displays in Extended Mode

MCA Enhancements in Future Intel Xeon Processors June 2013

Intel Extreme Memory Profile (Intel XMP) DDR3 Technology

Intel Data Migration Software

Intel X38 Express Chipset Memory Technology and Configuration Guide

How to Configure Intel Ethernet Converged Network Adapter-Enabled Virtual Functions on VMware* ESXi* 5.1

Intel Platform and Big Data: Making big data work for you.

Intel Cyber Security Briefing: Trends, Solutions, and Opportunities. Matthew Rosenquist, Cyber Security Strategist, Intel Corp

Cloud based Holdfast Electronic Sports Game Platform

Intel Network Builders: Lanner and Intel Building the Best Network Security Platforms

Intel 965 Express Chipset Family Memory Technology and Configuration Guide

COSBench: A benchmark Tool for Cloud Object Storage Services. Jiangang.Duan@intel.com

Intel Cloud Builder Guide: Cloud Design and Deployment on Intel Platforms

Intel Ethernet and Configuring Single Root I/O Virtualization (SR-IOV) on Microsoft* Windows* Server 2012 Hyper-V. Technical Brief v1.

A Superior Hardware Platform for Server Virtualization

SPC5-CRYP-LIB. SPC5 Software Cryptography Library. Description. Features. SHA-512 Random engine based on DRBG-AES-128

AES-GCM for Efficient Authenticated Encryption Ending the Reign of HMAC-SHA-1?

Intel HTML5 Development Environment. Article - Native Application Facebook* Integration

Intel Software Guard Extensions(Intel SGX) Carlos Rozas Intel Labs November 6, 2013

Intel Solid-State Drive Pro 2500 Series Opal* Compatibility Guide

Video Encoding on Intel Atom Processor E38XX Series using Intel EMGD and GStreamer

Intel Identity Protection Technology with PKI (Intel IPT with PKI)

Intel Desktop Board DP55WB

IMAGE SIGNAL PROCESSING PERFORMANCE ON 2 ND GENERATION INTEL CORE MICROARCHITECTURE PRESENTATION PETER CARLSTON, EMBEDDED & COMMUNICATIONS GROUP

Intel Q35/Q33, G35/G33/G31, P35/P31 Express Chipset Memory Technology and Configuration Guide

Intel Desktop Board DG43RK

The Foundation for Better Business Intelligence

Intel Identity Protection Technology (IPT)

新 一 代 軟 體 定 義 的 網 路 架 構 Software Defined Networking (SDN) and Network Function Virtualization (NFV)

Intel Solid-State Drives Increase Productivity of Product Design and Simulation

Intel Active Management Technology Embedded Host-based Configuration in Intelligent Systems

Intel Media Server Studio Professional Edition for Windows* Server

USB 3.0* Radio Frequency Interference Impact on 2.4 GHz Wireless Devices

ARM* to Intel Atom Microarchitecture - A Migration Study

Version Rev. 1.0

Intel Desktop Board D945GCPE Specification Update

Intel Media Server Studio - Metrics Monitor (v1.1.0) Reference Manual

Accomplish Optimal I/O Performance on SAS 9.3 with

Intel Desktop Board D945GCPE

Intel Identity Protection Technology Enabling improved user-friendly strong authentication in VASCO's latest generation solutions

Partition Alignment of Intel SSDs for Achieving Maximum Performance and Endurance Technical Brief February 2014

Intel Virtualization Technology FlexMigration Application Note

Intel Desktop Board DG41BI

* * * Intel RealSense SDK Architecture

Intel Retail Client Manager Audience Analytics

Intel Atom Processor E3800 Product Family

Intel Virtualization Technology FlexMigration Application Note

Intel Desktop Board DQ43AP

Intel Retail Client Manager

New Dimensions in Configurable Computing at runtime simultaneously allows Big Data and fine Grain HPC

Intel Desktop Board DG41TY

Intel Core i Processor (3M Cache, 3.30 GHz)

Benefits of Intel Matrix Storage Technology

Introduction to PCI Express Positioning Information

Intel Desktop Board DG31PR

Intel Desktop Board DP43BF

Intel RAID RS25 Series Performance

Intel Platform Controller Hub EG20T

Intel Cloud Builder Guide to Cloud Design and Deployment on Intel Platforms

Intel Matrix Storage Console

System Event Log (SEL) Viewer User Guide

A Powerful solution for next generation Pcs

Transcription:

White Paper Sean Gulley Vinodh Gopal IA Architects Intel Corporation Haswell Cryptographic Performance July 2013 329282-001

Executive Summary The new Haswell microarchitecture featured in the 4 th generation Intel Core Processor family implements several new instructions designed to improve cryptographic processing performance. Additionally, Haswell improves the performance characteristics of critical instructions, such as Standard-Instructions Set (AES-NI), over the previous Intel microarchitectures code named Sandy Bridge and Ivy Bridge. This paper details all the new and improved features Haswell has to offer related to cryptographic algorithm implementations of the Secure Hash Algorithm (SHA)[1], various modes of the Advanced Encryption Standard (AES)[2], and Rivest-Shamir-Adleman (RSA). The paper introduces many of the enhancements made to the Haswell microarchitecture intended to improve the performance of the most heavily used cryptographic algorithms. Increases of 15% 100% per cycle are demonstrated on a single thread of an Intel Core i7 processor 4770 (formerly Haswell) core over an Intel Core i7 processor 2600 (formerly Sandy Bridge) core. The Intel Embedded Design Center provides qualified developers with web-based access to technical resources. Access Intel Confidential design materials, step-by step guidance, application reference solutions, training, Intel s tool loaner program, and connect with an e-help desk and the embedded community. Design Fast. Design Smart. Get started today. http://www.intel.com/p/en_us/embedded. 2

Contents Overview... 4 Haswell Improvements... 4 AES... 5 SHA... 5 Public Key... 5 Performance Gains... 6 Methodology... 6 Conclusion... 9 Acknowledgements... 9 References... 9 Tables Table 1. SHA Single Buffer Performance (cycles/byte)... 7 Table 2. SHA Multi-Buffer Performance (cycles/byte)1... 7 Table 3. AES Performance (cycles/byte)1... 8 Table 4. Modular Exponentiation Performance (cycles)1... 8 3

Overview Cryptographic algorithms are the backbone of any secure platform or communication channel. Instances such as storage confidentiality, executable integrity, or secure web connections all require strong cryptographic algorithms to provide good security. One of the main obstacles to security is the adverse impact to a user s experience or system performance due to the high computational cost of cryptographic algorithms. Since the introduction of the AES-NI in Intel Processors in 2010, Intel has continuously focused on improving encryption/decryption, secure hashing, and public/private key operations in an effort to proliferate better security solutions for consumers and enterprises. Several features in the Haswell microarchitecture, first introduced in June 2013 on the 4 th generation Intel Core Processor family, have been specifically designed to lower the cost of secure implementations. The performance of two modes of AES operation, Cipher Block Chaining (CBC) and Galois Counter Mode (GCM), are examined to demonstrate how Haswell s improved AES-NI performance is designed to accelerate secure networking performance (i.e., Secure Socket Layer (SSL)/Transport Layer Security(TLS), IPSec)) and general encryption needs. Micro-architectural improvements in the core and new instructions are detailed to provide insight into the dramatic single and multi-buffer performance of SHA-1, SHA-256, and SHA-512 operations. Significant improvements in modular exponentiation functionality, useful in public key algorithms such as RSA, are also examined. Haswell Improvements The enhancements to Haswell for cryptographic performance come in three flavors: new instructions, micro-architectural improvements, and novel software implementations. There are several new VEX encoded Global Procurement Reporting (GPR) instructions in the Bit Manipulation Instructions (BMI) feature group that aid in SHA (RORX) and RSA (MULX) performance increases. Additionally, the new Advanced Vector Extensions (AVX) 2 instructions that promote vector integer operations from 128 bits to 256 bits increase SHA single buffer and multi-buffer performance. The increase from 3 Arithmetic Logic Unit (ALUs) to 4 on Haswell also benefits RSA and SHA single buffer. Micro-architectural improvements to the latency and throughput of the AESENC/AESDEC and PCLMULQDQ instructions drive higher AES performance. 4

AES Haswell has reduced the instruction latency of AES instructions from 8 cycles on the Intel microarchitecture 2 nd generation Intel Core Processor family down to 7. Additionally, the throughput has been optimized by a reduced number of micro-operations. The reduction in latency helps serial modes of AES operation, such as CBC Encrypt. The increase in throughput aids parallel modes of operation, such as CBC Decrypt or multi-buffer. The PCLMULQDQ instruction microarchitecture has been significantly modified to halve the latency (down to 7 cycles) and increase the throughput 4X (2 cycles). These major changes in concert with the AES instruction improvements have dramatically increased performance for implementations of algorithms such as AES-GCM [3]. SHA New features in Haswell have led to improved SHA performance for single data buffer hashing and for hashing multiple independent data buffers simultaneously. For single buffer hashing the new RORX instruction results in faster rotates, a key function in SHA processing, by allowing more operations to occur in parallel due to the non-destructive property of the instruction [4]. The wider 256-bit YMM registers used in AVX2 instructions bring two benefits to SHA processing, faster message schedule calculations for single buffer and double the number of data buffers that can be processed using multi-buffer. Public Key Large integer arithmetic is the foundation for many cryptographic algorithms, such as RSA, Elliptic Curve Cryptography (ECC), and Diffie-Hellman (DH) key exchange. These algorithms generally require the multiplication of very large integers (i.e., 1024 bits). Given the multipliers on Intel processors are only 64-bits, the full operation requires many multiplications and additions. The new MULX instruction was specifically designed with this in mind to provide two key advantages over the traditional MUL instruction [5]. First, MULX provides greater register flexibility by allowing the two destination registers to be distinct from the sources. Second, the arithmetic flags are untouched, thus allowing MULX to be mixed with add-carry instructions without corrupting the carry chain. 5

Performance Gains Cryptographic performance on Haswell has shown gains of 15% to over 100% 1 over the Intel microarchitecture code name Sandy Bridge implementations of the same algorithms. We will highlight the performance for a variety of the most common cryptographic algorithms in use today on a Intel Core i7 processor 4770 (formerly Haswell) core and a Intel Core i7 processor 2600 (formerly Sandy Bridge) core. All test results are given in cycles/byte or just CPU cycles (for RSA) in order to provide an accurate representation of the microarchitecture s capabilities and to eliminate any frequency discrepancies. For more accurate single thread performance comparisons we disabled Intel Turbo Boost Technology. Methodology Timing is measured using the rdtsc() function, which returns the processor time stamp counter (TSC). The TSC is the number of clock cycles since the last reset. The TSC_initial is the TSC recorded before the function to measure is called. After the function is complete, the rdtsc() was called again to record the new cycle count TSC_final. The effective cycle count for the called routine is computed using # of cycles = (TSC_final-TSC_initial). We measured the performance of the functions on data buffers of size 1024 bytes (or in the case of modular exponentiation just the function itself). We called the functions to hash, encrypt, or decrypt the same buffer a large number of times, collecting many timing measurements. We discarded the first and last 1/8th samples, sorted the timings, and then discarded the largest/smallest quarter, leaving the remaining quarter to be averaged. Finally, the average value was divided by the 1024 byte buffer size to express the performance in cycles per byte. For AES performance comparisons, the same code is run on the Intel microarchitecture code name Sandy Bridge and Haswell. The SHA and RSA performance comparisons use different code on Haswell to include the new instructions. Note: Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. For more information go to http://www.intel.com/performance 6

Table 1. SHA Single Buffer Performance (cycles/byte)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain SHA-1 5.44 3.80 1.43x SHA-256 12.82 8.59 1.49x SHA-512 8.60 6.27 1.37x The combination of new AVX2 instructions for the message schedule instead of AVX, the new BMI instructions used in the rounds processing, and the addition of the 4 th ALU, the SHA performance gains provide a significant increase of 1.37 to 1.49X 1. Table 2. SHA Multi-Buffer Performance (cycles/byte)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain SHA-1 2.11 1.13 1.87x SHA-256 5.12 2.67 1.92x SHA-512 6.72 3.32 2.02x Multi-Buffer SHA performance increases ~2X 1 due to the doubling of the SIMD integer operation width to 256 bits in AVX2. For SHA-1 and SHA-256, eight independent buffers can be processed at once because there are eight 32-bit lanes in the 256-bit YMM register. For SHA-512, which operates on 64-bit words, four independent data buffers can be processed in parallel. 1 Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Configurations: Refer to the Performance of Multi-Hash section on page 9. For more information go to http://www.intel.com/performance. 7

Table 3. AES Performance (cycles/byte)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain AES-128-CBC Encrypt 5.21 4.52 1.15x AES-128-CBC Decrypt 0.76 0.64 1.19x AES-256-CBC Encrypt 7.21 6.27 1.15x AES-256-CBC Decrypt 1.02 0.89 1.15x AES GCM Encrypt 2.76 1.26 2.19x The decrease in latency from 8 cycles to 7 cycles is obvious in the CBC Encrypt performance. The 8/7 ratio is efficiently achieved. For the throughput improvement, we now achieve close to an impressive perfect one round per cycle for CBC Decrypt (and CBC Encrypt Multi-Buffer not shown). The GCM performance is a dramatic 2.19X 1 since the microarchitecture of the PCLMULQDTable 1. SHA Single Buffer Performance (cycles/byte)q instruction on Haswell has been significantly improved. Table 4. Modular Exponentiation Performance (cycles)1 Algorithm Intel microarchitecture code name Sandy Bridge Haswell Haswell Gain 512-bit 231,750 173,348 1.34x 1024-bit 1,752,092 1,318,895 1.33x Modular exponentiation is the most compute intensive portion of the RSA algorithm, hence we focus on this calculation. For RSA 2048 decryption (private key operation), two 1024 modular exponentiation calculations are required, assuming the use of the Chinese Remainder Theorem. The 1.33X 1 performance increase is solid return on investment for modifying code to use MULX. 8

Conclusion The cryptographic application performance gains provided by Haswell over the Intel microarchitecture code name Sandy Bridge range from 1.15 to 2.19X 1. The AES improvements can be seen out of the box with a processor upgrade. The significant SHA and RSA performance increases require code updates to take advantage of the new Haswell feature set. Intel has been aggressively enabling common software libraries, such as OpenSSL*, with these code updates to ensure peak performance of most applications on Haswell processors. For more information on enabling custom software libraries, see the reference papers. Acknowledgements References We thank Jim Guilford, Erdinc Ozturk, David Cote, and Wajdi Feghali for their substantial contributions to this work. [1] Federal Information Processing Standards Publication 180-2 Secure Hash Standard http://csrc.nist.gov/publications/fips/fips180-2/fips180-2.pdf [2] Federal Information Processing Standards Publication 197 Advanced Encryption Standard http://csrc.nist.gov/publications/fips/fips197/fips- 197.pdf [3] E. Ozturk, V. Gopal Enabling High-Performance Galois-Counter-Mode on Intel Architecture Processors, October 2012, http://www.intel.com/content/www/us/en/intelligent-systems/networksecurity/enabling-high-performance-gcm.html [4] J. Guilford, K. Yap, V. Gopal, Fast SHA-256 Implementations on Intel Architecture Processors, May 2012, https://www-ssl.intel.com/content/www/us/en/intelligent-systems/inteltechnology/sha-256-implementations-paper.html [5] E. Ozturk, J. Guilford, V. Gopal, W. Feghali New Instructions Supporting Large Integer Arithmetic on Intel Architecture Processors, August 2012, https://www-ssl.intel.com/content/www/us/en/intelligent-systems/inteltechnology/ia-large-integer-arithmetic-paper.html 9

The Intel Embedded Design Center provides qualified developers with webbased access to technical resources. Access Intel Confidential design materials, step-by step guidance, application reference solutions, training, Intel s tool loaner program, and connect with an e-help desk and the embedded community. Design Fast. Design Smart. Get started today. http://www.intel.com/p/en_us/embedded. Authors Sean Gulley and Vinodh Gopal are IA Architects with the Datacenter and Connected Systems Group at Intel Corporation. Acronyms AES Advanced Encryption Standard ALU Arithmetic Logic Unit AVX Advanced Vector Extensions BMI Bit Manipulation Instructions CBC Cipher Block Chaining DH Diffie-Hellmen ECC Elliptic Curve Cryptography GCM Galois Counter Mode GPR General Purpose Register IA NI Intel Architecture Instructions Set RSA Rivest-Shamir-Adleman SHA Secure Hash Algorithm SIMD Single Instruction Multiple Data SSL Secure Socket Layer TLS Transport Layer Security 10

INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. EXCEPT AS PROVIDED IN INTEL'S TERMS AND CONDITIONS OF SALE FOR SUCH PRODUCTS, INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO SALE AND/OR USE OF INTEL PRODUCTS INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. A "Mission Critical Application" is any application in which failure of the Intel Product could result, directly or indirectly, in personal injury or death. SHOULD YOU PURCHASE OR USE INTEL'S PRODUCTS FOR ANY SUCH MISSION CRITICAL APPLICATION, YOU SHALL INDEMNIFY AND HOLD INTEL AND ITS SUBSIDIARIES, SUBCONTRACTORS AND AFFILIATES, AND THE DIRECTORS, OFFICERS, AND EMPLOYEES OF EACH, HARMLESS AGAINST ALL CLAIMS COSTS, DAMAGES, AND EXPENSES AND REASONABLE ATTORNEYS' FEES ARISING OUT OF, DIRECTLY OR INDIRECTLY, ANY CLAIM OF PRODUCT LIABILITY, PERSONAL INJURY, OR DEATH ARISING IN ANY WAY OUT OF SUCH MISSION CRITICAL APPLICATION, WHETHER OR NOT INTEL OR ITS SUBCONTRACTOR WAS NEGLIGENT IN THE DESIGN, MANUFACTURE, OR WARNING OF THE INTEL PRODUCT OR ANY OF ITS PARTS. Intel may make changes to specifications and product descriptions at any time, without notice. Designers must not rely on the absence or characteristics of any features or instructions marked "reserved" or "undefined". Intel reserves these for future definition and shall have no responsibility whatsoever for conflicts or incompatibilities arising from future changes to them. The information here is subject to change without notice. Do not finalize a design with this information. The products described in this document may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available on request. Contact your local Intel sales office or your distributor to obtain the latest specifications and before placing your product order. Copies of documents which have an order number and are referenced in this document, or other Intel literature, may be obtained by calling 1-800-548-4725, or go to: http://www.intel.com/design/literature.htm Hyper-Threading Technology requires a computer system with a processor supporting HT Technology and an HT Technology-enabled chipset, BIOS and operating system. Performance will vary depending on the specific hardware and software you use. For more information including details on which processors support HT Technology, see here. 64-bit computing on Intel architecture requires a computer system with a processor, chipset, BIOS, operating system, device drivers and applications enabled for Intel 64 architecture. Performance will vary depending on your hardware and software configurations. Consult with your system vendor for more information. Intel Turbo Boost Technology requires a PC with a processor with Intel Turbo Boost Technology capability. Intel Turbo Boost Technology performance varies depending on hardware, software and overall system configuration. Check with your PC manufacturer on whether your system delivers Intel Turbo Boost Technology. For more information, see http://www.intel.com/technology/turboboost. Intel, Intel Turbo Boost Technology, Intel Hyper Threading Technology, Intel Xeon are trademarks or registered trademarks of Intel Corporation or its subsidiaries in the U.S. and other countries. *Other names and brands may be claimed as the property of others. Copyright 2013 Intel Corporation. All rights reserved. 11