Improved Method for Parallel AES-GCM Cores Using FPGAs



Similar documents
Mobility management and vertical handover decision making in heterogeneous wireless networks

ibalance-abf: a Smartphone-Based Audio-Biofeedback Balance System

Implementation of Full -Parallelism AES Encryption and Decryption

A usage coverage based approach for assessing product family design

IJESRT. [Padama, 2(5): May, 2013] ISSN:

Design and Verification of Area-Optimized AES Based on FPGA Using Verilog HDL

THÈSE DE DOCTORAT DE L UNIVERSITÉ PIERRE ET MARIE CURIE. Spécialité: Informatique, Télécommunication et Électronique. Karim Moussa Ali Abdellatif

A graph based framework for the definition of tools dealing with sparse and irregular distributed data-structures

Study on Cloud Service Mode of Agricultural Information Institutions

Global Identity Management of Virtual Machines Based on Remote Secure Elements

Faut-il des cyberarchivistes, et quel doit être leur profil professionnel?

Managing Risks at Runtime in VoIP Networks and Services

Expanding Renewable Energy by Implementing Demand Response

Implementation and Design of AES S-Box on FPGA

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

QASM: a Q&A Social Media System Based on Social Semantics

Partial and Dynamic reconfiguration of FPGAs: a top down design methodology for an automatic implementation

VR4D: An Immersive and Collaborative Experience to Improve the Interior Design Process

Overview of model-building strategies in population PK/PD analyses: literature survey.

ANIMATED PHASE PORTRAITS OF NONLINEAR AND CHAOTIC DYNAMICAL SYSTEMS

Use of tabletop exercise in industrial training disaster.

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Area Optimized and Pipelined FPGA Implementation of AES Encryption and Decryption

40G MACsec Encryption in an FPGA

FPGA IMPLEMENTATION OF AN AES PROCESSOR

Performance Evaluation of Encryption Algorithms Key Length Size on Web Browsers

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

Discussion on the paper Hypotheses testing by convex optimization by A. Goldenschluger, A. Juditsky and A. Nemirovski.

Additional mechanisms for rewriting on-the-fly SPARQL queries proxy

Aligning subjective tests using a low cost common set

Bitstream Encryption and Authentication using AES-GCM in Dynamically Reconfigurable Systems

Minkowski Sum of Polytopes Defined by Their Vertices

Application-Aware Protection in DWDM Optical Networks

Online vehicle routing and scheduling with continuous vehicle tracking

An update on acoustics designs for HVAC (Engineering)

Towards Unified Tag Data Translation for the Internet of Things

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

SELECTIVELY ABSORBING COATINGS

Improving Performance of Secure Data Transmission in Communication Networks Using Physical Implementation of AES

An Energy Efficient ATM System Using AES Processor

New implementions of predictive alternate analog/rf test with augmented model redundancy

Design and Implementation of Low-area and Low-power AES Encryption Hardware Core

Information Technology Education in the Sri Lankan School System: Challenges and Perspectives

SeChat: An AES Encrypted Chat

Efficient Software Implementation of AES on 32-bit Platforms

GDS Resource Record: Generalization of the Delegation Signer Model

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

Pavithra.S, Vaishnavi.M, Vinothini.M, Umadevi.V

AES (Rijndael) IP-Cores

Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware

synthesizer called C Compatible Architecture Prototyper(CCAP).

The Advanced Encryption Standard (AES)

ELECTENG702 Advanced Embedded Systems. Improving AES128 software for Altera Nios II processor using custom instructions

Enhancing Advanced Encryption Standard S-Box Generation Based on Round Key

Flauncher and DVMS Deploying and Scheduling Thousands of Virtual Machines on Hundreds of Nodes Distributed Geographically

Polymorphic AES Encryption Implementation

The Advanced Encryption Standard (AES)

Proposal for the configuration of multi-domain network monitoring architecture

Automata Designs for Data Encryption with AES using the Micron Automata Processor

What Development for Bioenergy in Asia: A Long-term Analysis of the Effects of Policy Instruments using TIAM-FR model

AES1. Ultra-Compact Advanced Encryption Standard Core. General Description. Base Core Features. Symbol. Applications

ETHERNET ENCRYPTION MODES TECHNICAL-PAPER

Fast Implementations of AES on Various Platforms

Multi-Layered Cryptographic Processor for Network Security

Heterogeneous PLC-RF networking for LLNs

ANALYSIS OF SNOEK-KOSTER (H) RELAXATION IN IRON

Secret File Sharing Techniques using AES algorithm. C. Navya Latha Garima Agarwal Anila Kumar GVN

Running an HCI Experiment in Multiple Parallel Universes

Survey on Enhancing Cloud Data Security using EAP with Rijndael Encryption Algorithm

Hinky: Defending Against Text-based Message Spam on Smartphones

CSCE 465 Computer & Network Security

An Instruction Set Extension for Fast and Memory-Efficient AES Implementation

A DA Serial Multiplier Technique based on 32- Tap FIR Filter for Audio Application

Networking Virtualization Using FPGAs

An integrated planning-simulation-architecture approach for logistics sharing management: A case study in Northern Thailand and Southern China

AES Power Attack Based on Induced Cache Miss and Countermeasure

Floating Point Fused Add-Subtract and Fused Dot-Product Units

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

Area optimized in storage area network using Novel Mix column Transformation in Masked AES

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

Chapter 6 CDMA/802.11i

Territorial Intelligence and Innovation for the Socio-Ecological Transition

Towards Collaborative Learning via Shared Artefacts over the Grid

Cobi: Communitysourcing Large-Scale Conference Scheduling

FPGA IMPLEMENTATION OF AES ALGORITHM

A modeling approach for locating logistics platforms for fast parcels delivery in urban areas

Security+ Guide to Network Security Fundamentals, Third Edition. Chapter 6. Wireless Network Security

Cracks detection by a moving photothermal probe

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

Transcription:

Improved Method for Parallel -GCM Cores Using FPGAs Karim Moussa Ali Abdellatif, Roselyne Chotin-Avot, abib Mehrez To cite this version: Karim Moussa Ali Abdellatif, Roselyne Chotin-Avot, abib Mehrez. Improved Method for Parallel -GCM Cores Using FPGAs. Reconfigurable Computing and FPGAs (ReConFig), Dec 2013, Cancun, Mexico. 2013, <10.1109/ReConFig.2013.6732299>. <hal-01160904> AL Id: hal-01160904 http://hal.upmc.fr/hal-01160904 Submitted on 8 Jun 2015 AL is a multi-disciplinary open access archive for the deposit and dissemination of scientific research documents, whether they are published or not. The documents may come from teaching and research institutions in France or abroad, or from public or private research centers. L archive ouverte pluridisciplinaire AL, est destinée au dépôt et à la diffusion de documents scientifiques de niveau recherche, publiés ou non, émanant des établissements d enseignement et de recherche français ou étrangers, des laboratoires publics ou privés.

Improved Method for Parallel -GCM Cores Using FPGAs Karim M. Abdellatif, R. Chotin-Avot, and. Mehrez LIP6-SoC Laboratory, University of Paris VI, France {karim.abdellatif, roselyne.chotin-avot, habib.mehrez}@lip6.fr C[2] C[3] C[i] Abstract This paper proposes an efficient method for implementing parallel -GCM cores using FPGAs. The proposed method improves the performance of the parallel architecture (Throughput/Slice). Presented architectures can be used for applications which require encryption and authentication with slow changing keys like Virtual Private Networks (VPNs). Our architectures were evaluated using Virtex5 FPGAs. It is shown that the performance of the presented parallel -GCM architecture outperforms the previously reported ones. Keywords-Parallel -GCM, FPGAs C[1] Figure 1. operation I. INTRODUCTION -GCM [1] simultaneously provides confidentiality, integrity and authenticity assurances on the data. It supports not only high speed authenticated encryption but also protection against bit-flipping attacks. It can be implemented in hardware to achieve high speeds with low cost and low latency. Software implementations can achieve excellent performance by using table-driven field operations. GCM was designed to meet the need for an authenticated encryption mode that can efficiently achieve speeds of 10 Gbps and higher in hardware. It contains an engine in CTR mode and a Galois ash () module. It is well-suited for wireless, optical, and magnetic recording systems due to its multi-gbps authenticated encryption speed, outstanding performance, minimal computational latency as well as high intrinsic degree of pipelining and parallelism. New communication standards like IEEE 802.1ae [2] and NIST 800-38D have considered employing GCM to enhance their performance. Although in ASIC technologies, several architectures of the -GCM reaching the 100 Gbps throughput have been demonstrated by [3], to the best of our knowledge no real designs for field-programmable gate array (FPGA) devices reaching the same performances have been so far presented. Therefore, we present efficient method for implementing parallel -GCM architectures by taking the advantage of slow changing key applications. Section II presents a background of -GCM. After that, our proposed parallel -GCM is presented in Section III. Then, we report our implementation results in Section IV. Finally, Section V concludes our work. II. RELATED WORK OF -GCM As shown in Fig. 1, the function (authentication part) is composed of chained GF(2 ) multipliers and bit- wise exclusive-or (XOR) operations. Serial implementation of GF(2 ) multiplier performs the multiplication process in clock cycles. Parallel method can be implemented like [4] and it takes only one clock cycle. Karatsuba Ofman Algorithm (KOA) was used by [5] to reduce the complexity (consumed area) of the GF(2 ) multiplier. In order to reduce the data path of KOA multiplier, pipelining concept was accomplished by [6]. Although the use of pipelining concept for KOA decreases the data path and increases the operating frequency, the number of clock cycles to process a number of -bits is increased. This is because the output is fed back and XORed to the next input and there is a latency resulting from pipelining. An example of this problem is shown in [6], their GF multiplier achieves the multiplication of 8 frames of -bits in 19 clock cycles because of using the pipelining concept. ence, their throughput is calculated as follows: T hroughput(mbps) = F max(mz) (8/19) (1) If is fixed, the multiplier is called fixed operand GF(2 ) multiplier [7] which can be used efficiently (smaller area) on FPGAs as the circuit is specialized for and a new reconfiguration is uploaded into the FPGA with the new specialization in case of changing the key. Two methods of pipelined (composite field and BRAMs) were accomplished with KOA multiplier by [5] using virtex4 FPGA. Zhou et al. [6] used three methods for pipelined implementation (composite field, BRAMs, and LUT) with pipelined KOA to increase the operating frequency of the overall architecture. enzen et al. [8] proposed 4-parallel -GCM using pipelined KOA. Their design achieved the authentication of 18 frames of -bits in 11 clock cycles because of the latency resulting from the pipelined KOA. As a result, their throughput is calculated c 2013 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. DOI: 10.1109/ReConFig.2013.6732299

Plaintext C[i] Precomputed K0 Add Round Key SubBytes Precomputed K1 Round 1 Plaintext ShiftRows Precomputed K2 Round 2 Synthesized Key MixColumns Ciphertext(C) X[i] () Precomputed K10 Round 10 Add Round Key Figure 3. operation as follows: Ciphertext(C) Figure 2. key-synthesized based T hroughput(mbps) = F max(mz) (18/11) (2) III. EFFICIENT PARALLEL -GCM FOR SLOW CANGING KEY APPLICATIONS USING FPGAS Virtual Private Networks (VPNs) are widely employed to connect private local area networks to remote locations. VPNs use -GCM for encryption and authentication. In this kind of networks, the secret key used for encryption and authentication is changed weekly, monthly or yearly. Current commercial security appliances of VPNs allow a throughput from 40 to 60 Gbit/s [9],[10]. Another example of slow changing keys application is embedded system memory protection [11] which requires infrequent key changes over weeks or months. The previous applications are considered slow key changing applications. Therefore, implementing the key expansion is particularly expensive in terms of hardware cost. The GF multiplier used for authentication is a challenge because its data path is longer than and pipelining method does not solve this problem as described before. has a key expansion or key schedule operation, which takes the main key and derives from it subkeys K r (10, 12, and 14 for -, -192, and - 256, respectively), where r denotes the corresponding round number. For our case, we concentrate on -. In our hardware implementation, constant key specialization on the FPGA is used. The precomputed keys are generated using a C code. After, these keys are synthesized into the architecture of. As a result, the key expansion scheme is reduced from the architecture of. As we look for high speed architectures, subpipelining is used to obtain high throughput. Fig. 2 shows synthesized key, where all keys are precomputed and synthesized into the architecture. As a result of using key-synthesized, the operand of the function is also fixed because it is generated by applying the block cipher to the zero block. Therefore, the proposed multiplier by [7] is suitable because it is based on fixed operand multiplier. In order to implement parallel architectures of - GCM using key-synthesized method, parallel s must be constructed to meet the requirement of key-synthesized method (i.e, one of the two operands of every must be fixed). Previous parallel schemes of [3], [12], [7] are not suitable because the two operands of every are varied during the running time operation. As a result, their architectures are not suitable for key-synthesized approach. Therefore, constructing parallel s which have a fixed operand for every multiplier is very important for high speed applications. Fig. 3 shows the GF(2 ) multiplication between an value and a -bit input value C. function for block i is defined in Eq. 3. X i = (C i X i 1 ) (3) In order to accomplish parallel architectures for higher throughput, Eq. 3 is rewritten to fit the parallel scheme as shown in Eq. 4. For every multiplier, there is a fixed operand as shown in Fig. 4. For example, Multiplier i has as an operand, Multiplier i 1 has 2,..., and Multiplier 1 has i. X i = (C i X i 1 ) = (C i ) (X i 1 ) = (C i ) [(C i 1 X i 2 ) 2 ] = (C i ) (C i 1 2 ) [(C i 2 X i 3 ) 3 ] = (C i ) (C i 1 2 ) (C i 2 3 ) [(C i 3 X i 4 ) 4 ] = ((C i ) } {{ } (C i 1 2 ) } {{ } (C i 2 3 ) } {{ } (C i 3 4 ) } {{ }... (C 2 i 1 ) } {{ } (C 1 i ) } {{ } (4)

Table I ARDWARE COMPARISON FPGA type Design Key SubBytes Slices BRAM Max-Freq Thr. Thr./Slice Mz Gbit/s Mbps/Slice This work Virtex5 -GCM LUT 3211 0 216.3 27.7 8.62 This work Virtex5 4-parallel -GCM LUT 12152 0 200 102.4 8.42 [6] Virtex5 -GCM LUT 4628 41 324 17.5 3.77 [8] Virtex5 4-parallel -GCM LUT 14799 0 233 48.8 3.29 C C C C i 2 i 1 3 i 1 i i 2 2 Figure 4. Proposed parallel with fixed operand during running time operation Counter4 synthesized Counter3 Plaintext 4 Plaintext 3 Plaintext 2 Plaintext 1 2 synthesized 3 synthesized C 1 Ciphertext 4(C) Ciphertext 3(C) Ciphertext 2(C) Ciphertext 1(C) Figure 5. Counter 2 Counter1 4 synthesized Presented 4-parallel -GCM using key-synthesized method Unlike previous work, this method makes the parallel architecture of suitable for key-synthesized method as the values i, i 1,... get synthesized into the architecture of the parallel. After solving the problem of parallel s using keysynthesized method, these parallel multipliers are combined with parallel cores to perform parallel -GCM. Fig. 5 shows an example of 4-parallel -GCM architecture using key-synthesized method. For parallel pipelined, the round keys are precomputed and synthesized into the architecture. In terms of parallel, the used operands (, 2, 3, 4 ) are also precomputed and synthesized into the architecture. VPNs infrastructure [9] can benefit from our keysynthesized parallel -GCM due to the nature of slow changing key operation. IV. ARDWARE COMPARISON We coded two architectures (-GCM and 4- parallel -GCM) in VDL and targeted to Virtex5 (XC5VLX220). ModelSim 6.5c was used for simulation. Xilinx Synthesize Technology (XST) is used to perform the synthesize and ISE9.2 was adopted to run the Place And Route (PAR). Table I shows the hardware comparison between our results and previous work. Note the filled dots in the Key column. Key is synthesized into the architecture when denoted by, otherwise, the key schedule is implemented when denoted by. The LUT approach for SubBytes is especially interesting on Virtex5 devices because 6-input Look-Up-Tables (LUT) combined with multiplexors allow an efficient implementation of the SubBytes stage. Therefore, we implemented the SubBytes of using LUT approach because the final target is Virtex5. On Virtex5 platform, our key-synthesized based -GCM core reaches the throughput of 27.7 Gbps with the area consumption of 3211 slices. By comparing our results of -GCM to the previous work, the comparison shows that our performance (Thr. /Slice) is better than [6]. The operating frequency presented by [6] is better than ours because they used pipelined KOA but the overall throughput is lower than ours because their achieves the multiplication of 8 frames of -bits in 19 clock cycles. Therefore, their throughput is calculated as described in Eq. 1. A 4-parallel -GCM module operates at 200 Mz on Virtex-5. In total, it consumes 12152 Slices. This implementation achieves throughput that reaches to 102.4 Gbps. enzen et al. [8] proposed 4-parallel -GCM using pipelined KOA. Their design achieves the authentication of 18 frames of -bits in 11 clock cycles because of the latency resulting from the pipelined KOA. As a result, their throughput is calculated as described in Eq. 2. It is clear that our work outperforms the previously reported ones. Therefore, proposed architectures can be used

efficiently for slow changing key applications like VPNs and embedded memory protection. V. CONCLUSION In this paper, we presented the performance improvement for implementing parallel -GCM cores using FPGAs. We integrated this solution by solving the complexity of the calculation (Eq. 4). With our proposed parallel -GCM, every multiplier has a fixed operand. Therefore, our parallel -GCM is suitable for key-synthesized method. Our comparison to previous work reveals that our architectures are more performance-efficient (Thr./Slice). [12] J. Wang, G. Shou, Y. u, and Z. Guo, igh-speed Architectures for Based on Efficient Bit-parallel Multipliers, IEEE International Conference on Wireless Communications, Networking and Information Security (WCNIS), pp. 582 586, 2010. ACKNOWLEDGMENT This work is a part of the project Robust FPGA ANR 11 INS-02, funded by The French National Research Agency, The Pole Systematic and The Pole Minalogic. REFERENCES [1] D. McGrew and J. Viega, The Security and Performance of the Galois/Counter Mode (GCM) of Operation, Progress in Cryptology- INDOCRYPT 2004, pp. 377 413, 2005. [2] IEEE Standard for Local and metropolitan area networks Media Access Control () Security Amendment 1: Galois Counter Mode Advanced Encryption Standard 256 (GCM--256) Cipher Suite, IEEE. [3] A. Satoh, T. Sugawara, and T. Aoki, igh-speed Pipelined ardware Architecture for Galois Counter Mode, Information Security, pp. 118 129, 2007. [4] A. Satoh, igh-speed ardware Architectures for Authenticated Encryption Mode GCM, Circuits and Systems, 2006. ISCAS 2006. Proceedings. 2006 IEEE International Symposium on, pp. 4 pp, 2006. [5] G. Zhou,. Michalik, and L. insenkamp, Efficient and igh- Throughput Implementations of -GCM on FPGAs, International Conference on Field-Programmable Technology (FPT), pp. 185 192, 2007. [6] G. Zhou and. Michalik, Improving Throughput of -GCM with Pipelined Karatsuba Multipliers on FPGAs, Reconfigurable Computing: Architectures, Tools and Applications, pp. 193 203, 2009. [7] J. Crenne, P. Cotret, G. Gogniat, R. Tessier, and J. Diguet, Efficient Key-Dependent Message Authentication in Reconfigurable ardware, International Conference on Field-Programmable Technology (FPT), pp. 1 6, 2011. [8] L. enzen and W. Fichtner, FPGA Parallel-Pipelined -GCM Core for 100G Ethernet Applications, Proceedings of the ESSCIRC, pp. 202 205, 2010. [9] C. Corporation, Cisco ASA 5500 Series Adaptive Security Appliances, 2011. [Online]. Available: http://www.cisco.com/en/us/prod/collateral/vpndevc/ps6032/ \ps6094/ps6120/prod-brochure0900aecd80285492.pdf [10] Stonesoft, Security Engine Firewall/VPN, 2011. [Online]. Available: http://www.stonesoft.com/export/download/pdf/ datasheet-stonesoft-3206.pdf [11] R. Vaslin, G. Gogniat, J. Diguet, R. Tessier, D. Unnikrishnan, and K. Gaj, Memory Security Management for Reconfigurable Embedded Systems, International Conference on ICECE Technology, pp. 153 160, 2008.