synthesizer called C Compatible Architecture Prototyper(CCAP).



Similar documents
Implementation of Full -Parallelism AES Encryption and Decryption

IJESRT. [Padama, 2(5): May, 2013] ISSN:

Design and Verification of Area-Optimized AES Based on FPGA Using Verilog HDL

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

Efficient Software Implementation of AES on 32-bit Platforms

Hardware Implementation of AES Encryption and Decryption System Based on FPGA

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

FPGA IMPLEMENTATION OF AES ALGORITHM

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

Architectures and Platforms

Implementation and Design of AES S-Box on FPGA

ELECTENG702 Advanced Embedded Systems. Improving AES128 software for Altera Nios II processor using custom instructions

Area Optimized and Pipelined FPGA Implementation of AES Encryption and Decryption

7a. System-on-chip design and prototyping platforms

FPGA IMPLEMENTATION OF AN AES PROCESSOR

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

SeChat: An AES Encrypted Chat

Polymorphic AES Encryption Implementation

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Improving Performance of Secure Data Transmission in Communication Networks Using Physical Implementation of AES

AES Power Attack Based on Induced Cache Miss and Countermeasure

Performance Evaluation of AES using Hardware and Software Codesign

An Instruction Set Extension for Fast and Memory-Efficient AES Implementation

Rapid System Prototyping with FPGAs

Rijndael Encryption implementation on different platforms, with emphasis on performance

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016

Low Cost System on Chip Design for Audio Processing

The Advanced Encryption Standard: Four Years On

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

A PERFORMANCE EVALUATION OF COMMON ENCRYPTION TECHNIQUES WITH SECURE WATERMARK SYSTEM (SWS)

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

High-Level Synthesis for FPGA Designs

RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition

A HARDWARE IMPLEMENTATION OF THE ADVANCED ENCRYPTION STANDARD (AES) ALGORITHM USING SYSTEMVERILOG

Enhancing Advanced Encryption Standard S-Box Generation Based on Round Key

Fast Implementations of AES on Various Platforms

Quartus II Software Design Series : Foundation. Digitale Signalverarbeitung mit FPGA. Digitale Signalverarbeitung mit FPGA (DSF) Quartus II 1

Networking Remote-Controlled Moving Image Monitoring System

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and

Digital Systems Design! Lecture 1 - Introduction!!

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Serial port interface for microcontroller embedded into integrated power meter

Area optimized in storage area network using Novel Mix column Transformation in Masked AES

AES (Rijndael) IP-Cores

ELEC 5260/6260/6266 Embedded Computing Systems

Efficient Software Implementation of AES on 32-Bit Platforms

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Improved Method for Parallel AES-GCM Cores Using FPGAs

Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

The Advanced Encryption Standard (AES)

AC : PRACTICAL DESIGN PROJECTS UTILIZING COMPLEX PROGRAMMABLE LOGIC DEVICES (CPLD)

Design and Verification of Nine port Network Router

Cryptography and Network Security. Prof. D. Mukhopadhyay. Department of Computer Science and Engineering. Indian Institute of Technology, Kharagpur

Automata Designs for Data Encryption with AES using the Micron Automata Processor

9/14/ :38

Accelerate Cloud Computing with the Xilinx Zynq SoC

Fast Software AES Encryption

Survey on Enhancing Cloud Data Security using EAP with Rijndael Encryption Algorithm

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

AES implementation on Smart Card

40G MACsec Encryption in an FPGA

AES-CBC Software Execution Optimization

How To Design A Single Chip System Bus (Amba) For A Single Threaded Microprocessor (Mma) (I386) (Mmb) (Microprocessor) (Ai) (Bower) (Dmi) (Dual

How To Design An Image Processing System On A Chip

Computer Engineering: Incoming MS Student Orientation Requirements & Course Overview

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

AES Cipher Modes with EFM32

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Codesign: The World Of Practice

Hardware Encryption for Embedded Systems

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

LogiCORE IP AXI Performance Monitor v2.00.a

System-on. on-chip Design Flow. Prof. Jouni Tomberg Tampere University of Technology Institute of Digital and Computer Systems.

Design and Implementation of Low-area and Low-power AES Encryption Hardware Core

Pavithra.S, Vaishnavi.M, Vinothini.M, Umadevi.V

A Framework for Automatic Generation of Configuration Files for a Custom Hardware/Software RTOS

An Energy Efficient ATM System Using AES Processor

Performance Oriented Management System for Reconfigurable Network Appliances

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Side Channel Analysis and Embedded Systems Impact and Countermeasures

A Comparative Study Of Two Symmetric Encryption Algorithms Across Different Platforms.

Offline HW/SW Authentication for Reconfigurable Platforms

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Hardware Task Scheduling and Placement in Operating Systems for Dynamically Reconfigurable SoC

DS1104 R&D Controller Board

White Paper. Shay Gueron Intel Architecture Group, Israel Development Center Intel Corporation

Chapter 2 Features of Embedded System

Computer Systems Structure Input/Output

Example-driven Interconnect Synthesis for Heterogeneous Coarse-Grain Reconfigurable Logic

Implementation of emulated digital CNN-UM architecture on programmable logic devices and its applications

Transcription:

Speed Improvement of AES Encryption using hardware accelerators synthesized by C Compatible Architecture Prototyper(CCAP) Hiroyuki KANBARA Takayuki NAKATANI Naoto UMEHARA Nagisa ISHIURA Hiroyuki TOMIYAMA ASTEM RI, Shimogyo, Kyoto, 600-8813, JAPAN Graduate School of Science&Engineering, Ritsumeikan University, Kusatsu, 525-8577, JAPAN School of Science&Technology, Kwansei Gakuin University, Sanda, 669-1337, JAPAN Graduate School of Information Science, Nagoya University, Chigusa, Nagoya, 464-8603, JAPAN hls@ksc.kwansei.ac.jp Abstract The authors are developping a high-level synthesizer called C Compatible Architecture Prototyper(CCAP). CCAP compiles ANSI C program which is a part of embedded software and generates an application specific hardware accelerator in HDL. Synthesized accelerator has an ability to read/write main memory and executes calculation faster than an embedded processor. CCAP offers an arbiter circuit which makes it possible for the synthesized accelerator and a processor to access main memory in parallel. In this paper we report the speed improvement of AES Encryption using design methodology of CCAP synthesizer. I. Introduction Recently handheld devices like mobile phones or personal digital assistant(pda)s need powerful computing power for capturing movie or internet browsing. Sensor network devices with signal-processing and communication capability is one of the future trends of ubiquitous computing. Requirements to these embedded systems like handheld devices or sensor network devices are increasingly highvolume and low-cost. A common solution for high-volume and low-cost embedded systems production is the sytem on a chip(soc), which includes an embedded processor, a memory and I/O peripherals. Usually clock rates of the embedded processors are one-tenth of processors for personal computer(pc)s. Because an energy source for the handheld devices or the sensor network devices is commonly a battery and power consumption of the embedded processor must be under several watts. Performance requirements to the embedded systems have become much higher, sometimes almost the same computation ability compared with PCs. accerelation is one of the techniques to improve perfomance of the embedded systems. Takayuki NAKATANI is currently with SHIMADZU Corporation and Naoto UMEHARA is currently with KAWASAKI MICROELEC- TRONICS, Inc. accelerators are designed for computationary intensive software code. For example, Gaussian or Sobel filter functions are used for signal processing. Block cipher encryption/decryption is used for secure communication on internet. These functions can be five-to-ten times faster when its calculation is accelerated by an application specific hardware. accerelation is very useful approach but design efficiency of the hardware accerelator has not been improved. Because design steps for the hardware accrelerator as follows are difficult and time-consuming. design hardware accelerator using Descrption Language develop device driver program for invoking the hardware accelerator from software verification of the hardware accelerator including the driver program The authors are developping a high-level synthesizer called C Compatible Architecture Prototyper (CCAP) [1][2]. CCAP compiles ANSI C program which is a part of embedded software and generates an application specific hardware accelerator in HDL. Synthesized accelerator executes faster than a cpu and accesses to main memory. CCAP offers an arbiter circuit which makes it possible for the synthesized accelerator and a cpu to access main memory in parallel. These features of CCAP synthesizer make it possible for a software designer to get much faster calculation result by the synthesized hardware accelerator, without hardware designer s assistance. In this paper we report the speed improvement of AES Encryption using CCAP synthesizer. AES encryption mainly consists of 5 stages : KeyExpansion,, SubBytes, ShiftRows and Mix- Columns. AESencryptionsoftwareiswritteninANSIC and above mentioned 5 stages are implemented in separate 5 functions. accelerators for 3 functions : Sub- Bytes, ShiftRows and MixColumns are designed by human designer from ANSI C functions.

SoC Functional Spec. in ANSI C System Simulation Function(s) in ANSI C Functional Spec. in ANSI C Functions Software Software CCAP High-Level Synthesizer PC FPGA SoC Software in ANSI C (s) in HDL CCAP Design Target Fig. 1. System Design Flow using CCAP Synthesizer Fig. 2. Design Target of CCAP Synthesizer These 3 hardware accelerators speed up the AES encryption almost 5.0 times faster, compared with software execution by an embedded processor which is Mips R3000 compatible. Circuit size of the 3 accelerators with the arbitor circuit is only 59% of the embedded processor. This design result shows that solution using CCAP hadware synthesizer is enough efficient compared with multi-processors solution. II. High-Level Synthesizer CCAP system design flow using high-level synthesizer CCAP is show in Fig. 1. In this flow, bottleneck functions of SoC Functional Spec. in ANSI C are found in the System Simulation step and CCAP High-Level Synthesizer generates (s) in HDL from the bottleneck function(s) called Function(s) in ANSI C. system designer describes all requirements to a SoC in ANSI C. ANSI C is a commonly used programming language for embedded software. The described ANSI C program can be compiled by ANSI C compiler and executed on PCs. Using embedded sytem simulator, execution speed of the software by an embedded software, or speed improvement by hardware accereletors synthesized by CCAP is estimated. The system designer finds out bottleneck function(s) and generates hardware accelerator(s) using CCAP. We call these ANSI C functions for hardware synthesis hardware functions. CCAP synthesizes hardware accelerators in HDL from the hardware functions in ANSI C. Driver software for each synthesized accelerator is generated by CCAP. The system designer compiles the orignal ANSI C program and the generated driver software by a cross compiler for an embedded processor. Fig. 2 shows our prototyping scheme that FPGA will be used for verification and performance estimation. The synthesized hardware accelerators and the embedded processor will be implemented within the FPGA logic. The compiled ANSI C program will be executed by the FPGA prototype embedded system and the system designer will find out that the embedded processor and the accelerators give same functions done by the software execution the embedded processor and the accelerators offer enough performance Following these design steps, the system designer co-designs hardware accelerators in HDL and embedded software in ANSI C. The system designer completes prototyping of the embedded system without help of hardware designers. To realize this hardware/software co-design methodology, CCAP high-level synthesizer has features as follows. CCAP handles ANSI C software as it is. The system designer does not have to change ANSI C hardware functions for synthesizing hardware accelerators. CCAP compiles pointer and external variables of ANSI C. This makes it possible for synthesized hardware accelerators and an embedded processor to share same memory address space. III. AES Encryption Argorithm Advanced Encryption Standard(AES) is a block cipher adopted as an encryption standard by the U.S. government. [3][4]. Fig. 3 shows procedure of the AES encryption. The AES has a fixed block size input(in) of 128bits(16bytes) and a key(key) size of 128, 192, 256bits. Rounds of loop change depending on the KEY size, 9 rounds for 128bits, 11 rounds for 192bits and 13 rounds for 256bits. In the KeyExpansion step, the round key for the step is generated from the original key through a process called Key Scheduling. The KeyExpansion step is usually done once because the same round key will be used for encryption of one input stream. In the SubBytes step, each input byte of 4!_4 array of bytes is replaced by another byte, from look-up tabel called S-Box. This 4!_4 array of bytes for the AES encryption is callled the State matrix. In the ShiftRows step, each byte of the State matrix is shifted cyclically certain number of steps. In the

KEY KeyExpansion IN loop (ShiftRows) (SubBytes) SubBytes Arbiter ShiftRows MixColumns Main Memory (MixColumn) Fig. 4. Configuration for AES Speed Improvement SubBytes Fig. 3. AES Encryption Procidure ShiftRows OUT MixColumns step, each column of the State matrix is multiplied with a fixed polynomial. In the step, each byte of the state is combined with a byte of the round key using the XOR operations. The following is speed improvement of the AES Encryption based on CCAP synthesizer design methodology. IV. Speed Improvement of AES Encryption Five different configurations as follow are used to estimate speed improvement of the AES encryption. All steps in the AES encryption are software(executed by an embedded ShiftRows step is executed by hardware accelerators SubBytes step is executed by hardware accelerators MixColumns step is executed by hardware accelerators ShiftRows SubBytes and MixColumns steps are executed by hardware accelerators and other steps are software(executed by an embedded For each configuration, we estimated circuit size and clock cycles taken for 128 bits input stream to be encrypted by 128bits length user key. The KeyExpansion step remains software for every configuration, assuming that the KeyExpansion step does not become bottleneck. Becuase the KeyExpansion step is done only once and other ShiftRows, SubBytes and Mix- Columns steps are repeated. RT-level simulation using ModelSim-XE Verilog Simulator is used to count clock cycles for encrypting 128bits length input stream by 128bits length user key. MIPS R3000 compatible processor in RTL level description is prepared for the embedded processor. This embedded processor does not include FPU, MMU and cache memory for instructions and data that original R3000 processor has. AES encryption software is compiled by GNU C Compiler for R3000. architecture for an embedded processor and hardware accelerators is shown in Fig. 4. The embedded processor, the arbiter circuit and the hardware acclerators for ShiftRows, SubBytes and Mix- Columns are synthesized for XILINX Vertex series FPGA by ISE Foundation tools. Circuit size of each configuration isshownin slices tobeusedforconfiguringvertexseries FPGA. Main memory in Fig. 4 is not included for estimating the circuit size. Arbiter in Fig. 4 is not a target of high-level synthesis by CCAP. We prepared the arbiter circuit in Verilog HDL which allows 3 hardware acclerators and an embedded processor to read/write the main memory in parallel. Table I shows circuit size and maximum delay time for each hardware module. The maximum delay time of the embedded R3000-compatible processor is longer than any other ones of hardware modules. This result means that adding hardware accleretors does not affect clock rates of the embedded processor. The sum of the circuit size of the arbitor, hardware accleretors for ShiftRows, hardware accleretors for MixColumn and hardware accleretors for ShiftRows is 59% of the size of the processor. In Table II clock cycles for the AES encryption and circuit size of each configuration are shown. acceleration of the SubBytes step is the most effective for speed improvement. The hardware accelerelator for SubBytes has 4 lookup tables for the S-Box and circuit size of the Sub- Bytes acclerelator becomes bigger than any other accelera-

tor This 4 lookup tables makes it possible to replace input byte to another byte in parallel in the SubBytes step. Taking increase in circuit size, hardware acceleration for MixColumns step is the most efficient. In the MixColumns step, each input byte of the 4!_4 State matrix is replaced by other byte. When the MixColumns step is software implementation, this replacement is done sequentially by the embedded software. In the hardware acclerelator for Mix- Columns, only hard-wired logic is needed to execute this replacement in parallel. When 3 steps of ShiftRows, SubBytes, MixColumns are implemented as hardware accelerators, execution speed of the AES encryption becomes almost 5.0 times faster than software execution. With this configuration, circuit size including the embedded processor and the arbitor is 1.58 times bigger than the embedded processor. System design approach with CCAP synthesizer is more efficient than multi-processor solution. [4] National Institute of Standards and Technology(NAIST), Advanced Encryption Standard(AES), FIPS Publication 197, 2001, http://csrc.nist.gov/encryption/aes/index.html V. Conclusion We have presented an embedded system design approach using high-level synthesizer CCAP. AES encryption embedded software is accelerated by converting bottleneck software functions into hardwares based on the proposed design approach. We confirms that CCAP high-level synthesizer approach is efficient to synthesize hardware from embedded software. Acknowledgement We would like to thank Prof. Yamazki at Ritsumeikan University for making AES encryption software open. We would also like to thank Mr. Sugihara of Kyoto University, Mr. Nishimura and the memebers of Ishiura Laboratory of Kwansei Gakuin University for their discussion and suggestions. This work is in part supported by KAKENHI 19700040. References [1] M.Nishimura, K.Nishiguchi, N.Ishiura, H.Kanbara, H.Tomiyama, Y.Takatsukasa, M.Kotani, High-Lebel Synthesis of Variable Accesses and Function Calls in Software Compatible Synthsizer CCAP, in Proc. Workshop on Synthesis And System Integration of Mixed Information Technologies(SASIMI) 2006, pp.29-34, 2006 [2] M.Nishimura, N.Ishiura, Y.Ishimori, H.Kanbara, H.Tomiyama, Calling Software Functions from Functions, in Proc. Workshop on Synthesis And System Integration of Mixed Information Technologies(SASIMI) 2007, 2007 [3] Daemen,J. and Rijmen, AES Proposal : Rijndael, http://csrc.nist.gov/encryption/aes/rijndael/rijndael.pdf

TABLE I Circuit Size and Maximum Delay of, Arbiter, and Acclerelators Module Circuit Size Maximum Delay (Slices) (ns) 2825 12.4 Arbiter 528 5.71 Acclerelator (ShiftRows) 210 1.95 Acclerelator (SubBytes) 470 1.97 Acclerelator (MixColumn) 367 2.01 TABLE II Speed Improvement of AES Encryption and Increase in Circuit Size Configuration Execution Speed Circuit Size Clock Cycles Improvement Ratio Slices Increase Ratio (AES Software) 26794 1.00 2825 1.00 and Arbitor - - 3353 1.18, Arbitor and Accleretor(ShiftRows) 24964 1.07 3563 1.26, Arbitor and Accleretor(SubBytes) 21964 1.21 3823 1.35, Arbitor and Accleretor(MixColumns) 12047 2.22 3720 1.32, Arbitor Accleretor(ShiftRows) Accleretor(SubBytes) 5387 4.97 4400 1.58 and Accleretor(MixColumns)