Hardware Implementations of RSA Using Fast Montgomery Multiplications. ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner



Similar documents
FPGA and ASIC Implementation of Rho and P-1 Methods of Factoring. Master s Thesis Presentation Ramakrishna Bachimanchi Director: Dr.

Implementation of Full -Parallelism AES Encryption and Decryption

Cryptography & Network-Security: Implementations in Hardware

FPGA Implementation of RSA Encryption Engine with Flexible Key Size

The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

Implementation and Design of AES S-Box on FPGA

Let s put together a Manual Processor

Multipliers. Introduction

Central Processing Unit (CPU)

Exploiting Stateful Inspection of Network Security in Reconfigurable Hardware

A DA Serial Multiplier Technique based on 32- Tap FIR Filter for Audio Application

Improved Method for Parallel AES-GCM Cores Using FPGAs

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

White Paper FPGA Performance Benchmarking Methodology

DEVELOPMENT OF DEVICES AND METHODS FOR PHASE AND AC LINEARITY MEASUREMENTS IN DIGITIZERS

Mathematics of Internet Security. Keeping Eve The Eavesdropper Away From Your Credit Card Information

Introduction to Xilinx System Generator Part II. Evan Everett and Michael Wu ELEC Spring 2013

CUDA Programming. Week 4. Shared memory and register

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: ext: Sequential Circuit

International Association of Scientific Innovation and Research (IASIR) (An Association Unifying the Sciences, Engineering, and Applied Research)

An Efficient Hardware Architecture for Factoring Integers with the Elliptic Curve Method

40G MACsec Encryption in an FPGA

DDS. 16-bit Direct Digital Synthesizer / Periodic waveform generator Rev Key Design Features. Block Diagram. Generic Parameters.

Innovative improvement of fundamental metrics including power dissipation and efficiency of the ALU system

Chapter 13: Verification

Polymorphic AES Encryption Implementation

AES (Rijndael) IP-Cores

路 論 Chapter 15 System-Level Physical Design

(Refer Slide Time: 00:01:16 min)

Hardware and Software

From Concept to Production in Secure Voice Communications

Hardware-Software Codesign in Embedded Asymmetric Cryptography Application a Case Study

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

synthesizer called C Compatible Architecture Prototyper(CCAP).

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

7a. System-on-chip design and prototyping platforms

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

DRAFT Gigabit network intrusion detection systems

Sistemas Digitais I LESI - 2º ano

Technical Aspects of Creating and Assessing a Learning Environment in Digital Electronics for High School Students

Chapter 7 Memory and Programmable Logic

Speeding Up RSA Encryption Using GPU Parallelization

Lecture 5: Gate Logic Logic Optimization

ELECTENG702 Advanced Embedded Systems. Improving AES128 software for Altera Nios II processor using custom instructions

Novel Hardware Implementation of Modified RC4 Stream Cipher for Wireless Network Security

Modeling Latches and Flip-flops

CS 758: Cryptography / Network Security

LLRF. Digital RF Stabilization System

Digital Systems Design! Lecture 1 - Introduction!!

Operating Systems. Virtual Memory

High-Performance Modular Multiplication on the Cell Processor

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

FPGA-based MapReduce Framework for Machine Learning

Open Flow Controller and Switch Datasheet

DEPARTMENT OF ELECTRONICS AND COMMUNICATION ENGINEERING NATIONAL INSTITUTE OF TECHNOLOGY ROURKELA EFFICIENT ROUTER DESIGN FOR NETWORK ON CHIP

Asynchronous IC Interconnect Network Design and Implementation Using a Standard ASIC Flow

MIMO detector algorithms and their implementations for LTE/LTE-A

IJESRT. [Padama, 2(5): May, 2013] ISSN:

CoProcessor Design for Crypto- Applications using Hyperelliptic Curve Cryptography

Hardware Implementation of Improved Adaptive NoC Router with Flit Flow History based Load Balancing Selection Strategy

High-Speed Computing & Co-Processing with FPGAs

9/14/ :38

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

Applied Cryptography Public Key Algorithms

FPGA Implementation of an Extended Binary GCD Algorithm for Systolic Reduction of Rational Numbers

Floating Point Fused Add-Subtract and Fused Dot-Product Units

Secure Cloud Storage and Computing Using Reconfigurable Hardware

The new frontier of the DATA acquisition using 1 and 10 Gb/s Ethernet links. Filippo Costa on behalf of the ALICE DAQ group

Arquitectura Virtex. Delay-Locked Loop (DLL)

How To Design A Chip Layout

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

Scalability and Classifications

Cryptography and Network Security Chapter 10

FPGA IMPLEMENTATION OF AN AES PROCESSOR

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

Design and Verification of Area-Optimized AES Based on FPGA Using Verilog HDL

INTEGRATED CIRCUITS. For a complete data sheet, please also download:

EE361: Digital Computer Organization Course Syllabus

ALFFT FAST FOURIER Transform Core Application Notes

International Journal of Advancements in Research & Technology, Volume 2, Issue3, March ISSN

How To Design An Image Processing System On A Chip

ATLAS Tile Calorimeter Readout Electronics Upgrade Program for the High Luminosity LHC

Final Project: Enhanced Music Synthesizer and Display Introduction

Design of Digital Circuits (SS16)

Client Server Registration Protocol

ECE 3401 Lecture 7. Concurrent Statements & Sequential Statements (Process)

Automata Designs for Data Encryption with AES using the Micron Automata Processor

A single register, called the accumulator, stores the. operand before the operation, and stores the result. Add y # add y from memory to the acc

Introduction to CMOS VLSI Design (E158) Lecture 8: Clocking of VLSI Systems

Haswell Cryptographic Performance

Pavithra.S, Vaishnavi.M, Vinothini.M, Umadevi.V

Ashraf Abusharekh Kris Gaj Department of Electrical & Computer Engineering George Mason University

Eastern Washington University Department of Computer Science. Questionnaire for Prospective Masters in Computer Science Students

LogiCORE IP AXI Performance Monitor v2.00.a

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Module: Software Instruction Scheduling Part I

Network Security. Computer Networking Lecture 08. March 19, HKU SPACE Community College. HKU SPACE CC CN Lecture 08 1/23

Transcription:

Hardware Implementations of RSA Using Fast Montgomery Multiplications ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner

Overview Introduction Functional Specifications Implemented Design and Optimizations Tools Testing Results Conclusions

Introduction RSA Encryption / Decryption Worldwide use in securing data transmission Public / private key based Large (512-bit +) keys required for protection of data Large keys = Slower decryption times Alice M Bob M Public Key X Private Key M Hacker RSA Encryption CypherData CypherData RSA Decryption

Introduction Picture taken from Lecture 11 Exponentiation, Multi-Precision Arithmetic in Software. George Mason University. Prof Gaj. http://teal.gmu.edu/courses/ece645/viewgraphs_s06/lecture11_exp_sw_2.pdf pp 2.

Functional Specification RSA Encryption / Decryption Algorithm To calculate Y = X E mod N: S = X Y = 1 for (i = 0 to k-1) k { } if (E i = 1) Y = Y * S mod N S = S * S mod N

Functional Specification Picture taken from Lecture 11 Exponentiation, Multi-Precision Arithmetic in Software. George Mason University. Prof Gaj. http://teal.gmu.edu/courses/ece645/viewgraphs_s06/lecture11_exp_sw_2.pdf pp 21.

Functional Specification Picture taken from Lecture 11 Exponentiation, Multi-Precision Arithmetic in Software. George Mason University. Prof Gaj. http://teal.gmu.edu/courses/ece645/viewgraphs_s06/lecture11_exp_sw_2.pdf pp 22.

Functional Specification Montgomery Multiplication MP (A, B, N) Algorithm: S[0] = 0; for i in 0 to k-1 k 1 loop q i = (S[i] 0 + A i * B 0 ) mod 2; S[i + 1] = (S[i] + A i * B + q i * N) / 2; end loop; return S[k];

Functional Specification 5-22 Montgomery Multiplication MP52 (A1, A2, B1, B2, N) Algorithm: S1[0] = 0; S2[0] = 0; for i in 0 to k-1 k 1 loop q i = (S1[i] 0 + S2[i] 0 ) + (A i * ( B1 0 + B2 0 ) ) mod 2; S1[i+1],S2[i+1] = CSR(S1[i] + S2[i] + A i * (B1 + B2) + q i * N) / 2; end loop; return S1[k], S2[k];

Functional Specification RSA Encryption / Decryption with 5-25 2 MP RSA (C, d, N) K = 2 2k mod N; P1, P2 = 5to2_MontMult( K, 0, C, 0, N ); R1, R2 = 5to2_MontMult( K, 0, 1, 0, N ); for i in 0 to d k loop if d[i] = 1 R1, R2 = 5to2_MontMult( R1, R2, P1, P2, N ); P1, P2 = 5to2_MontMult( P1, P2, P1, P2, N ); end loop; M1, M2 = 5to2_MontMult( 1, 0, R1, R2, N ); return M1 + M2;

Functional Specification Addition Chains Sequence of additions to produce a large number Each sequence step is the sum of two numbers previously in the chain e.g. 27 = 1, 2, 3, 6, 12, 24, 27 Expanded to sequence of multiplications X 27 = X 1, X 2, X 3, X 6, X 12, X 24, X 27

Functional Specification Addition Chains Use memory (registers) to store intermediate results Use memory to store and serve addition chain commands to multiplier circuit Command structure: 2 Log 2 R C Destination Log 2 R Operand2 Log 2 R Operand1

Functional Specification RSA Shell Unit: clock reset data_avail data_in key_in start full RSA with Montgomery Multiplication data_read write data_out

Functional Specification RSA Chain Shell Unit: clock reset command_avail command_in data_avail data_in start full Addition Chain RSA with Montgomery Multiplication command_read data_read ready write data_out

Implemented Design Criteria: Maximize Throughput Minimize Clock Period Minimize Area (on selected chips) RSA considerations: Encryption is trivial Decryption is bottleneck for RSA process High throughput allows for more decryptions in shorter amount of time

Implemented Design Design architectures: Sequential multipliers Small area, incremental results, small latency / round Tree and Array multipliers Large area, one result / clock cycle, pipeline-ready Choice: : Sequential Montgomery Multiplier Best fit to algorithm, small footprint, high clock rates Algorithm difficult to pipeline

Implemented Design RSA Diagrams

Optimizations 5-22 Montgomery Multiplier Ss [i] Sc [i] Ai * Bs CSA Ai * Bc CSA q * N CSA Ss [i+1] Sc [i+1]

Optimizations 4-22 Montgomery Multiplier Calculate D before running multiplier Xi q Q1 Q2 0 0 0 0 0 1 Bs Bc 1 0 N 0 Ss [i] Sc [i] CSA Q1 Q2 1 1 D1 D2 CSA D1, D2 = CSA( Bs, Bc, N ) Ss [i+1] Sc [i+1]

Optimizations 2x 5-25 2 Montgomery Multiplier Ss [i] Sc [i] A0*Bs A0*Bc q0*n C Shift x2 Register S0 C0 S1 C1 MP cin FA cout FA cout A1*Bs A1*Bc q1*n MP Ss [i+1] Sc [i+1] A0 A1 Ss [i+2] Sc [i+2]

Tools FPGA: ActiveHDL 7.1 Build 1583 Expert Addition Xilinix ISE 7.1i Synplicity Synthesis Pro 8.0 ASIC: Synopsys Design Analyzer (version X-2005.09) X Above tools used in ECE 203 lab, remotely on CPE02 and on personal laptops

Testing Test Vector Generation Official vectors from RSA Security Personally developed vectors Software RSA Implementation Limited program for performing RSA encryption/decryption Regular exponentiation Montgomery Modular exponentiation

Testing Addition Chains Attempted personal addition chain generation tool Few available sources for generating chains, especially for bit lengths > 12-bits (4096) Only able to perform tests with simple exponentiation and non-optimal optimal chains

Results Target FPGA Xilinx Virtex 4VLX160FF1513 Speed Grade 12 Device is oversized for all designs Chosen to eliminate area from consideration Main optimization being speed and throughput Target ASIC 90nm TCBN90G TSMC Library

Results Architecture Area (CLB Slices) FPGA K = 128-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 1,790 33,173 4.505 221.976 MP 4-24 2,102 40,921 5.238 190.913 2x MP 5-25 2,950 45,928 5.665 176.523 Addition Chain 4,323 317,106 10.831 92.328 Architecture ASIC K = 128-bits Area Clock Period (ns) Clock Frequency (MHz) MP 5-25 69314.063 2.26 442.478 MP 4-24 81535.914 2.26 442.478 2x MP 5-25 89671.680 2.26 442.478 Addition Chain 222,200.297 3.17 315.457

Results Architecture Area (CLB Slices) FPGA K = 256-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 3,341 63,078 5.212 191.865 MP 4-24 3,863 77,981 5.592 178.827 2x MP 5-25 5,153 82,967 6.134 163.026 Addition Chain 4,091 209,099 10.119 98.824 Architecture Area Clock Period (ns) Clock Frequency (MHz) MP 5-25 134022.063 2.26 442.478 MP 4-24 158533.406 2.26 442.478 2x MP 5-25 174450.938 2.26 442.478 Addition Chain ASIC K = 256-bits NA NA NA

Results Architecture Area (CLB Slices) FPGA K = 512-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 6,123 124,171 8.613 116.104 MP 4-24 7,308 153,800 9.387 106.530 2x MP 5-25 9,426 162,503 11.137 89.791 Addition Chain 7,665 413,271 11.315 88.378

Results Architecture Area (CLB Slices) FPGA K = 1024-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 12,105 245,423 10.177 98.261 MP 4-24 14,524 304,439 11.426 87.520 2x MP 5-25 18,964 324,667 14.495 68.990 Addition Chain 16,649 842,231 27.091 36.913

Results RSA with MP waveforms: 2x MP5-2 MP4-2 MP5-2

Results RSA with Addition Chains waveform:

Results Circuit MP5-2 Bit Length Latency (ns) Throughput (kb / s) 128 75.55 1460 256 349.59 690 512 2271.09 210 1024 10702.64 85 MP4-2 128 87.84 1690 256 370.78 740 512 2475.18 230 1024 12016.15 96 2x MP5-2 128 47.87 2670 256 204.15 1250 512 1471.18 350 1024 7629.27 130 Addition Chains 128 176.07 730 256 670.95 380 512 2983.56 170 1024 28490.25 36

Conclusions Recommendations MP 4-24 2 design 18% Area increase, 11% Speed increase 2x MP 5-25 2 design 60% Area increase, 30% Speed increase Addition Chains design Only benefit when chain performs fewer multiplies than both square and addition portion of Montgomery Multiplication

Conclusions Considerations Algorithmic improvements when performing 2x MP? 4x MP vs. 2x MP performance? Difficulties RSA test vector generation Addition chain command and vector generation System resources on CPE01 and CPE02 for FPGA and ASIC synthesis

Questions