Hardware Implementations of RSA Using Fast Montgomery Multiplications ECE 645 Prof. Gaj Mike Koontz and Ryon Sumner
Overview Introduction Functional Specifications Implemented Design and Optimizations Tools Testing Results Conclusions
Introduction RSA Encryption / Decryption Worldwide use in securing data transmission Public / private key based Large (512-bit +) keys required for protection of data Large keys = Slower decryption times Alice M Bob M Public Key X Private Key M Hacker RSA Encryption CypherData CypherData RSA Decryption
Introduction Picture taken from Lecture 11 Exponentiation, Multi-Precision Arithmetic in Software. George Mason University. Prof Gaj. http://teal.gmu.edu/courses/ece645/viewgraphs_s06/lecture11_exp_sw_2.pdf pp 2.
Functional Specification RSA Encryption / Decryption Algorithm To calculate Y = X E mod N: S = X Y = 1 for (i = 0 to k-1) k { } if (E i = 1) Y = Y * S mod N S = S * S mod N
Functional Specification Picture taken from Lecture 11 Exponentiation, Multi-Precision Arithmetic in Software. George Mason University. Prof Gaj. http://teal.gmu.edu/courses/ece645/viewgraphs_s06/lecture11_exp_sw_2.pdf pp 21.
Functional Specification Picture taken from Lecture 11 Exponentiation, Multi-Precision Arithmetic in Software. George Mason University. Prof Gaj. http://teal.gmu.edu/courses/ece645/viewgraphs_s06/lecture11_exp_sw_2.pdf pp 22.
Functional Specification Montgomery Multiplication MP (A, B, N) Algorithm: S[0] = 0; for i in 0 to k-1 k 1 loop q i = (S[i] 0 + A i * B 0 ) mod 2; S[i + 1] = (S[i] + A i * B + q i * N) / 2; end loop; return S[k];
Functional Specification 5-22 Montgomery Multiplication MP52 (A1, A2, B1, B2, N) Algorithm: S1[0] = 0; S2[0] = 0; for i in 0 to k-1 k 1 loop q i = (S1[i] 0 + S2[i] 0 ) + (A i * ( B1 0 + B2 0 ) ) mod 2; S1[i+1],S2[i+1] = CSR(S1[i] + S2[i] + A i * (B1 + B2) + q i * N) / 2; end loop; return S1[k], S2[k];
Functional Specification RSA Encryption / Decryption with 5-25 2 MP RSA (C, d, N) K = 2 2k mod N; P1, P2 = 5to2_MontMult( K, 0, C, 0, N ); R1, R2 = 5to2_MontMult( K, 0, 1, 0, N ); for i in 0 to d k loop if d[i] = 1 R1, R2 = 5to2_MontMult( R1, R2, P1, P2, N ); P1, P2 = 5to2_MontMult( P1, P2, P1, P2, N ); end loop; M1, M2 = 5to2_MontMult( 1, 0, R1, R2, N ); return M1 + M2;
Functional Specification Addition Chains Sequence of additions to produce a large number Each sequence step is the sum of two numbers previously in the chain e.g. 27 = 1, 2, 3, 6, 12, 24, 27 Expanded to sequence of multiplications X 27 = X 1, X 2, X 3, X 6, X 12, X 24, X 27
Functional Specification Addition Chains Use memory (registers) to store intermediate results Use memory to store and serve addition chain commands to multiplier circuit Command structure: 2 Log 2 R C Destination Log 2 R Operand2 Log 2 R Operand1
Functional Specification RSA Shell Unit: clock reset data_avail data_in key_in start full RSA with Montgomery Multiplication data_read write data_out
Functional Specification RSA Chain Shell Unit: clock reset command_avail command_in data_avail data_in start full Addition Chain RSA with Montgomery Multiplication command_read data_read ready write data_out
Implemented Design Criteria: Maximize Throughput Minimize Clock Period Minimize Area (on selected chips) RSA considerations: Encryption is trivial Decryption is bottleneck for RSA process High throughput allows for more decryptions in shorter amount of time
Implemented Design Design architectures: Sequential multipliers Small area, incremental results, small latency / round Tree and Array multipliers Large area, one result / clock cycle, pipeline-ready Choice: : Sequential Montgomery Multiplier Best fit to algorithm, small footprint, high clock rates Algorithm difficult to pipeline
Implemented Design RSA Diagrams
Optimizations 5-22 Montgomery Multiplier Ss [i] Sc [i] Ai * Bs CSA Ai * Bc CSA q * N CSA Ss [i+1] Sc [i+1]
Optimizations 4-22 Montgomery Multiplier Calculate D before running multiplier Xi q Q1 Q2 0 0 0 0 0 1 Bs Bc 1 0 N 0 Ss [i] Sc [i] CSA Q1 Q2 1 1 D1 D2 CSA D1, D2 = CSA( Bs, Bc, N ) Ss [i+1] Sc [i+1]
Optimizations 2x 5-25 2 Montgomery Multiplier Ss [i] Sc [i] A0*Bs A0*Bc q0*n C Shift x2 Register S0 C0 S1 C1 MP cin FA cout FA cout A1*Bs A1*Bc q1*n MP Ss [i+1] Sc [i+1] A0 A1 Ss [i+2] Sc [i+2]
Tools FPGA: ActiveHDL 7.1 Build 1583 Expert Addition Xilinix ISE 7.1i Synplicity Synthesis Pro 8.0 ASIC: Synopsys Design Analyzer (version X-2005.09) X Above tools used in ECE 203 lab, remotely on CPE02 and on personal laptops
Testing Test Vector Generation Official vectors from RSA Security Personally developed vectors Software RSA Implementation Limited program for performing RSA encryption/decryption Regular exponentiation Montgomery Modular exponentiation
Testing Addition Chains Attempted personal addition chain generation tool Few available sources for generating chains, especially for bit lengths > 12-bits (4096) Only able to perform tests with simple exponentiation and non-optimal optimal chains
Results Target FPGA Xilinx Virtex 4VLX160FF1513 Speed Grade 12 Device is oversized for all designs Chosen to eliminate area from consideration Main optimization being speed and throughput Target ASIC 90nm TCBN90G TSMC Library
Results Architecture Area (CLB Slices) FPGA K = 128-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 1,790 33,173 4.505 221.976 MP 4-24 2,102 40,921 5.238 190.913 2x MP 5-25 2,950 45,928 5.665 176.523 Addition Chain 4,323 317,106 10.831 92.328 Architecture ASIC K = 128-bits Area Clock Period (ns) Clock Frequency (MHz) MP 5-25 69314.063 2.26 442.478 MP 4-24 81535.914 2.26 442.478 2x MP 5-25 89671.680 2.26 442.478 Addition Chain 222,200.297 3.17 315.457
Results Architecture Area (CLB Slices) FPGA K = 256-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 3,341 63,078 5.212 191.865 MP 4-24 3,863 77,981 5.592 178.827 2x MP 5-25 5,153 82,967 6.134 163.026 Addition Chain 4,091 209,099 10.119 98.824 Architecture Area Clock Period (ns) Clock Frequency (MHz) MP 5-25 134022.063 2.26 442.478 MP 4-24 158533.406 2.26 442.478 2x MP 5-25 174450.938 2.26 442.478 Addition Chain ASIC K = 256-bits NA NA NA
Results Architecture Area (CLB Slices) FPGA K = 512-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 6,123 124,171 8.613 116.104 MP 4-24 7,308 153,800 9.387 106.530 2x MP 5-25 9,426 162,503 11.137 89.791 Addition Chain 7,665 413,271 11.315 88.378
Results Architecture Area (CLB Slices) FPGA K = 1024-bits Area (Gate Count) Clock Period (ns) Clock Frequency (MHz) MP 5-25 12,105 245,423 10.177 98.261 MP 4-24 14,524 304,439 11.426 87.520 2x MP 5-25 18,964 324,667 14.495 68.990 Addition Chain 16,649 842,231 27.091 36.913
Results RSA with MP waveforms: 2x MP5-2 MP4-2 MP5-2
Results RSA with Addition Chains waveform:
Results Circuit MP5-2 Bit Length Latency (ns) Throughput (kb / s) 128 75.55 1460 256 349.59 690 512 2271.09 210 1024 10702.64 85 MP4-2 128 87.84 1690 256 370.78 740 512 2475.18 230 1024 12016.15 96 2x MP5-2 128 47.87 2670 256 204.15 1250 512 1471.18 350 1024 7629.27 130 Addition Chains 128 176.07 730 256 670.95 380 512 2983.56 170 1024 28490.25 36
Conclusions Recommendations MP 4-24 2 design 18% Area increase, 11% Speed increase 2x MP 5-25 2 design 60% Area increase, 30% Speed increase Addition Chains design Only benefit when chain performs fewer multiplies than both square and addition portion of Montgomery Multiplication
Conclusions Considerations Algorithmic improvements when performing 2x MP? 4x MP vs. 2x MP performance? Difficulties RSA test vector generation Addition chain command and vector generation System resources on CPE01 and CPE02 for FPGA and ASIC synthesis
Questions