FPGA and ASIC Implementation of Rho and P-1 Methods of Factoring Master s Thesis Presentation Ramakrishna Bachimanchi Director: Dr. Kris Gaj
Contents Introduction Background Hardware Architecture FPGA and ASIC Design Flow Results Conclusions
RSA In 1977 Ron Rivest, Adi Shamir & Leonard Adleman developed the first public key cryptosystems, they called RSA
RSA Public key {e, N} Private key {d, P,Q} Alice Encryption Network Decryption Bob { e, N } { d, P, Q } N = P Q P, Q - large prime factors e d 1 mod ((P-1)(Q-1))
Common Applications of RSA Secure WWW, SSL Network Browser WebServer S/MIME, PGP Alice Bob
Recommended key sizes for RSA Size of the RSA key = size of N=P Q Old standard: Individual users New standard: Short-term use ( up to 2010) 512 bits (155 decimal digits) 1024 bits Long-term use 2048 bits
Factoring RSA RSA-200 (663-bits) factored by Bahr, Boehm, Frank and Kleinjung When? Dec 2003 May 2005 Effort? First stage: About 1 year on various machines, equivalent to 55 years on Opteron 2.2 GHz CPU Second stage: 3 months on a cluster of 80 2.2 GHz Opterons connected via a gigabit network
Number Field Sieve Best Algorithm to Factor Large Numbers Complexity: Sub-exponential time and memory N = Number to factor, k = Number of bits of N Exponential function, e k Sub-exponential function, e k1/3 (ln k) 2/3 Polynomial function, a k m
Steps of Number Field Sieve (NFS) Polynomial Selection Relation Collection Sieving 200 bit & 350 bit numbers Mini factoring Pollard rho p-1 method ECM Linear Algebra Square Root
Rho Algorithm
Pollard s Rho Method Birthday paradox: If more than 23 random people are in a room (or even if they aren't) there is a more than 50% probability that the birthdays of two of them fall on the same day of the year.
Pollard's rho method - Example N = 97 1889 = 183 233 x i+1 = x i2 + 1 mod N x 0 x 1 x 2 x 3 x 4 x 5 x 6 x 7 x 8 x 9 2 5 26 677 91864 15449 102236 39678 5749 69062 mod 97: 2 5 26 95 5 26 95 5 26 95 x 2 x 5 x 8 mod q x 1 x 4 x 7 mod q 26 5 2 95 x 0 x 3 x 6 x 9 mod q x 1 x 4 mod q q (x 1 x 4 ) q N q gcd(x 1 x 4, N) q=gcd(-91 859, 183 233) = 97
Pollard s Rho Method x 3 mod q x 4 mod q x s x e mod q x e mod q.... x e-1 mod q... x s mod q x i+1 mod q x s+1 mod q period=e-s.. x s+2 mod q. x i mod q x 2 mod q x 1 mod q x 0 mod q x s x e mod q x s+1 x e+1 mod q.... x s+k x e+k mod q
Rho Algorithm- Floyd s Version Initialize b c x 0 1. ( ) 2 choose the polynomial as f x x a 2. calculate b f ( b) mod n and c f ( f ( c)) mod n 3. compute d gcd( b- c, n) 4. if 1 d n, a non trivial factor of n is found 5. if d 1 go to step 2 if d N change a and go to step 1
Rho Method - Floyd s Version x 1 -x 2 x 1 -x 3 x 1 -x 4 x 1 -x 5 x 1 -x 6 ---------------------------------------------------- x 1 -x i x 2 -x 3 x 2 -x 4 x 2 -x 5 x 2 -x 6 x 2 -x 7 ---------------------------------------------------- x 2 -x i x 3 -x 4 x 3 -x 5 x 3 -x 6 x 3 -x 7 x 3 -x 8 ---------------------------------------------------- x 3 -x i x 4 -x 5 x 4 -x 6 x 4 -x 7 x 4 -x 8 x 4 -x 9 ---------------------------------------------------- x 4 -x i x 5 -x 6 x 5 -x 7 x 5 -x 8 x 5 -x 9 x 5 -x 10 ----------------------------------------------------- x 5 -x i x 6 -x 7 x 6 -x 8 x 6 -x 9 x 6 -x 10 x 6 -x 11 x 6 -x 12 --------------------------------------- x 6 -x i x 7 -x 8 x 7 -x 9 x 7 -x 10 x 7 -x 11 x 7 -x 12 x 7 -x 13 x 7 -x 14 ------------------------- x 7 -x i x 8 -x 9 x 8 -x 10 x 8 -x 11 x 8 -x 12 x 8 -x 13 x 8 -x 14 x 8 -x 15 x 8 -x 16 --------------- x 8 -x i --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- x k -x k+1 x k -x k+2 x k -x k+3 --------------------------------------------------- x k -x 2k --------- x k -x i
Pollard s Rho Algorithm - Floyd s Version f(x)=x 2 +a with a {-2,0} # iterations t <100 q max (q max is the maximum factor we expect to find using rho method) We choose random x 0 in the range(0,n-1) and x 1 =f(x 0 ) V 2 V 1 d x 0 d=1 x 2 x 1 d=d*(x 2 -x 1 ) f(f()) f() x 4 x 2 d=d*(x 4 -x 2 ) x 6 x 3 d=d*(x 6 -x 3 )... x t x t/2 d=d*(x t -x t/2 ) x t+2 x (t+2)/2 d=d*(x t+2 -x (t+2)/2 )....... x 2i x i d=d*(x 2i -x i ) x 2(i+1) x i+1 d=d*(x 2i+2 -x i+1 )....... x 2t x t d=d*(x 2t -x t ) *x 2i+2 =f(f(x 2i )),x i+1 =f(x i ) q=gcd(d,n) Minimization for area and/or memory
Rho Algorithm- Floyd s Version Contd. Inputs x a f x x a N t even 2 : 0,, ( ),, (, 2) Outputs : q ( such that q N) v x f ( x ), v x f ( x ), temp v -v x - x, d 1 1 1 0 2 2 1 2 1 2 1 for ( i 2; i t; i ) { v v 2 2 2 v v a v 2 2 v 2 2 2 v v a v 2 2 v 2 1 1 v v a * all operations are done 1 1 temp v -v mod ulo N 2 1 d d* temp } q gcd ( d, N)
Rho Method - Brent s Version x 1 -x 2 x 1 -x 3 x 1 -x 4 x 1 -x 5 x 1 -x 6 ---------------------------------------------------- x 1 -x i x 2 -x 3 x 2 -x 4 x 2 -x 5 x 2 -x 6 x 2 -x 7 ---------------------------------------------------- x 2 -x i x 3 -x 4 x 3 -x 5 x 3 -x 6 x 3 -x 7 x 3 -x 8 ---------------------------------------------------- x 3 -x i x 4 -x 5 x 4 -x 6 x 4 -x 7 x 4 -x 8 x 4 -x 9 ---------------------------------------------------- x 4 -x i x 5 -x 6 x 5 -x 7 x 5 -x 8 x 5 -x 9 x 5 -x 10 ----------------------------------------------------- x 5 -x i x 6 -x 7 x 6 -x 8 x 6 -x 9 x 6 -x 10 x 6 -x 11 x 6 -x 12 --------------------------------------- x 6 -x i x 7 -x 8 x 7 -x 9 x 7 -x 10 x 7 -x 11 x 7 -x 12 x 7 -x 13 x 7 -x 14 ------------------------- x 7 -x i x 8 -x 9 x 8 -x 10 x 8 -x 11 x 8 -x 12 x 8 -x 13 x 8 -x 14 x 8 -x 15 x 8 -x 16 --------------- x 8 -x i --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- x k -x k+1 x k -x k+2 x k -x k+3 ---------------- x 2k -x 2 k + 2 k-1 +1 -------------------------------------------- x 2k -x 2 k+1
Rho Method - Brent s Version Sequence of Operations v 2 d v 1 x 2 d=1 x 2 x 3 x 4 d=d*(x 4 -x 2 ) x 4 x 5 x 6 x 7 d*(x 7 -x 4 ) x 8 d*(x 8 -x 4 ) x 8 x 9 x 10 x 11 x 12 x 13 d*(x 13 -x 8 ) x 14 d*(x 14 -x 8 ) Minimization for x 15 d*(x 15 -x 8 ) execution time x 16 d*(x 16 -x 8 ) x 16 24%
Rho Algorithm- Brent s Version Inputs x a f x x a N t even 2 : 0,, ( ),, (, 2) Outputs : q ( such that q N) x f ( x ), v v x f ( x ), k 1 1 0 2 1 2 1 for ( i 3; i 2 t; i ) { v f ( v ) if { 2 2 k k-1 k 1 (2 2 1 i 2 ) temp v -v 2 1 d d * temp } if { v k 1 ( i 2 ) v 1 2 k k 1 } } q gcd( d, N)
p-1 Algorithm
p-1 Algorithm Based on Fermat s Little Theorem a p-1 1(mod p) a m(p-1) 1(mod p) a m(p-1) 1 0(mod p) N number to be factored a, any small integer p, non-trivial factor of N Choose a small number a, such that 1<a<N Choose a special number k Compute a k (mod N) 1 Compute gcd(a k (mod N) 1, N)
p-1 algorithm Inputs : N a B 1 B 2 number to be factored arbitrary integer such that gcd(a, N)=1 smoothness bound for Phase1 smoothness bound for Phase2 Outputs: q - factor of N, 1 < q N or FAIL
p-1 algorithm Phase 1 ei 1: k p such that p - consecutive primes B k 2: q a mod N 0 3: q gcd( q 1, N) p i 0 i 4 : if q 1 5: return q (factor of N) 6: else 7: go to Phase 2 8: end if i ei e - largest exponent such that p B i precomputations 1 main computations postcomputations i 1
p-1 algorithm Phase 2 09: d 1 10: for each prime p B to B do p 0 1 2 11: d d ( q 1) (mod N) 12 : end for 13: q gcd( d, N) 14: if q 1 then 15: return q 16: else 17: return FAIL 18: end if main computations postcomputations
p-1 Phase 1 Numerical example N = 1 740 719 = 1279 1361 a = 2 B 1 = 20 k = 2 4 3 2 5 7 11 13 17 19 = 232 792 560 q 0 =a k mod N = 2 232 792 560 mod 1 740 719 = 1 003 058 q = gcd (1 003 058 1; 1 740 719) = 1361 Why did the method work? q-1 = 1360 = 2 5 17 k a k mod q = a (q-1) m mod q = 1 q a k -1
Modular Exponentiation- Sliding Window Method Input : g, e ( e e... e, e ) with e 1, and an int eger w 1 Output : g 1. precomputation e 1 2 t t 1 1 0 g g, g g 2. A 1, i t 3. while i 0 do the following 2 For i from to do g g g w 1 1 (2 1) : 2i 1 2i 1 * 2 2 if e 0 then do : A A, i i -1 i i-l 1 i i 1 t otherwise ( e 0), find the longest bitstring e e... e such that i - l 1 w and e 4.Re turn( A) l 1, i i i-1 l and do the following 2 A A g( e e... e ) i l *, 1 l
Sliding Window Method- Example calculating g 50, e = (110010) 2, window size 2 Pre-computations g 3 Main computations, A 1 11 0010, window size = 2 and the value = 11 = 3 A (A) 4.g 3 = g 3 11 0 010 A A 2 = g 6 110 0 10 A A 2 = g 12 1100 1 0, window size = 1 and the value = 1 = 1 A (A) 2.g 1 = g 25 11001 0 A A 2 = g 50
Hardware Architecture
Top-level View FPGA / ASIC Control Unit I/O Host computer Global memory Rho, p-1, unified Units RAM
Low Level Arithmetic Units
Montgomery Multiplication A _M _C hoice B A _M write start 3 2 32 w w B M A ws ws ws S1in S2in Es Es Eb Eb loada S1 S2 B reset M reset A (Shift_Reg) reg_rst reg_rst reset clk reset M U LT IPLIE R read S1out S2out zeros Bout zeros w w Mout read w w Ai qi BB mm w w w w A(0) Ai C 32 read done_m ul Based on McIvor, McLoone, et al. Asilomar 2003: full-length CSAs word-length CPAs S1in S2in >>1 >>1 A1 A2 B C CSR42 + ws read ws data_out S2out(0) S1out(0) SUM CARRY sum carry w w S1out(ws-1 downto 0) S2out(ws-1 downto 0) ws ws Bout(0) Ai U V W Y w w w w CSR42 CSA w+1 w+1 CSA w+2 w+2 qi S C
Addition / Subtraction a d d r1 W E L a d d r2 B A_M _Choice L U T 3 2 X 3 2 M E M A _ M A_M write add_sub 32 32 M A _ M _ C h o ic e A _ M B < < 1 2 M clk reset ADDE R/ SUB TRACTO R O P 1 O P 2 E A 3 2 b ti re g A 3 2 b ti re g B E B s u b 32 + s u m 1 s u m 2 E C 1 C read Original design C o u t A D D E R C in C 1 E C 2 C 2 < > re a d s ig n Z
Global Memory- Rho 0 31 0 n for unit1 n for unit2...... n for unit m Same for all units x 0 a t No. of iterations
Local Memory- Rho data_out 32 31 0 g_l A_M Grei 32 0 M temp data_in 32 0 1 32 0 Kout 32 C V1 6 Aaddr 1 V2 u_l a 6 Baddr B 32 d WEA Local Memory 63
Computation Flow MUL ADD/SUB 1 to 2t-1 v 2 v 2 2 cond1 temp (v 2 -v 1 ) cond1 d d*temp 1 to 2t-1 v 2 v 2 + a cond1: 2 k +2 k-1 +1 i-1 2 k+1
Control Unit - Rho Memory Initialization Main Computations Reading Out Results
Global Memory p-1 0 Phase 1 31 0 N for unit 1 N for unit 2... N for unit m 0 Phase 2 31 0 GCD_table[1]... GCD_table[GMAXD] M min M max Determines j such that 1 j D and gcd(j, D) = 1 g 2 g 1 initial values for All units prime_table[1] prime_table[2] k N... Determines m,j such that P = m.d-j is a prime k prime_table[pmax D ] 511 511
Local Memory p-1 a) 0 Phase 1 31 0 N g 2 g 1 g 3........ b) 0 Phase 2 31 0 N /d d 2 d d 11 d 13........ g s *s = 2 k -1 d 209 d D d m.d 511 d = g e 511 d md - d j x
Control Unit Phase 1 Phase 2 Memory Initialization Memory Initialization Pre-Computations Modular Exponentiation Reading Out Results Main-Computations Reading Out Results
Unified Architecture ADD/SUB Local Memory for p-1 Control Unit MUL Local Memory for Rho Global Memory
Control Unit Memory Initialization Rho-Computations P-1 -Computations Reading Out Results
Control Unit Total 17 state machines with 140 states 5 state machines with 45 states in Rho 12 state machines with 103 states in P-1 5 Shift registers 9 Registers 13 Counters 22 Comparators Original design
Design Flow
FPGA vs ASIC FPGA Field Programmable Gate Array Array of logic blocks Switchable interconnect resources Final user can set switches Immediate use ( Zero fab time) Not good for high volume applications ASIC Application Specific Integrated Circuit Standard cells and Macros Requires full manufacturing sequence Good for high volume applications
FPGA Design Flow Design Entry Design Verification Specification RTL Description (VHDL / Verilog HDL) Functional Simulation Synthesis Post-Synthesis Simulation Implementation Timing Simulation Configuration On Chip Testing
ASIC Design Flow Front-End Design Synthesis Timing Analysis Design Analyzer Primetime Back-End Design Floorplanning Placement Clock Tree Synthesis Astro Routing Design for Manufacturing
Results
Families of Xilinx FPGA Devices Low-cost High-performance Spartan 3 Virtex II (< $130*) (< $2,700*) Spartan 3E Virtex 4 (< $35*) (< $3,000*) *approximate cost of the largest device per unit for a batch of 10,000 units
FPGA Implementation of Single Units Results Rho P-1 Unified Resources -CLB Slices 1,680(4%) 1,749(5%) 2,042(6%) -LUTs 2,714(4%) 2,875(4%) 3,451(5%) -FFs 1,518(2%) 1,645(2%) 1,740(2%) -BRAMs 0/144 2/144 2/144 Max. Clock Frequency 130 MHz 131 MHz 115 MHz Target device is Virtex II XC2v6000-6
Number of unified units per FPGA 42 19 21 8 Spartan 3 Virtex II Spartan 3E Virtex 4 XC3S5000-5 XC2V6000-6 XC3S1600-5 XC4VLX200-11 Low-cost High-performance Low-cost High-performance
Performance Unified Operations per Second 2,262 819 581 x 1.41 x 7.8 290 Spartan 3 Virtex II Spartan 3E Virtex 4 XC3S5000-5 XC2V6000-6 XC3S1600-5 XC4VLX200-11 Low-cost High-performance Low-cost High-performance
Performance to cost ratio Unified Operations per second per $100 828 447 x 14.9 x 11 30 75 Spartan 3 Virtex II Spartan 3E Virtex 4 XC3S5000-5 XC2V6000-6 XC3S1600-5 XC4VLX200-11 Low-cost High-performance Low-cost High-performance
ASIC - Layout of p-1 - floorplanning
Layout of p-1 - placement
Layout of p-1 clock tree synthesis
Layout of p-1 Global Routing
Layout of p-1 Detailed Routing
Results - ASIC Implementation Unified architecture Operation rho p-1 Area 1.15 mm2 1.21 mm2 1.8 mm2 Max. Clock Frequency 200 MHz 200MHz 200 MHz Time for execution 3.52 ms 9.56 ms 13.1 ms # of operations per second (using maximum no. of units) 96,022 34,100 16,615 Core utilization ratio 70% 70% 65% Area of Virtex II FPGA is 19.68 x 19.8 mm2 (estimation by R.J. Lim Fong, MS Thesis, VPI, 2004)
FPGA vs ASIC - Area 338 ASIC FPGA 322 216 x 17 20 Rho x 14 23 x 10 21 P-1 Unified Area of Virtex II FPGA is 19.68 x 19.8 mm2 (estimation by R.J. Lim Fong, MS Thesis, VPI, 2004)
Rho in an ASIC 130 nm Global Memory Local Memory
ASIC 130 nm vs. Virtex II 6000 rho (20 units) 19.68 mm 19.80 mm 51x Area of Virtex II 6000 (estimation by R.J. Lim Fong, MS Thesis, VPI, 2004) 2.7 mm 2.82 mm Area of an ASIC with equivalent functionality
ASICs vs. FPGAs Source: I. Kuon, J. Rose, University of Toronto Measuring the Gap Between FPGAs and ASICs IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, vol. 62, no. 2, Feb 2007.
Contributions Verified the VHDL code through functional and timing simulation by comparison with the operation of test software implementation written in C. Ported the VHDL code to 4 different families of FPGA devices and to a standard-cell ASIC based on 130 nm TSMC library
Conclusions Low-cost FPGA devices, such as Spartan 3, outperformed high-performance devices, such as Virtex II, in terms of performance to cost ratio by a factor of 14.9 ASIC Implementation outperforms FPGA with a factor of 50* in terms of area and 1.5 times in terms of frequency. *In case of rho it is 50, for other architectures it may be less
Conclusions Low cost FPGA devices Spartan 3 and Spartan 3E are suitable for code-breaking ASIC implementation is suitable when large number of chips (>1,000,000) are considered
Future Work Implementation of Trial Division in Hardware Implementation of ECM in Hardware using one multiplier and one adder/subtractor Integrating Trial division, Rho, P-1 and ECM to build a co-factoring machine Experiments on COPACOBANA
Thank you! Questions???