Coputers an Electrical Engineering 35 (29) 54 58 Contents lists available at ScienceDirect Coputers an Electrical Engineering journal hoepage: www.elsevier.co/locate/copeleceng An area/perforance trae-off analysis of a GF(2 ) ultiplier architecture for elliptic curve cryptography Miguel Morales-Sanoval, Clauia Feregrino-Uribe, René Cuplio *, Ignacio Algreo-Baillo Coputer Science Departent, National Institute for Astrophysics, Optics an Electronics, Luis Enrique Erro No. 1, Tonantzintla, Pue. 7284, Mexico article info abstract Article history: Receive 24 January 27 Receive in revise for 26 Noveber 27 Accepte 27 May 28 Available online 31 August 28 A harware architecture for GF(2 ) ultiplication an its evaluation in a harware architecture for elliptic curve scalar ultiplication is presente. The architecture is a paraeterizable igit-serial ipleentation for any fiel orer. Area/perforance trae-off results of the harware ipleentation of the ultiplier in an FPGA are presente an iscusse. Ó 28 Elsevier Lt. All rights reserve. 1. Introuction Finite fiels like the binary GF(2 ) an the prie GF(p) have been use successfully in error correction coes an cryptographic algoriths. In elliptic curve cryptography (ECC), the overall perforance of cryptographic ECC schees is harly eterine by arithetic in GF(2 ), being inversion an ultiplication the ost tie consuing operations. Accoring to the literature, arithetic in GF(2 ) binary fiels using polynoial basis leas to efficient harware ipleentations of ECC. Soe works relate to harware ipleentation of ECC have reporte paraeterizable GF(2 ) arithetic units to copute the ost tie consuing operation in elliptic curve cryptography, the scalar ultiplication. Those architectures are base on a iversity of ultiplication algoriths, for exaple: Massey Oura ultipliers [1], linear feeback shift registers ultipliers [2], Karatsuba [3,4], an igit-serial ultipliers [5]. Other works have stuie an ipleente GF(2 ) ultipliers using polynoial basis like [8,9]. Others have use ifferent algoriths, like the Montgoery ultiplication [1,11]. Although, fro the architectural point of view, it is well known that the arithetic unit has a big ipact in the tiing an area of harware for scalar ultiplication, it is not clear whether the architecture perforance is ue to the parallelis in the ultipliers, the nuber of ultipliers, or the kin of ultipliers use. This technical counication presents the harware architecture of a GF(2 ) igit-serial ultiplier an evaluates the area/perforance trae off, consiering various igit sizes an finite fiel orers. 2. GF(2 ) ultiplication architecture Multiplication in GF(2 ) in polynoial basis is the operation A(x) B(x) o F(x), that can be copute using a variety of propose algoriths in the literature. On the one han, serial or bit-serial algoriths, consier each iniviual bit of the operan B(x) which iplies a latency for ultiplication of clock cycles. On the other han, igit-serial ultipliers consier a group of bits of operan B(x) at tie an perfor the ultiplication in / cycles. However, it is not clear which is the * Corresponing author. Tel.: +52 222 26631; fax: +52 222 2663152. E-ail aress: rcuplio@inaoep.x (R. Cuplio). 45-796/$ - see front atter Ó 28 Elsevier Lt. All rights reserve. oi:1.116/j.copeleceng.28.5.8
M. Morales-Sanoval et al. / Coputers an Electrical Engineering 35 (29) 54 58 55 best size of for this kin of ultiplier to achieve an appropriate perforance that eets the constraints for a specific application. Varying the size of the igit allows to explore the cost in area an perforance iproveents fro a serial ipleentation up to a parallel ultiplication architecture. At each iteration, the operan A(x) is ultiplie by a group of bits of operan B(x) an the result is reuce oulo F(x). The result is ae accuulatively to the result of the next iteration, consiering the following bits of B(x) until all B(x) bits are processe. The reuction in the operation latency coes with an increent in the coplexity at each step of the ultiplication. For our ipleentation, we consier the igit serial Algorith 1 [6], the sae algorith use for the work reporte in [5], an show the ifferent area/tie results when the igit size is varie. This will help esigners to select suitable paraeters when ipleenting architectures for high level applications like cryptographic algoriths or error correction coe algoriths. Algorith 1. Digit-serial ultiplication: ultiplication in GF(2 ) Require: A(x), B(x) in GF(2 ), F(x) the + 1 grae irreucible polynoial Ensure: C(x)=A(x) * B (x) o F(x) 1: C(x) B s 1 (x)a(x) o F(x) 2: for k fro s 2 own to o C(x) x C(x) C(x) C(x)+B k (x)a(x) o F(x) en for Being B(x) an eleent in GF(2 ) using polynoial basis, this is viewe as the polynoial b 1 x 1 + b 2 x 2 + + b 1 x + b. For a positive igit nuber <, the polynoial B(x) can be groupe so that it can be expresse as B(x)=x (s 1) B s 1 (x)+x (s 2) B s 2 (x)++ x B 1 (x)+b (x), where s = /e an each wor B i (x) is efine as follows: 8 P 1 >< b iþj x j if 6 i < s 1; B i ðxþ ¼ >: j¼ ð%þ 1 P j¼ b iþj x j if i ¼ s 1: If x is factore fro the groupe representation of B(x), the resulting expression is BðxÞ ¼x ðx ð ðx ðx B s 1 ðxþþb s 2 ðxþþþþþb 1 ÞþB Þ: This last representation of operan B(x) is use in Algorith 1 to copute the fiel ultiplication. That is, A(x)B(x) o F(x)=x (x ((x (x B s 1 (x)a(x)+b s 2 (x)a(x)) +)+ B 1 A(x)) + B A(x)) o F(x). At each iteration, the accuulator C(x) is ultiplie by x an the result is ae to the ultiplication of A(x) by each wor B i (x) ofb(x). The partial result C(x) is reuce oulo F(x). CðxÞ ¼B s 1 ðxþaðxþ o FðxÞ Initialization CðxÞ ¼x CðxÞ o FðxÞ ¼x B s 1 ðxþaðxþ o FðxÞ Iteration s 2 CðxÞ ¼x B s 1 ðxþaðxþþb s 2 ðxþaðxþ o FðxÞ CðxÞ ¼x CðxÞ o FðxÞ ¼x ðx B s 1 ðxþaðxþþb s 2 ðxþaðxþþ o FðxÞ Iteration s 3 CðxÞ ¼x ðx B s 1 ðxþaðxþþb s 2 ðxþaðxþþ þ B s 3 ðxþaðxþ o FðxÞ The propose architecture for Algorith 1 is shown In the left sie of Fig. 1. A finite state achine controls the ata flow executing the loop in Algorith 1. At each iteration, a new igit of bits fro B(x) is processe so the operation is perfore in /e cycles. The operations x C(x) an B i (x)a(x) are copute using parallel cobinatorial ultipliers, that ultiplies a 1 grae polynoial with a 1 grae polynoial. Being U(x) a 1 grae polynoial u 1 x 1 + u 2 x 2 +...+ u 1 x + u, an A(x) a 1 grae polynoial, the parallel ultiplication is UðxÞAðxÞ o FðxÞ ¼u 1 x 1 AðxÞ o FðxÞ þ u 2 x 2 AðxÞ o FðxÞ þ þ u 1 xaðxþ o FðxÞ þ u AðxÞ o FðxÞ: The operation xa(x) o F(x) is a shift to the left operation of A(x) together a reuction of F(x). Thus, the value x i A(x) o F(x) is the shifte an reuce version of x i 1 A(x) o F(x). So each value x i A(x) o F(x) can be generate sequentially starting with x A(x). Finally, each x i A(x) o F(x) value is ae epening on the bit value of u i. These operations are execute by the parallel ultiplier shown in the right sie of Fig. 1.
56 M. Morales-Sanoval et al. / Coputers an Electrical Engineering 35 (29) 54 58 A(x) B(x) 1 Shift Digit B(x)-register A(x) F(x) Cobinatorial Multiplier S&R S&R S&R S&R Cobinatorial Multiplier U(x) u u 1 u 2 u -2 u -1 Q 1 (x) Q 2 (x) XOR C(x)-register S&R = A(x)x o F(x) - A(x) U(x) o F(x) C(x) = A(x) B(x) o F(x) GF(2 ) igit ultiplier architecture Parallel Cobinatorial ultiplier Fig. 1. Harware architecture for igit serial finite fiel ultiplication. The operation x C(x) o F(x) is copute in two steps. Using the polynoial representation of C(x), x CðxÞ o FðxÞ ¼x ðc 1 x 1 þ c 2 x 2 þþc x þ c 1 x 1 þþc 1 x þ c Þ o FðxÞ ¼ x ðc 1 x 1 þ c 2 x 2 þþc x Þ o FðxÞþx ðc 1 x 1 þþc 1 x þ c Þ o FðxÞ ¼ðc 1 x þ-1 þ c 2 x þ 2 þþc x Þ o FðxÞþðc 1 x 1 þþc 1 x þ1 þ c x Þ o FðxÞ ¼ Q 1 ðxþ o FðxÞþQ 2 ðxþ o FðxÞ: Q 2 (x)isa 1 grae polynoial, corresponing to the least significant bits of C(x) shifte positions to the left. Q 2 (x) oes not nee to be reuce. By factoring x fro Q 1 (x), it is obtaine Q 1 (x)=x (c 1 x 1 + c 2 x 2 + + c ). In this case, being F(x) a + 1 trinoial or pentanoial of the for F(x)=x + g(x), where g(x) is a polynoial with grae g, the equivalence x g(x) can be use. In this case, g(x) correspons to all bits of F(x) except the -bit. Thus, Q 1 (x) o F(x)=g(x)(c 1 x 1 + c 2 x 2 ++ c ). That is, the operation is a ultiplication of g(x) of grae g, an a polynoial of grae, corresponing to the ost significant bits of C(x). The resulting polynoial is of grae g +. In all the cases the polynoial F(x) use in the tests for the finite fiels 2 {163,233,283,49,571}, an igits {1,4,8,16,32}, the value g +, so no reuction is necessary. The polynoial g(x) is expane to a 1 grae polynoial so Q 1 (x) o F(x) be copute using the parallel cobinatorial ultiplier. All these coputations are perfore by the oules in the architecture for the ultipliers, which inclues the parallel ultipliers, a shift to the left oule of -bits, two registers an a 3-input xor gate. 3. Ipleentation an results The architecture was esigne in VHDL, siulate an valiate using Active-HDL an a test progra in C. The architecture is paraetrizable in the file orer for any value. The average syste throughput of the architecture was obtaine by synthesizing it to several finite fiel orers for the reconfigurable evice xc2v2 FPGA, using Xilinx s tools. The ultiplier was ipleente for the fiel orers = 163, 233, 283, 49 an 571 recoene by NIST [7] for elliptic curve cryptography, an for the fiel = 277 recoene by IPSec. Due the large nuber of I/O pins in the architecture, the GF(2 ) ultiplier was ipleente together an I/O interface. This is a finite state achine that gets the input paraeters A(x) an B(x) as 32-bit wors an once the operation is copute, it elivers the results in several 32-bit wors. The results presente in figures inclue the I/O interface. We also investigate the perforance of the ultiplier consiering the processing tie. Fig. 2 shows the processing tie for specific finite fiels an igit sizes an Fig. 3 shows the area resources require for each one of these finite fiels an igit sizes. Fro these figures, it can be observe that the bigger the igit, the better the perforance, but the higher area requireents. Latency of the ultiplier is ainly reuce by the size of the igit. Fro Fig. 2, it is seen that the ifference in tiing between the igit size 16 an 32 bits is not significant, thus the extra cost in ters of area for igit sizes greater that 32 bits is not justifie.
M. Morales-Sanoval et al. / Coputers an Electrical Engineering 35 (29) 54 58 57 Tie (us) 2.6 2.3 2 1.7 1.4 1.1 igit = 1 igit = 4 igit = 8 igit = 16 igit = 32.8.5.2.1 15 2 25 3 35 4 45 5 55 6 (fiel orer) Fig. 2. Tie (us) to copute GF(2 ) ultiplication using ifferent parallelis grae an finite fiel orers. 12 Area (Slices) 1 8 6 igit = 1 igit = 4 igit = 8 igit = 16 igit = 32 4 2 15 2 25 3 35 4 45 5 55 6 (fiel orer) Fig. 3. Area (slices) resources for ifferent parallelis grae an finite fiel orers. The application constraints will guie the selection of the best ipleentation paraeters. As an exaple of an application, consier a reconfigurable architecture for scalar ultiplication in elliptic curve cryptography that anages several finite fiel orers, but only assigns fixe space for the fiel ultiplier. For each specific finite fiel orer, there is a igit size that axiizes the perforance of the ultiplier for a fixe area. For exaple, with 4 K gates the best perforer is a 4- igit fiel ultiplier for the fiel = 571. In this area, we coul also ipleent a 8-igit or a 16-igit ultiplier for the fiels = 49 an = 277, respectively, an so on. It is worth to ention that the results presente in this technical counication were obtaine fro place an route optiize for spee an without keeping the hierarchical structure of the esign. Finally, Table 1 shows a coparison of the area results an perforance achieve in this work against the results presente in [8] for several kins of parallel ultipliers, using the fiel = 233 an the sae technology, a virtex2 FPGA xc2v6-4. In this coparison, the I/O interface was not use. The results show that the igit serial solution requires less area, 1 ties lower for = 32 copare to the parallel ipleentation of the classical ultiplier at the cost of six ore clock cycles. In all the cases, the igit-serial ultiplier has greater frequency which iplies this oule can be integrate to other esigns working at high frequencies. The ultiplier using = 1 achieves better tiing (269 MHz,.6us) copare with the bit-serial ipleentation in [9] (42 MHz, 7.4 us) for the finite fiel = 163. Other works have ipleente finite fiel ultipliers an use the in elliptic curve coprocessors or processors [1 5] but the results for the stanalone ultiplier are not available. Others have ipleente the ultiplier for the GF(p) finite fiel so a irect coparison is not possible [1,11].
58 M. Morales-Sanoval et al. / Coputers an Electrical Engineering 35 (29) 54 58 Table 1 Area an tie coparison results Ref. Multiplier LUT/FF Slices Gate count Clock perio (ns) Frequency (MHz) [8] Classical (est.) 37,296/37,552 528,427 13. 77 [8] HybriKaratsuba 11,746/13,941 182,7 11.7 9.3 [8] MasseyOura 36,857/8,543 289,489 15.91 62.8 [8] SunarKoc 45,435/41,942 68,149 1.73 93.2 This work ( = 1) 484/477 246 6731 4.26 234.8 This work ( = 4) 1188/634 766 12,95 4.82 27.2 This work ( = 8) 2115/71 1384 18,394 5.32 187.7 This work ( = 16) 44/889 2436 31,139 6.19 161.5 This work ( = 32) 711/1349 4457 53,647 6.59 151.6 4. Conclusions An area/perforance trae off analysis for a igit-serial GF(2 ) finite fiel ultiplication was presente. The size of the igit to use in an application of the propose ultiplier architecture will be guie by the area assigne to the ultiplier. Also, the require processing tie an which other igits can be use to axiize the perforance for other fiel orer using greater igits shoul be taken into account. Acknowlegents First author thanks the National Council for Science an Technology fro Mexico (CONACyT) for financial support through the scholarship nuber 171577. References [1] Ernest M, Klupsch S, Hauck O, Huss SA. Rapi prototyping for harware accelerate elliptic curve public key cryptosystes. In: Proceeings of 12th IEEE workshop on rapi syste prototyping, RSP 21, Monterey, CA; June 21, p. 24 31. [2] Benara M, Dalrup M, Gathen J, Shokrollahi J, Teich J. Reconfigurable ipleentation of elliptic curve crypto algoriths. In: IPDPS 2: Proceeings of the 16th international parallel an istribute processing syposiu. Washington, DC, USA: IEEE Coputer Society; 22. p. 284 91. [3] Ernest M, Jung M, Malener F, Huss S, Blüel R. A reconfigurable syste on chip ipleentation for elliptic curve cryptographyover GF(2 ). In: Proceeings of the 4th international workshop on cryptographic harware an ebee systes CHES 22. Lecture notes in coputer science, vol. 2523. Rewoo Shores, CA: Springer; 22. p. 381 99. [4] Saquib N, Roriguez F, Diaz A. A parallel architecture for fast coputation of elliptic curve scalar ultiplication over GF(2 ). In: Proceeings of 11th reconfigurable architectures workshop, RAW 4, Sta. Fe, USA; April 24. p. 26 7. [5] Lutz J, Hasan A. High perforance FPGA base elliptic curve cryptographic co-processor. ITCC 4: international conference on inforation technology: coing an coputing, vol. 2. IEEE Society Press; 24. p. 486 92. [6] Lutz Jonathan. High perforance elliptic curve cryptographic co-processor. Master s thesis, University of Waterloo; 23. [7] NIST. Recoene elliptic curves for feeral governent use. <http://csrc.nist.gov/csrc/festanars.htl>; 1999. [8] Grabbe C, Benara M, Teich J, von zur Gathen J, Shokrollahi J. FPGA esigns of parallel high perforance GF(2 233 ) ultipliers. In: Proceeings of IEEE ISCAS 3, vol. II; 23. p. 268 71. [9] Kitsos P, Theooriis G, Koufopavlou O. An efficient reconfigurable ultiplier architecture for galois fiel GF(2 ). Microelectron J 23;34(1). [1] Savasß E, Tenca AF, Koç ÇK. A scalable an unifie ultiplier architecture for finite fiels GF(p) an GF(2 ). Cryptographic harware an ebee systes, LNCS no. 1965; August 2. p. 281 96. [11] Tenca AF, Koc Cetin K. A scalable architecture for oular ultiplication base on Montgoery s algorith. IEEE Trans Coput 23;52(9):1215 21.