RSA Cryptography using Designed Processor and MicroBlaze Soft Processor in FPGAs

RSA Cryptography usng Desgne Processor an McroBlaze Soft Processor n FPGAs M. Nazrul Islam Monal Dept. of CSE, Rajshah Unversty of Engneerng an Technology, Rajshah-6204, Banglaesh M. Al Mamun Dept. of CSE, Rajshah Unversty of Engneerng an Technology, Rajshah-6204, Banglaesh Boshr Ahme Dept. of CSE, Rajshah Unversty of Engneerng an Technology, Rajshah-6204, Banglaesh ABSTRACT Some applcatons such as RSA encrypton/ecrypton nee nteger arthmetc operatons wth many bts. However, such operatons cannot be performe rectly by conventonal CPUs, because ther nstructon supports ntegers wth fxe bts, say, 64 bts. Snce the CPUs nee to repeat arthmetc operatons to numbers wth fxe bts, they have conserably overhea to execute applcatons nvolvng nteger arthmetc wth many bts. On the other han, Ths paper mplements harware algorthms for such applcatons n the FPGAs for further acceleraton. However, the mplementaton of harware algorthm s usually very complcate an ebuggng of harware s too har. The man contrbuton of ths paper s to present an ntermeate approach of software an harware usng FPGAs. More specfcally, ths paper presents a processor base on FDFM (Few DSP slces an Few Memory blocks) approach that supports arthmetc operatons wth flexbly many bts, an mplement t n the FPGA. To show the potentalty of esgne processor, 128- bt RSA encrypton/ecrypton s mplemente an compare wth soft processor McroBlaze n FPGA. The resultng processor uses only one DSP48E1 slce an four Block RAMs (BRAMs), an RSA encrypton software on t runs n 0.42ms. However, McroBlaze uses three DSP48E1 slces an 170 Block RAMs (BRAMs) an runs n 152.28ms. Hence, the propose esgne processor s sgnfcantly effcent n terms of resource use an tme complexty n comparson to soft processor McroBlaze n FPGAs. Also the propose processor can be use effcently for longer bt arthmetc operaton such as 2048-bt wthout further mofcatons an hence t s more flexble. General Terms RSA Securty Algorthm Keywors Multple-Length-Arthmetc, McroBlaze, Soft Processor, Montgomery Moular Multplcaton n RSA, FPGA, DSP Slces, Block RAMs. 1. INTRODUCTION An FPGA s a programmable logc evce esgne to be confgure by the customer or esgner by HDL (Harware Descrpton Language) after manufacturng. An FPGA chp mantans relatve lower prce an programmable features [1], [2], [3], hence, t s wely use recently. The reaers may refer to see some crcut mplementatons n FPGAs [4], [5], [6], [7], [8], [9], [10], [11], [12] to accelerate computaton. Applcatons requre arthmetc operatons on nteger numbers whch excee the range of processng by a CPU rectly s calle Multple Double Length Numbers or Multple Precson Numbers an hence, computaton of these numbers s calle Multple-Length-Arthmetc. More specfcally, applcaton requres nteger arthmetc operatons for multple-length numbers wth sze longer than 64 bts cannot be performe rectly by conventonal 64-bt CPUs, because ther nstructon sup-ports ntegers wth fxe 64 bts. To execute such applcaton, CPUs nee to repeat arthmetc operatons for those numbers wth fxe 64 bts whch ncrease the executon overhea. Alternatvely, harware algorthms for such applcatons can be mplemente n FPGAs to spee up computatons. However, the mplementaton of harware algorthm s usually very complcate an ebuggng of harware s too har. Snce, low level of nstructons, represente by 0 s an 1 s s an almost mpossble to unerstan even by an expert, the ebuggng of an algorthm at ths level s very har. Moreover, to mplement harware algorthm, wrtten by harware language such as Verlog HDL, users shoul have suffcent knowlege of harware such as regsters whch makes t complcate to the non-expert or to the begnners. The nstructons n assembly language are wrtten by alphanumerc symbols nstea of 0 s an 1 s n low level that s an almost smlar to the hgh level language, wrtten by Englsh whch makes the nstructons as well as algorthms easy to rea, mofy an ebuggng by the non-expert or by the begnners. The man contrbuton of ths paper s precsely escrbe as follows: To prove the potentalty of the propose processor, 128-bt RSA s mplemente an compare wth soft processor McroBlaze n FPGA whch mples that propose processor s sgnfcantly effcent n terms of resource use an tme complexty. Also, the propose flexble-length arthmetc processor base on FDFM approach can be use for computng of nteger numbers wth flexbly many bts, even longer than 2048-bt by a sngle machne nstructon. To make the ebuggng an further evelopment easy, an ntermeate approach of software an harware s presente. Ths esgne processor proves flexblty so that t can be use for computng of nteger numbers wth flexbly many bt such as 64-bt, 128-bt, even longer than 2048-bt wthout further mofcaton. 25

Snce, the esgne processor base on FDFM approach, the key ea of the FDFM approach s to use few DSP slces an few block RAMs to perform routne computatons whch can be treate alternatvely as a resource effcent approach. Let us explan brefly the FDFM approach usng a smple example. Fgure 1 (1) llustrates a harware algorthm to compute the output of FIR (Fnte Impulse Response) y = a 0 x +a 1 x 1 +a 2 x 2 +a 3 x 3. A conventonal approach mplementng the FIR s to use four DSP slces as llustrate n Fgure 1 (2)[13]. In ths conventonal approach, the number of DPS blocks must be the same as that of multplers n the harware algorthm. However, FDFM approach uses one or few DSP slces an one or few block RAMs to mplement the FIR. The Fgure 1 (3) shows the FDFM approach usng one DSP slce an one block RAM to mplement the same mentone above. Note that, the coeffcents a 0,a 1,... are store n the block RAM. The most common FPGA archtecture conssts of an array of logc blocks, I/O pas, Block RAMs an routng channels. Furthermore, embee DSP blocks whch s ntegrate nto an FPGA that makes a hgher performance an a broaer applcaton. Fgure 2 llustrates the Vrtex-6 FPGA evelope by Xlnx. The CLB (Confgurable Logc Blocks) n Vrtex-6 conssts of 2 sub-logc blocks calle slce. Usng LUTs (Look Up Tables) an Flp-Flops n the slces, varous combnatoral crcuts an sequental crcuts can be mplemente. The summarzaton of the several sgnfcant ponts of the results s escrbe as follows: 1. A flexble-length-arthmetc processor s propose for the applcatons whch requre arthmetc operatons for numbers longer than 64 bts. Even, numbers longer than 2048 bts or hgher can be compute by the esgne processor wthout any mofcaton. 2. The propose processor s flexble n a sense that t can support arthmetc operatons for numbers wth flexbly many bts or numbers wth varable sze longer than 64 bts. 3. Fnally, the potentalty of ths processor s checke by mplementng 128-bt RSA an compare wth soft processor McroBlaze whch shows that the propose processor s sgnfcantly effcent. Fgure 1: FDFM approach over conventonal one for FIR The reaers may refer to the papers [14], [15], [16], [17] n whch they can fn etals about FDFM approach an conventonal approach. The rest of ths paper s organze as follows: Secton II brefly escrbes the Multple-length-arthmetc operaton. In Secton III escrbes the propose archtecture. The RSA cryptography as an applcaton s escrbe brefly n Secton IV; however the reaers may refer to the prevous paper [23] for etals. Secton V escrbes expermental results an scussons. Fnally Secton VI conclues ths work. 2. MULTIPLE-LENGTH-ARITHMETIC OPERATION The man purpose of ths secton s to escrbe Multple- Length-Arthmetc operatons. Suppose that A an B are two multple-length numbers of 1024 bts each. These numbers are parttone nto several blocks of 17 bts. Frst, let us see that how a multple-length number of 1024 bts s store n ata memory. Fgure 3(a) shows a ata memory (BRAM). Every 17-bt block ata together wth 1-bt flag represents a btblock of 18 bts an MSB (Most Sgnfcant Bt) of each btblock s known as flag whch set to 1 ncates the en of each store multple-length ata nto the ata memory as shown n Fgure 3(b). In ths fgure, multple-length ata A of 1024 bts s ve nto 61 numbers of 17 bts block such as a 0, a 1,...,, a 60, a 61. Every 17 bts block of multple-length ata, A together wth 1-bt flag s store n fferent memory locaton of the ata memory (BRAM). Fgure 2: Internal Confguraton of Vrtex-6FPGA 26

Fgure 3: Data of 1024-bt Length s Store n Memory (BRAM) For the beneft of reaers, Fgure 4 escrbes the nstructon memory as well as nstructon format of multple-length or mult-ouble long ata. Fgure 4 (a) represents an nstructon memory n whch 53 bts nstructon together wth 1-bt flag can be store at any aress of the nstructon memory aresses. In ths case, 1-bt flag s set to 1 ncates the last nstructon for executon. Note that, aresses of the nstructon memory are hanle by the Program Counter (PC). Let us gve an example of a multplcaton of two multplelength or two mult-ouble long ata. However, other arthmetc operatons can also be performe such as aton, subtracton, vson, comparson of mult-ouble long ata. Suppose u an v represents two mult-ouble long ata. Let us multply u by v an the result s store n w, that s w = u. v. An assembly nstructon for ths computng of mult-ouble long ata s as follows: MUL A, B, C In the above nstructon, A, B an C are known as operans of 16-bt each whch can be use to ncate 2 16 fferent aresses 0, 1,..., 2 16 of the ata memory (BRAM) an MUL s known as OPCODE of 5-bt whch etermnes the operaton of operans (n ths case multplcaton) as llustrate n Fgure 4(b). The reaers may refer to the paper [23] for etals n whch an example s clearly escrbe for mult-ouble long multplcaton. Let us see Algorthm 1 for multplcaton of two mult-ouble long ata u an v. Fgure 5: The Propose Processor Archtecture Algorthm 1: Mult-Double Long Multplcaton B: number of gts n rax-2 17 operans n: last number of gt of rax-2 17 numbers n u m: last number of gt of rax-2 17 numbers n v n 1 Input: u = u. B =0 Output: w=u.v 1. for j=0 to m-1 o 2. c 0 3. w 0 0 4. for =0 to n-1 o 5. {c,w +j } w +j u. v j +c 6. En for 7. Wn+j c 8. En for n+m 1 9. Return w. B =0 m 1,v = v. B =0 Snce there s a page lmtaton, hence the reaers may refer to the prevous paper [23] to show a smple example of the above algorthm. 3. PROPOSED ARCHITECTURE Let us brefly escrbe the propose processor archtecture for multple-length-arthmetc operatons. The esgne processor conssts of program counter, nstructon memory, aress counters, ata memory, ALU, regsters, control unts as llustrate n Fgure 5 an Fgure 6. Because of page lmtaton, the reaers may refer to the prevous paper [23] for etals. Fg.4.An Instructon Memory an an Instructon Format for Mult-Double Data 27

Fgure 6: ALU Archtecture In the followng secton, RSA cryptography s mplemente usng the propose archtecture an t s programme by assembly language. The assembly nstructons of number, 117, each of 54-bt are neee to mplement moular exponentaton algorthm. In ths paper, only the assembly coe s shown for Montgomery Multplcaton as llustrate n the prevous paper [23]. In below assembly coe, the regsters R1 an R2 are use to take nputs X an Y, each of 64-bt. Also 64-bt M s gven. The regsters R3, R4, R5 s use to hol the ntermeate results an fnal results of the Montgomery Multplcaton s store ether n R6 or C. Note that, 64-bt ata n regster R1 s ve nto several blocks of 17-bt each an these are store n several block regsters such as R1 0, R1 1, R1 2, R1 3 (lower block to hgher block). For the case of other regsters, these can be explane n smlar way. [Assembly Coe for Montgomery Multplcaton] R1=X, R2=Y; R3=0, R4=0, R5=0, R6=0; C=0; 01: MUL, R1 0, R2 0, R3 0 ; X (R1) Y(R2) store n R3 02: MOVI, R3 0, R4 0, R3 3 ; copy blocks of R3 nto R4 03:MASK, R4 3, R4 3, 1FFF; make last 4-bt of R4 3 as 0 s 04: MUL, R4 0, M 0 1 R,5 0 ; R4M 1 store n R5 05: MOVI, R5 0, R4 0, R5 3 ; copy blocks of R5 nto R4 06:MASK, R4 3, R4 3, 1FFF; make last 4-bt of R4 3 as 0 s 07: MUL, R4 0, M 0, R5 0 ; R4 M store n R4 08: ADD, R3 0, R5 0, R6 0 ; store R3 + R5 n R6 09: SHR, R6 3, R6 0, 13; 64-bt shft rght of R6 0A: CMP, R6 0, M 0 ; R6 compare wth M 0B: JC, 0D; f R6 > M, go to OD 0C: SUB, R6 0, M 0, R6 0 ; R6-M s the results n R6 0D:MOV, R6 0, C 0 ; results n R6 move to C 4. AN APPLICATION OF RSA CRYPTOGRAPHY USING THE PROPOSED PROCESSOR Ths secton brefly revews the RSA Cryptography whch s escrbe etals n paper [15]. Usng the propose processor, the same algorthm s mplemente by software, nstea of HDL as llustrate n paper [15] to make t easy for mofcatons or changes by a non-expert or by a begnner. In RSA [18], the moular exponentaton C = P E mo M or P = C D mo M are compute, where P an C are plan an cypher text, an (E, M) an (D, M) are encrypton an ecrypton keys. Usually, the bt length n P, E, D, an M s 512 or larger. Also, the moulo exponentaton s repeately compute for fxe E, D, an M, an varous P an C. Snce moulo operaton s very costly n terms of the computng tme an harware resources, Montgomery moular multplcaton [19], [20], [21] s use, whch oes not rectly compute moulo operaton. Montgomery multplcaton [19], [20], [21] s an optmal metho to calculate moular exponentaton. Three R-bt numbers X, Y, an M are gven, an (X Y + q M ) 2 R mo M s compute, where an nteger q s selecte such that the least sgnfcant R bts of X Y + q M are zero. The value of q can be compute as follows. Let ( M 1 ) enote the mnmum non-negatve number such that ( M 1 ) M 1( or 2 R 1) (mo 2 R ). Snce M s o, then ( M 1 ) < 2 R always hols. The q can be selecte such that q = ((X Y ) ( M 1 ))[r 1, 0]. For such q, (X Y +q M )[r 1, 0] are zero. The reaers may refer to paper [23] for an example. Rax-2 r Montgomery multplcaton s shown n Algorthm 2. In Algorthm 2, = [R/r] presents the number of gts n rax-2 r operans. The multpler Y s parttone by each r-bt an Y represents the -th gt of Y. Therefore, Y coul be gven by Y = 1 =0 2 r. Y. After loops, R-bt Montgomery multplcaton can be compute. As far as now, Montgomery multplcaton coul be compute by multplcaton, aton an shft operatons wthout moulo operatons. Algorthm 2: rax-2 r Montgomery Multplcaton rax-2 r, = [R/r], X, Y, M {0, 1,..., 2 R 1}, 0 Y 2 r. Y, Y {0, 1,..., 2 r 1} ( M 1 ) M 1 mo 2 r, M 1 {0, 1,..., 2 r 1} Input: X, Y, M, M 1 Output: S = X Y 2 r mo M 1. S 0 0 2. for = 0 to 1 o 3. q ((S + X Y ) ( M 1 )) mo 2 r 4. S +1 (X Y + q M + S ) / 2 r 5. en for 6. f (M S ) then S S M Snce X Y + q M X Y (mo M), the (X Y + q M ) 2 R mo M = X Y 2 R mo M can be wrtten. Let us see how Montgomery moular multplcaton s use to compute C = P E mo M. Suppose C = P E mo M nees to be compute. For smplcty, assume that E s a power of two. Snce R an M are fxe, agan assume that 2 2R mo M s compute beforehan. Frst compute P (2 2R mo M) 2 R mo M =P 2 R 28

mo M usng the Montgomery moular multplcaton, then compute the square (P 2 R mo M ). (P 2 R mo M ) 2 R mo M = P 2 2 R mo M. It shoul be clear that, by repeatng the square computaton usng the Montgomery moular multplcaton, havng P E 2 R mo M. After that, multply 1, that s (P E 2 R mo M ) 1 2 R mo M =P E mo M s compute. In ths way, cypher text C coul be obtane. Algorthm 3 shows the moular exponentaton usng Montgomery multplcaton of Algorthm 2. In Algorthm 3, E b represents the sze of E. Inputs 2 2r mo M an M 1 are gven. To use Montgomery moular multplcaton, C an P are converte from 1 an P n the 1st lne an the 2n lne, respectvely. The lne 1, 2, 4, 5 an 7 n Algorthm 3 can be compute usng Montgomery multplcaton of Algorthm 2. Algorthm 3: Moular Exponentaton E 0 E 2E b 1, b 1 E 2. E, E {0, 1} 0 Input: P, E, M, M 1, 2 2r mo M Output: C = P E mo M 1. C (2 2r mo M ) 1 2 r mo M ; 2. P (2 2r mo M ) P 2 r mo M ; 3. for = E b 1 ownto 0 o 4. C C C 2 r mo M ; 5. f E = 1 then C C P 2 r mo M ; 6. en for 7. C C 1 2 r mo M ; Let {A : B} enote a concatenaton of A an B. For example, f A = (F F ) 16 an B = (EC) 16, {A : B} = (F F EC ) 16. Algorthm 4 s an mprove algorthm from Algorthm 2. Conserng the features of the target Vrtex 6 FPGA, rax- 2 17 s selecte. Let R enotes the sze of Montgomery multpler operans X, Y, an M. Also, = [R/17] s the number of gts of the operans on rax-2 17. In the algorthm, the 17 R + 3 conton s ntrouce to gnore the subtracton shown n the 6th lne of Algorthm 2. If the conton s satsfe, t can be guarantee that at least 3-bt 0 s pae to the most sgnfcant bts of the most sgnfcant gt as the reunancy. Due to the strngent page lmtaton, the proof s omtte. However, M C s always satsfe n the moular exponentaton shown n Algorthm 3. Further, n practcal RSA encrypton, the sze of operans s rax-2 numbers such as 512-bt, 1024-bt, 2048-bt, an 4096-bt. For rax-2 17 system, the conton 17 R +3 s satsfe. If the conton s not satsfe, then t nees to appen one reunant gt at the most sgnfcant gt. Algorthm 4: Montgomery Algorthm rax-2 17, = [R/17], 17 R + 3, X, Y, M, S {0, 1,..., 2 R 1}, M 1, α, β, γ, C α, C β {0, 1,..., 2 17 1}, C γ, C S {0, 1}, 0 X 2 17. X, X {0, 1,..., 2 17 1}, X = 0 0 Y 2 17. Y, Y {0, 1,..., 2 17 1} 0 M 2 17. M, M {0, 1,..., 2 17 1}, M = 0 1 2 17 j S. S, S j 0 (, j) (,j) {0, 1,..., 2 17 1}, S = 0 Input: X, Y, M, M 1 Output: S = X Y 2 17 mo M 1. S0 0 2. for = 0 to 1 o 3. q ((X0 Y + S (,0) ) ( M 1 )) mo 2 17 4. C α, C β, C γ, C S 0 5. for j = 0 to o 6. {C α : α} X j Y + C α 7. {C β : β} q M j + C β 8. {C γ : γ} α + β + C γ 9. {C S : S (+1,j 1) } γ + S (,j) + C S 10. en for 11. en for Algorthm 4 s a rax-2 17 gt seral Montgomery algorthm from Algorthm 2. In other wors, each 17-bt, as 1 gt, s processe every clock cycle. For ths reason, the operans X, Y, M, an S are splt nto 17-bt gts X j, Y j, M j, an S (,j), respectvely. The loop from the 2n to 11th lnes of Algorthm 4 correspons to the 2n to 5th lnes of Algorthm 2. Comparng the two algorthms, S +1 (X Y + q M + S )/ 2 r of the 4th lne of Algorthm 2 correspons to the gt seral processng by 4th to 10th lnes of Algorthm 4. In Algorthm 4, C α, C β, C γ, an C S are carres an they are ae at the next loop. In the algorthm, C α, C β are 17-bt carres for 17-bt MACC, an C γ, C S are 1-bt carres for 17-bt aton. For example, at the 6th lne X j, Y are tme an ae to 17-bt carry C α, the result s 34-bt. A prouct of X j an Y, an an aton of the prouct an C α are compute. The resultng upper 17-bt enotes a carry C α whch can be ae at next loop. The lower 17-bt of result s α whch s use at the 8th an 9th lnes. These carres n the algorthm appear n both the 17-bt MACC an the 17-bt aer to prevent a long carry chan that causes crcut elay. The propose flexble-length-arthmetc processor archtecture s use to mplement moular exponentaton algorthm an evaluate on Xlnx Vertex-6 XC6VLX240T-3FF1156, programme by software an synthess wth Xlnx ISE Founaton 13.4. Table II an Table I show the synthesze result for ths work. For the beneft of reaers, an optmal mplementaton[15] s recalle, whch s evaluate on Xlnx Vrtex-6 FPGA XC6VLX240T-1, programme by harware escrpton language Verlog HDL an synthesze by Xlnx ISE Founaton 11.4. Note that, the optmal one, programme by HDL s specalze esgn by an expert so that t s ffcult to ebug or change by a non-expert or sometmes even by an expert. However, propose approach s n between software an harware whch makes ebuggng easy to the non-experts or begnners. Ths mplementaton s near to the optmal one [15]. Hence, t s sa that the mplementaton s an almost scalable n ths paper. However, the optmal one [15] s esgne to be mplemente by harware language, HDL whch s ffcult for mofcatons or changes by non-expert, because ths s specally esgne by an expert. 29

On the other han, the mplementaton of RSA encrypton/ecrypton usng propose processor archtecture s esgne to be mplemente by software, more specfcally by assembly language; hence t s easy for mofcatons or changes by a non-expert or by a begnner whch makes t flexble. Even though, t has ablty to support hgher bt (more than 2048-bt) of RSA encrypton/ecrypton. 5. EXPERIMENTAL RESULTS AND DISCUSSIONS The propose flexble-length-arthmetc processor archtecture s use to mplement moular exponentaton algorthm an evaluate on Xlnx Vertex-6 XC6VLX240T-3FF1156, programme by software an synthess wth Xlnx ISE Founaton 13.4. Table I shows the synthesze result for the work n comparson to soft processor n FPGAs. Table I an Table II show the synthesze results of Vrtex-6 for comparng both mplementatons. In Table I, t s compare how much resource use to mplement RSA Cryptography. In ths table, the mplementaton uses 4 block RAMs an 1 DSP slce. On the other han, 70 block RAMs an 3 DSP slces are requre for the same mplementaton usng soft processor, McroBlaze n FPGAs. Hence, t s sa that ths work s more resource effcent. In Table II, tme complexty to mplement RSA Cryptography s compare. In ths table, the mplementaton requres 0.42ms. On the other han, soft processor, McroBlaze nees 152.28ms for the same mplementaton. Hence, t s sa that ths work s more effcent n term of tme complexty. Base on results n Table I an Table II, t s sa that, ths mplementaton of RSA encrypton/ecrypton usng propose processor s sgnfcantly effcent (less resource use, less tme) n comparson to soft processor, McroBlaze. Also esgne processor archtecture can be use to mplement 256-bt, 512-bt, 1024-bt, even more than 2048-bt RSA encrypton/ecrypton wthout further mofcatons or changes. Hence t s flexble. Table I: Comparson of Resource Use to Implement 128- bt RSA Cryptography Target FPGA: Vertex-6 FPGA Desgne Processor: Flexble Length Arthmetc Processor Soft Processor n FPGA: McroBlaze Slce Use 170/301440 6984/301440 Block RAM Use 4/416 70/416 DSP (DSP48E1) Use Clock Frequency n MHz 1/768 3/768 299.90 150 Table II: shows the synthesze result to compare tme complexty Number of bts, R 64bts 128bt A: Soft Processor McroBlaze B: Desgne Processor: Flexble Length Arthmetc Processor No. of Cycles 3081947 22841973 Worst Case Executon Tme[ms] A/B: Rato of Executon Tme 20.55 152.28 186.80 362.5 No. of Cycles 34116 127025 Worst Case Executon Tme[ms] 0.11 0.42 6. CONCLUSIONS In ths paper, an ntermeate approach of software an harware s presente usng DSP Slces an Block RAMs n FPGAs. More specfcally, a flexble-length-arthmetc processor base on FDFM approach s presente that supports arthmetc operatons for numbers wth flexbly many bts, even longer than 2048 bts. The potentalty of ths processor s shown through the mplementaton of moular exponentaton algorthm an compares t wth the mplementaton results usng 32-bt soft processor, McroBlaze n FPGA. The results n Table I an Table II show that ths work s sgnfcantly effcent n terms of resource use an tme complexty. In future, t can be apple for hgher bt RSA cryptography. 7. REFERENCES [1] VIRTEX-6 FPGA Memory Resources (V1.5), Xlnx Inc., 2010. [2] VIRTEX 6 ML605 Harware USER GUIDE (V1.2.1), Xlnx Inc., 2010. [3] VIRTEX-6 FPGA DSP48E1 SLICE USER GUIDE (V1.3), Xlnx Inc., 2011. [4] J. Borm, Y. Ito, an K. Nakano, Acceleratng the CKY parsng usng FPGAs, IEICE Transactons on Informaton an Systems, vol. E86-D, no. 5, pp. 803 810, 2003. [5] J. L. Borm, Y. Ito, an K. Nakano, Instance-specfc solutons to accelerate the CKY parsng for large context-free grammars, Internatonal Journal on Founatons of Computer Scence, vol. 15, no. 2, pp. 403 416, 2004. [6] Y. Ito an K. Nakano, A harware-software cooperatve approach for the exhaustve verfcaton of the Collatz conjecture, n Proc. of Internatonal Symposum on Parallel an Dstrbute Processng wth Applcatons, 2009, pp. 63 70. [7] K. Nakano an Y. Yamagsh, Harware n choose k counters wth applcatons to the partal exhaustve search, IEICE Transactons on Informaton an Systems, vol. E88-D, no. 7, 2005. [8] Y. Ito an K. Nakano, Effcent exhaustve verfcaton of the Collatz conjecture usng DSP blocks of Xlnx FPGAs, Internatonal Journal of Networkng an 30

Computng, vol. 1, no. 1, pp. 19 62, 2011. [9] K. Nakano an E. Takamch, An mage retreval system usng FPGAs, IEICE Transactons on Informaton an Systems, vol. E86-D, no. 5, pp. 811 818, May 2003. [10] Y. Ago, Y. Ito, an K. Nakano, An FPGA mplementaton for neural networks wth the FDFM processor core approach, Internatonal Journal of Parallel, Emergent an Dstrbute Systems, vol. 28, no. 4, pp. 308 320, 2013. [11] Y. Ito an K. Nakano, Low-latency connecte component labelng usng an FPGA, Internatonal Journal of Founatons of Computer Scence, vol. 21, no. 03, pp. 405 425, 2010. [12] X. Zhou, N. Tomagou, Y. Ito, an K. Nakano, Effcent Hough transform on the FPGA usng DSP slces an block RAMs, n Proc. of Internatonal Parallel an Dstrbute Processng Symposum Work-shops, May 2013, pp. 771 778. [13] VIRTEX-6 FPGA DSP48E1 SLICE USER GUIDE (V1.2), Xlnx Inc., 2009. [14] Y. Ago, A. Inoue, K. Nakano, an Y. Ito, The parallel FDFM processor core approach for neural networks, n Proc. of Internatonal Conference on Networkng an Computng, December 2011, pp. 113 119. [15] S. Bo, K. Kawakam, K. Nakano, an Y. Ito, An RSA encrypton harware algorthm usng a sngle DSP Block an sngle Block RAM on the FPGA, Internatonal Journal of Networkng an Computng, vol. 1, no. 2, pp. 277 289, July 2011. [16] Y. Ito, K. Nakano, an S. Bo, The parallel FDFM processor core approach for CRT-base RSA ecrypton, Internatonal Journal of Networkng an Computng, vol. 2, pp. 56 78, 2012. [17] K. Nakano, K. Kawakam, an K. Shgemoto, RSA encrypton an ecrypton usng the reunant number system on the FPGA, n In Proc. IEEE Internatonal Symposum on Parallel an Dstrbute Processng, May 2009, pp. 1 8. [18] R. L. Rvest, A. Shamr, an L. Aleman, A metho for obtanng gtal sgnatures an publc-key cryptosystems, Commun. ACM, vol. 21, no. 2, pp. 120-126, 1978 [19] T. Blum an C. Paar, Montgomery moular exponentaton on recon-fgurable harware, n Proc. of the 14th IEEE Symposum on Computer Arthmetc, 1999, pp. 70 77. [20] P. L. Montgomery, Moular multplcaton wthout tral vson, Math. of Comput., vol. 44, pp. 519 521, 1985. [21] A. F. Tenca an C. K. Koc, A scalable archtecture for Montgomery multplcaton, n Proc. of the Frst Internatonal Workshop on Crypto-graphc Harware an Embee Systems, 1999, pp. 94 108. [22] M. Nmura an Y. Fuwa, Improvement of rax-2 k sgne-gt number for hgh spee crcut, Formalze Mathematcs, vol. 11, no. 2, 133 137, January 2003 [23] M. Nazrul Islam Monal, Kohan Sa, K. Nakano, an Y. Ito, A Flexble-Length-Arthmetc Processor Usng Embee DSP Slces an Block RAMs n FPGAs, In Proc. of the Frst Internatonal Symposum on Computng an Networkng (CANDAR 13), pp. 75 84, December 2013. IJCA TM : www.jcaonlne.org 31