Hardware-Software Codesign in Embedded Asymmetric Cryptography Application a Case Study

Hardware-Software Codesign in Embedded Asymmetric Cryptography Application a Case Study Martin Šimka1, Viktor Fischer 2, and Miloš Drutarovský 1 1 Department of Electronics and Multimedia Communications, Technical University of Košice, Park Komenského 13, 04120 Košice, Slovakia {Martin.Simka, Milos.Drutarovsky}@tuke.sk 2 Laboratoire Traitement du Signal et Instrumentation, Unité Mixte de Recherche CNRS 5516, Université Jean Monnet, Saint-Etienne, France fischer@univ-st-etienne.fr Abstract. This paper presents a case study of a hardware-software codesign of the RSA cipher embedded in reconfigurable hardware. The 16 and 32-bit soft cores of Altera s Nios RISC processor are used as the basic building block of the proposed complete embedded solutions. The effect of moving computationally intensive parts of RSA into an optimized parameterized scalable Montgomery coprocessor(s) is analyzed and compared with a pure software (but still embedded) solution. Advantages of the scalable solution are presented and discussed. The impact of the tasks distribution between the hardware and the software on the occupation of logic resources as well as the speed of the algorithm is demonstrated and generalized. The study demonstrates a number of guidelines generally applicable for similar designs. 1 Introduction Cryptography is given an important role in today s information security problems. Security of the system can be enhanced if it is embedded in a hardware chip. Using a reconfigurable hardware can be even more advantageous, since numerous standards, protocols and algorithms need to be implemented (although almost never simultaneously) in the same piece of hardware. Another advantage of the use of reconfigurable system on a chip (R-SOC) for cryptography algorithms implementation lies in the security of the solution: embedded systems are harder to tap, to decompose and to attack in general. Asymmetric cryptography (also called public key cryptography) is based on the difficulty of factoring large numbers (e. g. RSA) [1] or on the difficulty of calculating discrete logarithms in a finite field (e. g. ElGamal) [2]. To increase the speed of the encryption/decryption process, the algorithms are the most often implemented as a hardware component based on parallel array of processing elements [4], [5]. However, these components do not realize cryptographic protocol and key management.

The protocols in public key cryptography are an excellent example for studying hardware-software codesign concept: the protocol and the key generation and management process have a strong sequential feature, while the algorithm itself (e. g. modular multiplication of large numbers) can be better realized in parallel and pipelined structures. Existing conflict complicated sequential nature and high computational needs together with the need of the system adaptability to various protocols and algorithms is hard to solve in a conventional hardware and software concept. The hardware structures are generally fast enough, but not suitable for algorithm sequencing and they can t be adapted to algorithm changes. The software adapts itself more easily, but it is much slower and less secure. R-SOC offers the best solution: it can consist of an embedded processor and of one or more coprocessors. Some part or all parts of the system can be reconfigurable. However, the reconfigurable system on a chip has an extra aspect to be taken into account: both hardware and software part of the system are embedded in the (same) chip. So even entirely software solution occupies hardware resources inside the chip logic elements for processor implementation and, above all, embedded memory for data and program storing. In this paper we will study the effect of moving the barrier between the software and the hardware from entirely software (but still embedded) solution to almost all in hardware solution on a RSA encryption/decryption scheme. The principle can be easily extended to other asymmetric cryptographic algorithms including modern elliptic curve algorithms. The paper is organized as follows: In the next section the RSA algorithm and the Montgomery modular multiplication implementation problems will be discussed. In the third section we will present our parameterized cryptographic coprocessor concept. The forth section will concern the choice of an embedded processor and its internal structure. Problems affecting the processor and coprocessor interfacing will also be analyzed in this section. In the fifth section we will discuss the effect of moving selected tasks from software to hardware on the area occupation and on the speed of the system. Finally, concluding remarks will be presented in the final section. 2 RSA and Montgomery multiplication implementation aspects RSA was proposed by Rivest, Shamir, and Adleman in 1978 [1]. The private key of a user consists of two large primes p and q and an secret exponent D. The public key consists of the modulus and an exponent E such that it satisfies: Secret key D is chosen such that: M = pq (1) GCD(E, (p 1)(q 1)) = 1 (2) D = E 1 mod (p 1)(q 1) (3)

Algorithm 1 Montgomery exponentiation, R-L binary method 1. X = MM(X, R 2 mod M) = XR mod M 2. A = R mod M 3. for i = 0 to t 1 do 4. if e i = 1 then 5. A = MM(A, X) 6. X = MM( X, X) 7. A = MM(A, 1) The security of the system rests in part on the difficulty of factoring the published modulus, M. Basic mathematical operation used by RSA to encrypt a message X is modular exponentiation [2]: Y = X E mod M (4) that a binary or general m-nary methods can break into a series of modular multiplications. Decryption is done by calculating: X = Y D mod M (5) All of these computations have to be performed with large k-bit integers (typical k {1024, 2048,...}) in order to thwart currently known attacks. For speeding up encryption the use of a short exponent E has been proposed. Recommended by the International Telecommunications Union (ITU) is the Fermat prime F 4 = 2 24 + 1. Using F 4, only 2 multiplications and 16 squarings are computed (see Algorithm 1). Obviously the same trick can not be used for decryption, as the decryption exponent D must be kept secret and it has in general k/2 non-zero bits. Therefore decryption is much slower. 2.1 Montgomery Multiplication Algorithm To speed-up modular multiplication and squaring required for exponentiation in (4) and (5) the well-known Montgomery Multiplication (MM) algorithm [3] is used. It computes the MM product for k-bit integers X, Y MM(X, Y ) = XY R 1 mod M (6) where R = 2 k and M is an integer in the range 2 k 1 < M < 2 k such that GCD(R, M) = 1. By repeated MM a modular exponentiation is performed. There are two common algorithms, which can be used: the L-R binary method and R-L binary method (given in Algorithm 1, E = (e t 1,..., e 0 ) 2, with e t 1 = 1, all other variables are k-bit integers) [2]. Note that in Algorithm 1, the squaring and multiplication are independent, and may be performed in parallel. The starting point of the Algorithm 1 is MM. While the algorithm is simple and can be controlled by software, the MM is an expensive operation suitable for implementation in an algebraic coprocessor.

Fig. 1. Scalable processing element of the MM coprocessor 3 Parameterized scalable Montgomery multiplication coprocessor An arithmetic (or cryptographic) unit is called scalable if it can be reused or replicated in order to generate long-precision results independently of the data precision for which the unit was originally designed [4]. The scalability can be successfully used to adapt speed/area constraints to the size and the internal structure of the device. We have tested two different approaches to implement scalable processing element: the first (called MWR2- MM CSA) is based on a commonly used redundant form based on Cary-save adders [4], the second type of processing element (called MWR2MM CPA) has a FPLD-optimized architecture [5] based on Cary-propagated structure present practically in all kinds of modern FPLDs (see Figure 1). The core of both approaches is a modified Multiple Word Radix-2 Montgomery Multiplication (MWR2MM) algorithm [5] which imposes no constraints to the precision of operands. The algorithm performs bit-level computations, produces word-level outputs and provides direct support for scalable MM coprocessor design. For operands with a k-bit precision e = k/w (or for MWR2MM CPA and k + 2-bit precision e = (k + 2)/w ) words are required. MWR2MM algorithm scans word-wise operand Y (multiplicand), and bit-wise operand X(multiplier),

so it uses vectors M = (M (e 1),..., M (1), M (0) ) Y = (Y (e 1),..., Y (1), Y (0) ) (7) X = (x k 1,..., x 1, x 0 ) Table 1. Comparison of the PE size and speed for some Altera FPLDs Carry Propagate Adders Carry Save Adders Family Length Size Speed Length Size Speed w (bits) (LEs) (MHz) w (bits) (LEs) (MHz) 8 59 161 8 81 232 APEX 16 115 129 16 161 202 32 229 99 32 321 170 8 59 277 8 81 304 CYCLONE 16 115 235 16 161 304 32 227 221 32 321 304 FPLDs that have dedicated carry logic capability (e.g. modern Altera and Xilinx FPLDs) offer optimal implementation of long-precision carry propagate adders (applied in MWR2MM CPA). The size of this PE occupies less resources as the PE with MWR2MM CSA, but on the other side the speed of PE depends significantly on the word width w, as it is presented in Table 1. Moreover MWR2MM CPA algorithm requires about 20% less EMBs than MWR2MM CSA. Using the parallelism of the MWR2MM algorithm, a pipelined structure of the coprocessor has been developed. The data path is organized as a cascade chain of PEs (stages) realizing the MWR2MM algorithm and connected to the data memory. The first stage gets data from the memory, performs a computation and propagates the sub-words of Y, M, and the newly computed sub-results S (for MWR2MM CPA or 1 S and 2 S for MWR2MM CSA) to the next stage, the last stage stores data to the memory. The maximum degree of parallelism for this organization is found as: e p max = (8) 2 The computation time of single MM operation when n p max stages are used is: ( ) T MM = 1 k 2 f clk wn + 2n (9) The MM coprocessor has 3 main parameters (w, e, and n) that can be changed according to the required area of the implemented coprocessor and the required timings for MM computations (n, w) or the security level (e). This approach gives an unusual flexibility to the processor-coprocessor codesign. Size limits of

Fig. 2. Block diagram of the Nios processor and the MM coprocessor interconnection the parameters depending for example on the size of the device and/or on the data width of embedded memory blocks, will be discussed later. 4 Embedded processor and its interfacing with the coprocessor Nios is a soft-core embedded processor from Altera [11], that includes a CPU optimized for R-SOC integration. This configurable, general-purpose RISC processor can be combined with user-defined logic and programmed into Altera FPLDs. Nios supports both 16- and 32-bit variants with 16-bit instruction set. A size of RISC register file can be chosen as a parameter, too configuration with 128, 256 or 512 registers varies in a number of occupied memory blocks. A possibility to add up to 5 custom instructions to the instruction set of the processor is interesting especially for hardware-software codesign, when operation which have difficult or long software implementation can be replaced by custom instruction completed in 1 (combinatorial logic) or several (sequential logic) clock cycles. An Avalon Bus included in the Nios is a parameterized interface bus used for connecting Nios and peripherals into a SOC (see Figure 2). The Avalon is an interface that specifies the port connections between master and slave components, and specifies the timing by which these components communicate [11]. Apart from the simple wiring, the Avalon Bus module contains logic which performs these major functions: Address-decoding to produce chip-select signals for each peripheral. Data bus multiplexing to transfer data from a selecting peripheral to the master. Wait-state generation to add extra clock-cycles to read- and write-accesses, when required by the target peripheral.

Dynamic bus sizing to automatically execute multiple bus-cycles as required to fetch (or store) wide data values from (to) narrow peripherals. Interrupt Number Assignment to present the correct, prioritized IRQ number to the master when one or more peripherals is currently requesting an interrupt. Thanks to these features the data width of the selected processor (e. g. 16 bits) and of the coprocessor (from 8 to 64 bits) need not to be identical. However, some additional clock cycles are needed to convert data during communication between both components. Both coprocessor and processor data memory are implemented using Embedded Memory Blocks (EMBs) [10]. The EMB offers a dual-port mode, which supports simultaneous reads and writes at two different clock frequencies. When implementing memory, each EMB can be configured in one of the following sizes: 128 16, 256 8, 512 4, 1024 2, or 2048 1. Since the data width of the memory can vary in steps, the parameter w (word width) of PE should also vary in the same steps. 5 Analysis and discussion of selected solutions To evaluate software/hardware proportion in the solution and its impact on the size and the speed of the system, we have assumed five different representative architectures: the first one has been based on a fully software solution implemented on the 32-bit Nios processor, the second architecture represents a software solution too, but a hardwareimplemented instruction for standard integer multiplication (supported only by the 32-bit Nios) has been added to speed-up the execution, the third version has used the 16-bit Nios processor and the pipelined MM coprocessor, in the fourth version two pipelined MM coprocessors completed the 16-bit Nios, and finally the fifth system is a fully hardware solution without the processor. 5.1 Fully software solutions Time-critical parts of the software implementation (MM operation) has been programmed in the Nios assembly language where all known optimization techniques for the target processor have been used. The Separated Operand Scanning MM method [3] was used as the best method for given Nios RISC architecture. Logically, the first - pure software - solution without any hardware support for multiplication has been practically unusable, because in this case 100 clock cycles would be needed to perform one 16 16 32 bit multiplication. In the second software solution, the 32-bit Nios processor has taken 2583 logic elements (LEs) including a hardware integer multiplier (used by MUL instruction) occupying

Table 2. Execution times of software implementation of RSA on Altera Nios development board [11] (with APEX EP20K200EFC484-2X FPLD clocked by 50 MHz) Length Method Encr Decr (e w) (ms) (ms) 1024 SOS32MEM 46 845 2048 SOS32MEM 177 6374 446 logic elements and 45 EMBs. Thanks to the hardware support, the MUL instruction performs one 16 16 32 bit multiplication in 3 clock cycles. Table 2 shows the timings of the RSA operation for the second fully software solution in Nios clocked by 50 MHz (for encryption E = F 4, for decryption CRT algorithm is used). 5.2 Processor with one pipelined MM coprocessor In this version of the design there is no need to implement 32-bit Nios processor the multiplication as the most expensive operation is realized in the coprocessor. Therefore a smaller (16-bit) Nios version (occupying only 1275 LEs and 27 EMBs) has been used. Table 3 presents the RSA timings based on the use of the 16-bit MM coprocessor with implemented MWR2MM CPA algorithm and area occupations, which are similar for both lengths (w = 16 bits and e = 64 or 128). Table 3. Execution times of RSA encryption (using F 4) and decryption (with CRT) with MM coprocessor (clocked by 100 MHz) connected to the Nios processor (f clk = 50 MHz) Length # of stages Encr Decr Size (e w) (n) (ms) (ms) (LEs) 1024 2 14 155 429 1024 4 10 89 774 1024 8 8.7 56 1462 1024 16 7.8 39 2837 Length # of stages Encr Decr (e w) (n) (ms) (ms) 2048 2 55 1146 2048 4 42 618 2048 8 35 354 2048 16 31 222 Note that time indicated in the Table 3 includes also pre-computation of values X and A performed by the Nios processor. For this reason the overall speed is not decreasing linearly with the number of stages. The MM coprocessor requires extra memory resources to store sub-results S (this memory is not fully shared with the processor), but on the other hand the program code and the program memory is smaller since the MM is computed by the coprocessor. The 16-bit registers of the 16-bit processor are more suitable for implementation in EMBs (128 of 16-bit registers can be implemented in 2 EMBs) and this brings another saving of the EMBs.

5.3 Processor with two pipelined MM coprocessors Architecture with two coprocessors can also be applied thanks to the parallelism in Algorithm 1, where A and X inside the loop can be computed independently. For typical decryption exponents D there are about k/2 non-zero bits. Parallel execution on 2 separate coprocessors can decrease average execution time to about 66% of execution time with one coprocessor of the same size. Similarly, during the decryption process based on the CRT algorithm, the computations for p and q can be executed in parallel and thus decrease the execution time to about 50% of execution time based on one coprocessor solution. However, two coprocessors require two times more hardware resources (LEs and EMBs). When these resources are available, better solution for using hardware concurrency and speeding up execution, is to add two times more stages to one coprocessor applied. In such a case we will need 2 times more LEs but the number of EMBs will stay unchanged. Number of pipelined stages can be increased up to p max (8). If there are more hardware resources than required for p max, two coprocessors should be used. If not (as for our target device), from hardware efficiency point of view, single pipelined MM coprocessor with appropriate number of pipelined stages is the best option. There is one aspect that should be mentioned. Hardware efficiency is not the only criterion used for cryptographic hardware evaluation. Parallel execution on two coprocessors can potentially increase the resistance of hardware against the side-channel attacks. For this reason two coprocessors (that require more EMBs) can sometimes be optimal. 5.4 Fully hardware solution Many implementations realizing the whole system as a parallel hardware architecture have been published up to now [8], [9]. Clearly, such solutions are the fastest ones and can be used for high performance systems. The disadvantage of this kind of solutions is that all input data are expected to be already stored in a memory before the computation. And in that case even small changes in the implemented protocol may require the remake of the whole design. Sequential operations like precomputation of constants, and controlling a computational process are difficult to implement and to modify in a hardware. On the other hand, when these operations are controlled by the software, the hardware coprocessor does not include a complicated control part and can thus be highly optimized, regular and flexible. Even more, the software control of the process can allow the user to obtain very flexible and reusable solution. Therefore we don t see the fully hardware solution as a suitable way to implement flexible asymmetric encryption algorithm in FPLDs. 6 Conclusions Parameterized processors embedded in reconfigurable hardware are becoming a standard building block in complex SOC designs. They permit to encode effi-

ciently complex sequential algorithms but they are not powerful enough for many typical computationally intensive cryptographic applications. It was demonstrated that execution of carefully selected parts of the algorithm in properly optimized coprocessors increases considerably the speed of the complete RSA algorithm. Even more, it was shown that hardware resources used in this combined hardware-software design are not more significant than in a pure software (but still embedded) solution, because the combined design can use simpler (e. g. 16-bit) embedded processor. The possibility of the parallel use of two or more coprocessors can be advantageous from the security point of view, but one scalable coprocessor with more pipeline stages can reach in principle the same speed. The final embedded system architecture can be adapted to the expected algorithm speed and to the given hardware resources. Thanks to the scalability of coprocessors, the modification of their parameters (e.g. word-length, pipeline depth, parallelism) can be done very easily during the synthesis of the system. References 1. R. L. Rivest, A. Shamir, and L. Adleman. A method for obtaining digital signatures and public key cryptosystems. Communications of the ACM, 21(2):120 126, February 1978. 2. J. A. Menezes, P. C. Oorschot, and S. A. Vanstone. Applied Cryptography. CRC Press, New York, 1997. 3. C. K. Koc, T. Acar, and B. S. Kaliski Jr. Analyzing and comparing Montgomery multiplication algorithms. IEEE Micro, 16(3):26 33, June 1996. 4. A. F. Tenca and C. K. Koc. A scalable architecture for Montgomery multiplication. In C.K. Koc and C. Paar, editors, Cryptographic Hardware and Embedded Systems, number 1717 in Computer Science, pages 94 108, Berlin, Germany, 1999. Springer Verlag. 5. M. Drutarovský, V. Fischer, and M. Šimka. Two Implementation Methods of Scalable Montgomery Coprocessor Embedded in Reconfigurable Hardware. Cryptographic Hardware and Embedded Systems 2003. submitted 6. C. K. Koc. RSA hardware implementation. pages 1 28, August 1995. www.rsa.com. 7. M. Šimka and V. Fischer. Montgomery Multiplication Coprocessor for Altera Nios Embedded Processor. Proceedings of the 5th International Scientific Conference on Electronic Computers and Informatics 2002, pages 206 211, Kosice, Slovakia, October 2002. 8. T. Blum and C. Paar Montgomery Modular Exponentiation on Reconfigurable Hardware. Proceedings of the 14th IEEE Symposium on Computer Arithmetic (Adelaide, Australia), pages 70 77, 1999. 9. S. E. Eldridge and C. D. Walter. Hardware Implementation of Montgomery s Modular Multiplication Algorithm. IEEE Transactions on Computers, 42(6):693 699, June 1993. 10. APEX 20K Programmable Logic Family, www.altera.com 11. Nios Soft Core Embedded processor, www.altera.com/nios