# The implementation and performance/cost/power analysis of the network security accelerator on SoC applications

2 Exclusive-OR and Multiplication operations in the finite field. There are some researches about the hardware implementations like C.C.Wang[7] designed a VLSI hardware for computing multiplications and inverse, A.V. Dinh[8] implemented a low latency architecture for computing multiplicative inverses and divisions, and Jing, M.H.[9] designed a fast inverse module in AES. 3 Hardware accelerator implementation 3.1 Hardware acceleration platform To integrate the hardware accelerators into SoC we choice the ARM based CPU platform and follow the ARM7TDMI coprocessor interface [10] for the maximum compliance. Figure 1 is the integrated system block diagram shows the I/O pins of the coprocessor connected to ARM7TDMI. The ncpi, CPA and CPB connect to ARM7TDMI s ncpi, CPA and CPB pins respectively, and all components connect to the System Bus (may be AMBA). CPA/CPB INPUT INITIAL PERMUTATION 1 0 round f L n = R n-1 R n = L n-1 f(r n-1, K n ) INVERSE INITIAL PERMUTATION K n OUTPUT ARM7TDI Core ncpi CoP0 CoP1 AES Memory Figure 2 - iteration architecture System Bus Figure 1 - Integrated system The CoP0 is coprocessor 0 which is the accelerator and CoP1 is AES accelerator. 3.2 The design of hardware accelerator The core of computation is the 16 times L n = R n-1 and (1) R n = L n-1 f(r n-1, K n), (2) therefore we considering the implementation in two different ways: 1. Fully parallel design: We can duplicate the L n = R n-1 and R n = L n-1 f(r n-1, K n) hardware circuit for 16 copies and finish the computation in 1 cycle. 2. Sequential design: We can use single hardware circuit but run 16 rounds to finish the computation. The first way to implement hardware is the fastest way that it can finish the computation in one single cycle. But the cost is too much to gain the benefit, and may become critical path of the whole system so that it may slow down the system speed. For this reason, we pick the second way to implement our hardware accelerator. The figure 2 shows the architecture of our design, for each 64-but would take 16 cycles. Figure 3 shows the control flow for AddRoundKey() function. Figure 4 shows the whole hardware design; there are two parts in this design first the Key Schedule takes the responsibility to produce 16 Keys for each round and then according the round keys to encipher or decipher. Start Enter Input Data / Input Key Key Schedule Encryption/ Decryption Counter ++ Counter > 16 Yes Get Result Figure 3 - The flow chart of software program NO

3 Round KeyIn Encryption Round == 0 [33:64] Key Schedule Input Data IP 1 0 [33:64] [1:32] F IIP Output Data Figure 4 - algorithm block diagram KeyIn The design of AES hardware accelerator The architecture of AES hardware can divided two parts also, the first part is key expansion then uses the keys to encipher or decipher. In this hardware we followed Jing, M.H.[9] to implement the multiplier in finite field; and used table lookup to implement the S-box which is considerably used in SubBytes() and SubWord() functions. Figure 5 shows the block diagram of AES hardware accelerator architecture. KeyIn 5 KeyIn 4 KeyIn 3 KeyIn 0 KeyIn 1 KeyIn KeyIn 7 Rot Sub Sub 2 Rcon 4 Figure 6 AES integrated parallel Key Expansion Encryption Input Data round Encryption Module keyin keylength keyround Key Expansion Data Output Decryption Module Figure 5 - The block diagram of complete circuit of AES The core computation of AES is the key expansion, and is complex also. Because there are varied key lengths of AES key (,, and ), so that we use 4 multiplaxtor to take control of the path like figure 6 shows. The 1 st mux would let left way pass if it is key length or else it let right way pass through. The mux 2 nd and 3 rd would let up and left way pass respectively in or else it would take down and right respectively. The last 4 th will choice up way for or the middle for or the bottom for. 4 Experiment result and analysis 4.1 /AES hardware accelerator performance We compared the hardware performance with the pure software solutions which is written in C compiled with ADS 1.2 (ARM C compiler), and the result is shown in table 1. We encipher 3 different size packages with,,, and respectively, and count the cycles to finish the jobs. The performance improvement is about times faster than software solutions.

4 4.2 Cost and Power consumption Our hardware accelerators are implemented with verilog RTL code and synthesize with Synopsysy Design AnalyzerTM using TSMC 0.35µm process. The and AES results are shown in table 2 and table 3 respectively. The area (gate count) of is and the cycle time is ns. Comparing with the InventraTM Encryption core [11] which is a commercial product, our speed is slower for 2.06 ns but the area is smaller for 1000 gates (25%). The result is shown in table 4. The area of AES module is gates the cycle time is ns, and the maximum throughput is 897 Mbps, comparing with the Ocean Logic TM (OL) [12] AES module (Table 5) our performance is about 3.63 times faster in cycle count. Because the OL AES module uses ASIC 0.18 process and runs at 200 MHz frequency so that the throughput is about the same. We also implement an ARM compatible CPU core for the platform and the power consumption is mw. The power consumption of CPU+ module, which run at 83 MHz and the core voltage is 3.3 V, is mw. The power consumption of CPU+AES module running at 70 MHz with 3.3 V core voltages is mw. The /AES module added 12.86% and 64.16% of power respectively. The performance/cost ratio tells us that we spending 2978/74345 gates cost and a little power to gain 3000 times faster performance is a worthily and wisely choice. Although there may be some power consumption adding into the SoC system; for the hardware view it is consumed more power, but for the task view in the SoC system it can shorten the computing time very much and may save more energy than it consumed. For example to finish a /AES task the hardware accelerator needs 1 second and the software solution needs up to 3000 seconds, the energy consumption of the system with accelerator is definitely smaller than which without accelerator. According to the Amdahl s Law: Execution Time affected by improvement + Execution Time unaffected Amount of improvement And the Energy = power * cycle count * cycle time It tells us that if the ratio of over whole program exceeds 11.4 % the system energy consumption will go down. For AES the ratio is %. Table 6 shows the details of each module. Software Hardware 48-byte byte byte Table 1 - The hardware/software performance comparing result module synthesis results Synopsys Design Analyzer TSMC 0.35 µm Gate Count Power (mw) Votage: 3.3V, Frequency: 83MHz Cycle time (ns) Cycle / block 16 MAX Throughput (Mbps) 333 Table 2 - module synthesis results with TSMC 0.35µm process AES module synthesize results TSMC 0.35µm Gate Count Cycle time (ns) Power (mw) Voltage: 3.3V, Frequency: MHz KE cycles Cipher cycles KE Throughput (Mbps) Cipher throughput (Mbps) Table 3 - AES module synthesis results with TSMC 0.35µm process

5 Gate count Cycle time(ns) Our core Inventra TM core Table 4 - Compare with InventraTM core OL_KEXP_E D OL_AES_ED Our Key Expansion Our AES Improvement (OL / Our design) Average Improvement 3.63 Table 5 - Compare the cycle count with OL module Rijndael document Version 2, May 9, [7] C.C. Wang, T.K. Truong, H.M. Shao, L.J. Deutsch, J.K. Omura, and I.S. Reed, VLSI architecture for computing multiplications and inverse in GF(2 m ), IEEE Transactions on Computer, Volume C-34, No. 6, August 1985 [8] A.V. Dinh, R.J. Bolton, and R. Mason, A Low Latency Architecture for Computing Multiplicative Inverses and Divisions in GF(2m), IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, Volume 48, No. 8, August 2001 [9] Jing, M.H.; Chen, Y.H.; Chang, Y.T.; Hsu, C.H., The design of a fast inverse module in AES, International Conferences on Info-tech and Info-net, 2001, Volume: 3, Page(s) : [10] ARM7TDMI Data Sheet [11] Inventra TM, -core, Encryption core, [12] Ocean Logic TM, OL_AES AES Core family Rev 1.4, Module Minimal ratio of feature related computation ARM % ARM % ARM % ARM % Table 6 - Minimal ratio of feature related computation 5 Conclusion We propose the performance/cost/power information about the network security hardware accelerator on SoC, and to measure the we implement the /AES modules using ARM7TDMI coprocessor interface on ARM based SoC. It is a worth investment in spending a little cost but gaining much more performance (over 3000 times faster). The /AES accelerators run at 83 and 70 MHz with 3.3 V core voltage and have 12.86% and 64.16% power consumption in additional, but if the ratio of over whole program exceeds 11.4% or AES exceeds 39.95%, the system energy consumption will go down because the computing time is reduced. 6 Reference [1] NIST, Data Encryption Standard, FIPS PUB 46-3, October 25, 1999 [2] NIST, Advanced Encryption Standard, FIPS PUB 197, November 26, 2001 [3] Rivest, A. Shamir and L. Adleman. A Method for Obtaining Digital Signatures and Public-Key Cryptosystems. Communications of the ACM, 21 (2), pp , February [4] [5] f [6] Joan Daemen, Vincent Rijmen, AES Proposal:

