Hardware Implementation of XTEA

HI-1 1 Hardware Implementation of XTEA Steven M. Aumack, Michael D. Koontz Jr. Abstract Some very important factors to consider when designing a cryptographic system are performance, speed, size, and security. Sometimes the designer of the cryptosystem decides to prioritize these factors based on specific objectives of the cryptosystem. Tiny Encryption Algorithm (TEA), and the Extension of TEA (XTEA) are examples of cryptographic algorithms that were designed with size and simplicity as the main design criteria. Since TEA s presentation to the public in 1994 and XTEA s presentation in 1997, many software implementations have been designed of both algorithms. However, not nearly as many hardware designs of TEA have been implemented. This report details a hardware implementation of XTEA (Extension of TEA), with a design goal with speed as the main priority as opposed to size. Index Terms Cryptography, TEA (Tiny Encryption Algorithm), XTEA (Extension of Tiny Encryption Algorithm) T I. INTRODUCTION HE Tiny Encryption Algorithm, or TEA, is a block cipher which was originally designed by David Wheeler and Roger Needham of the Cambridge Computer Laboratory. It was first presented at the Fast Software Encryption Workshop in 1994. TEA was designed with the idea that a cryptographic algorithm could be implemented with smaller code size and less complexity, and still execute with similar or better performance measurements than other popular cryptographic algorithms such as DES [1]. decryption of the TEA block cipher is to ensure a higher level of security. This is related to the avalanche effect, which states that in order to provide a secure cryptosystem, when one bit of input is changed, then about half of the output bits should change. In the case of TEA, the number of cycles that must occur before changing one input bit effects 32 output bits (half of the 64-bit block) is roughly six cycles. The designers of TEA claim that sixteen cycles should be good enough, but recommend 32 cycles for enough permutation of the output [2]. Since TEA uses a Feistel structure, it uses addition, subtraction, and XOR as reversible operations. The use of addition, subtraction, and XOR operations helps eliminate the need to implement Substitution boxes (S-Box) and Permutation boxes (P-Box) as a part of the design [3]. For security reasons the key length is set to 128 bits to help prevent simple search techniques or persuade attackers against brute force attacks. The key used for TEA is also sometimes called the master key. This is because in the implementation discussed in the original presentation, the master key is subdivided into other derived keys, K[0 3]. The key scheduling method used in TEA is also a simple design. For odd cycles, subkey K[0] and K[1] are used, and for even cycles, subkey K[2] and K[3] are used. Each cycle takes into consideration a variable called delta and is used as a part of the key scheduling. A different value for delta is used in each cycle of TEA. The number delta is derived from the following equation: II. TINY ENCRYPTION ALGORITHM (TEA) A. Background Information As discussed above, TEA is a block cipher. A block cipher takes a specific input size of plaintext, and the cryptographic algorithm produces a corresponding ciphertext of identical size. For TEA, the block size is defined as 64 bits. TEA uses a Feistel structure. Cryptographic algorithms that use a Feistel structure have similar operations for encryption or decryption. For example, the decryption algorithms may require only a reversal of the encryption algorithm or the key schedule. Cryptographic algorithms that use a Feistel structure also use operations such as bit shuffling or linear mixing to help produce an output that is very different from the input. These operations may be repeated for several rounds for increased security. For TEA, the suggested number of rounds is 64. The rounds are implemented in pairs, which results in 32 cycles for one block of plaintext. The fact that there are 32 cycles for encryption or The equation above for delta yields the following rounded number, 2654435769. This number in hexadecimal format is 0x9E3779B9. The following diagram shows encryption for the TEA algorithm. As previously stated, the only operations performed during encryption include addition, XOR, and shift right or shift left. The diagram shows two rounds or one pair/cycle for TEA. The Feistel structure for TEA, and the 4 subkeys used for encryption are also easily seen.

HI-1 2 K[0] K[1] K[2] >> 5 void code(long* v, long* k) unsigned long y=v[0],z=v[1], sum=0, /* set up */ delta=0x9e3779b9, /* a key schedule constant */ n=32 ; while (n-->0) /* basic cycle start */ sum = delta ; y = ((z<<4)k[0]) ^ (zsum) ^ ((z>>5)k[1]) ; z = ((y<<4)k[2]) ^ (ysum) ^ ((y>>5)k[3]) ; /* end cycle */ v[0]=y ; v[1]=z ; void decode(long* v,long* k) unsigned long n=32, sum, y=v[0], z=v[1],delta=0x9e3779b9 ; sum=delta<<5 ; >> 5 K[3] Figure 1 - TEA Block Diagram while (n-->0) /* start cycle */ z-= ((y<<4)k[2]) ^ (ysum) ^ ((y>>5)k[3]) ; y-= ((z<<4)k[0]) ^ (zsum) ^ ((z>>5)k[1]) ; sum-=delta ; /* end cycle */ v[0]=y ; v[1]=z ; B. Software Implementation In the original presentation of TEA, David Wheeler and Roger Needham also included some source code for software implementation. The designer s state that the particular algorithm used in the source code for the software implementation of TEA was chosen because it was thought to be a compromise between security and simplicity of design. This algorithm was neither the fastest nor the slowest of those tested prior to the final down selection to one algorithm [2]. As a part of the original presentation of TEA, Wheeler and Needham published the following source code. In the software implementation, the source code separates the 64-bit block into two 32-bit numbers labeled y and z. As previously stated, TEA contains two rounds for one cycle of encryption or decryption. Round one (and subsequent odd rounds) operates on y, and subkeys K[0] and K[1]. Round two (and subsequent even rounds) operates on z, and K[2] and K[3]. Figure 2 - TEA Source Code [2] III. EXTENSIONS OF TINY ENCRYPTION ALGORITHM (XTEA) A. Background Information After some weaknesses and vulnerabilities of TEA were discovered and documented, Wheeler and Needham decided to present a new implementation of TEA and called it Extensions of TEA (XTEA). XTEA was first presented in 1997, three years after TEA was first presented. Similar to TEA, XTEA is also a block cipher, which uses Feistel structure. XTEA also uses the same 64-bit block and a 128-bit key as TEA. The same 64 rounds, or 32 cycles, are also recommended for the algorithm. The vulnerabilities of TEA were discovered using differential related-key attacks [4]. Therefore, XTEA attempts to correct the weaknesses by improving some aspects of the algorithm. The first change that was introduced in XTEA was a correction to the key schedule algorithm. In the updated XTEA, the introduction of subkeys is added more slowly. Also, the subkeys are selected by using two bits of the variable sum. In addition, a shift of 11 is also introduced in the key schedule to help create an irregular sequence of the subkeys. Some other changes introduced in XTEA is a rearrangement of the addition, shifts, and XOR operations. The following diagram shows XTEA. Instead of defined placement of the subkeys, now subkeys are introduced as subkey A and subkey B.

HI-1 3 Subkey A >> 5 IV. HARDWARE IMPLEMENTATION OF XTEA A. Top Level Block Diagram During the hardware design process, our project team needed to determine the design criteria for implementation. Our final decision was to implement a hardware design with speed as the main criteria. Therefore, encryption and decryption could be designed separately. The following diagram is a top-level design of the hardware. DATA IN 64 bit KEY 128 bit ENC_DEC LOAD_DATA LOAD_KEY Subkey B >> 5 RESET CLOCK XTEA Figure 3 - XTEA Block Diagram OUTPUT 64 bit READY Figure 5 - Hardware Implementation Top Level Diagram B. Software Implementation Similar to the publication for TEA, when XTEA was published in 1997, Wheeler and Needham also included source code for XTEA. As previously stated, XTEA contains two rounds for one cycle of encryption or decryption. Round one (and subsequent odd rounds) operates on y. The subkey selection in this round depends on the value of sum&3, which is the variable sum logic AND with 3, 0x03h, or 0011b. Round two (and subsequent even rounds) operates on z. The subkey selection in this round depends on the value of sum>>11 & 3, which is SUM shifted by 11 and then a logic AND with 3, 0x03h, or 0011b. tean( long * v, long * k, long N) unsigned long y=v[0], z=v[1], DELTA=0x9e3779b9 ; if (N>0) /* coding */ unsigned long limit=delta*n, sum=0 ; while (sum!=limit) y= ( (z<<4) ^ (z>>5) ) z) ^ (sum k[sum&3] ); sum=delta; z= ( (y<<4) ^ (y>>5) y) ^ (sum k[sum>>11 &3] ); else /* decoding */ unsigned long sum=delta*(-n) ; while (sum) z-= ( (y<<4) ^ (y>>5) y) ^ (sum k[sum>>11 &3] ); sum-=delta; y-= ( (z<<4) ^ (z>>5) z) ^ (sum k[sum&3] ); v[0]=y, v[1]=z ; return ; Figure 4 - XTEA Source Code [5] The definitions of the pins used for the hardware implementation are as follows. RESET: resets the circuit to an initial state CLOCK: input clock signal (active high) DATA_IN: 64-bit data input to circuit (can be plaintext or ciphertext). Also the upper 32-bits are used to input the keyschedule constant KEY: 128-bit key input ENC_DEC: controls circuit operating mode (encryption or decryption) LOAD_DATA: initiates loading of data into the circuit LOAD_KEY: initiates loading of the key and keyschedule constant into the circuit OUTPUT: 64-bit data output (encrypted ciphertext or decrypted plain-text) READY: signals that the circuit is ready to accept input B. Hardware Design Decisions As previously stated, the criteria for the XTEA hardware implementation is to maximize speed. Therefore, we needed to make a few critical design decisions. The first decision we made is to implement encryption and decryption algorithms as separate logic. If our design criteria were to minimize area, we most likely would have used the same adders, XOR s, and other logic for encryption and decryption. Except, in decryption the data flows through the circuit in reverse order. The second design decision is to determine how to implement a major building block for the logic. For XTEA, a major component of both encryption and decryption, besides the XOR gate is an adder. The default adder that is implemented using the design tools we chose is a ripple carry adder. For the hardware design of XTEA, we chose to implement a faster adder. The adder we decided to

HI-1 4 implement is a Kogge-Stone parallel prefix adder. following diagram shows a 16-bit Kogge-Stone adder. The equation: y= (z<<4 ^ z>>5) z ^ sum k[sum&3]; This output is the final result for the second step. See the block diagram below for more details about the first two steps. Y(V0) Z(V1) Sum >> 5 SubKey sum = sum delta sum k[sum & 3] (z<<4 ^ z>>5) z PL_1 PL_1 PL_1 y ( (z<<4 ^ z>>5) z) ^ (sum k[sum & 3] ) Figure 6 - Kogge Stone Adder Block Diagram The third design decision is to determine the flow of logic for the encryption and decryption routines. We decided to implement XTEA using registers. After each calculation at the end of a step, the current value is stored in a register. Each step that ends with a value getting stored in a register takes one clock period to complete. There are four steps that occur for every cycle of encryption or decryption. Therefore, if four steps are required, and each step takes one clock period to complete, both encryption and decryption algorithms take four clock periods to complete one cycle. For 32 total cycles, the total number of clock periods to encrypt or decrypt one 64-bit block takes 128 (32*4) clock periods. There are two additional clock periods during the encryption and decryption algorithms. One clock period at the beginning is required for reading input values for data, the key scheduling constant delta and the intermediate sum, and one more clock period at the end for writing the data output back to the main registers for a grand total of 130 clock cycles. C. Encryption and Decryption Algorithms During the detailed design stage of hardware implementation, we used the software source code to create a block diagram of the XTEA encryption and decryption algorithms. The first two lines of source code that is included in the while loop are: Figure 7 - Hardware Implementation Encryption Part II After this step, the value of sum is updated, and the new value of y is used for the third and fourth step of encryption. The final line of source code inside the while loop is: z= (y<<4 ^ y>>5) y ^ sum k[sum>>11 &3]; The output of the third step yields: (y<<4 ^ y>>5) y sum k[sum>>11 &3] These two values are XOR d together and added with the value of z, which results in the final equation: z= (y<<4 ^ y>>5) y ^ sum k[sum>>11 &3]; This final result for z is the output of the fourth step. The new values of sum, y, and z are used in the next iteration of the loop until all 32 cycles are complete for encryption. PL_2 SubKey sum k[sum>>11 & 3] Z(V1) PL_2 >> 5 (y<<4 ^ y>>5) y y= (z<<4 ^ z>>5) z ^ sum k[sum&3]; sum=delta; The output of the first step yields the following three expressions: (z<<4 ^ z>>5) z sum k[sum&3] sum=delta The first two lines of code above are XOR d together and added with the value of y, which results in the final New sum New y z ( (y<<4 ^ y>>5) y) ^ (sum k[sum>>11 & 3] ) Figure 8 - Hardware Implementation Encryption Part I

HI-1 5 The block diagram for decryption looks similar. However there are subtle differences. Instead of adders in some equations, subtraction is used as a reverse operation. In order to implement subtraction in hardware, we used an adder with one input inverted and input a 1 on the carry line. Also, the value of delta is slightly different. As opposed to using the value of delta shown in equation (1) used for encryption, the initial value of delta for decryption is multiplied by 32. For each subsequent cycle for decryption sum is subtracted from delta to get the new value of sum. Again, this is a reverse operation from addition where sum added with delta resulted in a next value for sum for encryption. Reset rst_sum ready rst_count load_key? 0 1 S1 The three lines of source code in the while loop for decryption are: z-= (y<<4 ^ y>>5) y ^ sum k[sum>>11 &3]; sum-=delta; y-= (z<<4 ^ z>>5) z ^ sum k[sum&3] ; The first step of decryption yields the following expressions: (y<<4 ^ y>>5) y sum k[sum>>11 &3] sum-=delta The first two lines of code above are XOR d together and added with the value of z, which results in the final equation: z-= (y<<4 ^ y>>5) y ^ sum k[sum>>11 &3]; 0 load_data? 1 S2 S3 enc_l1 en_data enc_data? 0 1 load_sum This result is the second step for decryption. After this step, the value of sum is updated, and the new value of z is used for the third and fourth step of encryption. The final line of source code inside the while loop is: S4 enc_l2 y-= (z<<4 ^ z>>5) z ^ sum k[sum&3] enc_l3 The output of the third step yields: (z<<4 ^ z>>5) z sum k[sum>>11 &3] These two values are XOR d together and added with the value of y, which results in the final equation: S5 en_data en_sum z= (y<<4 ^ y>>5) y ^ sum k[sum>>11 &3]; This final result for z is the output of the fourth step. The new values of sum, y, and z are used in the next iteration of the loop until all 32 cycles are complete for decryption. D. State Diagram For the purposes of the hardware design, we implemented a state diagram to help with logic control of the encryption and decryption algorithms. See the figure to see the flow of the state diagram. done? S6 en_out Figure 9 - Hardware Implementation Flow Diagram The first action in the state diagram is to assert the reset line to initiate the circuit. After the reset signal is asserted,

HI-1 6 the values of rst_sum, ready, and rst_count are asserted. The rst_sum signal resets the initial value of sum to zero for encryption and 32*delta for decryption. The rst_count signal resets the encryption/decryption cycle counter. The first decision in the state diagram is whether or not to load the key. If the value of load_key is not asserted, the state diagram continues to the next decision. If load_key is asserted, the key for encryption or decryption and the key schedule constant is loaded. The next decision is whether or not to load the data. If the load_data signal is not asserted, the state diagram loops until the load_data signal is asserted. Once the load_data signal is asserted, the en_data signal is asserted and the input is loaded. At this point, the next decision is whether to encrypt or decrypt the data. If the enc_data signal is not asserted, then the data input will be encrypted. If the enc_data signal is asserted, then the data input will be decrypted. Prior to decrypting the data, the value of sum is loaded, which is decremented after every cycle for decryption. After this step is complete, four clock periods for one complete cycle of encryption or decryption completes. In order to continue on to the next clock cycle, a signal needs to be asserted. For example, the signal enc_l1 needs to be asserted before continuing on to the next clock cycle, and then signal enc_l2 needs to be asserted before continuing on to the next clock cycle. This allows the intermediate values to flow through the pipeline registers. After the four clock cycles complete, another decision needs to be made. We need to determine if the 32 cycles have completed. If 32 cycles have not been completed, the done signal is not asserted and the state diagram loops back to the beginning of the pipeline. If the done signal is asserted, then all 32 cycles have been completed. After all 32 cycles have been completed, the en_out signal is asserted. When the en_out signal is asserted, the tri-state buffers are enabled and the new data is available at the output of the circuit. V. RESULTS After the hardware implementation of our hardware design of XTEA was completed, we recorded some data including area, clock frequency, latency, and throughput. First, as a part of the results, we need to discuss the target chipsets that were used to gather timing analysis. For FPGA implementation, the Xilinx Virtex 4SX25FF668 was chosen. This FPGA was chosen primarily because of its size. For ASIC implementation, 90 nm TCBN90G with the TSMC Library was chosen. This semi-custom library was chosen because it was the library with the smallest transistors available to us (90nm instead of 120nm). Data was recorded for the FPGA hardware implementation. Some of the data recorded includes area (slice flip-flops and look-up-tables (LUTs) as well as equivalent gate count). Data related to clock period and frequency was also recorded. The information listed in the following table was recorded using Active-HDL ver 7.1 Table 1 - FPGA Results Device FPGA Area (Slice Flip-Flops) 1,081 Area (LUTs) 4,608 Area (Equivalent Gate Count including JTAG gate count for IOBs) 49378.292969/2.419200 = 20,411 NAND Gate Equivalent Clock Period (ns) 6.403 Clock Frequency 156.177 (MHz) Similar data was recorded for the ASIC hardware implementation. The data recorded for the ASIC design includes area, clock period, and clock frequency. The information listed in the following table was recorded using Synopsys Design Analyzer Version X-2005.09. Table 2 - ASIC Results Device ASIC Area 49,378.29296 9 Clock Period (ns) 1.5 Clock Frequency (MHz) 666.67 After recording data related to clock frequency, we also recorded data related to latency of the design. After recording data for latency, and knowing that XTEA uses 64-bit blocks, we calculated the throughput in Mbps for both the FPGA and ASIC design. The following table contains latency, which was recorded using Active-HDL ver 7.1, and throughput which was calculated. Table 3 - Throughput for FPGA and ASIC Device Latency (ns) Throughput (Mbps) FPGA 832.39 73.33 ASIC 195 313 Throughput was calculated in the following manner. The XTEA block size of 64 bits was divided by the latency. The value obtained is throughput in bits per second (bps). This value was divided by 1024 to get Kilobits per second (kbps), and divided by 1024 again to get Megabits per second (Mbps). There is a noticeable latency difference between the FPGA and ASIC design implementation. The ASIC is a little more than four times faster than the FPGA. VI. CONCLUSION This paper provides enough data to show that XTEA can be implemented using hardware, either an FPGA or ASIC. The clock frequency was fast, and the throughput was also was high. This particular hardware implementation of XTEA was derived from the published source code of the original designers of TEA and XTEA. From this source code, a block

HI-1 7 diagram and state diagram were produced to help complete the hardware implementation with successful results. The work provided in this paper is important because there are not many hardware implementations of TEA or XTEA available to the public. If an end user is interested in using this hardware design and implementation for their application, the end user should determine if the security of XTEA meets their requirements and is sufficient for their application. The end user should compare the speed and throughput of this hardware implementation with other secret key block cipher cryptosystems to determine which algorithm will meet their speed and data requirements. The end user should also compare this hardware implementation of XTEA with other hardware implementations that may or may not become public in the future. For future work, other designers could expand upon this project in a number of ways. The designer could decide to change this implementation by using a fully pipelined architecture. The use of the Kogge-Stone adder will be very beneficial when migrating this design to a fully pipelined architecture. By changing this design to a fully pipelined design, the designer would also have the opportunity to reduce the critical path for all operations thus reducing the clock frequency even more, which results in higher throughput. Another way that a designer could expand upon this design would be to add FIFO memory for larger files. For our implementation and simulation, we only input one 64-bit block. We did not input any values larger than 64-bits. Therefore, the implementation could be modified to accept larger input sizes, and encrypt or decrypt multiple blocks of data than just one block of data. REFERENCES [1] S. Liu, O. Gavrylyako, and P. Bradford, Implementing the TEA algorithm on Sensors, Proceedings of the 42 nd annual Southeast regional conference, pp. 64-69, 2004. [2] D. Wheeler and R. Needham, TEA, a tiny encryption algorithm, Proc. Fast Software Encryption: Second International Workshop, Lecture Notes in Computer Science, vol. 1008, pp. 363-366, December 2004. [3] P. Israsena, Securing Ubiquitous and Low-Cost RFID Using Tiny Encryption Algorithm, Wireless Pervasive Computing, 2006 1 st International Symposium, pp. 1-4, 2006. [4] J. Kelsey, B. Schneider, and D. Wagner, Related-Key Cryptanalysis of 3-WAY, Biham-DES, CAST, DES-X, NewDES, RC2, and TEA, Proceedings of the First International Conference on Information and Communication Security: Lecture Notes In Computer Science; Vol. 1334, pp. 233-246, 1997. [5] D. Wheeler and R. Needham, TEA Extensions, unpublished, October 1997.