Modified Booth Algorithm Carry Save Adder for High-Speed Multiplier

Transcription

1 Modified Booth Algorithm Carry Save Adder for High-Speed Multiplier Mahyar Shahsavari July 2012 Abstract Designing an optimized processor has been a main concern of the computer and hardware designers during the recent decades. Many approaches have been tested and implemented. Different methods for addition are applied which cause different method for multiplication and division. Having high-speed multipliers is critical for the performance of processors. Infact 8.72 % of all instructions in a typical scientific program are Multipliers (Kyoung, H.L., 2003). In this report, we present a parallel multiplier by applying modified booth algorithm along with Carry Save Adder (CSA). We enhanced Leon 3 software processor from Gaisler company. The implementation is based on a Xilinx Vertix4 board. 1 Introduction For multiplication the conventional iterative add-shift methods are inexpensive to implement in term of hardware but the resulting execution speeds are too low to satisfy the increasing demand for high-speed computing. Since the speed of CPUs have increased tremendously in recent years, parallel multipliers can be implemented such that meet high-speed requirements. Between different types of multiplication techniques, the modified Booth algorithm is very prominent. This technique, along with the use of Carry Save Adder (CSA) approach, increases the performance of parallel multipliers. In this paper, we work on enhancing the performance of 32 bit leon3 open source multiplier. The idea was to integrate the computer architecture and computer arithmetic concepts and utilize them in order to design the required functional units and optimize the overall processor performance. Furthermore, these functional units and the overall processor needed to be implemented on Virtex4 ML410 FPGA board in order to run the benchmark and measure the performance. Basically, the most important principle of computer design is to focus on the application cases like, When making a design trade-off, favor the frequent case over the infrequent case. For instance, the instruction fetch and decode unit of a processor may be used much more frequently than a multiplier and divider. Similarly, the multiplication operation is performed more often than division thus, more performance gain can be achieved by improving multipliers than the dividers. 1

2 2 Present schemes used There are 3 methods for multiplications: Binary Multiplicaton Array Multiplier Multiplier and Accumulator Unit (Tree Multiplier) Binary multiplication is a software method. In this case processors do not have a hardware multiplier. Binary multiplier is fine but it is slow, The entire process consists of three steps, partial product generation, partial product reduction and final addition. Next choice, the array multiplier, is a vast improvement in speed over the traditional bit serial multipliers. Array multiplier is very regular in its structure and uses only short wire to connect to the next full adder. Thus, it has a very simple and efficient layout in VLSI. This method still is not fast enough and the area and power would be the most obvious shortage of this technique. The tree multiplier or in the other word, Multiplier and Accumulator (MAC), has this capability to be fully paralleled. Efficiency can dramatically be improved if we use high-performance CSA and using higher radix multiplier. These multiplication schemes handle more than one bit of the multiplier in each cycle. A higher representation radix leads to fewer digits. Thus, the multiplication algorithm requires fewer cycles, which means fewer partial products. 3 Overview of our design The multiplication algorithm has 4 steps to which improving each one can have better consequences in whole process. These steps are partial product generation, partial product addition, final addition and accumulator. Using several techniques [3]such as the Baugh- Wooley (BW), Booth Algorithm (BA) and Modified Booth Algorithm (MBA) cause having faster and efficient partial product generation. For n-bit multiplier, the number of summands are n-bit, n/2 and n/2 for BW, BA and MBA respectively. In addition to the encoding step, the BA and MBA algorithms also require generation of the twos complement of the multiplier which introduces extra delay. The delay for twos-complement generation is not trivial, but has been consistently neglected in most of the proposed designs in the literature. The method for improving partial product addition characteristics is related to using the proper adder. In our design we applied Carry-Save Adder as what is illustrated in Figure 1. For storing the final multiplier result 2n-bit accumulator is required. Modified Booth Algorithm [2] is the method that we have chosen for producing partial product. In the conventional MBA, three-bit strings of the multiplier are scanned and appropriate operations are carried out on the multiplicand. We express n bit numbers A and B by sequences a n 1 a n 2... a 0 and b n 1 b n 2... b 0, respectively. The product of the two numbers can be written as n 1 n 1 P = A B = a i 2 i b j 2 j = i=0 j=0 a i b j 2 i+j (1) n 1 n 1 i=0 j=0 In a straightforward parallel multiplication operation of two n bit numbers, all the partial products are generated simultaneously. Since parallel hardware 2

3 Figure 1: A partial schematic of the adding 32-bit CSA implementation lends itself only to a fixed number of partial products, the algorithm was modified by MacSorley [1] which could encodes 3-bit strings of the multiplier at a time with an overlapping bit. The multiplier can be written as B = n i=even n/2 ( 2b i+1 + b i + b i 1 )2 i = Q i 4 i (2) i=0 where Q i = 2b 2i+1 + b 2i + b 2i 1 with b 1 = 0 and Q i { 2, 1, 0, +1, +2}. The product of the multiplication can be written as n/2 P = AQ i 4 i (3) i=0 An encoder accepts three-bit strings of the multiplier as input and outputs the appropriate control signals like what is shown in Figure 2. The truth table for the encoder and the mathematical operations effected by each three-bit sequence of the multiplicand is shown in Table 1. The control signals generated by the encoder are Z, ADD, 2ADD, 2SUB, SUB and NEG. Z is the signal for which the multiplexer modifies the multiplicand to output zero. ADD and 2ADD are signals for which the multiplexer produces the multiplicand and twice the multiplicand, respectively. The SUB and 2SUB control signals allow the multiplexer to generate the complement and complement of twice the multiplicand, respectively. Finally, NEG generates a 0 or a 1 depending upon whether the multiplexer generates a positive or a complemented number. Subtraction 3

4 Figure 2: The radix 4 schematic using booth encoding method Table 1: Modified Booth algorithm Multiplier bits Booth modified outputs b 2i+1 b 2i b 2i 1 Z ADD 2ADD 2SUB SUB NEG Mux Out A A A A A A can be carried out using 2s complement addition. This involves adding one to the complement of the multiplicand at the LSB for SUB and 2SUB operations. The extra one is generated by the encoding logic. 4 Implement Multiplier We designed a multiplier which can multiply two 32 bit signed numbers in 3 cycles. 32 bits of the multiplicand and 16 bits of the multiplier are fed to the multiple generation block. 16 outputs of multiple generation block combine with the sum and carry from the previous cycle. Lower 16 bits of sum and lower 15 bits of carry are inserted into a 16-bit CPA to produce lower 16-bits of product and after 2 iterations of this process the lower 32 bits of product are obtained. After choosing the suitable algorithms, the first step is writing the code for multiplication and its testbench to verify the correctness of our design too. After multiplication verification by itself, we would replace it in the whole project of leaon3. The VHDL code for applying Booth encoder is two_a <= (30 downto 0) & 0 ; --shift Lest to produce 2a a_bar <= not a; -- generate (-a) two_a_bar <= a_bar(30 downto 0) & 0 ; -- generate (-2a) aa <= a when b="001" or b="010" --Check to use proper booth output else two_a when b="011" 4

5 else two_a_bar when b="100" -- cin=1 else a_bar when b="101" or b="110" -- cin=1 else x" "; cin <= 1 when b="100" or b="101" or b="110" else 0 ; topbit <= a(31) when b="001" or b="010" or b="011" else a_bar(31) when b="100" or b="101" or b="110" else 0 ; Figure 3: Simulation results of testbench (Modelsim) With another look at the Table 1, we will find out easier how this code checking the 3 bits of b and base on these three bits choose the Booth encoder output as a partial product. Running testbench of designed multiplier Figure 3, can help us to see the results which confirm we are using the trustworthy multiplication. 5 Timing Report After implement our design in ISE, the time summaries in Table 2 were obtained. In order to run the Dhrystone benchmark, we had to implement the modified processor on Virtex 4 FPGA. For this we had to do placement and routing of our design. The actual clock of the design is not what is mentioned in the synthesis report. The actual speed on which the design can run is given after actually placing and routing the design on the target FPGA. Taking privilege of Booth encoder in radix 4, in addition to the improvement in time and minimum period, the level of logic decreased too. For instance in case of slack (setup path), source: l3.cpu[0].u0/p0/iu0/r.x.result 3 (F F ) and destination: l3.cpu[0].u0/cmem0/dme.dtags1.dt1.dt0[1].dtags0/xc2v.x0/a9.x[0].r0 (RAM) the level of logic decreased from 15 to 13 which could save a notable amount of area too. 5

6 Table 2: Timing Summary Processor Constraints (paths) Constraints (connections) Min period Max freq Baseline ns MHz Modefied ns MHz Figure 4: Device Utilization Summary for Baseline Processor 6 Performance Results Arfet these modifications and doing implementation, we are going to compare two Device Utilization Summaries output of ISE regard to baseline soft core as well as modified one. Figure 4 and 5 are shown below which are snapshots of design summery of Xilinx ISE tools version Summary and Conclusion In this report, We have presented an algorithm to do faster and efficient multiplication. Multiplication is more frequently used by processor. Therefore, we expected better performance of processor. We did our implementation on leon3 32 bit open source soft-core. Our platform was xilinx board Vertix 4 and the frequency which we applied was the same with what leon3 itself was applied 6

7 Figure 5: Device Utilization Summary for Modified Processor (80 MHz). By doing this modification, the number of occupied slices decreased and we save the area and consequently reduction in power consumption. Our multiplier works with 3 clock pulses so we have a faster processor now as it is shown in the timing summary section. For the future work let me see I (instead of we) have a plan to work on the divider and apply a new efficient algorithm for it. The other thing which I am thinking about for future work is using higher frequency for this core. For this work, I could not fully investigate the power reduction due to clock gating and other techniques because of the time limitation, but I am planning to work on it in summer. There is a possibility of using low power intelligent tool environment (LITE) with back annotation to investigate more about the power consumption but due to lack of time I could not work on that. As a final comment, I would like to mention that this course was very interesting project and I learnt many things of this course. I understood the concept of soft-core, how to use Modelsim, Xilinx ISE, writing VHDL codes how to check new arithmetic algorithms and ideas and many other technical things related to computer arithmetic. By this exercise, we have practically realized the role of different factors in the performance of processor, realization of the arithmetic circuits and their improvement. The only limitation which I had and took me more time to progress, was my isolation and working alone without enough feedback. 7

8 References [1] Algirdas Avizienis. Binary-compatible signed-digit arithmetic. In Proceedings of the October 27-29, 1964, fall joint computer conference, part I, AFIPS 64 (Fall, part I), pages , New York, NY, USA, ACM. [2] Shiann-Rong Kuang, Jiun-Ping Wang, and Cang-Yuan Guo. Modified booth multipliers with a regular partial product array. Trans. Cir. Sys., 56(5): , May [3] Behrooz Parhami. Computer arithmetic: algorithms and hardware designs. Oxford University Press, Oxford, UK,