ESE 570 Final Project

UNIVERISTY OF PENNSYLVANIA ESE 570 Final Project 8 bit Fixed Point Arithmetic Logic Unit Yang Lu *, Feitong Yin, Mengxiao Yan * Department of Materials Science and Engineering Department of Electrical and System Engineering Contact E-mail: yanglu1@seas.upenn.edu

I. TOPIC INVESTIGATION 1. Background and Theory Arithmetic Logic Unit (ALU) is the core of modern microprocessors. Typically, it is the combinational logic circuit that performs multiple logic like AND, OR, XOR; and arithmetical operations such as addition, subtraction, shift, negate, complement and magnitude comparison etc. ALU is composed of several important modules: Logic Unit, Shifter, Algorithm Unit including a full adder, Multiplexer and Static Flip-flop. In addition, these components are controlled by operation codes generated by microprocessor. In this design, we simply put various codes as input and mainly focus on the ALU part. Performance of ALU plays a key role in the Very-Large Scale Circuits design. If ALU's operation speed is improved, the whole system's performance would be boosted. Also, ALUs may be tailored with different bit-widths for different types of Very-Large Scale Integrated Circuits. Taking considerations of these industrial trends for ALU design, our team will aim at designing the highspeed Arithmetic Logic Unit with 8-bit width. High speed performance can be achieved through each VLSI design level, ranging from circuit (We try to use as few transistors as possible), architecture (Each module is placed wisely in the full chip) to layout (In order to minimize the RC delay, we optimize the metal connection and decrease chip area). Procedures of our design are like following: First, each fundamental gate of ALU is designed starting from the simplest inverter to AND, OR and XOR gates. Second, guided by behavioral simulations, each module is built up separately. Then transistor level schematics and simulations are used to verify our design. Lastly, based on our proposed architecture, these components are connected together both in schematic and layout. Post layout verification is used to evaluate the performance. 2. System Specification Following functions for 8-bit data are realized in our design: logic operations like AND, OR, XOR and 1 s complement; shift operations like left/right shift/rotate; algorithm operations like add, subtract, compare and 2 s complement. In Table.1, pins of our chip are listed. Pin Name A0-A7 B0-B7 Clkin S0-S3 C0 Description 8-bit Data Input A 8-bit Data Input B Clock Operation Codes Carry in for algorithm unit Table.1(a) Input Pins 1

Pin Name Out0-Out7 C8 comp eq Description 8-bit Data Output Carry out from algorithm unit Boole value of A B Boole value of A!=B Table.1(b) Output Pins Symbol of 8-bit ALU (full chip) is shown in Figure.1. Figure.1 Symbol of 8-bit ALU Operation codes for each function are shown in Table.2. 2

Operation Code (S0S1S2S3) Output Logic Description 0x00 A+B Add 0x01 A-B Subtract & Comparison 0x10!B+1 Negate 1000 AB AND 1001 A+B OR 1010 A B XOR 1011!B 1 s Complement 1100 A Right Rotate 1101 A Left Rotate 1110 A Right Shift 1111 A Left Shift Table.2 Operation Codes At last, 8-bit fixed point ALU is realized with die area 1150um x 650um and maximum clock frequency 100MHz. Highlights for our design are: 1. It deals with data of 8 bit. 2. 13 Functions are realized: AND, OR, XOR, equal, 1 s complement, right rotate, left rotate, right shift, left shift, add, subtract, comparison, 2 s complement. 3. Static flip flop is added to stabilize outputs. 4. Power dissipation is only 150mW for the worst case. 5. Only two layers of metal have been used for routing, which significantly reduces the fabrication difficulties and RC delay. 3

8-bit MUX2 8-bit MUX2 8-bit Dyn Flip Flop II. DESIGN 1. Architecture Our ALU architecture is shown below. Three basic building blocks perform algorithm (add, subtract, negate and comparison), logic (AND, OR, XOR and Complement) and shifter (left or right rotate/shift). Input signals S2-S3 control the operation made by each module and S0-S1 control the 8- bit 2-1 Multiplexers, implementing selective output. Out0-Out7 is synchronized by the final stage dynamic Flip-Flop, with Clkin as input. Carry in A B? A=B? Carry out 8-bit Algorithm Add, Subtract, Negate, Compare A[0:7] Out[0:7] 8-bit Logic AND, OR, XOR, Complement B[0:7] 8-bit Shifter L-Rotate, R-Rotate, L-Shift, R-Shift Clock S2 S3 S1 S0 Figure.2 Architecture of ALU 2. Individual Block and Gate Descriptions 8-bit MUX2 Controlled by signal S, 8-bit MUX selects one from two channels 8-bit input signal as its 8-bit outputs. Its true table is shown below: S Output 0 A0-A7 1 B0-B7 Table.3 True table of 8-bit MUX2 4

Figure.3 8-bit MUX2 symbol 1-bit Ripple Adder Take Ai, Bi and Ci as input, this unit adds Ai and Bi together and outputs SUMi and corresponding carry out Ci+1. Its true table is shown below: It s easy to find that Ai Bi Ci SUMi Ci+1 0 0 0 0 0 0 0 1 1 0 0 1 0 1 0 1 0 0 1 0 0 1 1 0 1 1 1 0 0 1 1 0 1 0 1 1 1 1 1 1 Table.4 True table of 1-bit Ripple Adder Schematic view of this unit is shown in part 4: Transistor Level Schematics and Simulations. If we connect 8 1-bit ripple adders in series, we ll get a 8-bit ripple adder. Symbols of 1-bit ripple adder and 8-bit ripple adder are shown in Figure.3. 5

Figure.4(a) 1-bit ripple adder symbol Figure.4(b) 8-bit ripple adder symbol Algorithm Unit Algorithm Unit basically performs all calculation, including add, subtract, 2 s complement (negate) and comparison. Its core component is a 8-bit full adder implemented through ripple adder. (Schematic view of this unit is shown in part 4: Transistor Level Schematics and Simulations.) Its true table is shown in Table. 5. And its symbol is shown in Figure. 5. 6

S2 S3 Output 0 0 A+B 0 1 A-B 1 0!B+1 Table.5 True table of algorithm unit Figure.5 Algorithm unit symbol Logic Unit Logic Unit performs 4 different kinds of logic operations: AND, OR, XOR and 1 s complement. Selection of these operations is realized by 3 8-bit MUX2 controlled by signal S2 and S3. (Schematic view of this unit is shown in part 4: Transistor Level Schematics and Simulations.) Its true table is shown in Table. 6. 7

S2 S3 Output 0 0 AB 0 1 A+B 1 0 A B 1 1!A Table.6 True table of logic unit Figure.6 Logic unit symbol 8-bit Shifter 8-bit Shifter is used to perform the shifting and rotating in selected directions for the input data A0-A7. The design of shifter utilizes 3 8-bit MUX2, two at the first stage and one at the second. Schematic view of this unit is shown in part 4: Transistor Level Schematics and Simulations. Its true table is shown in Table. 7. 8

S2 S3 Output 1100 A Right Rotate 1101 A Left Rotate 1110 A Right Shift 1111 A Left Shift Table.7 True table of 8-bit shifter Figure.7 8-bit shifter symbol 8-bit Static D-Flip Flop 8-bit Static D-Flip Flop is designed by paralleling 8 1-bit Static D-Flip Flops, including a two phase clock. Schematic view of this unit is shown in part 4: Transistor Level Schematics and Simulations. Its symbol is shown in Figure.8. 9

Figure.8 8-bit static D-Flip flop symbol 3. Behavioral Simulations Behavioral simulations for important blocks are conducted to verify our design. Verilog codes are based on schematics of each block shown in part 4. 1-bit MUX2 //Verilog HDL for "VerilogTEST", "MY_1MUX2" "functional" `resetall `celldefine `delay_mode_path `timescale 1ns/10ps `resetall `celldefine `delay_mode_path `timescale 1ns/10ps module MY_1MUX2 ( out, A, B, S ); input S; input A; input B; output out; 10

wire inv, nand1, nand2; not(inv, S); nand(nand1, A, inv); nand(nand2, S, B); nand(out, nand1, nand2); endmodule `endcelldefine 8-bit MUX2 //Verilog HDL for "VerilogTEST", "MY_8MUX2" "functional" `resetall `celldefine `delay_mode_path `timescale 1ns/10ps module MY_8MUX2 ( out0, out1, out2, out3, out4, out5, out6, out7, A0, A1, A2, A3, A4, A5, A6, A7, B0, B1, B2, B3, B4, B5, B6, B7, S ); output out0, out1, out2, out3, out4, out5, out6, out7; input A0, A1, A2, A3, A4, A5, A6, A7, B0, B1, B2, B3, B4, B5, B6, B7, S; MY_1MUX2 mux0 ( out0, A0, B0, S ); MY_1MUX2 mux1 ( out1, A1, B1, S ); MY_1MUX2 mux2 ( out2, A2, B2, S ); MY_1MUX2 mux3 ( out3, A3, B3, S ); MY_1MUX2 mux4 ( out4, A4, B4, S ); MY_1MUX2 mux5 ( out5, A5, B5, S ); MY_1MUX2 mux6 ( out6, A6, B6, S ); MY_1MUX2 mux7 ( out7, A7, B7, S ); endmodule `endcelldefine // Verilog stimulus file. // Please do not create a module in this file. // Default verilog stimulus. initial begin A0 = 1'b1; A1 = 1'b0; A2 = 1'b1; A3 = 1'b0; 11

A4 = 1'b1; A5 = 1'b0; A6 = 1'b1; A7 = 1'b0; B0 = 1'b0; B1 = 1'b1; B2 = 1'b0; B3 = 1'b1; B4 = 1'b0; B5 = 1'b1; B6 = 1'b0; B7 = 1'b1; S = 1'b0; #5 S = 1'b1; #5 S=1'b0; #5 S=1'b1; #5 S=1'b0; #5 $finish; end Figure.9 Simulation result for 8-bit MUX2 Shifter //Verilog HDL for "Verilog_TEST", "MY_shifter" "functional" `resetall `celldefine 12

`delay_mode_path `timescale 1ns/10ps module MY_shifter ( out0, out1, out2, out3, out4, out5, out6, out7, in0, in1, in2, in3, in4, in5, in6, in7, S2, S3 ); output out7, out6, out5, out4, out3, out2, out1, out0; input in0, in1, in2, in3, in4, in5, in6, in7; input S2, S3; wire ou0, ou1, ou2, ou3, ou4, ou5, ou6, ou7, pu0, pu1, pu2, pu3, pu4, pu5, pu6, pu7; MY_8MUX2 m1 (ou0, ou1, ou2, ou3, ou4, ou5, ou6, ou7, in7, in0, in1, in2, in3, in4, in5, in6, 0, in0, in1, in2, in3, in4, in5, in6, S3); MY_8MUX2 m2 (pu0, pu1, pu2, pu3, pu4, pu5, pu6, pu7, in1, in2, in3, in4, in5, in6, in7, 0, in1, in2, in3, in4, in5, in6, in7, in0, S3); MY_8MUX2 m3 (out0, out1, out2, out3, out4, out5, out6, out7, pu0, pu1, pu2, pu3, pu4, pu5, pu6, pu7, ou0, ou1, ou2, ou3, ou4, ou5, ou6, ou7, S2); endmodule `endcelldefine // Verilog stimulus file. // Please do not create a module in this file. // Default verilog stimulus. initial begin S2 = 1'b0; S3 = 1'b0; in0 = 1'b1; in1 = 1'b1; in2 = 1'b0; in3 = 1'b1; in4 = 1'b1; in5 = 1'b0; in6 = 1'b0; in7 = 1'b1; #5 S3 = 1'b1; #5 S3 = 1'b0; #0 S2 = 1'b1; #5 S3 = 1'b1; #5 $finish; end 13

Figure.10 Simulation result for Shifter Logic Unit //Verilog HDL for "VerilogTEST", "MY_logic" "functional" `resetall `celldefine `delay_mode_path `timescale 1ns/10ps module logic ( eq, L0, L1, L2, L3, L4, L5, L6, L7, A0, A1, A2, A3, A4, A5, A6, A7, B0, B1, B2, B3, B4, B5, B6, B7, S2, S3 ); output L0, L1, L2, L3, L4, L5, L6, L7, eq; input A0, A1, A2, A3, A4, A5, A6, A7; input B0, B1, B2, B3, B4, B5, B6, B7; input S2, S3; wire and0, and1, and2, and3, and4, and5, and6, and7; wire or0, or1, or2, or3, or4, or5, or6, or7; wire inv0, inv1, inv2, inv3, inv4, inv5, inv6, inv7; wire xor0, xor1, xor2, xor3, xor4, xor5, xor6, xor7; wire mux10, mux11, mux12, mux13, mux14, mux15, mux16, mux17; wire mux20, mux21, mux22, mux23, mux24, mux25, mux26, mux27; wire m1, m2, m3, mu4, m5, m6; and (and0, A0, B0); and (and1, A1, B1); and (and2, A2, B2); and (and3, A3, B3); and (and4, A4, B4); and (and5, A5, B5); and (and6, A6, B6); and (and7, A7, B7); 14

or (or0, A0, B0); or (or1, A1, B1); or (or2, A2, B2); or (or3, A3, B3); or (or4, A4, B4); or (or5, A5, B5); or (or6, A6, B6); or (or7, A7, B7); not (inv0, B0); not (inv1, B1); not (inv2, B2); not (inv3, B3); not (inv4, B4); not (inv5, B5); not (inv6, B6); not (inv7, B7); xor (xor0, A0, B0); xor (xor1, A1, B1); xor (xor2, A2, B2); xor (xor3, A3, B3); xor (xor4, A4, B4); xor (xor5, A5, B5); xor (xor6, A6, B6); xor (xor7, A7, B7); MY_8MUX2 mux1( mux10, mux11, mux12, mux13, mux14, mux15, mux16, mux17, and0, and1, and2, and3, and4, and5, and6, and7, or0, or1, or2, or3, or4, or5, or6, or7, S3 ); MY_8MUX2 mux2( mux20, mux21, mux22, mux23, mux24, mux25, mux26, mux27, xor0, xor1, xor2, xor3, xor4, xor5, xor6, xor7, inv0, inv1, inv2, inv3, inv4, inv5, inv6, inv7, S3 ); MY_8MUX2 mux3( L0, L1, L2, L3, L4, L5, L6, L7, mux10, mux11, mux12, mux13, mux14, mux15, mux16, mux17, mux20, mux21, mux22, mux23, mux24, mux25, mux26, mux27, S2 ); or (m1, xor0, xor1); or (m2, xor2, xor3); or (m3, xor4, xor5); or (m4, xor6, xor6); or (m5, m1, m2); or (m6, m3, m4); or (eq, m5, m6); endmodule `endcelldefine // Verilog stimulus file. // Please do not create a module in this file. 15

// Default verilog stimulus. initial begin A0 = 1'b1; A1 = 1'b1; A2 = 1'b1; A3 = 1'b0; A4 = 1'b0; A5 = 1'b1; A6 = 1'b0; A7 = 1'b0; B0 = 1'b0; B1 = 1'b0; B2 = 1'b1; B3 = 1'b0; B4 = 1'b0; B5 = 1'b0; B6 = 1'b1; B7 = 1'b1; S2 = 1'b0; S3 = 1'b0; #5 S3=1'b1; #5 S2=1'b1; #5 S3=1'B0; #5 $finish; end Figure.10 Simulation result for logic unit 16

1-bit Ripple Adder //Verilog HDL for "Verilog_TEST", "MY_adder1" "functional" `resetall `celldefine `delay_mode_path `timescale 1ns/10ps module MY_adder1 ( C1,SUM0,A0,B0,C0 ); output C1, SUM0; input A0, B0, C0; xor (o1, A0, B0); xor (SUM0, o1, C0); and (a1, o1, C0); and (a2, A0, B0); or (C1, a1, a2); endmodule `endcelldefine 8-bit Ripple Adder //Verilog HDL for "Verilog_TEST", "MY_8adder" "functional" `resetall `celldefine `delay_mode_path `timescale 1ns/10ps module MY_8adder ( C8,SUM7,SUM6,SUM5,SUM4,SUM3,SUM2,SUM1,SUM0,A0,A1,A2,A3,A4,A5,A6,A7,B0,B1,B2,B3,B4,B5,B6,B7,C0 ); output C8,SUM7,SUM6,SUM5,SUM4,SUM3,SUM2,SUM1,SUM0; input A0,A1,A2,A3,A4,A5,A6,A7,B0,B1,B2,B3,B4,B5,B6,B7,C0; MY_1adder a1(c1,sum0,a0,b0,c0); MY_1adder a2(c2,sum1,a1,b1,c1); MY_1adder a3(c3,sum2,a2,b2,c2); MY_1adder a4(c4,sum3,a3,b3,c3); MY_1adder a5(c5,sum4,a4,b4,c4); MY_1adder a6(c6,sum5,a5,b5,c5); MY_1adder a7(c7,sum6,a6,b6,c6); MY_1adder a8(c8,sum7,a7,b7,c7); endmodule `endcelldefine 17

// Verilog stimulus file. // Please do not create a module in this file. // Default verilog stimulus. initial begin A0 = 1'b1; A1 = 1'b1; A2 = 1'b1; A3 = 1'b1; A4 = 1'b1; A5 = 1'b0; A6 = 1'b0; A7 = 1'b0; B0 = 1'b0; B1 = 1'b0; B2 = 1'b0; B3 = 1'b0; B4 = 1'b0; B5 = 1'b1; B6 = 1'b0; B7 = 1'b0; C0= 1'b0; #5 C0=1'b1; #5 $finish; end Figure.11 Simulation result for 8-bit Ripple Adder 18

Algorithm Unit //Verilog HDL for "VerilogTEST", "MY_8algorithm" "functional" `resetall `celldefine `delay_mode_path `timescale 1ns/10ps module MY_8algorithm module MY_algorithm8 (comp,c8, AL0, AL1, AL2, AL3, AL4, AL5, AL6, AL7, C0, A0, A1, A2, A3, A4, A5, A6, A7, B0, B1, B2, B3, B4, B5, B6, B7, S2, S3 ); output comp,c8, AL0, AL1, AL2, AL3, AL4, AL5, AL6, AL7; input C0, A0, A1, A2, A3, A4, A5, A6, A7, B0, B1, B2, B3, B4, B5, B6, B7, S2, S3; wire NB0,NB1,NB2,NB3,NB4,NB5,NB6,NB7; wire up0,up1,up2,up3,up4,up5,up6,up7; wire dw0,dw1,dw2,dw3,dw4,dw5,dw6,dw7; wire C1; not (NB0,B0); not (NB1,B1); not (NB2,B2); not (NB3,B3); not (NB4,B4); not (NB5,B5); not (NB6,B6); not (NB7,B7); MY_8MUX2 m1 (up0,up1,up2,up3,up4,up5,up6,up7,a0, A1, A2, A3, A4, A5, A6, A7, gnd, gnd, gnd, gnd,gnd, gnd, gnd, gnd,s2); MY_8MUX2 m2 (dw0,dw1,dw2,dw3,dw4,dw5,dw6,dw7,b0,b1,b2,b3,b4,b5,b6,b7,nb0,nb1,nb2,nb3,nb4,nb5,nb6,nb 7,S3); MY_1MUX2 m3 (C1,C0,vdd,S3); MY_adder8 adder8 ( C8,AL0, AL1, AL2, AL3, AL4, AL5, AL6, AL7,up0,up1,up2,up3,up4,up5,up6,up7,dw0,dw1,dw2,dw3,dw4,dw5,dw6,dw7,C1 ); nand (comp, AL7, vdd); endmodule `endcelldefine // Verilog stimulus file. // Please do not create a module in this file. // Default verilog stimulus. initial begin 19

A0 = 1'b1; A1 = 1'b1; A2 = 1'b1; A3 = 1'b0; A4 = 1'b1; A5 = 1'b0; A6 = 1'b0; A7 = 1'b0; B0 = 1'b1; B1 = 1'b0; B2 = 1'b1; B3 = 1'b1; B4 = 1'b1; B5 = 1'b1; B6 = 1'b0; B7 = 1'b0; C0 = 1'b0; S2 = 1'b0; S3 = 1'b0; #5 S3 = 1'b1; #5 S2 = 1'b1; S3 = 1'b0; #5 $finish; end Figure.12 Simulation result for algorithm unit Full Chip (without Flip Flop) //Verilog HDL for "VeilogTEST", "MyChip" "functional" `resetall `celldefine 20

`delay_mode_path `timescale 1ns/10ps module MyChip(eq,comp,C8,out0,out1, out2,out3,out4,out5,out6,out7,a0,a1,a2,a3,a4,a5,a6,a7,b0,b1,b2,b3,b4, B5,B6,B7,C0,S0,S1,S2,S3); output eq,comp,c8,out0,out1, out2,out3,out4,out5,out6,out7; input A0,A1,A2,A3,A4,A5,A6,A7,B0,B1,B2,B3,B4, B5,B6,B7,C0,S0,S1,S2,S3; wire AL0, AL1, AL2, AL3, AL4, AL5, AL6, AL7; wire L0, L1, L2, L3, L4, L5, L6, L7; wire SH0, SH1, SH2, SH3, SH4, SH5, SH6, SH7; wire m0, m1, m2, m3, m4, m5, m6, m7; MY_algorithm8 a1 (comp,c8,al0,al1,al2,al3,al4,al5,al6,al7,c0,a0,a1,a2,a3,a4,a5,a6,a7,b0,b1,b2,b3,b4,b5,b6,b7,s2, S3 ); MY_logic mylogic (eq,l0,l1,l2,l3,l4,l5,l6,l7,a0,a1,a2,a3,a4,a5,a6,a7,b0,b1,b2,b3,b4, B5,B6,B7,S2,S3); MY_shifter myshifter (SH0,SH1,SH2,SH3,SH4,SH5,SH6,SH7,A0,A1,A2,A3,A4,A5,A6,A7,S2,S3); MY_8MUX2 mux1 (m0,m1,m2,m3,m4,m5,m6,m7,l0,l1,l2,l3,l4,l5,l6,l7,sh0,sh1,sh2,sh3,sh4,sh5,sh6,sh7,s1); MY_8MUX2 mux2 (out0,out1, out2,out3,out4,out5,out6,out7,al0,al1,al2,al3,al4,al5,al6,al7,m0,m1,m2,m3,m4,m5,m6,m7,s0); endmodule `endcelldefine // Verilog stimulus file. // Please do not create a module in this file. // Default verilog stimulus. initial begin A0 = 1'b1; A1 = 1'b1; A2 = 1'b1; A3 = 1'b0; A4 = 1'b1; A5 = 1'b0; A6 = 1'b0; A7 = 1'b0; B0 = 1'b1; B1 = 1'b0; B2 = 1'b1; B3 = 1'b1; B4 = 1'b1; 21

B5 = 1'b1; B6 = 1'b0; B7 = 1'b0; C0 = 1'b0; S0 = 1'b0; S1 = 1'b0; S2 = 1'b0; S3 = 1'b0; input0 = 1'b0; input1 = 1'b1; #5 S3 = 1'b1; #5 S2 = 1'b1; S3 = 1'b0; #5 S0 = 1'b1; S2 = 1'b0; S3 = 1'b0; #5 S3 = 1'b1; #5 S2 = 1'b1; S3 = 1'b0; #5 S3 = 1'b1; #5 S1 = 1'b1; S2 = 1'b0; S3 = 1'b0; #5 S3 = 1'b1; #5 S2 = 1'b1; S3 = 1'b0; #5 S3 = 1'b1; #5 $finish; end As shown in Figure.13, our design successfully performs ALU functions. Based on Table.2, when A=00010111, B=00111101, output should be 1010100 (S0S1S2S3 = 0000), 11011010 (S0S1S2S3 = 0001), 00111101 (S0S1S2S3 = 0010), 00010101 (S0S1S2S3 = 1000), 00111111 (S0S1S2S3 = 1001), 00101010 (S0S1S2S3 = 1010), 11000010 (S0S1S2S3 = 1011), 10001011 (S0S1S2S3 = 1100), 00101110 (S0S1S2S3 = 1101), 00001011 (S0S1S2S3 = 1110) and 00101110 (S0S1S2S3 = 1111), consistent with behavioral simulation. 22

Figure.13 Simulation result for full chip 23

4. Transistor Level Schematics and Simulations 1-bit MUX2 Signal S selects A or B as output. When S=0, Out=A; when S=1, Out=B. So we have, Schematic of 1-bit MUX2 is shown in Figure.14. Schematics of NAND gate and inverter are shown in Appendix. Figure.14 1-bit MUX2 Schematic Simulation result of 1-bit MUX2 is shown in Figure.15. S is changing from 0 to 1 then to 0 with period of 4ns, while A is fixed at 1 and B is fixed at 0. After S is changing, Out is pulled up or down consequently with delay less than 0.5ns. Figure.15 1-bit MUX2 Simulation 24

8-bit MUX2 Figure.16 8-bit MUX2 Schematic As shown in Figure.16, to make 8-bit MUX2, simply parallel 8 1-bit MUX2 with the common control signal S. Test circuit is shown in Figure.16. We set A as 01010101 and B as 10101010 while S is changing between 0 and 1 with period of 4ns. 25

Figure.17 8-bit MUX2 test circuit Simulation result is shown in Figure.18. Output signals are consistent with the inputs with delay less than 0.5ns. Figure.18(a) 8-bit MUX2 simulation result: input signals 26

Figure.18(b) 8-bit MUX2 simulation result: output signals Shifter Figure.19 Shifter Schematics 27

To realize functions in the part 2, we connect 3 8-bit MUX2 as shown in Figure.19. In the left two MUX2, with A and B inputs connected to the right lines of A0-A7 and ground, S2 signal selects whether the operation is rotate (S2=0) or shift (S2=1). In the right MUX2, S3 signal selects whether the direction is right (S3=0) or left (S3=1). Simulations start with fixed S signals and variable A0-A7 inputs (A0, A2, A4, A6 are changing with different periods in order to be distinguished as shown in Figure.20). Test circuits and simulations are shown in Figure.20, for S2S3=00, 10, 01, 11 respectively. The propagation delay is less than 1ns. Figure.20 Shifter test input signals Figure.21(a) Shifter test circuit for S2S3=00 (right rotate) 28

Figure.21(b) Shifter simulation result for S2S3=00 (right rotate) As shown in Figure.21(b), A0 signal is right rotated to SH7. A2, A4 and A6 are right rotated to SH1, SH3, SH5 respectively. Figure.22(a) Shifter test circuit for S2S3=10 (right shift) 29

Figure.22(b) Shifter simulation result for S2S3=10 (right shift) Compare Figure.22(b) with Figure.21(b), we can clearly tell the difference between rotate and shift. In Figure.21(b), SH7 is always 0 while in Figure.22(b), SH7 is the input A0. Figure.23(a) Shifter test circuit for S2S3=01 (left rotate) 30

Figure.23(b) Shifter simulation result for S2S3=01 (left rotate) As shown in Figure.23(b), A7 signal (1) is left rotated to SH0. A0, A2, A4 and A6 are left rotated to SH1, SH3, SH5, SH7 respectively. Figure.24(a) Test circuit for S2S3=11 (left shift) 31

Figure.24(b) Shifter simulation result for S2S3=11 (left shift) Difference between rotate and shift operations is again shown in the comparison of Figure.23(b) and Figure.24(b). Another simulation with fixed A0-A7 inputs and variable S2S3 is shown in Figure.25. Figure.25(a) Shifter control simulation inputs 32

S2S3 is in the following sequence: 00, 01, 11, 10, which corresponds to right rotate, left rotate, left shift and right shift. Figure.25(b) Shifter control simulation outputs 8-bit Inverter Gate 8-bit inverter gate is realized by paralleling 8 1-bit inverter gates as shown in Figure.26. Schematic of inverter gate is shown in Appendix. 33

Figure.26 schematic of 8-bit inverter gate Its test circuit is shown in Figure.27. We set in0=in7, in1=in6, in2=in5 and in3=in4 with different delay times. Corresponding result is shown in Figure.28. 34

Figure.27 8-bit inverter gate test circuit Figure.28(a) 8-bit inverter gate simulation inputs 35

Figure.28(b) 8-bit inverter gate simulation outputs 8-bit AND Gate 8-bit AND gate is realized by paralleling 8 1-bit AND gates as shown in Figure.29. Schematic of AND gate is shown in Appendix. 36

Figure.29 schematic of 8-bit AND gate Its test circuit is shown in Figure.30. We set A as 01010101 and B0-B7 as variables. Corresponding result is shown in Figure.30. 37

Figure.30 8-bit AND gate test circuit Figure.31(a) 8-bit AND gate simulation inputs 38

Figure.31(b) 8-bit AND gate simulation outputs Because A1=A3=A5=A7=0, Out1=Out3=Out5=Out7=0. And Out0=B0, Out2=B2, Out4=B4, Out6=B6. This is consistent with result in Figure.31(b). 8-bit OR 8-bit OR gate is realized by paralleling 8 1-bit OR gates as shown in Figure.32. Schematic of OR gate is shown in Appendix. 39

Figure.32 schematic of 8-bit OR gate Its test circuit is shown in Figure.33. We set A as 01010101 and B0-B7 as variables. Corresponding result is shown in Figure.34. 40

Figure.33 8-bit OR gate test circuit Figure.34(a) 8-bit OR gate simulation inputs 41

Figure.34(b) 8-bit OR gate simulation outputs Because A0=A2=A4=A6=1, Out0=Out2=Out4=Out6=1. And Out1=B1, Out3=B3, Out5=B5, Out7=B7. This is consistent with result in Figure.34(b). 8-bit XOR 8-bit XOR gate is realized by paralleling 8 1-bit XOR gates as shown in Figure.35. Schematic of XOR gate is shown in Appendix. 42

Figure.35 schematic of 8-bit XOR gate Its test circuit is shown in Figure.36. We set A as 01010101 and B0-B7 as variables. Corresponding result is shown in Figure.37. 43

Figure.36 8-bit AND gate test circuit Figure.37(a) 8-bit XOR gate simulation inputs 44

Figure.37(b) 8-bit XOR gate simulation inputs All of the outputs should change with inputs. As we set A=01010101, B0=B7, B1=B6, B2=B5, B3=B4, Out0-Out7, Out1-Out6, Out2-Out5 and Out3-Out4 should change complementarily. This is verified by the result shown in Figure.37(b). Logic Unit Based on Table.2, we use 3 8-bit MUX2 with signals S2 and S3 to select 4 logic operations AND, OR, XOR and 1 s complement of B. Equal Function is realized by expression: Schematic of logic unit is shown in Figure.38. 45

Figure.38 schematic of logic unit Figure.39 logic unit test circuit 46

Figure.40(a) logic unit simulation inputs Figure.40(b) logic unit simulation outputs 47

Simulation of logic unit is done in this way: let A= 00100111 and B= 11000100, set S2S3 as variables, changing in this sequence: 00, 01, 11, 10 as shown in Figure.38(a). Therefore, outputs should be AB=00000101, A+B=11100111,!B= 00111011 and A B=11100011, which is exactly what is illustrated in Figure.40(b) with propagation delay about 1ns. Another simulation to test the output eq is shown in Figure.41-Figure.42, where the input signals A=B=00100111 and S2S3 is the same as previous test. Output signals change from 00100111, 00100111, 11011000 to 00000000. At t=3ns, S2S3=10, eq starts to pull down to 0 which means A=B. Figure.41 logic unit test circuit to test eq 48

Figure.42 logic unit simulation result for eq test 1-bit Ripple Adder Derived from its true table, outputs of 1-bit ripple adder can be expressed as: Schematic of 1-bit adder is shown in Figure.43. 49

Figure.43 Schematic of 1-bit adder As illustrated in Figure.44, we set Ai=1, Bi=0 and Ci as variables to do the simulation. Corresponding result is shown in Figure.45. Its propagation delay is about 1ns. Figure.44 1-bit adder test circuit 50

Figure.45(a) 1-bit adder simulation inputs Figure.45(b) 1-bit adder simulation outputs 51

8-bit Ripple Adder Based on 1-bit ripple adder, 8-bit ripple adder is realized as shown in Figure.46. Figure.46 Schematic of 8-bit adder 52

Figure.47 8-bit adder test circuit As shown in Figure.47, we test 8-bit ripple adder in this way: A is set to be 00011111, B is set to be 00100000 and C0 is variable. When C0=0, SUM7-0 is 00111111, when C0=1, SUM7-0 is 01000000. Corresponding result is shown in Figure.48, which indicates the propagation delay is accumulated through SUM0 to SUM7, reaching the value of 4ns. This value heavily decreases the maximum clock frequency, which will be discussed later. Figure.48(a) 8-bit adder simulation inputs 53

Figure.48(b) 8-bit adder simulation outputs Algorithm Unit Based on the 8-bit ripple adder just shown above, algorithm unit is made through this theorem: subtract of A and B is to add A and 2 s complement of B. The inputs of 8-bit adder are controlled by S2S3. When the add operation is required, S2S3 select A, B and C0 as inputs. When the subtract operation is required, S2S3 select A,!B and C0=1 as inputs, which actually adds A and B s negate. In addition, comparison of A and B can be made through the highest order of output: AL7. In principle, highest order with value 1 means this number is negative. When the negate of B is wanted, S2S3 select A=0, B!, and C0=1. In this way, all of these three operations are realized. Schematic of algorithm unit is shown in Figure.49. 54

Figure.49 Schematic of algorithm unit Simulations of algorithm unit for different S2S3 are done separately. For S2S3=00 (add operation), we set A=00100111 and B=0001(0/1)11(0/1), where B0 and B3 are variables. It s easy to anticipate that, at sequence of B3B0 is 00, 01, 11, 10; the corresponding output should be 00111101, 00111110, 01000110 and 01000101. Test circuit and simulation results are shown in Figure.50-Figure.51. Propagation delay is 3.5ns in this case. For the worst case, it ll be 1ns+0.5*8ns=5ns. 55

Figure.50 Algorithm unit test circuit for S2S3=00 Figure.51(a) Algorithm unit simulation inputs for S2S3=00 56

Figure.51 (b) Algorithm unit simulation outputs for S2S3=00 For S2S3=11 (negate of B), we set A=00100111 and B=00(0/1)1(0/1)110, where B5 and B3 are variables. It s easy to anticipate that, at sequence of B5B3 is 00, 01, 11, 10; the corresponding output should be 11101010, 11100010, 11000010 and 11001010. Test circuit and simulation results are shown in Figure.52-Figure.53. Propagation delay is 1.5ns in this case. Figure.52 Algorithm unit test circuit for S2S3=11 57

Figure.53(a) Algorithm unit simulation inputs for S2S3=11 Figure.53(b) Algorithm unit simulation outputs for S2S3=11 For S2S3=10 (subtract), we set A=00100111 and B=00(0/1)1(0/1)110, where B5 and B3 are variables. It s easy to anticipate that, at sequence of B5B3 is 00, 01, 11, 10; the corresponding output should be 00010001, 00001001, 11101001 and 1111001. Test circuit and simulation results are shown in Figure.54-Figure.55. Propagation delay is 1.5ns in this case. 58

Figure.54 Algorithm unit test circuit for S2S3=01 Figure.55(a) Algorithm unit simulation inputs for S2S3=01 59

Figure.55(b) Algorithm unit simulation inputs for S2S3=01 8-bit Static D-Flip Flop 8-bit Static D-Flip Flop is made of 8 1-bit Static D-Flip Flops in parallel with the same clock signal input. Two phase clock and 1-bit Static D-Flip Flop are shown in Appendix. Figure.56 Schematic of 8-bit static D-flip flop 60

Figure.57 8-bit static D-flip flop test circuit For the simulation, we set d0=d4, d1=d5, d2=d6, d3=d7 with different periods to distinguish each other as shown in Figure.58(a). Output of two phase clock is shown in Figure.58(b). Output data is captured at the rising edge of ClkP, the propagation delay is 1ns. So the d0 and d4 inputs are always being captured as 0, which is shown in Figure.58(c). Figure.58(a) 8-bit static D-flip flop simulation inputs 61

Figure.58(b) 8-bit static D-flip flop simulation two phase clock outputs Figure.58(c) 8-bit static D-flip flop simulation outputs 62

Full Chip Components in full chip are connected based on our architecture as shown in Figure.59. 2 8-bit MUX2 with signals S0 and S1 are used to select outputs from Algorithm Unit, Logic Unit and Shifter. The selected outputs are put into the flip flop to make final outputs more stable. Figure.59 Schematic of full chip Firstly, simulations about shifter and logic functions are conducted as shown in Figure.60. We set A=10011011, B=11000110. S0 is fixed at 1 and S2S3 is changing in this sequence: 00, 01, 11, 10. In this way, following functions should be conducted: AB, A+B,!B, A B, right rotate of A, left rotate of A, left shift of A and right shift of A. It s easy to anticipate the corresponding results should be: 10000010, 11011111, 00111001, 01011101, 11001101, 00110111, 00110110 and 01001101. Simulation results are shown in Figure.59. From Figure.59(b), the propagation delay is approximately 3ns. Therefore, for a single pulse input (only one signal is changing), the maximum clock frequency is about 300MHz. 63

Figure.60 full chip test circuit for logic and shifter Figure.61(a) full chip simulation inputs for logic and shifter 64

Figure.61(b) full chip simulation outputs for logic and shifter Secondly, simulations about algorithm functions (S0=0) are conducted separately with S2S3 fixed at 00, 01, 11 respectively. For S2S3=00, test circuit is shown in Figure.62. A is set to be 00111111 and B is set to be 00100000. C0 is the variable. When C0=0, Out7-Out0 should be 01011111. When C0=1, Out7-Out0 should be 01100000. Simulation results are shown in Figure.63. The initial data stored in the Flip Flop is 00000000. After about 3ns, Out7-Out0 is changing to 00111111, the outputs of C0=0. This means the propagation delay without considering carry in effect is 3ns, which is consistent with the value we obtain in logic unit and shifter. After C0=1 at t=6ns, we can observe Out5 pulls up to 1 at t=12ns. Therefore, the total propagation delay for algorithm should be 6ns. The corresponding maximum clock frequency is about 150MHz. 65

Figure.62 full chip test circuit for algorithm 00 Figure.63(a) full chip simulation inputs for algorithm 00 66

Figure.63(b) full chip simulation outputs for algorithm 00 For S2S3=11, test circuit is shown in Figure.64. A is set to be 00000000 and B7-B1 is set to be 0000000. B0 is the variable. When B0=0, Out7-Out0 should be 00000000. When B0=1, Out7-Out0 should be 11111111. Simulation results are shown in Figure.65. The initial data stored in the Flip Flop is 11111111. After about 3ns, Out7-Out0 is changing to 00000000, the outputs of B0=0. This means the propagation delay without considering carry in effect is 3ns, which is consistent with the value we obtain in logic unit and shifter. After B0=1 at t=6ns, we can observe Out7 pulls up to 1 at t=13.5ns. Therefore, the total propagation delay for algorithm should be 6.5ns. The corresponding maximum clock frequency is about 150MHz. 67

Figure.65(b) full chip simulation outputs for algorithm 11 For S2S3=01, test circuit is shown in Figure.66. A is set to be 00000000 and B7-B1 is set to be 0000000. B0 is the variable. When B0=0, Out7-Out0 should be 00000000. When B0=1, Out7-Out0 should be 11111111. Simulation results are shown in Figure.67. The initial data stored in the Flip Flop is 11111111. After about 3ns, Out7-Out0 is changing to 00000000, the outputs of B0=0. This means the propagation delay without considering carry in effect is 3ns, which is consistent with the value we obtain in logic unit and shifter. After B0=1 at t=6ns, we can observe Out7 pulls up to 1 at t=13.5ns. Therefore, the total propagation delay for algorithm should be 6.5ns. The corresponding maximum clock frequency is about 150MHz. C8=1 is caused by the calculation of negate of 0. Output signal comp pulls up to 0 when B0=1, because A=0<B=1. 69

Figure.67(b) full chip simulation outputs for algorithm 01 71

III. LAYOUT 1. Layout of Blocks and Full Chip Area of each block: 8MUX2: 300um x 50um; Shifter: 900um x 100um; Logic Unit: 1000um x 200um Algorithm Unit: 1100um x 180um; Static D-Flip Flop: 500um x 60um. In order to achieve the minimum die are, the floor plan for the full chip is shown below: Figure.68 Floor plan of full chip Finally, die area is 1150um x 650um. Layout of blocks and full chip are shown in next pages. Notice that only two layers of metal have been used for every blocks and the full chip. 72

Figure.69 Layout of 1MUX2 73

Figure.70 Layout of 8-bit inv 74

Figure.71Layout of 8-bit AND Figure.72 Layout of 8-bit OR 75

Figure.73 Layout of 8-bit XOR Figure.74 Layout of 1-bit ripple adder 76

Figure.75 Layout of 8-bit ripple adder 77

Figure.76 Layout of 8MUX2 Figure.77 Layout of shifter 78

Figure.78 Layout of logic unit Figure.79 Layout of 8-bit static D-Flip Flop 79

Figure.80 Layout of full chip 80

2. Extraction and Layout vs. Schematic Figure.81 Extraction of shifter Figure.82 Extraction of logic unit 81

Figure.83 Extraction of algorithm unit Figure.84 Extraction of 8-bit State D-Flip Flop 82

Figure.85 Extraction of full chip 83

Figure.86 LVS output of shifter Figure.87 LVS output of logic unit Figure.88 LVS output of algorithm unit Figure.89 LVS output of static D-FF Figure.90 LVS output of full chip 84

IV. PERFORMANCE EVALUATION 1. Post Layout Verification As shown before, algorithm functions have the largest propagation delay. Therefore, to find the maximum clock frequency allowed in this ALU, we only need to focus on the worst case. Post layout verification is done using the same test circuit configuration in Figure.62 and Figure.66. Figure.91 shows PLS for S0S1S2S3 = 0000. Compare this with Figure.63, propagation delay is increased to 9ns (previously is 5ns). Figure.91(a) Inputs in PLS for 0000 algorithm function Figure.91(b) Outputs in PLS for 0000 algorithm function 85

As shown in Figure.92, power dissipation is measure by monitoring the current in Vdd in one cycle. RMS of the current is approximately 15mA, which means power dissipation is about 75mW. Figure.92 Vdd current in PLS for 0000 algorithm function Figure.93 shows PLS for S0S1S2S3 = 0001. Compare this with Figure.67, propagation delay is increased to 9ns (previously is 5ns). Figure.93(a) Inputs in PLS for 0001 algorithm function 86

Figure.93(b) Outputs in PLS for 0001 algorithm function As shown in Figure.94, power dissipation is measure by monitoring the current in Vdd in one cycle. RMS of the current is approximately 30mA, which means power dissipation is about 150mW. 2. Conclusion Figure.94 Vdd current in PLS for 0001 algorithm function Performance of our designed ALU is summarized in Table. 7. 87

Block Name Die Area Maximum Clock Frequency Power Dissipation Shifter 900um x 100um 1GHz N.A. Logic Unit 1000um x 200um 1GHz N.A. Algorithm Unit 1100um x 180um 300MHz N.A Static Flip Flop 500um x 60um 1GHz N.A. Full Chip (before PLS) 1150um x 650um 150MHz N.A. Full Chip in 0000 1150um x 650um 100MHz 75mW Full Chip in 0001 1150um x 650um 100MHz 150mW Table.7 Performance Summary for ALU As clearly shown in the Table, the maximum clock frequency is mainly constrained by algorithm operations. This is because we choose to use ripple adder instead of look ahead adder. Propagation delay in ripple adder is related to number of inputs. Since our ALU is 8 bit, the maximum clock frequency should be one half of the 4 bit ALU with the same design. In conclusion, 8-bit fixed point ALU is realized with die area 1150um x 650um and maximum clock frequency 100MHz. Highlights for our design are: 1. It deals with data of 8 bit. 2. 13 Functions are realized: AND, OR, XOR, equal, 1 s complement, right rotate, left rotate, right shift, left shift, add, subtract, comparison, 2 s complement. 3. Static flip flop is added to stabilize outputs. 4. Power dissipation is only 150mW for the worst case. 5. Only two layers of metal have been used for routing, which significantly reduces the fabrication difficulties and RC delay. 88

APPENDIX 1-bit inverter 89

1-bit AND gate 90

1-bit OR gate 91

1-bit XOR gate 92

Two Phase Clock 93

1-bit Static D-Flip Flop 94

REFERENCE Wikipedia. J. F. Wakerly, Digital Design: Principles and Practices, 3rd Edition, Prentice Hall, NJ, 2001. Lecture Notes from ESE570, DIGITAL INTEGRATED CIRCUITS AND VLSI FUNDAMENTALS Professor Kenneth R. Laker. 95