ECE 5745 Complex Digital ASIC Design Course Overview Christopher Batten School of Electrical and Computer Engineering Cornell University http://www.csl.cornell.edu/courses/ece5745
Application Algorithm PL OS ISA Arch RTL Gates Circuits Complex Digital ASIC Design I Course goal, structure, motivation. What is the goal of the course?. How is the course structured?. Why should students want to take this course? I Activity 1: Evaluation of Integer Multiplier I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips I Activity 2: Brainstorming for Sorting Accelerator I Course Logistics Devices Technology ECE 5745 Course Overview 2 / 42
The Computer Engineering Stack Application Gap too large to bridge in one step (but there are exceptions e.g., magnetic compass) Technology In its broadest definition, computer architecture is the design of the abstraction/implementation layers that allow us to execute information processing applications efficiently using available manufacturing technologies ECE 5745 Course Overview 3 / 42
The Computer Engineering Stack Computer Architecture Application Algorithm Programming Language Operating System Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology In its broadest definition, computer architecture is the design of the abstraction/implementation layers that allow us to execute information processing applications efficiently using available manufacturing technologies ECE 5745 Course Overview 3 / 42
What is Computer Architecture? Computer Architecture Application Algorithm Programming Language Operating System Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology Application Requirements Suggest how to improve architecture Provide revenue to fund development Architecture provides feedback to guide application and technology research directions Technology Constraints Restrict what can be done efficiently New technologies make new arch possible In its broadest definition, computer architecture is the design of the abstraction/implementation layers that allow us to execute information processing applications efficiently using available manufacturing technologies ECE 5745 Course Overview 4 / 42
Key Metrics in Computer Architecture I$ Network I$ I$ I$ I Primary Metrics. Execution time (cycles/task). Energy (Joules/task). Cycle time (ns/cycle). Area (µm 2 ) P D$ P P P Network D$ D$ D$ Network I Secondary Metrics. Performance (ns/task). Average power (Watts). Peak power (Watts). Cost ($). Design complexity. Reliability. Flexibility Discuss qualitative first-order analysis from ECE 4750 on board ECE 5745 Course Overview 5 / 42
Unanswered Questions from ECE 4750 P I$ D$ Network Accelerated Instructions Xcel Xcel Network Network I$ P D$ D$ D$ I How can we quantitatively evaluate area, cycle time, and energy? I How do we actually implement processors, memories, and networks in a real chip? I How should we analyze application-specific accelerators?. Very loosely coupled memory-mapped accelerators. More tightly coupled co-processor accelerators. Specialized instructions and functional units ECE 5745 Course Overview 6 / 42
Application-Specific Accelerators Network Out-of-Order Superscalar Superpipelined C D P I$ Accelerated Instructions Xcel Xcel Network I$ P Energy (Joules per Task) Simple Proc Superscalar w/ Deeper Pipelines B A Multicore E F Processor Power Constraint Specialized Accelerators D$ D$ D$ D$ Network Performance (Tasks per Second) ECE 5745 Course Overview 7 / 42
Application-Specific Accelerators P I$ D$ Network Accelerated Instructions Xcel Xcel Network I$ P D$ D$ D$ Energy Efficiency (Tasks per Joule) Design Performance Constraint Embedded Architectures Simple Processor Custom ASIC Less Flexible Accelerator More Flexible Accelerator Design Power Constraint Flexibility vs. Specialization High-Performance Architectures Performance (Tasks per Second) Network ECE 5745 Course Overview 8 / 42
Goal for ECE 5745 is to answer these questions! P I$ D$ Network Accelerated Instructions Xcel Xcel Network Network I$ P D$ D$ D$ I How can we quantitatively evaluate area, cycle time, and energy? I How do we actually implement processors, memories, and networks in a real chip? I How should we analyze application-specific accelerators?. Very loosely coupled memory-mapped accelerators. More tightly coupled co-processor accelerators. Specialized instructions and functional units ECE 5745 Course Overview 9 / 42
Piece of full-custom multiplier array, 1.0 m 2-metal Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Full Custom Design vs. Standard-Cell Design Full-custom layout in 1.0µm w/ 2 metal layers uires complete customization of all layers of wafer Reserved for very high performance or very high volume devices (Intel microprocessors, RF power amps for cellphones) t time consuming design style though each design team usually imposes some discipline igner is free to do anything, anywhere I Full-Custom Design Full Custom Design. Designer is free to do anything, anywhere; though team usually imposes some design discipline. Most time consuming design style; reserved for very high performance or very high volume chips (Intel microprocessors, RF power amps for cellphones) I Standard-Cell Design. Fixed library of standard cells and SRAM memory generators. Register-transfer-level description is automatically mapped to this library of standard cells, then these cells are placed and routed automatically ECE 5745 Course Overview 10 / 42
Also called Cell-Based ICs (CBICs) Fixed library Complex of cells Digital plus ASIC memory Design generators Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Cells can be synthesized from HDL, or entered in schematics Cells placed and routed automatically Standard-Cell Design Methodology Requires complete set of custom masks for each design Currently most popular hard-wired ASIC type (6.884 will use this) Cells arranged in rows Standard Cell Design Mem 1 Mem 2 Cells have standard height but vary in width Designed to connect power, ground, and wells Generated by abutment memory arrays ring 2005 2 Feb 2005 Clock Rail (not typical) Power Rails in M1 Clock Rail VDD Rail L01 Introduction 24 Well Contact under Power Rail Cell I/O on M2 Ripple carry adder with carry chain highlighted NAND2 GND Rail Flip-flop ECE 5745 Course Overview 11 / 42 6.884 Spring 2005 2 Feb 2005 L01 Introduction 25
Standard-Cell Design Methodology Verilog RTL Verilog Simulator Synthesis Place&Route Area & Cycle Time Gate-Level Model Layout Educational Synopsys 90nm Process and Standard Cells Verilog Simulator Switching Activity Power Analysis Execution Time Energy ECE 5745 Course Overview 12 / 42
Motivation Architectural Patterns Scale VT Core Maven VT Core Evaluation Example Standard-Cell Chip Plot Single-Lane Vector-Thread Unit w/ 256 Registers MIT CSAIL Christopher Batten 32 / 42 ECE 5745 Course Overview 13 / 42
What is Complex Digital ASIC Design? Computer Architecture Application Algorithm Programming Language Operating System Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology Complex digital ASIC design is the process of quantitatively exploring the area, cycle time, execution time, and energy trade-offs of various general-purpose and application-specific designs using automated standard-cell CAD tools and then to transform the most promising design to layout ready for fabrication ECE 5745 Course Overview 14 / 42
Application Algorithm PL OS ISA Arch RTL Gates Circuits Complex Digital ASIC Design I Course goal, structure, motivation. What is the goal of the course?. How is the course structured?. Why should students want to take this course? I Activity 1: Evaluation of Integer Multiplier I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips I Activity 2: Brainstorming for Sorting Accelerator I Course Logistics Devices Technology ECE 5745 Course Overview 15 / 42
Course Structure Prereq Computer Architecture Part 2 Digital CMOS Circuits Part 1 ASIC Design Overview P P MM Part 3 CAD Algorithms ECE 5745 Course Overview 16 / 42
Part 1: ASIC Design Overview P P Topic 1 Hardware Description Languages M Topic 4 Full-Custom Design Methodology Topic 6 Closing the Gap Topic 5 Automated Design Methodologies Topic 8 Testing and Verification Topic 3 CMOS Circuits Topic 2 CMOS Devices Topic 7 Clocking, Power Distribution, Packaging, and I/O ECE 5745 Course Overview 17 / 42
Part 2: Digital CMOS Circuits Topic 9 Combinational Logic Topic 11 Interconnect Topic 10 Sequential State ECE 5745 Course Overview 18 / 42
Part 3: CAD Algorithms Topic 12 Synthesis Algorithms Topic 13 Physical Design Automation Placement x = a'bc + a'bc' y = b'c' + ab' + ac x = a'b y = b'c' + ac RTL to Logic Synthesis Technology Independent Synthesis Technology Dependent Synthesis Global Routing Detailed Routing ECE 5745 Course Overview 19 / 42
Five-Week Design Project P I$ D$ Network Accelerated Instructions Xcel Xcel Network I$ P D$ D$ D$ Energy Efficiency (Tasks per Joule) Design Performance Constraint Embedded Architectures Simple Processor Custom ASIC Less Flexible Accelerator More Flexible Accelerator Design Power Constraint Flexibility vs. Specialization High-Performance Architectures Performance (Tasks per Second) Network ECE 5745 Course Overview 20 / 42
Application Algorithm PL OS ISA Arch RTL Gates Circuits Complex Digital ASIC Design I Course goal, structure, motivation. What is the goal of the course?. How is the course structured?. Why should students want to take this course? I Activity 1: Evaluation of Integer Multiplier I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips I Activity 2: Brainstorming for Sorting Accelerator I Course Logistics Devices Technology ECE 5745 Course Overview 21 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Course Motivation: Comp Arch Research Perspective 7 Intel 48-Core Prototype 10 6 AMD 4-Core Opteron 10 5 4 10 DEC Alpha 21264 Frequency (MHz) 3 10 Parallel Proc Performance Sequential Processor Performance Intel Pentium 4 10 Transistors (Thousands) MIPS R2K 2 Typical Power (Watts) 1 Number of Cores 10 10 0 10 1975 1980 1985 1990 1995 2000 2005 2010 2015 Data partially collected by M. Horowitz, F. Labonte, O. Shacham, K. Olukotun, L. Hammond ECE 5745 Course Overview 22 / 42
Cross-Layer Interaction is Critical Computer Architecture Application Algorithm Programming Language Operating System Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology Need to quantitatively understand area, cycle time, and energy trade-offs to be able to create new revolutionary architectures that tackle the most challenging physical design issues ECE 5745 Course Overview 23 / 42
Course Motivation: Circuits Research Perspective Your Digital Circuit Here Your Analog Circuit Here Fighter Airplane ~100,000 parts Intel Sandy Bridge E 2.27 Billion transistors ECE 5745 Course Overview 24 / 42
Cross-Layer Interaction is Critical Computer Architecture Application Algorithm Programming Language Operating System Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology Digital circuit researchers need to appreciate the system-level context for their subsystems; cross-layer interaction can generate some of the most exciting research ideas ECE 5745 Course Overview 25 / 42
Course Motivation: Industry Development Perspective ECE 5745 Course Overview 26 / 42
Course Motivation: Industry Development Perspective I Ever increasing chip size and complexity. Microprocessors: 100M gates growing to 1000M gates. ASICs: 5M to 10M gates growing to 50M to 100M gates I Ever increasing costs and design team sizes. More than $30M for a 10M gate ASIC. More than $1M for re-spin in case of error (not including redesign costs) I 18 months to design, only an 8 month selling opportunity in market. Fewer new chip-starts every year. Looking for alternatives (e.g., FPGAs) I For chip-starts that do happen. Conservative design. No time for exploration. Educated guess and then implement ECE 5745 Course Overview 27 / 42
Cross-Layer Interaction is Critical Computer Architecture Application Algorithm Programming Language Operating System Instruction Set Architecture Microarchitecture Register-Transfer Level Gate Level Circuits Devices Technology Most significant impact on area, cycle time, execution time, and energy is usually at architectural and microarchitectural levels ECE 5745 Course Overview 28 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Application Algorithm PL OS ISA Arch RTL Gates Circuits Complex Digital ASIC Design I Course goal, structure, motivation. What is the goal of the course?. How is the course structured?. Why should students want to take this course? I Activity 1: Evaluation of Integer Multiplier I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips I Activity 2: Brainstorming for Sorting Accelerator I Course Logistics Devices Technology ECE 5745 Course Overview 29 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Scalar Processors with Multithreading Programmer's Logical View T0 T1 T2 T3 T4 Ti Memory Control Instructions Memory Instructions Execution Resources Architectural Registers ECE 5745 Course Overview 30 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Scalar Processors with Multithreading Programmer's Logical View T0 T1 T2 T3 T4 Ti Memory Vector-Vector Add loop: load a, a_ptr load b, b_ptr add c, a, b store c, c_ptr a_ptr++ b_ptr++ c_ptr++ branch Masked Filter loop: load a, a_ptr a_ptr++ branch a = 0 op0 op1... branch ECE 5745 Course Overview 30 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Scalar Processors with Multithreading Programmer's Logical View T0 T1 T2 T3 T4 Ti Memory Typical Core Micro- Architecture Instr Memory T0 T1 T2 T3 Multithreaded Cores Energy MIMD Data Memory Performance ECE 5745 Course Overview 30 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Vector-SIMD Processors Programmer's Logical View CT0 elm 0 elm 1 elm 2 elm 3 CTj elm i Memory Vector-SIMD Arithmetic Instructions Vector-SIMD Memory Instructions Architectural Vector Register with 4 Elements ECE 5745 Course Overview 31 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Vector-SIMD Processors Programmer's Logical View CT0 elm 0 elm 1 elm 2 elm 3 CTj elm i Memory Vector-Vector Add loop: vload A, a_ptr vload B, b_ptr vadd C, A, B vstore C, c_ptr a_ptr++ b_ptr++ c_ptr++ branch Masked Filter loop: vload A, a_ptr vset F, A = 0 vop0 under flag F vop1 under flag F... a_ptr++ branch ECE 5745 Course Overview 31 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Vector-SIMD Processors Programmer's Logical View CT0 elm 0 elm 1 elm 2 elm 3 CTj elm i Memory Instruction Memory VIU 1 3 Vector Lanes Energy MIMD Typical Core CP Micro- Architecture 0 2 Vector- SIMD VMU Data Memory Performance ECE 5745 Course Overview 31 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Motivation Architectural Patterns Scale VT Core Maven VT Core Evaluation Quantitative Area Evaluation Single-Lane Vector-Thread Unit w/ 256 Registers MIT CSAIL Christopher Batten 32 / 42 ECE 5745 Course Overview 32 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Quantitative Area Evaluation Normalized Area 1.75 1.50 1.25 1.00 0.75 0.50 0.25 0.00 1 thread 2 threads 4 threads 8 threads 1 core w/ 4 lanes 4 cores w/ 1 lane ea control logic register file memory units floating-point functional units integer functional units control processor instruction cache data cache Single-core multi-lane design reduces area by 15% Quad-Core w/ Vertical Multithreading Vector- SIMD Multi-core single-lane design increases area by 20% ECE 5745 Course Overview 33 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Quantitative Performance and Energy Evaluation Normalized Energy / Task 1.6 1.4 1.2 1.0 0.8 0.6 0.4 Single-Lane Vector (increasing vlen) Multithreaded Multicore (increasing number of threads per core) 0.8 1.0 1.2 1.4 1.6 Normalized Tasks / Second Energy / Task ( J) 25 20 15 10 5 0 1 thread 2 threads 4 threads 8 threads 1 elm/lane 2 elm/lane 4 elm/lane 8 elm/lane control logic register file memory units int func units control proc inst cache data cache leakage Quad-Core w/ Vertical Multithreading Vector- SIMD (4 core w/ 1 lane) ECE 5745 Course Overview 34 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Simple RISC Processor ASIC SP Control SP Datapath SP Regfile AHIP Controller RAM Interface VCO RAM Subbank (2KB) RAM Subbank (2KB) RAM Subbank (2KB) RAM Subbank (2KB) Course 2: STC1 chip plot and die photo. On the Overview chip plot, the Exec Unit is colored blue, and the 35 / 42 Fetch Unit is colored green and purple. ECE 5745 Figure
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Simple RISC Processor ASIC 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3 4 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 750 -------------......................... 740......................... 730......................... 720......................... 710...................... * *. 700 -------------.................... * * * *. 690................... * * * * *. 680................. * * * * * *.. 670................ * * * * * * *.. 660................ * * * * * * *.. 650 -------------.............. * * * * * * * * * *. 640.............. * * * * * * * * * *. 630............. * * * * * * * * * *.. 620............ * * * * * * * * * * * *. 610........... *. * * * * * * * * * *.. 600 -------------........... * * * * * * * * * * * *.. 590.......... * * * * * * * * * * * * * *. 580.......... * * * * * * * * * * * * *.. 570......... * * * * * * * * * * * * * * * * 560......... * * * * * * * * * * * * * * * * 550 -------------....... * * * * * * * * * * * * * * * * *. 540....... * * * * * * * * * * * * * * * *. * 530....... * * * * * * * * * * * * * * * * * * 520...... * * * * * * * * * * * * * * * * * * * 510...... * * * * * * * * * * * * * * * * * * * 500 -------------..... * * * * * * * * * * * * * * * * * * * * 490.... * * * * * * * * * * * * * * * * * * * * * 480.... * * * * * * * * * * * * * * * * * * * * * 470... * * * * * * * * * * * * * * * * * * * * * * 460... * * * * * * * * * * * * * * * * *. * * * * 450 -------------... * * * * * * * * * * * * * * * * * * * * * * 440.. * * * * * * * * * * * * * * * * * * * * * * * 430.. * * * * * * * * * * * * * * * * * * * * * * * 420.. * * * * * * * * * * * * * * * * * * * * * * * 410. * * * * * * * * * * * * * * * * * * * * * * * * 400 -------------. * * * * * * * * * * * * * * * * * * * * * * * * 390..... * * * * * 380..... * * * * * 370..... * * * * * 360..... * * * * * 350 ---.... * * * * * * 340.... * * * * * * 330.... * * * * * * 320... * * * * * * * 310... * * * * * * * 300 ---... * * * * * * * 290... * * * * * * * 280.. * * * * * * * * 270.. * * * * * * * * 260.. * * * * * * * * 250 ---.. * * * * * * * * 240. * * * * * * * * * 230. * * * * * * * * * 220. * * * * * * * * * 210. * * * * * * * * * 200 --- * * * * * * * * * * 190 * * * * * * * * * * 180 * * * * * * * * * * I RISC processor w/ 8 KB SRAM I TSMC 0.18 µm process Figure 5: Shmoo plot for a wide voltage range. The horizontal axis plots supply voltageiin mv, and the vertical 1.7 2.1 mm axis plots frequency in MHz. A * indicates correct operation at that operating point and a dot indicates failure. The shmoo plot shows combined results from two di erent chips. I 610K Transistors I 450 MHz at 1.8 V ECE 5745 Course Overview 36 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Scale Vector-Thread Processor ASIC ECE 5745 Course Overview 37 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Scale Vector-Thread Processor ASIC TSMC 0.18µm 7.14 Million Transistors 16.6 mm 2 Core Area Cache Tag CAMs CP Cache Control Cache Data RAM Banks X B A R V M U Vector Lane 0 Vector Lane 1 Vector Lane 2 V I U Vector Lane 3 ECE 5745 Course Overview 38 / 42
Motivation Complex Digital ASIC Vector-Threading Design ActivityScale 1 VT Processor Case Study: Scalar Maven vs. Vector VT Processors FutureActivity Research 2 Energy-Efficiency Scale vs. ofperformance the Scale VTResults Processor ECE 5745 Course Overview 39 / 42 MIT CSAIL Christopher Batten 24 / 39
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Application Algorithm PL OS ISA Arch RTL Gates Circuits Complex Digital ASIC Design I Course goal, structure, motivation. What is the goal of the course?. How is the course structured?. Why should students want to take this course? I Activity 1: Evaluation of Integer Multiplier I Case Study: Scalar vs. Vector Processors. Example design-space exploration. Example real ASIC chips I Activity 2: Brainstorming for Sorting Accelerator I Course Logistics Devices Technology ECE 5745 Course Overview 40 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Brainstorming for Sorting Accelerator I$ Sort array of integers stored in memory Proc sends Xcel base address and size Proc waits for Xcel to finish sorting I Software Baseline. Insertion sort O(n 2 ). 20% of dynamic instructions are lw/sw P Xcel Arbiter D$ Energy Efficiency (Tasks per Joule) Software Baseline I Hardware Accelerator. What is ideal speedup assuming we still use insertion sort?. What if we use a different algorithm?. Sketch hardware impl of a sorting accelerator Performance (Tasks per Second). Where will accelerator be in the energy/perf space? ECE 5745 Course Overview 41 / 42
Complex Digital ASIC Design Activity 1 Case Study: Scalar vs. Vector Processors Activity 2 Application Algorithm PL OS ISA Arch RTL Gates Circuits Devices Take-Away Points I Complex digital ASIC design is the process of quantitatively exploring the area, cycle time, execution time, and energy trade-offs of general-purpose and application-specific designs using automated standard-cell CAD tools and then to transform the most promising design to layout ready for fabrication I Course can provide useful experience with cross-layer interaction for students interested in pursuing research in computer architecture or circuits or students interested in pursuing a career in in industry development of ASICs Technology ECE 5745 Course Overview 42 / 42