Embedded System Design. Embedded System Design Modeling, Synthesis, Verification

Similar documents
Computer organization

Modeling Sequential Elements with Verilog. Prof. Chien-Nan Liu TEL: ext: Sequential Circuit

High-Level Synthesis for FPGA Designs

Architectures and Platforms

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Example-driven Interconnect Synthesis for Heterogeneous Coarse-Grain Reconfigurable Logic

Digital Systems Design! Lecture 1 - Introduction!!

Chapter 2 Basic Structure of Computers. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

PROBLEMS #20,R0,R1 #$3A,R2,R4

RAPID PROTOTYPING OF DIGITAL SYSTEMS Second Edition

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

EVALUATION OF SCHEDULING AND ALLOCATION ALGORITHMS WHILE MAPPING ASSEMBLY CODE ONTO FPGAS

CPU Organisation and Operation

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

9/14/ :38

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

ARM Microprocessor and ARM-Based Microcontrollers

ECE410 Design Project Spring 2008 Design and Characterization of a CMOS 8-bit Microprocessor Data Path

EE361: Digital Computer Organization Course Syllabus

TIMING DIAGRAM O 8085

Testing & Verification of Digital Circuits ECE/CS 5745/6745. Hardware Verification using Symbolic Computation

A Lab Course on Computer Architecture

Central Processing Unit

System-On Chip Modeling and Design A case study on MP3 Decoder

System on Chip Design. Michael Nydegger

CHAPTER 7: The CPU and Memory

FSMD and Gezel. Jan Madsen

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

路 論 Chapter 15 System-Level Physical Design

7a. System-on-chip design and prototyping platforms

Software Pipelining - Modulo Scheduling

Getting the Most Out of Synthesis

Central Processing Unit (CPU)

(Refer Slide Time: 00:01:16 min)

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

Preface. Any questions from last time? A bit more motivation, information about me. A bit more about this class. Later: Will review 1st 22 slides

EECS 583 Class 11 Instruction Scheduling Software Pipelining Intro

PROBLEMS. which was discussed in Section

Introducción. Diseño de sistemas digitales.1

More Verilog. 8-bit Register with Synchronous Reset. Shift Register Example. N-bit Register with Asynchronous Reset.

Chapter 4 Register Transfer and Microoperations. Section 4.1 Register Transfer Language

Finite State Machine Design and VHDL Coding Techniques

Life Cycle of a Memory Request. Ring Example: 2 requests for lock 17

Computer Organization and Components

Sequential Circuit Design

Design of Digital Circuits (SS16)

Quartus II Software Design Series : Foundation. Digitale Signalverarbeitung mit FPGA. Digitale Signalverarbeitung mit FPGA (DSF) Quartus II 1

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

CLASS: Combined Logic Architecture Soft Error Sensitivity Analysis

Implementation Details

Microprocessor & Assembly Language

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

VHDL GUIDELINES FOR SYNTHESIS

FINITE STATE MACHINE: PRINCIPLE AND PRACTICE

Techniques for Accurate Performance Evaluation in Architecture Exploration

The 104 Duke_ACC Machine

Software-Programmable FPGA IoT Platform. Kam Chuen Mak (Lattice Semiconductor) Andrew Canis (LegUp Computing) July 13, 2016

Instruction Set Architecture. Datapath & Control. Instruction. LC-3 Overview: Memory and Registers. CIT 595 Spring 2010

Register File, Finite State Machines & Hardware Control Language

CPU Organization and Assembly Language

Computer Systems Design and Architecture by V. Heuring and H. Jordan

SystemC Tutorial. John Moondanos. Strategic CAD Labs, INTEL Corp. & GSRC Visiting Fellow, UC Berkeley

I/O. Input/Output. Types of devices. Interface. Computer hardware

MICROPROCESSOR AND MICROCOMPUTER BASICS

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

A New Paradigm for Synchronous State Machine Design in Verilog

Levels of Programming Languages. Gerald Penn CSC 324

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Microelectronic System-on-Chip Modeling using Objects and their Relationships

COMPUTERS ORGANIZATION 2ND YEAR COMPUTE SCIENCE MANAGEMENT ENGINEERING UNIT 5 INPUT/OUTPUT UNIT JOSÉ GARCÍA RODRÍGUEZ JOSÉ ANTONIO SERRA PÉREZ

YAML: A Tool for Hardware Design Visualization and Capture

Overview. CISC Developments. RISC Designs. CISC Designs. VAX: Addressing Modes. Digital VAX

Modeling a GPS Receiver Using SystemC

Solutions. Solution The values of the signals are as follows:

CS:APP Chapter 4 Computer Architecture. Wrap-Up. William J. Taffe Plymouth State University. using the slides of

Below is a diagram explaining the data packet and the timing related to the mouse clock while receiving a byte from the PS-2 mouse:

Module-I Lecture-I Introduction to Digital VLSI Design Flow

MACHINE ARCHITECTURE & LANGUAGE

Execution Cycle. Pipelining. IF and ID Stages. Simple MIPS Instruction Formats

High-speed image processing algorithms using MMX hardware

ECE 3401 Lecture 7. Concurrent Statements & Sequential Statements (Process)

Codesign: The World Of Practice

Instruction Set Design

l C-Programming l A real computer language l Data Representation l Everything goes down to bits and bytes l Machine representation Language

Introduction to Functional Verification. Niels Burkhardt

Embedded Systems. introduction. Jan Madsen


INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

Understanding Verilog Blocking and Non-blocking Assignments

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Contents. System Development Models and Methods. Design Abstraction and Views. Synthesis. Control/Data-Flow Models. System Synthesis Models

MICROPROCESSOR. Exclusive for IACE Students iacehyd.blogspot.in Ph: /422 Page 1

Network Traffic Monitoring an architecture using associative processing.

Client/Server Computing Distributed Processing, Client/Server, and Clusters

Abstract. MEP; Reviewed: GAK 10/17/2005. Solution & Interoperability Test Lab Application Notes 2005 Avaya Inc. All Rights Reserved.

Chapter 13: Verification

Engineering Change Order (ECO) Support in Programmable Logic Design

CHAPTER 4 MARIE: An Introduction to a Simple Computer

Transcription:

Modeling, Synthesis, Verification Daniel D. Gajski, Samar Abdi, Andreas Gerstlauer, Gunar Schirner 5/25/2 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 2

HW Synthesis Design Flow Specification Compilation Estimation HLS Model generation Logic synthesis Layout RTL Component Library Compilation Tool Model Estimation HLS Allocation Binding Scheduling Model Generation RTL Model RTL Tools... 5/25/2 3 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 4 2

RTL Architecture ler FSM controller Programmable controller Datapath components Storage components Functional units Connection components Pipelining Functional unit Datapath Structure Chaining Multicycling li Forwarding Branch prediction Caching 5/25/2 5 RTL Architecture with FSM ler Simple architecture Small number of states Inputs Data Inputs put Logic Signals B B2 RF Input Logic Statust Signals ALU MUL Memory FSM ler puts B3 Datapath Data puts 5/25/2 6 3

RTL Architecture with Programmable ler Complex architecture and datapath pipelining Advanced structural features Large number of states (CW or IS) Inputs Data Inputs PC Cmem or PMem IR or CWR Signals B B2 RF Offset AG Status Address SR ALU MUL Memory Programmable ler B3 Datapath puts Data Inputs 5/25/2 7 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 8 4

Input Specification Programg language (C/C, ) Programg semantics requires presynthesis optimization System description language (SystemC, ) Simulation semantics requires presynthesis optimization /Data flow graph (CDFG) CDFG generation requires dependence analysis Finite state machine with data (FSMD) State interpretation requires some kind of scheduling RTL netlist RTL design that requires only input and output logic synthesis Hardware description language (Verilog / VHDL) HDL description requires RTL library and logic synthesis 5/25/2 9 C Code for Ones Counter Programg language semantics Sequential execution, Coding style to imize coding HW design Parallel execution, Communication through signals : int OnesCounter(int Data){ 2: int Ocount = ; 3: int Temp, Mask = ; 4: while (Data > ) { 5: Temp = Data & Mask; 6 Ocount = Ocount Temp; 7: Data >>= ; 8: } 9: return Ocount; : } Functionbased C code : while() { 2: while (Start == ); 3: Done = ; 4: Data = Input; 5: Ocount = ; 6: Mask = ; 7: while (Data>) { 8: Temp = Data & Mask; 9: Ocount = Ocount Temp; : Data >>= ; : } 2: put = Ocount; 3: Done = ; 4: } RTLbased C code 5/25/2 5

CDFG for Ones Counter /Data flow graph Resembles programg g language g Loops, ifs, basic blocks (BBs) Explicit dependencies dependences between BBs Data dependences inside BBs Missing dependencies between BBs Start Input Data MaskOcount Done Data Mask Ocount Done & >> Data Ocount Done Data > put Done 5/25/2 FSMD for Ones Counter FSMD more detailed then CDFG States may represent clock cycles Conditionals and statements executed concurrently All statement in each state executed concurrently signal and variable assignments executed concurrently FSMD includes scheduling FSMD doesn't specify binding or connectivity 5/25/2 2 6

CDFG and FSMD for Ones Counter 5/25/2 3 RTL Specification for Ones Counter RTL Specification Input logic table Present Inputs: Next put: State Start Data = State Done S S S S S ler and datapath netlist Input and output tables for logic synthesis RTL library needed for netlist S7 S7 S Start Inport put logic table (RF[] = Data, RF[] = Mask, RF[2] = Ocount, RF[3] = Temp) put Logic Signals B B2 RF State RF Read Port A RF Read Port B ALU Shifter RF selector RF Write port S Z S Inport RF[] Z RF[2] RF[2] subtract pass B3 RF[2] Z Input Logic status Shifter ALU RF[2] increment pass B3 RF[] Z RF[] RF[] AND pass B3 RF[3] Z FSM ler B3 Datapath RF[2] RF[3] add pass B3 RF[2] Z Done port RF[] pass shift right B3 RF[] Z S7 RF[2] disable enable 5/25/2 4 7

HDL description of Ones Counter HDL description Same as RTL description Several levels of abstraction Variable binding to storage Operation binding to FUs Transfer binding to connections Partial HLS may be needed ler and datapath netlists must be generated : // 2: always@(posedge clk) 3: begin : output_logic 4: case (state) 5: // 6: : begin 7: B = RF[]; 8: B2 = RF[]; 9: B3 = alu(b, B2, l_and); : RF[3] = B3; : next_state = ; 2: end 3: // 4: S7: begin 5: B = RF[2]; 6: port <= B; 7: done <= ; 8: next_state = S; 9: end 2: endcase 2: end 22: endmodule 5/25/2 5 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 6 8

Profiling and Estimation Presynthesis optimization Preliary scheduling Simple scheduling algorithm Profiling Operation usage Variable lifetimes Connection usage Estimation Performance Cost Power 5/25/2 7 SquareRoot Algorithm (SRA) S a = In b = In2 SQR = ((.875x.5y), x) Start x = (, ) y = (, ) S t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) S7 Done = = t7 5/25/2 8 9

Variable and Operation Usage S S7 a b t t2 x y t3 t4 t5 t6 t7 No. of live variables 2 2 2 3 3 2 S S7 Variable usage Max. no. of units abs 2 2 >> 2 2 No. of operations 2 2 Operation usage S S S7 a = In b = In2 Start t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) Done = = t7 5/25/2 9 Connectivity usage S S7 Max. no. of units S a = In b = In2 abs 2 2 >> 2 2 No. of 2 2 operations a b t t2 x y t3 t4 t5 t6 t7 Operation usage S Start t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> abs I O t5 = x t3 abs2 I O I I O t6 = t4 t5 I I I O I O >>3 I O >> I O I I O I I O Connectivity usage S7 t7 = ( t6, x ) Done = = t7 5/25/2 2

Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 2 Datapath Synthesis Variable Merging (Storage Sharing) Operation Merging (FU Sharing) Connection Merging (Bus Sharing) Register merging (RF sharing) Chaining and MultiCycling Data and Pipelining 5/25/2 22

Gain in register sharing Register sharing Grouping variables with nonoverlapping lifetimes Sharing reduces connectivity cost a c b d a, c b, d x y x, y Partial FSMD Datapath without register sharing Datapath with register sharing 5/25/2 23 General partitioning algorithm Compatibility graph Compatibility: Nonoverlapping in time Not using the same resource Noncompatible: Overlapping in time Using the same resource Priority Critical path Same source, same destination no Start Create compatibility graph Merge highest priority nodes Upgrade compatibility graph All nodes incompatible yes Stop 5/25/2 24 2

/ Variable Merging for SRA / / S a = In b = In2 Start (a) Initial compatibility graph / (b) Compatibility graph after merging t3, t5, and t6 S t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 (c) Compatibility graph after merging t, x, and t7 S7 t7 = ( t6, x ) Done = = t7 5/25/2 25 Variable Merging for SRA / / a t4 / t2 y t3 t5 t6 b t x t7 (a) Initial compatibility graph / (d) Compatibility graph after merging t2 and y (b) Compatibility graph after merging t3, t5, and t6 (e) Final compatibility graph R = [ a, t, x, t7 ] R2 = [ b, t2, y, t3, t5, t6 ] R3 = [ t4 ] (c) Compatibility graph after merging t, x, and t7 (f) Final register assignments 5/25/2 26 3

/ Variable Merging for SRA / / / (a) Initial compatibility graph (b) Compatibility graph after merging t3, t5, and t6 (c) Compatibility graph after merging t, x, and t7 (d) Compatibility graph after merging t2 and y R = [ a, t, x, t7 ] R2 = [ b, t2, y, t3, t5, t6 ] R3 = [ t4 ] (e) Final compatibility graph (f) Final register assignments 5/25/2 27 Datapath with Shared Registers Variables combined into registers One functional unit for each operation R R2 R3 a b >> >>3 5/25/2 28 4

Gain in Functional Unit Sharing Functional unit sharing Smaller number of FUs Larger connectivity cost Si x = a b a c b d Sj y = c d x / y Partial FSMD Nonshared design Shared design 5/25/2 29 Operation Merging for SRA / / / / / / 2/ / / / / 2/ / Initial compatibility graph / / / / / / 2/ 2/ Compatibility graph after merging of and 5/25/2 3 5

Operation Merging for SRA / / / / / / 2/ / / / / 2/ / Initial compatibility graph Compatibility graph after merging of, and / / / / / / 2/ 2/ Compatibility graph after merging of and Final graph partitions 5/25/2 3 Datapath with Shared Registers and FUs Variables combined into registers Operations combined into functional units R R2 R3 abs/ abs/// >> >>3 5/25/2 32 6

Connection usage for SRA S a = In b = In2 Start S S S7 A B C D E F G H I J K L M N Connection usage table Merge connections into buses S S7 t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) Done = = t7 5/25/2 33 Connection Merging for SRA Combine connection not used at the same time Priority to same source, same destination Priority to imum groups M I K S S S7 A B C D E F G H I J K L M N Connection usage table J L N Compatibility graph for input buses Compatibility graph for output buses Bus assignment 5/25/2 34 7

Datapath with Shared Registers, FUs and Buses Minimal SRA architecture 3 registers 4 (2) functional units 4b buses 5/25/2 35 Register Merging into RFs Register merging: Port sharing Merge registers with nonoverlapping access times No of ports is equal to simultaneous read/write accesses Register assignment R R2 R R2 R3 S S S7 Register access table S S a = In b = In2 Start t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) R3 Compatibility graph S7 Done = = t7 5/25/2 36 8

Datapath with Shared RF RF imize connectivity cost by sharing ports In In2 R R3 R2 Bus Bus2 abs/ abs/// H >>3 >> Bus3 Bus4 5/25/2 37 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 38 9

Datapath with Chaining Chaining connects two or more FUs S a = In b = In2 Allows execution of two or more operation in a single clock cycle Improves performance at no cost S Start = t = t2 = x = ( t, t2 ) t3 = ( t, t2 )>>3 t4 = ( t, t2 )>> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) Done = In 2 = t7 5/25/2 39 Datapath with Chained and MultiCycled FUs S a = In b = In2 Start Multicycling allows use of slower FUs Multicycling allows faster clockcycle S t = t2 = x = ( t, t2 ) t3 = ( t, t2 )>>3 [t4]= ( t, t2 )>> t4 = [ ( t, t2 ) >>] Bus Bus 2 In In 2 R R2 R3 t5 = x t3 abs/ abs// t6 = t4 t5 Bus 3 >>3 >> t7 = In ( 2t6, x ) Bus 4 S7 Done = = t7 5/25/2 4 2

Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 4 Pipelining Functional Unit pipelining Two or more operation executing at the same time Datapath pipelining Two or more register transfers executing at the same time Pipelining Two or more instructions generated at the same time 5/25/2 42 2

Functional Unit Pipelining () Operation delay cut in half Shorter clock cycle Dependencies may delay some states Extra NO states reduce performance gain 5/25/2 43 Functional Unit Pipelining (2) Tig diagram with 4 additional NO states S S NO NO NO S7 NO S8 Read R a t t t7 Read R2 b t2 t2 t3 t5 t6 Read R3 t4 ALU stage ALU stage 2 Shifters >>3 >> Write R a t x t7 Write R2 b t2 t3 t5 t6 Write R3 t4 Write t7 5/25/2 44 22

Datapath Pipelining () Registertoregister delay cut in equal parts Much shorter clock cycle Dependencies may delay some states Extra NO states reduce performance gain In In2 R R2 R3 Bus Bus2 ALU Bus3 Bus4 >>3 >> 5/25/2 45 Datapath pipelining (2) S a = In b = In2 In In2 R R2 R3 S Start t = x = ( t, t2 ) t3 = ( t, t2 )>>3 t4 = ( t, t2 )>> t2 = t5 = x t3 t6 = t4 t5 Bus Bus2 ALU >>3 >> Bus3 Bus4 Tig diagram with additional NO clock cycles Cycles 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 Read R a t t x x t7 Read R2 b t2 t2 t3 t5 t6 Read R3 t4 ALUIn(L) a t t x t4 x ALUIn(R) b t2 t2 t3 t5 t6 ALU S7 t7 = ( t6, x ) S8 Done = In 2 = t7 Shifters >>3 >> Write R a t x t7 Write R2 b t2 t3 t5 t6 Write R3 t4 Write t7 5/25/2 46 23

Datapath and Pipelining () Fetch delay cut into several parts Shorter clock cycle Conditionals may delay some states Extra NO states reduce performance gain Inputs signals Data Inputs S PC CMem CWR Register RF Mem Bus a>b x = c d AG Offset Status Signals ALU / Bus2 SR Bus3 y = x ler puts Register Data puts Datapath 5/25/2 47 Data and Pipelining (2) Inputs signals Data Inputs 3 NO cycles for the branch 2 NO cycles for data dependence PC AG CMem Offset CWR Status Signals Register ALU RF Mem Bus Bus2 / SR Bus3 S a>b y = x x = c d ler puts Register Data puts Datapath Tig diagram with additional NO clock cycles Cycle Operation 2 3 4 5 6 7 8 9 Read PC 2 3 4 5 6 7 8 9 2 Read CWR S NO NO NO NO NO Read RF(L) a c x Read RF(R) b d Write ALUIn(L) a c x Write ALUIn(R) b d Write ALU cd x Write RF x y Write SR a>b Write PC 2 3 4/7 5 6 7 8 9 2 5/25/2 48 24

Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 49 Scheduling Scheduling assigns clock cycles to register transfers Nonconstrained scheduling ASAP scheduling ALAP scheduling Constrained scheduling Resource constrained (RC) scheduling Given resources, imize metrics (time, power, ) Time constrained (TC) scheduling Given time, imize resources (FUs, storage, connections) 5/25/2 5 25

C and CDFG for SRA Algorithm a=in b=in2 Start t= t2= x=(t,t2) y=(t,t2) t3=x>>3 t4=y>> t5=xt3 t6=t4t5 t4 t5 t7=(t6,x) Done= =t7 In In 2 a b Start a b >> >>3 Done C flowchart CDFG 5/25/2 5 ASAP and ALAP Scheduling In In 2 a b a b a b S Start a b >> >>3 >> >>3 >> >>3 S7 S8 Done ASAP schedule ALAP schedule 5/25/2 52 26

RC Scheduling Perfrom ASAP a b a b In In 2 a b Start S Perfrom ALAP Detere mobilities a b >> >>3 >> >>3 Create ready list Sort ready list by mobilities >> >>3 S7 Schedule ops from ready list Delete scheduled ops from ready list Add new ops to ready list Done S8 ASAP schedule ALAP schedule no Increment state index All ops scheduled? yes 5/25/2 53 RC Scheduling Perfrom ASAP a b a b a b Perfrom ALAP S Detere mobilities Create ready list >> >>3 >>3 Sort ready list by mobilities >> >>3 >>3 Schedule ops from ready list Delete scheduled ops from ready list Add new ops to ready list S7 S8 >> >> no Increment state index All ops scheduled? yes ASAP schedule ALAP schedule Ready list with mobilities (ALAP ASAP) RC schedule (for single FU and 2 shifters) 5/25/2 54 27

TC Scheduling Perform ASAP a b a b AU units Probability sum/state Shift units Perform ALAP S S Detere mobilities ranges.83 Create probability distribution graphs >> >>3 >>3.83.83.83.83 no All ops scheduled? yes >>.33 no Any gain? yes Schedule op with imum gain S7 S7.5 Schedule op with imum loss S8 ASAP ALAP Initial probability distribution graph 5/25/2 55 Distribution Graphs for TC scheduling Initial probability distribution graph Graph after,, and were scheduled 5/25/2 56 28

Distribution Graphs for TC scheduling AU units Probability sum/state Shift units S >> >>3 >> S7 Graph after,, and were scheduled Graph after,,,, >>3, and >> were scheduled 5/25/2 57 Distribution Graphs for TC scheduling AU units Probability sum/state Shift units AU units Probability sum/state Shift units S S >>3 >>3 >> >> S7 S7 Graph after,,,, >>3, and >> were scheduled Distribution graph for final schedule 5/25/2 58 29

TC Scheduling a b a b a b In In 2 S a b Start a b >> >>3 >>3 >>3 >> >> >> >>3 S7 Done S8 ASAP ALAP TC schedule 5/25/2 59 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 6 3

Interface Synthesis Combine process and channel codes HW and protocol clock cycles may differ Insert a businterface component Communication in three parts: Freely schedulable code Scheduled with process code Schedule constrained code MAC driver from library for selected bus interface Bus interface Implemented by bus interface component from library 5/25/2 6 Bus Interface ler () ler Datapath CMem signals Reg RF Mem offset Bus Bus 2 AG Status signals ALU / ready ack Cntrl Addr Data InData Input logic put logic signals Bus 3 Bus 4 Address INC Write Queue Read Queue MAC driver REQUEST GRANT CONTROL ADDRESS DATA 5/25/2 62 3

Bus Interface ler (2) ler Datapath CMem signals Reg RF Mem Bus offset Bus 2 AG Status signals ALU / Bus 3 Bus 4 ready ack Cntrl Addr Data InData signals put logic Write Read Queue Queue Input Address logic INC MAC driver REQUEST GRANT CONTROL ADDRESS DATA Bus protocol 5/25/2 63 Transducer/ Bridge Translates one protocol into another ler receives data with protocol and writes into queue ler2 reads from queue and sends data with protocol2 PE Bus Bus 2 Transducer PE2 Interrupt Ready Ack Interrupt2 Processor <clk> ler <clk> Ready2 Ack2 ler2 <clk2> Processor2 <clk2> Data Data2 Memory Queue <clk3> Memory2 5/25/2 64 32

Conclusion Synthesis techniques Variable Merging (Storage Sharing) Operation Merging (FU Sharing) Connection Merging (Bus Sharing) Architecture techniques Chaining and MultiCycling Data and Pipelining Forwarding and Caching Scheduling Metric constrained scheduling Interfacing Part of HW component Bus interface unit If too complex, use partial order 5/25/2 65 33