Modeling, Synthesis, Verification Daniel D. Gajski, Samar Abdi, Andreas Gerstlauer, Gunar Schirner 5/25/2 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 2
HW Synthesis Design Flow Specification Compilation Estimation HLS Model generation Logic synthesis Layout RTL Component Library Compilation Tool Model Estimation HLS Allocation Binding Scheduling Model Generation RTL Model RTL Tools... 5/25/2 3 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 4 2
RTL Architecture ler FSM controller Programmable controller Datapath components Storage components Functional units Connection components Pipelining Functional unit Datapath Structure Chaining Multicycling li Forwarding Branch prediction Caching 5/25/2 5 RTL Architecture with FSM ler Simple architecture Small number of states Inputs Data Inputs put Logic Signals B B2 RF Input Logic Statust Signals ALU MUL Memory FSM ler puts B3 Datapath Data puts 5/25/2 6 3
RTL Architecture with Programmable ler Complex architecture and datapath pipelining Advanced structural features Large number of states (CW or IS) Inputs Data Inputs PC Cmem or PMem IR or CWR Signals B B2 RF Offset AG Status Address SR ALU MUL Memory Programmable ler B3 Datapath puts Data Inputs 5/25/2 7 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 8 4
Input Specification Programg language (C/C, ) Programg semantics requires presynthesis optimization System description language (SystemC, ) Simulation semantics requires presynthesis optimization /Data flow graph (CDFG) CDFG generation requires dependence analysis Finite state machine with data (FSMD) State interpretation requires some kind of scheduling RTL netlist RTL design that requires only input and output logic synthesis Hardware description language (Verilog / VHDL) HDL description requires RTL library and logic synthesis 5/25/2 9 C Code for Ones Counter Programg language semantics Sequential execution, Coding style to imize coding HW design Parallel execution, Communication through signals : int OnesCounter(int Data){ 2: int Ocount = ; 3: int Temp, Mask = ; 4: while (Data > ) { 5: Temp = Data & Mask; 6 Ocount = Ocount Temp; 7: Data >>= ; 8: } 9: return Ocount; : } Functionbased C code : while() { 2: while (Start == ); 3: Done = ; 4: Data = Input; 5: Ocount = ; 6: Mask = ; 7: while (Data>) { 8: Temp = Data & Mask; 9: Ocount = Ocount Temp; : Data >>= ; : } 2: put = Ocount; 3: Done = ; 4: } RTLbased C code 5/25/2 5
CDFG for Ones Counter /Data flow graph Resembles programg g language g Loops, ifs, basic blocks (BBs) Explicit dependencies dependences between BBs Data dependences inside BBs Missing dependencies between BBs Start Input Data MaskOcount Done Data Mask Ocount Done & >> Data Ocount Done Data > put Done 5/25/2 FSMD for Ones Counter FSMD more detailed then CDFG States may represent clock cycles Conditionals and statements executed concurrently All statement in each state executed concurrently signal and variable assignments executed concurrently FSMD includes scheduling FSMD doesn't specify binding or connectivity 5/25/2 2 6
CDFG and FSMD for Ones Counter 5/25/2 3 RTL Specification for Ones Counter RTL Specification Input logic table Present Inputs: Next put: State Start Data = State Done S S S S S ler and datapath netlist Input and output tables for logic synthesis RTL library needed for netlist S7 S7 S Start Inport put logic table (RF[] = Data, RF[] = Mask, RF[2] = Ocount, RF[3] = Temp) put Logic Signals B B2 RF State RF Read Port A RF Read Port B ALU Shifter RF selector RF Write port S Z S Inport RF[] Z RF[2] RF[2] subtract pass B3 RF[2] Z Input Logic status Shifter ALU RF[2] increment pass B3 RF[] Z RF[] RF[] AND pass B3 RF[3] Z FSM ler B3 Datapath RF[2] RF[3] add pass B3 RF[2] Z Done port RF[] pass shift right B3 RF[] Z S7 RF[2] disable enable 5/25/2 4 7
HDL description of Ones Counter HDL description Same as RTL description Several levels of abstraction Variable binding to storage Operation binding to FUs Transfer binding to connections Partial HLS may be needed ler and datapath netlists must be generated : // 2: always@(posedge clk) 3: begin : output_logic 4: case (state) 5: // 6: : begin 7: B = RF[]; 8: B2 = RF[]; 9: B3 = alu(b, B2, l_and); : RF[3] = B3; : next_state = ; 2: end 3: // 4: S7: begin 5: B = RF[2]; 6: port <= B; 7: done <= ; 8: next_state = S; 9: end 2: endcase 2: end 22: endmodule 5/25/2 5 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 6 8
Profiling and Estimation Presynthesis optimization Preliary scheduling Simple scheduling algorithm Profiling Operation usage Variable lifetimes Connection usage Estimation Performance Cost Power 5/25/2 7 SquareRoot Algorithm (SRA) S a = In b = In2 SQR = ((.875x.5y), x) Start x = (, ) y = (, ) S t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) S7 Done = = t7 5/25/2 8 9
Variable and Operation Usage S S7 a b t t2 x y t3 t4 t5 t6 t7 No. of live variables 2 2 2 3 3 2 S S7 Variable usage Max. no. of units abs 2 2 >> 2 2 No. of operations 2 2 Operation usage S S S7 a = In b = In2 Start t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) Done = = t7 5/25/2 9 Connectivity usage S S7 Max. no. of units S a = In b = In2 abs 2 2 >> 2 2 No. of 2 2 operations a b t t2 x y t3 t4 t5 t6 t7 Operation usage S Start t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> abs I O t5 = x t3 abs2 I O I I O t6 = t4 t5 I I I O I O >>3 I O >> I O I I O I I O Connectivity usage S7 t7 = ( t6, x ) Done = = t7 5/25/2 2
Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 2 Datapath Synthesis Variable Merging (Storage Sharing) Operation Merging (FU Sharing) Connection Merging (Bus Sharing) Register merging (RF sharing) Chaining and MultiCycling Data and Pipelining 5/25/2 22
Gain in register sharing Register sharing Grouping variables with nonoverlapping lifetimes Sharing reduces connectivity cost a c b d a, c b, d x y x, y Partial FSMD Datapath without register sharing Datapath with register sharing 5/25/2 23 General partitioning algorithm Compatibility graph Compatibility: Nonoverlapping in time Not using the same resource Noncompatible: Overlapping in time Using the same resource Priority Critical path Same source, same destination no Start Create compatibility graph Merge highest priority nodes Upgrade compatibility graph All nodes incompatible yes Stop 5/25/2 24 2
/ Variable Merging for SRA / / S a = In b = In2 Start (a) Initial compatibility graph / (b) Compatibility graph after merging t3, t5, and t6 S t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 (c) Compatibility graph after merging t, x, and t7 S7 t7 = ( t6, x ) Done = = t7 5/25/2 25 Variable Merging for SRA / / a t4 / t2 y t3 t5 t6 b t x t7 (a) Initial compatibility graph / (d) Compatibility graph after merging t2 and y (b) Compatibility graph after merging t3, t5, and t6 (e) Final compatibility graph R = [ a, t, x, t7 ] R2 = [ b, t2, y, t3, t5, t6 ] R3 = [ t4 ] (c) Compatibility graph after merging t, x, and t7 (f) Final register assignments 5/25/2 26 3
/ Variable Merging for SRA / / / (a) Initial compatibility graph (b) Compatibility graph after merging t3, t5, and t6 (c) Compatibility graph after merging t, x, and t7 (d) Compatibility graph after merging t2 and y R = [ a, t, x, t7 ] R2 = [ b, t2, y, t3, t5, t6 ] R3 = [ t4 ] (e) Final compatibility graph (f) Final register assignments 5/25/2 27 Datapath with Shared Registers Variables combined into registers One functional unit for each operation R R2 R3 a b >> >>3 5/25/2 28 4
Gain in Functional Unit Sharing Functional unit sharing Smaller number of FUs Larger connectivity cost Si x = a b a c b d Sj y = c d x / y Partial FSMD Nonshared design Shared design 5/25/2 29 Operation Merging for SRA / / / / / / 2/ / / / / 2/ / Initial compatibility graph / / / / / / 2/ 2/ Compatibility graph after merging of and 5/25/2 3 5
Operation Merging for SRA / / / / / / 2/ / / / / 2/ / Initial compatibility graph Compatibility graph after merging of, and / / / / / / 2/ 2/ Compatibility graph after merging of and Final graph partitions 5/25/2 3 Datapath with Shared Registers and FUs Variables combined into registers Operations combined into functional units R R2 R3 abs/ abs/// >> >>3 5/25/2 32 6
Connection usage for SRA S a = In b = In2 Start S S S7 A B C D E F G H I J K L M N Connection usage table Merge connections into buses S S7 t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) Done = = t7 5/25/2 33 Connection Merging for SRA Combine connection not used at the same time Priority to same source, same destination Priority to imum groups M I K S S S7 A B C D E F G H I J K L M N Connection usage table J L N Compatibility graph for input buses Compatibility graph for output buses Bus assignment 5/25/2 34 7
Datapath with Shared Registers, FUs and Buses Minimal SRA architecture 3 registers 4 (2) functional units 4b buses 5/25/2 35 Register Merging into RFs Register merging: Port sharing Merge registers with nonoverlapping access times No of ports is equal to simultaneous read/write accesses Register assignment R R2 R R2 R3 S S S7 Register access table S S a = In b = In2 Start t = t2 = x = ( t, t2 ) y = ( t, t2 ) t3 = x >> 3 t4 = y >> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) R3 Compatibility graph S7 Done = = t7 5/25/2 36 8
Datapath with Shared RF RF imize connectivity cost by sharing ports In In2 R R3 R2 Bus Bus2 abs/ abs/// H >>3 >> Bus3 Bus4 5/25/2 37 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 38 9
Datapath with Chaining Chaining connects two or more FUs S a = In b = In2 Allows execution of two or more operation in a single clock cycle Improves performance at no cost S Start = t = t2 = x = ( t, t2 ) t3 = ( t, t2 )>>3 t4 = ( t, t2 )>> t5 = x t3 t6 = t4 t5 t7 = ( t6, x ) Done = In 2 = t7 5/25/2 39 Datapath with Chained and MultiCycled FUs S a = In b = In2 Start Multicycling allows use of slower FUs Multicycling allows faster clockcycle S t = t2 = x = ( t, t2 ) t3 = ( t, t2 )>>3 [t4]= ( t, t2 )>> t4 = [ ( t, t2 ) >>] Bus Bus 2 In In 2 R R2 R3 t5 = x t3 abs/ abs// t6 = t4 t5 Bus 3 >>3 >> t7 = In ( 2t6, x ) Bus 4 S7 Done = = t7 5/25/2 4 2
Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 4 Pipelining Functional Unit pipelining Two or more operation executing at the same time Datapath pipelining Two or more register transfers executing at the same time Pipelining Two or more instructions generated at the same time 5/25/2 42 2
Functional Unit Pipelining () Operation delay cut in half Shorter clock cycle Dependencies may delay some states Extra NO states reduce performance gain 5/25/2 43 Functional Unit Pipelining (2) Tig diagram with 4 additional NO states S S NO NO NO S7 NO S8 Read R a t t t7 Read R2 b t2 t2 t3 t5 t6 Read R3 t4 ALU stage ALU stage 2 Shifters >>3 >> Write R a t x t7 Write R2 b t2 t3 t5 t6 Write R3 t4 Write t7 5/25/2 44 22
Datapath Pipelining () Registertoregister delay cut in equal parts Much shorter clock cycle Dependencies may delay some states Extra NO states reduce performance gain In In2 R R2 R3 Bus Bus2 ALU Bus3 Bus4 >>3 >> 5/25/2 45 Datapath pipelining (2) S a = In b = In2 In In2 R R2 R3 S Start t = x = ( t, t2 ) t3 = ( t, t2 )>>3 t4 = ( t, t2 )>> t2 = t5 = x t3 t6 = t4 t5 Bus Bus2 ALU >>3 >> Bus3 Bus4 Tig diagram with additional NO clock cycles Cycles 2 3 4 5 6 7 8 9 2 3 4 5 6 7 8 Read R a t t x x t7 Read R2 b t2 t2 t3 t5 t6 Read R3 t4 ALUIn(L) a t t x t4 x ALUIn(R) b t2 t2 t3 t5 t6 ALU S7 t7 = ( t6, x ) S8 Done = In 2 = t7 Shifters >>3 >> Write R a t x t7 Write R2 b t2 t3 t5 t6 Write R3 t4 Write t7 5/25/2 46 23
Datapath and Pipelining () Fetch delay cut into several parts Shorter clock cycle Conditionals may delay some states Extra NO states reduce performance gain Inputs signals Data Inputs S PC CMem CWR Register RF Mem Bus a>b x = c d AG Offset Status Signals ALU / Bus2 SR Bus3 y = x ler puts Register Data puts Datapath 5/25/2 47 Data and Pipelining (2) Inputs signals Data Inputs 3 NO cycles for the branch 2 NO cycles for data dependence PC AG CMem Offset CWR Status Signals Register ALU RF Mem Bus Bus2 / SR Bus3 S a>b y = x x = c d ler puts Register Data puts Datapath Tig diagram with additional NO clock cycles Cycle Operation 2 3 4 5 6 7 8 9 Read PC 2 3 4 5 6 7 8 9 2 Read CWR S NO NO NO NO NO Read RF(L) a c x Read RF(R) b d Write ALUIn(L) a c x Write ALUIn(R) b d Write ALU cd x Write RF x y Write SR a>b Write PC 2 3 4/7 5 6 7 8 9 2 5/25/2 48 24
Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 49 Scheduling Scheduling assigns clock cycles to register transfers Nonconstrained scheduling ASAP scheduling ALAP scheduling Constrained scheduling Resource constrained (RC) scheduling Given resources, imize metrics (time, power, ) Time constrained (TC) scheduling Given time, imize resources (FUs, storage, connections) 5/25/2 5 25
C and CDFG for SRA Algorithm a=in b=in2 Start t= t2= x=(t,t2) y=(t,t2) t3=x>>3 t4=y>> t5=xt3 t6=t4t5 t4 t5 t7=(t6,x) Done= =t7 In In 2 a b Start a b >> >>3 Done C flowchart CDFG 5/25/2 5 ASAP and ALAP Scheduling In In 2 a b a b a b S Start a b >> >>3 >> >>3 >> >>3 S7 S8 Done ASAP schedule ALAP schedule 5/25/2 52 26
RC Scheduling Perfrom ASAP a b a b In In 2 a b Start S Perfrom ALAP Detere mobilities a b >> >>3 >> >>3 Create ready list Sort ready list by mobilities >> >>3 S7 Schedule ops from ready list Delete scheduled ops from ready list Add new ops to ready list Done S8 ASAP schedule ALAP schedule no Increment state index All ops scheduled? yes 5/25/2 53 RC Scheduling Perfrom ASAP a b a b a b Perfrom ALAP S Detere mobilities Create ready list >> >>3 >>3 Sort ready list by mobilities >> >>3 >>3 Schedule ops from ready list Delete scheduled ops from ready list Add new ops to ready list S7 S8 >> >> no Increment state index All ops scheduled? yes ASAP schedule ALAP schedule Ready list with mobilities (ALAP ASAP) RC schedule (for single FU and 2 shifters) 5/25/2 54 27
TC Scheduling Perform ASAP a b a b AU units Probability sum/state Shift units Perform ALAP S S Detere mobilities ranges.83 Create probability distribution graphs >> >>3 >>3.83.83.83.83 no All ops scheduled? yes >>.33 no Any gain? yes Schedule op with imum gain S7 S7.5 Schedule op with imum loss S8 ASAP ALAP Initial probability distribution graph 5/25/2 55 Distribution Graphs for TC scheduling Initial probability distribution graph Graph after,, and were scheduled 5/25/2 56 28
Distribution Graphs for TC scheduling AU units Probability sum/state Shift units S >> >>3 >> S7 Graph after,, and were scheduled Graph after,,,, >>3, and >> were scheduled 5/25/2 57 Distribution Graphs for TC scheduling AU units Probability sum/state Shift units AU units Probability sum/state Shift units S S >>3 >>3 >> >> S7 S7 Graph after,,,, >>3, and >> were scheduled Distribution graph for final schedule 5/25/2 58 29
TC Scheduling a b a b a b In In 2 S a b Start a b >> >>3 >>3 >>3 >> >> >> >>3 S7 Done S8 ASAP ALAP TC schedule 5/25/2 59 Hardware Synthesis Design flow RTL architecture Input specification Specification profiling Highlevel synthesis Chaining and multicycling Data and control pipelining Scheduling Component interfacing Conclusions 5/25/2 6 3
Interface Synthesis Combine process and channel codes HW and protocol clock cycles may differ Insert a businterface component Communication in three parts: Freely schedulable code Scheduled with process code Schedule constrained code MAC driver from library for selected bus interface Bus interface Implemented by bus interface component from library 5/25/2 6 Bus Interface ler () ler Datapath CMem signals Reg RF Mem offset Bus Bus 2 AG Status signals ALU / ready ack Cntrl Addr Data InData Input logic put logic signals Bus 3 Bus 4 Address INC Write Queue Read Queue MAC driver REQUEST GRANT CONTROL ADDRESS DATA 5/25/2 62 3
Bus Interface ler (2) ler Datapath CMem signals Reg RF Mem Bus offset Bus 2 AG Status signals ALU / Bus 3 Bus 4 ready ack Cntrl Addr Data InData signals put logic Write Read Queue Queue Input Address logic INC MAC driver REQUEST GRANT CONTROL ADDRESS DATA Bus protocol 5/25/2 63 Transducer/ Bridge Translates one protocol into another ler receives data with protocol and writes into queue ler2 reads from queue and sends data with protocol2 PE Bus Bus 2 Transducer PE2 Interrupt Ready Ack Interrupt2 Processor <clk> ler <clk> Ready2 Ack2 ler2 <clk2> Processor2 <clk2> Data Data2 Memory Queue <clk3> Memory2 5/25/2 64 32
Conclusion Synthesis techniques Variable Merging (Storage Sharing) Operation Merging (FU Sharing) Connection Merging (Bus Sharing) Architecture techniques Chaining and MultiCycling Data and Pipelining Forwarding and Caching Scheduling Metric constrained scheduling Interfacing Part of HW component Bus interface unit If too complex, use partial order 5/25/2 65 33