New Strategies for System Level Design Daniel Gajski Center for Embedded Computer Systems (CECS) University of California, Irvine gajski@uci.edu
Overview Introduction Issues Models Platforms Tools Benefits Conclusion
Closing the System Gap Real gap: behavior and structure (semantics and syntax)
Simulation Based Methodology Ambiguous semantics of hardware/system level languages Simuletable but not synthesizable or verifiable
In Search of a Solution Algebra: < objects, operations> Arithmetic algebra allows creation of expressions and equivalences
Model Algebra Model algebra: <objects, compositions> Model algebra allows creation of models and model equivalences
Specify-Explore-Refine Methodology Design decisions Model refinement Replacement or re-composition FPGA board
How many models? Minimal set for any methodology (3 is enough) System specification model (application designers) Transaction-level model (system designers) Pin&Cycle accurate model (implementation designers)
Three Models (with Respect to OSI) Pin / Cycle Accurate Model Transaction Level Model Specification Model 7. Application 6. Presentation 5. Session 4. Transport 3. Network 2b. Link + Stream 2a. Media Access Ctrl 2a. Protocol 1. Physical Spec TLM 7. Application 6. Presentation 5. Session 4. Transport 3. Network 2b. Link + Stream 2a. Media Access Ctrl 2a. Protocol 1. Physical Address lines Data lines Control lines P/CAM Source: G Schirner
System Specification CPU Mem Computation Behaviors (in C) B1 B2 v1 Communication Channels (in C) Variables (in C) Arbiter C1 C2 Bridge B3 B4 HW System Definition = (Partial) Platform + (Partial) Spec IP
Transaction-Level Model (TLM) CPU B1 B2 Mem Drivers OS HAL CPU Bus IP Bus B3 B4 HW IP
Pin/Cycle Accurate Model (P/CAM) CPU Program EXE IC Mem Arbiter RTOS HAL Bridge HW Source: D. Gajski et al. P/CAM is downloaded automatically for fast prototyping with FPGAs or ASIC design IP
How many components? Minimal set for any design (4 is enough?) Processing element (PE) Memory Transducer / Bridge Arbiter
General System Model Arbiter 1 Arbiter 2 PE 1.1 Interrupt1.1 Transducer1-2 Interrupt2.1 Interrupt2.2 PE 2.1 (Master) PE 2.2 (Slave) Arbiter 3 PE 1.2 Interrupt3.1 Interrupt3.2 Transducer2-3 PE 3.1 Memory 1 Memory 3 Bus1 Bus2 Bus3
Transducer Model PE1 Addr bus1 Data bus1 Transducer Addr bus2 Data bus2 PE2 Interrupt1 Ready1 Ack_ready1 Interrupt2 Processor1 <clk1> FSMD1 <clk1> Ready2 Ack_ready2 FSMD2 <clk2> Processor2 <clk2> Data1 Data2 Memory1 Queue <clk3> Memory2 Source: H. Cho
Processing Element: NISC technology Direct compilation of C to HW (fastest possible execution) Statically and dynamically reconfigurable (anytime, anywhere) Designed for manufacturability (solving timing closure) const RF / Scratch pad PC CMem CW B1 B2 AG offset status Status ALU MUL Memory address B3 Programmable controller Datapath Multi-cycle units Pipelined units Controller pipelining Datapath pipelining Data forwarding
General System Design Environment Model A Estimation tool Refinement tool GUI Synthesis tool Component library Transforms: t1 t2... tn Verify tool ti Simulation tool Model B
How many tools? Minimal set for any methodology (2 is enough?) Front-End (for application developers) Input: C, C++, Mathlab, UML, Output: TLM Back-End (for SW/HW system designers) Input : TLM Output: Pin/Cycle accurate Verilog/VHDL
ES Environment Decision User Interface (DUI) Validation User Interface (VUI) Create Select Partition Map Compile Replace ESE Front End System Capture + Platform Development ESE Back End SW Development + HW Development Compiler Debugger Stimulate Verify TIMED CYCLE ACCURATE Compile Check Simulate Verify Application Tools : Compilers/Debuggers Commercial Tools : FPGA, ASIC
Benefit: Spec-to-Prototype in 1 Week
Does it work? Intuitively it does Well defined models, rules, transformations, refinements Worked in the past: layout, logic, RTL? System level complexity simplified Proof of concept demonstrated Embedded System Environment (ESE) Automatic model generation Model synthesis and verification Universal IP technology (NISC) Productivity gains greater then 1000 Benefits Large productivity gains Easy design management Easy derivatives Shorter TTM
Design flow with NISC technology for(int i=0; i<8; i++) for(int i=0; i<8; i++) for(int j=0; j<8; j++){ for(int j=0; j<8; j++){ sum=0; sum=0; for(int k=0; k<8; k++) for(int k=0; k<8; k++) sum = sum + A[i][k] B[k][j]; sum = sum + A[i][k] B[k][j]; C[i][j] = sum; C[i][j] = sum; } } Code Refinement for(int i=0; i<8; i++) for(int i=0; i<8; i++) for(int j=0; j<8; j++){ for(int j=0; j<8; j++){ i8 = i 8; i8 = i 8; sum = *(A + i8) *(B + j); sum = *(A i8) *(B + j); sum += *(A + i8 + 1) *(B + 8 + j); sum += *(A + i8 + 1) *(B + 8 + j);...... C[i][j] = sum; C[i][j] = sum; } } Application NISC Compiler Results Application NISC Compiler Results PC NISC CMem AG offset status CW status address const B1 B2 B3 ALU MUL RF Memory NISC Refinement PC NISC CMem CW const offset AG status ALU DR RF OR AR Mem al bl Mul Sum P Add Iterative design & refinement Source: M. Reshadi
DCT with NISC technology Execution Time Power Energy Area 1.4 1.2 1 0.8 0.6 0.4 0.2 0 MIPS NMIPS CDCT1 CDCT2 CDCT3 CDCT4 CDCT5 CDCT6 CDCT7 Manual NMIPS vs. MIPS CDCT3 vs. NMIPS CDCT7 vs. NMIPS Performance Power saving Energy saving Area reduction 1.25X NA NA NA 5.3X 2.1X 11.6X 2.5X 10X 1.3X 12.8X 3X CDCT7 vs. Manual Source: B. Gorjiara 0.83X 1.3X 0 2.1X
MP3 on Xillinx with ESE % chip utilization 100 90 80 70 60 50 40 30 20 10 0 SW+0 SW+1 SW+2 SW+4 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 seconds %Slices %BRAMs Exec. time Design Points Area % of FPGA slices and BRAMS Performance Time to decode 1 frame of MP3 data Source: S. Abdi
MP3 on Xillinx with ESE using NISC % chip utilization 100 90 80 70 60 50 40 30 20 10 0 SW+0 SW+1 SW+2 SW+4 NISC+0 0.35 0.3 0.25 0.2 0.15 0.1 0.05 0 seconds %Slices %BRAMs Exec. time Design Points Area NISC uses fewer FPGA slices and more BRAMs than manual HW Performance NISC comparable to manual HW and much faster than SW
Development Manual Development Time with Time ESE 70 60 person-days 50 40 30 20 ESE SW+0 SW+1 SW+2 SW+4 10 0 Spec. TLM RTL Board models Model Development time Includes time for C, TLM and RTL Verilog coding and debugging ESE drastically cuts RTL and Board development time Source: S. Abdi
Development Time with ESE 70 60 person-days 50 40 30 20 ESE SW+0 SW+1 SW+2 SW+4 10 0 Spec. TLM RTL Board models ESE drastically cuts RTL and Board development time Models can be developed at Spec and TL Synthesizable RTL models are generated automatically by ESE Source: S. Abdi
Validation Time Time with ESE hours 18.06 hrs 17.71 hrs 17.56 hrs 15.93 hrs X seconds 10 9 8 7 6 5 4 3 2 1 0 ESE Spec. TLM RTL Board models SW+0 SW+1 SW+2 SW+4 Simulation time measured on 3.3 GHz processor Emulation time measured on board with Timer ESE cuts validation time from hours to seconds Source: S. Abdi
Validation Time with ESE 10 9 8 seconds 7 6 5 4 3 ESE SW+0 SW+1 SW+2 SW+4 2 1 0 Spec. TLM RTL Board models ESE cuts validation time from hours to seconds No need to verify RTL models Designers can perform high speed validation at TLM and board Source: S. Abdi
Conclusions Extreme makeover is necessary for a new paradigm, where SW = HW = SOC = Embedded Systems Simulation based flow is not acceptable Design methodology is based on scientific principles Model algebra is enabling technology for System design, modeling and simulation System synthesis, verification, and test What is next? Change of mind Application oriented EDA Looking for early adapters
Thank You Daniel Gajski Center for Embedded Computer Systems (CECS) www.cecs.uci.edu