UTDSP: A VLIW DSP Processor in TSMC 0.35 CMOS



Similar documents
Architectures and Platforms

Instruction Set Design

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

ARM Microprocessor and ARM-Based Microcontrollers

VLIW Processors. VLIW Processors

A Lab Course on Computer Architecture

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Design Cycle for Microprocessors

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

What is a System on a Chip?

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Fondamenti su strumenti di sviluppo per microcontrollori PIC

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Software based Finite State Machine (FSM) with general purpose processors

Embedded System Hardware - Processing (Part II)

Rapid System Prototyping with FPGAs

Chapter 1 Computer System Overview


1. Introduction to Embedded System Design

7a. System-on-chip design and prototyping platforms

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Introduction to the Latest Tensilica Baseband Solutions

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

A case study of mobile SoC architecture design based on transaction-level modeling

PROBLEMS #20,R0,R1 #$3A,R2,R4

Central Processing Unit (CPU)

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Open Architecture Design for GPS Applications Yves Théroux, BAE Systems Canada

Embedded Development Tools

Low Power AMD Athlon 64 and AMD Opteron Processors

Tensilica Software Development Toolkit (SDK)

Extending the Power of FPGAs. Salil Raje, Xilinx

Chapter 2 Heterogeneous Multicore Architecture

Introducción. Diseño de sistemas digitales.1

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Testing of Digital System-on- Chip (SoC)

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture

Introduction to Exploration and Optimization of Multiprocessor Embedded Architectures based on Networks On-Chip

EMBEDDED SYSTEM BASICS AND APPLICATION

VHDL DESIGN OF EDUCATIONAL, MODERN AND OPEN- ARCHITECTURE CPU

ELEC 5260/6260/6266 Embedded Computing Systems

Pipeline Hazards. Structure hazard Data hazard. ComputerArchitecture_PipelineHazard1

Developing Embedded Applications with ARM Cortex TM -M1 Processors in Actel IGLOO and Fusion FPGAs. White Paper

Mobile Processors: Future Trends

Digital Hardware Design Decisions and Trade-offs for Software Radio Systems

EE282 Computer Architecture and Organization Midterm Exam February 13, (Total Time = 120 minutes, Total Points = 100)

ARM Ltd 110 Fulbourn Road, Cambridge, CB1 9NJ, UK.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Advanced Computer Architecture-CS501. Computer Systems Design and Architecture 2.1, 2.2, 3.2

WAR: Write After Read

CISC, RISC, and DSP Microprocessors

Reconfigurable Computing. Reconfigurable Architectures. Chapter 3.2

CSE 141L Computer Architecture Lab Fall Lecture 2

Qsys and IP Core Integration

IA-64 Application Developer s Architecture Guide

Five Families of ARM Processor IP

Instruction Set Architecture (ISA)

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Implementation of Web-Server Using Altera DE2-70 FPGA Development Kit

From Bus and Crossbar to Network-On-Chip. Arteris S.A.

The AVR Microcontroller and C Compiler Co-Design Dr. Gaute Myklebust ATMEL Corporation ATMEL Development Center, Trondheim, Norway

Digital Signal Controller Based Automatic Transfer Switch

On-Chip Interconnection Networks Low-Power Interconnect

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

White Paper Streaming Multichannel Uncompressed Video in the Broadcast Environment

Architectural Level Power Consumption of Network on Chip. Presenter: YUAN Zheng

How To Understand The Design Of A Microprocessor

INSTRUCTION LEVEL PARALLELISM PART VII: REORDER BUFFER

System Software Integration: An Expansive View. Overview

SGRT: A Scalable Mobile GPU Architecture based on Ray Tracing

İSTANBUL AYDIN UNIVERSITY

Intel StrongARM SA-110 Microprocessor

ARM Webinar series. ARM Based SoC. Abey Thomas

Router Architectures

Keil C51 Cross Compiler

Chapter 13. PIC Family Microcontroller

TMS320C6000 Programmer s Guide

Computer Systems Design and Architecture by V. Heuring and H. Jordan

CPU Organization and Assembly Language

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

9/14/ :38

SDLC Controller. Documentation. Design File Formats. Verification

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Chapter 2 Logic Gates and Introduction to Computer Architecture

NIOS II Based Embedded Web Server Development for Networking Applications

UNIVERSITY OF CALIFORNIA, DAVIS Department of Electrical and Computer Engineering. EEC180B Lab 7: MISP Processor Design Spring 1995

Serial port interface for microcontroller embedded into integrated power meter

Transcription:

UTDSP: A VLIW DSP Processor in TSMC 0.35 CMOS Sean Hsien-en Peng Supervisor: Prof. Paul Chow Computer Engineering Group University of Toronto Copyright@1999 by Sean Peng speng@eecg.toronto.edu

Motivation Low-Cost, Low-Power and High- Performance DSP Processors are needed for telecommunication and embedded systems VLIW architectures are ideal targets for HLL compilers to exploite parallelism Flexible for application-specific embedded systems ( Current VLIWs: TI TMS320C6x, and Philips R.E.A.L. DSP )

Limitations of Current VLIW DSPs in Cost-Sensitive systems Memory size is increased substantially due to the empty slots in long-instruction words 1. Unable to exploite enough parallelism in applications 2. Loop Unrolling Technique used in compilers Huge memory bandwidth for long-instruction fetching Will be the performance bottleneck when off - chip instruction memory is used

The New TI VelociTI Architecture in TMS320C6x

A Novel Instruction Packing and Decoding Method Based on a two-level horizontal microcode architecure Achieve better packing results while eliminating the need of using crossbar Reduce off-chip instruction memory bandwidth for cost-sensitive systems Patent has been filed for the memory packing design

The UTDSP Memory System

Reduce size of Decoder Memory

Denser packing using two clusters

Combine clusters and slot sharing to have a near-optimal result

The Packer and Software tools

Need complex data structures Design C++ template libraries to provide the container class for different objects List <Inst> A; List <Decoder> B; List <Token> C; Design associated class methods and use operator overloading to ease the packing algorithm development ListA = ListB.merge(ListC); if ( InstA > InstB )...

Better VLIW solution than TI s new VelociTI architecture Similar instruction compaction rate (65%) Eliminate the necessity of using crossbar Can use inexpensive off-chip instruction memory without suffering the bandwidth problem that TI has. Minimize the size of on-chip memory 90% of execution time is spent in 10% of code => only need to store DSP kernel code on chip

The UTDSP Architecture

Instruction Set Highly orthogonal, RISC-like instructions add r1,r2,r3, mult r1,r2,r3, ld (a1), r2 st r4,(a5) JSR, Jmp, BEQ, BLE. Specialized DSP instructions multiply-accumulator, modulo addressing, Zero-Overhead Looping Instruction Achieve optimal performance in DSP kernels Can handle 8-level nested looping, interruptable, good for real-time application

Data Hazards and Data Forward Traditional RISC architectures stall pipelines to solve RAW data hazards

Combine EX/MEM to reduce RAWs and Bypassing Logic Use register indirect addressing mode instead of displacement mode (ld (A2+4), R5) Resolve ALL RAWs now, no need to stall Reduce bypassing logic by 50% Pipeline latency is NOT changed (parallel)

Zero-overhead Looping Inst. reduces branch penalty and size of instruction memory TI suggests assembly programmers using Loop Unrolling to optimize DSP kernel code 1. Difficult 2. Increase code size dramatically UTDSP has a zero-overhead DO instruction

Design Challenge: The PC Unit: Handles 8- level nested DO loop. Allows JSR, Branch, and Interrupts in inner loop

The Forwarding Logic:

VLSI Implementation Strategy P&R a subblock and instantiate it at toplevel methodology doesn t work in processor design Various size of blocks => Need grouping/ merging features in floorplan tools Memory blocks => Need hierarchical P&R Huge interconnection bandwidth between blocks (bypassing) => Need Global Pin Optimization (GPO) 0.35u: Wiring delay dominates gate delay => Need GPO and Area-based router

Benchmark Results UTDSP Quick Facts: Die size: 7.2 x 7.2 mm 170,000 gates 20KByte On-chip SRAM Texas Instruments TMS320C62 Philips R.E.A.L DSP (Core) The UTDSP Clock 200 MHz 70 MHz 70 MHz Pin count > 400 pins 105 pins Func. Units 8 10 7 FIR 4, N_coeff, M_output smaples Mx( N+8)/2 + 6 cycles ~Mx(N+7)/2 + 8 clcles M x (N+6) /2 + 6 cycles

S.O.C Solution for Low-Cost Low- Power Telecommunication and Consumer applications The UTDSP core implemented in synthesizable VHDL can be easily integrated with other system blocks as an IP core Provide a GUI-based architecture simulator to help desingers to evaluate/modify the UTDSP core Provide an interactive assembly debugger comparable to any commercial counterparts Single-step trace, set break-point, memory location probing.

Conclusion: What has been done RTL level VHDL for UTDSP (10,000 lines) UTDSP Long-instruction packer and assembler ( 5,000 line C++ with template) Hierarchical P&R flow using PDP + Silicon Ensemble + Cadence 1999a GUI-based assembly debugger and simulator (6,000 lines in Java ) Potential commercialization will benefit the Canadian industry

The Assembly Debugger

Before Logic Merging: (Cadence PDP3.4C)

After Logic Merging: (Cadence PDP3.4C )

Global Pin Optimization (Cadence PDP 3.4C )

Hierarchical P&R using PDP3.4C, Silicon Ensemble 5.2

Finalized Grouping and Floorplan

Top-level Routing (Silicon Ensemble 5.2 )

Java-HDL System ( Experimental, Not shown in TEXPO99 ) Use Java language to describe digital design Embedd timing info into wires, and ports so that wire delay can be estimated in early design stage. Schematic is automatically generated and is used for visualized simulation => Important for DSP ASIC or pipelined design. You can observe any internal wires without using waveform simulator Synthesizable VHDL is automatically created

Java-HDL code example: public class Register... { Inport p1; Inport p2;... Wire w5 = new Wire(4); // a 4-bit wide bus Wire w6 = new Wire(1); // one -bit.. if ( rst.value == 1 ) w8.drive(0); else if ( event1(clk) ) w8.drive(d);.

Java-HDL Simulation System