DSP Benchmark Results for the Latest Processors (Workshop 427)

Similar documents
CISC, RISC, and DSP Microprocessors

BDTI Solution Certification TM : Benchmarking H.264 Video Decoder Hardware/Software Solutions

Which ARM Cortex Core Is Right for Your Application: A, R or M?

FPGAs for High-Performance DSP Applications

İSTANBUL AYDIN UNIVERSITY

System Considerations

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

Digital Hardware Design Decisions and Trade-offs for Software Radio Systems

Architectures and Platforms

what operations can it perform? how does it perform them? on what kind of data? where are instructions and data stored?

ARM Microprocessor and ARM-Based Microcontrollers

Choosing a Computer for Running SLX, P3D, and P5

7a. System-on-chip design and prototyping platforms

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

The new 32-bit MSP432 MCU platform from Texas

ELEC 5260/6260/6266 Embedded Computing Systems

MICROPROCESSOR REPORT. THE INSIDER S GUIDE TO MICROPROCESSOR HARDWARE

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Introduction to RISC Processor. ni logic Pvt. Ltd., Pune

Am186ER/Am188ER AMD Continues 16-bit Innovation

High-Level Synthesis Tools for Xilinx FPGAs

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

A New, High-Performance, Low-Power, Floating-Point Embedded Processor for Scientific Computing and DSP Applications

Mobile Processors: Future Trends

Embedded System Hardware - Processing (Part II)

EMBEDDED SYSTEM BASICS AND APPLICATION

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

Software Pipelining. for (i=1, i<100, i++) { x := A[i]; x := x+1; A[i] := x

on an system with an infinite number of processors. Calculate the speedup of

Digital signal processor fundamentals and system design

BUILD VERSUS BUY. Understanding the Total Cost of Embedded Design.

Instruction Set Design

Guidelines for Software Development Efficiency on the TMS320C6000 VelociTI Architecture

Introduction to the Latest Tensilica Baseband Solutions

Codesign: The World Of Practice

Five Families of ARM Processor IP

Achieving Nanosecond Latency Between Applications with IPC Shared Memory Messaging

Processor Architectures

Developing Embedded Applications with ARM Cortex TM -M1 Processors in Actel IGLOO and Fusion FPGAs. White Paper

GPUs for Scientific Computing

EE361: Digital Computer Organization Course Syllabus

12. Introduction to Virtual Machines

Chapter 13. PIC Family Microcontroller

System Design Issues in Embedded Processing

High-speed image processing algorithms using MMX hardware

Quiz for Chapter 1 Computer Abstractions and Technology 3.10

Multi-core architectures. Jernej Barbic , Spring 2007 May 3, 2007

An examination of the dual-core capability of the new HP xw4300 Workstation

Tensilica Software Development Toolkit (SDK)

FPGA-based Multithreading for In-Memory Hash Joins

x64 Servers: Do you want 64 or 32 bit apps with that server?

Intel Labs at ISSCC Copyright Intel Corporation 2012

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Computer Performance. Topic 3. Contents. Prerequisite knowledge Before studying this topic you should be able to:

Unit A451: Computer systems and programming. Section 2: Computing Hardware 1/5: Central Processing Unit

Intel Processors in Industrial Control and Automation Applications Top-to-bottom processing solutions from the enterprise to the factory floor

evm Virtualization Platform for Windows

RISC AND CISC. Computer Architecture. Farhat Masood BE Electrical (NUST) COLLEGE OF ELECTRICAL AND MECHANICAL ENGINEERING

Static Scheduling. option #1: dynamic scheduling (by the hardware) option #2: static scheduling (by the compiler) ECE 252 / CPS 220 Lecture Notes

PRINT SERVER IMPLEMENTATION ALTERNATIVES. An XCD White Paper

Advanced Core Operating System (ACOS): Experience the Performance

EE482: Advanced Computer Organization Lecture #11 Processor Architecture Stanford University Wednesday, 31 May ILP Execution

Low Power AMD Athlon 64 and AMD Opteron Processors

Implementing an In-Service, Non- Intrusive Measurement Device in Telecommunication Networks Using the TMS320C31

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

High Performance Computing in the Multi-core Area

What is a System on a Chip?

Design Cycle for Microprocessors

ARM Cortex-A9 MPCore Multicore Processor Hierarchical Implementation with IC Compiler

LSN 2 Computer Processors

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Introduction to GPU Programming Languages

SOC architecture and design

FPGA Prototyping Primer

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Pipelining Review and Its Limitations

Generations of the computer. processors.

Introduction to Digital System Design

CS 147: Computer Systems Performance Analysis

Introduction to Microprocessors

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Virtual Platforms Addressing challenges in telecom product development

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

GSM/GPRS PHYSICAL LAYER ON SANDBLASTER DSP

Chapter 7: Distributed Systems: Warehouse-Scale Computing. Fall 2011 Jussi Kangasharju

Evaluation Report: Accelerating SQL Server Database Performance with the Lenovo Storage S3200 SAN Array

Virtualization. Jukka K. Nurminen

Solutions for Increasing the Number of PC Parallel Port Control and Selecting Lines

BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

Memory Configuration for Intel Xeon 5500 Series Branded Servers & Workstations

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Computer Architecture Lecture 3: ISA Tradeoffs. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 1/18/2013

Current and Future Trends in Medical Electronics


Transcription:

Optimized DSP Software Independent DSP Analysis DSP Benchmark Results for the Latest Processors (Workshop 427) Berkeley Design Technology, Inc. 2107 Dwight Way, Second Floor Berkeley, California 94704 USA +1 (510) 665-1600 info@bdti.com http://www.bdti.com Outline Why are benchmarks important? Comparing benchmarking approaches Benchmark results for the latest processors C64xx, SC140, C55xx, BF53x, OMAP, PXA2xx The effect of architecture on benchmark results Architectural trends and trade-offs Emerging challenges in benchmarking Conclusions 2 Embedded Systems Conference Page 1

Why Benchmark? Assess and compare key processor metrics accurately DSP speed Memory efficiency Energy efficiency Cost-performance Compare performance across a wide range of architectures (conventional, VLIW, SIMD, DSPenhanced GPP, etc.) difficult without benchmarks Simple metrics (MIPS, MACS) don t cut it 3 How to Benchmark? A few candidate approaches: Simplified metrics E.g., MIPS (Millions of Instructions Per Second), MOPS, MMACS Full DSP applications E.g., v.90 modem DSP algorithm kernel benchmarks E.g., FIR filter, FFT 4 Embedded Systems Conference Page 2

What's Wrong with MIPS? MIPS and MFLOPS (Millions of Floating-Point Operations per Second) are frequently used as shorthand for processor speed. But are they really meaningful? Two instructions from different processors: DSP16410 A0=A0+P0+P1 P0=Xh*Yh P1=Xl*Yl Y=*R0++ X=*PT0++ TMS320C6414 ADD A0,A3,A0 5 Full-Application Benchmarks This approach has pros and cons Applications tend to be ill-defined Hand-optimization needed Costly, time-consuming to implement Measures programmer as much as processor Measures system, not just processor Sometimes this is an advantage Results useful only for specific app (or similar apps) But if results are available for your app, this not a disadvantage For processors, similar results via simpler approach But this is not true for all DSP implementation technologies 6 Embedded Systems Conference Page 3

Algorithm Kernel Benchmarks The BDTI Benchmarks are based on DSP algorithm kernels The most computationally intensive portions of DSP applications Examples include FFTs, IIR filters, and Viterbi decoders Denorm 11% Application Profile Other 25% Window 25% IDCT 39% Benchmark results are used with application profiling to predict overall performance 7 Algorithm Kernel Benchmarks Advantages: Relevant; chosen by analysis of real DSP apps Kernels are short, allowing Functionality to be precisely specified Benchmarks to be implemented, optimized in a reasonable amount of time Disadvantages: Not practical to implement all possible algorithms Don t reflect application-level optimizations and trade-offs For some implementation technologies, this is a problem Ignores system-level considerations This, too, is a problem for some implementation technologies 8 Embedded Systems Conference Page 4

Vendor Benchmarks Many processor vendors provide benchmark results, but Benchmarks not standardized across vendors Results not independently verified Clock speeds often projected Results are often misused, for example, Comparing results for functionally different benchmarks Comparing fastest chip to slowest from another vendor Comparing vaporware to real silicon Presenting cycle counts as a proxy for performance Cherry-picking benchmark results 9 Benchmark Results for the Latest Processors High-performance processors Texas Instruments TMS320C64xx StarCore SC140 Low-power processors Texas Instruments TMS320C55xx Analog Devices Blackfin (ADSP-BF53x) General-purpose/DSP processors Intel PXA2xx Texas Instruments OMAP5910 10 Embedded Systems Conference Page 5

Overall DSP Speed: BDTImark2000 6480 Higher is faster 3360 3430 1460 930 TI C5502 (300 MHz) ADI BF53x (600 MHz) TI C6414 (720 MHz) StarCore SC140 (300 MHz) Intel PXA2xx (400 MHz) (See www.bdti.com for more scores) 11 Relative Results on Benchmarks The BDTImark2000 shows overall speed results across 12 DSP kernel benchmarks Relative results on specific benchmarks can vary widely 6480 BDTImark2000 Higher is Faster 3430 LMS Adaptive FIR Filter (execution time, in µs) Lower is Faster 0.05 0.06 C6414 (720 MHz) SC140 (300 MHz) C6414 (720 MHz) SC140 (300 MHz) 12 Embedded Systems Conference Page 6

What Factors Affect DSP Speed? Processors DSP speeds are affected by: Parallelism How many parallel operations can be executed per cycle Instruction set Suitability for the task at hand Clock speed Data types Data bandwidth Pipeline depth Instruction latencies Support for DSP-oriented features, e.g., DSP addressing modes Zero-overhead looping Saturation, scaling, rounding 13 Memory Use: BDTI Control Benchmark 256 Lower is better Bytes 146 140 144 140 TI C55xx (8/16/32/48) ADI BF53x (16/32/64) TI C64xx (32) StarCore SC140 (16/32) Intel PXA2xx (16/32) 14 Embedded Systems Conference Page 7

What Factors Affect Memory Use? Processors memory usage affected by: Instruction set Wider instructions take more memory Mixed-width instruction sets becoming popular Use short, simple instructions for simple tasks Use longer instructions for more complex tasks Suitability of instruction set for task at hand Architecture VLIW, SIMD, and deep pipelines all may encourage (or require) optimizations that increase memory use to obtain speed-optimized code Compiler quality (for compiled code) 15 Energy Efficiency: BDTImark2000/mW 11.8 16.9 16.1 13.7 Higher is better 2.6 (estimated) TI C55xx 300 MHz, 1.26V ADI BF53x 600 MHz, 1.2V TI C64xx 300 MHz, 1.0V Motorola MSC8101 (SC140) 300 MHz, 1.5V Intel PXA2xx 200 MHz, 1.0V 16 Embedded Systems Conference Page 8

What Factors Affect Energy Efficiency? Processors energy consumption affected by: Hardware implementation Fabrication process, voltage, circuit design, logic design Memory usage Match between instruction set and task at hand Compiler quality (for compiled code) 17 Cost-Perf: BDTImark2000/$ 375.9 Higher is better 146.2 98.3 29 25.6 TI C55xx 300 MHz, $9.95 ADI BF53x 400 MHz, $5.95 TI C64xx 500 MHz, $44.95 Motorola MSC8101 (SC140) 300 MHz, $118.10 Intel PXA2xx 300 MHz, $27.20 18 Embedded Systems Conference Page 9

What Factors Affect Cost-Perf? Speed Chip cost, which is affected by: Die size Fabrication process Size of on-chip memory Influenced by processor s memory usage On-chip peripherals Good cost-performance results don t necessarily mean chip is suitable for apps with severe cost constraints OEMs don t want to pay for more performance than is needed 19 Architectural Trends VLIW (multi-issue) to increase performance SIMD to increase performance Simplified instruction sets, architectures to increase clock speeds, compilability Mixed-width instruction sets to reduce memory usage Deeper pipelines to enable higher clock speeds DSP-enhanced general-purpose processors (GPPs) 20 Embedded Systems Conference Page 10

Architectural Trends: The Down Side VLIW (multi-issue), SIMD, and deep pipelines can increase Memory use Energy consumption Code-generation complexity Simple instruction sets often increase memory usage More instructions are needed to accomplish a given task Sometimes a processor s legacy constraints are overriding Each processor makes different tradeoffs, depending on its target application top speed is often not the goal! 21 Texas Instruments TMS320C64xx Targets high-performance DSP applications. Goals: Fast Compilable Compatible with earlier C62xx Sacrifices: High memory consumption High chip price for fastest ($199, qty 10K) Difficult to program in assembly language 22 Embedded Systems Conference Page 11

C64xx is Fast Because Highly parallel architecture Based on C62xx: VLIW, up to 8 instructions/cycle Adds SIMD to C62xx Faster, but still compatible Four 16-bit MACs/cycle More powerful instructions BDTImark2000 Higher is Faster Very high clock speed (720 MHz) Deep pipe (11 stages) Most instructions are simple 32-bit, mostly RISC-like Also increases compilability Caching C62xx (300 MHz) 970 6480 3360 3430 930 'C55xx 'BF53x 'C64xx SC140 PXA2xx 23 But C64xx Makes Sacrifices High memory use Wide, uniform-width (32-bit) instructions Mostly simple, RISC-like instructions VLIW, SIMD, deep pipeline Memory use increases chip cost Deep pipe also complicates code generation Multi-cycle latencies Memory Use on Control Benchmark Lower is Better 146 140 256 144 140 'C55xx 'BF53x 'C64xx SC140 PXA2xx 24 Embedded Systems Conference Page 12

StarCore SC140 Targets high-performance DSP applications. Goals: Fast Low memory usage Easy to program in assembly Compilable Low energy consumption Sacrifices: Not as fast as C64xx High chip price (300 MHz MSC8101 is $118, qty 10K) No compatibility with previous architectures 25 SC140 Fast, But Not Fastest Like C64xx, highly parallel architecture VLIW, up to 6 instructions/cycle Like C64xx, four 16-bit MACs/cycle Limited SIMD Mid-range clock speed (300 MHz) About half as high as C64xx Shallow pipe (5 stages) Not deep enough for ultrahigh clock But uniform single-cycle latencies ease code generation, decrease memory use Simple instruction set 1460 BDTImark2000 Higher is Faster 3360 6480 3430 930 'C55xx 'BF53x 'C64xx SC140 PXA2xx 26 Embedded Systems Conference Page 13

SC140 Has Surprisingly Low Memory Use Mixed-width instruction set 16-bit instructions with optional 16-bit prefixes Surprising, since it s VLIW and uses RISC-like instructions Memory Use on Control Benchmark Lower is Better 256 146 140 144 140 'C55xx 'BF53x 'C64xx SC140 PXA2xx 27 Texas Instruments TMS320C55xx Targets low-power, cost-sensitive DSP applications. Goals: Low energy consumption Low memory use Low chip cost Midrange speed Partly compatible with earlier C54xx architecture Sacrifices: Not nearly as fast as high-end DSPs Not very compilable Difficult to program in assembly 28 Embedded Systems Conference Page 14

C55xx Focuses on Power, Cost, Compatibility Not Speed Moderate parallelism Adds limited (2-issue) VLIW capabilities to boost speed while maintaining partial compatibility with C54xx Two MACs/cycle Convoluted architecture (like C54xx) Medium clock speed (300 MHz) 7-stage pipeline Single-cycle latencies Moderately priced ($8-20 qty 10K) C54xx (160 MHz) 1460 3360 6480 BDTImark2000 Higher is Faster 3430 930 'C55xx 'BF53x 'C64xx SC140 PXA2xx 29 Analog Devices ADSP-BF53x Targets low-power, cost-sensitive DSP applications. Goals: Low energy consumption Low memory use Low chip cost Midrange speed Compilable Sacrifices: Not nearly as fast as high-end DSPs No compatibility with previous architectures 30 Embedded Systems Conference Page 15

ADSP-BF53x Balances Power, Cost, Speed Moderate parallelism 3-issue VLIW Two MACs per cycle Somewhat more parallelism than C55xx; not nearly as much as SC140 or C64xx Not constrained by legacy architecture High clock speed (600 MHz) 10-stage pipeline Single-cycle latencies Good energy efficiency Moderately priced 1460 ($6-35 qty 10K) 3360 6480 BDTImark2000 Higher is Faster 3430 930 'C55xx 'BF53x 'C64xx SC140 PXA2xx 31 BF53x and C55xx Both Have Low Memory Usage Both use mixed-width instruction sets BF53x uses 16/32/64-bit instructions C55xx uses instructions ranging from 8-48 bits BF53x instructions (and architecture) are fairly simple BF53x is easy to program in assembly, good compiler target C55xx inherits C54xx instruction set Not as easy to program in assembly as BF53x, but familiar to C54xx programmers Not a good compiler target Results for compiled code would likely favor BF53x 146 140 Memory Use on Control Benchmark Lower is Better 256 144 140 'C55xx 'BF53x 'C64xx SC140 PXA2xx 32 Embedded Systems Conference Page 16

BF53x vs. C55xx Energy Efficiency BF53x supports multiple voltages Good energy efficiency at top speed of 600 MHz/1.2V Energy results at other speed/voltage combos will vary C55xx supports one voltage BDTImark2000/mW Higher is Better 16.9 11.8 'C55xx 300 MHz 1.26V 'BF53x 600 MHz 1.2V 33 Intel PXA2xx (XScale) Targets low-power, cost-sensitive DSP applications where general-purpose processing features or software are needed. Goals: Low memory use Low chip cost Midrange speed Compatible with earlier ARM architectures Support for operating systems, compilers Sacrifices: Not very efficient for DSP Poor energy and cost efficiency 34 Embedded Systems Conference Page 17

PXA2xx Speed Comes from Clock Relatively low level of parallelism Single-issue, mostly general-purpose 32-bit architecture Good compiler target ARM architecture augmented with limited SIMD Two MACs/cycle Few additional DSP-specific features Moderately high clock speed (400 MHz) 7-stage pipeline Multi-cycle latencies Moderately priced ($27-42 qty 10K) 1460 3360 BDTImark2000 Higher is Faster 6480 3430 930 'C55xx 'BF53x 'C64xx SC140 PXA2xx 35 PXA2xx is Efficient in Memory Not surprising; general-purpose architecture is a good match for control code Mixed-width (16/32) instruction set Probably would look even better if benchmark showed compiled code results 256 Memory Use on Control Benchmark Lower is Better 146 146 144 140 'C55xx 'BF53x 'C64xx SC140 PXA2xx 36 Embedded Systems Conference Page 18

PXA2xx Can t Compete With DSPs on Energy, Cost Not very efficient on DSP algorithms High clock rate helps boost speed, but doesn t help with energy or chip cost This is often a drawback of GPPs for DSP BDTImark2000/mW Higher is Better 16.9 16.1 13.7 11.8 BDTImark2000/$ Higher is Better 376 2.6 (est.) 146 98 29 26 'C55xx 'BF53x 'C64xx SC140 PXA2xx 'C55xx 'BF53x 'C64xx SC140 PXA2xx 37 TI OMAP ( C55xx plus ARM9) Targets low-power, cost-sensitive DSP applications where general-purpose processing features or software are needed. Goals: Low energy consumption Low memory use Low chip cost Midrange speed Support for operating systems, compilers Compatibility with C54xx, ARM Sacrifices: C55xx not very compilable Dual-core approach complicates system development 38 Embedded Systems Conference Page 19

OMAP BDTImark2000 Score?? 730 OMAP5910 BDTImark2000 score could range from 186 (ARM only) to 916 (ARM plus C55xx) depending on app partitioning 730(estimated) 930 186 C55xx (150 MHz) ARM9 (150 MHz) (simulated) OMAP5910 (150 MHz) PXA2xx (400 MHz) 39 Emerging Benchmarking Challenges New technologies create benchmarking challenges Multi-core devices DSP-enhanced FPGAs Application-specific processors Customizable processors Reconfigurable processors Evolving applications and tools also lead to new challenges Increasing reliance on C compilers 40 Embedded Systems Conference Page 20

Application Benchmarking For technologies not well served by kernel benchmarks DSP-enhanced FPGAs Application-specific processors Limited applicability Practicality concerns can be partly addressed by Using off-the-shelf implementations where available, or Using simplified applications E.g., BDTI s OFDM Benchmark simplified telecom receiver IQ FIR Filter FFT Slicer Viterbi 41 Compiler Benchmarking Better compilers, more compilable architectures encourage migration to C for DSP Processor selection may hinge on compiler quality But it is difficult to assess how efficient a compiler is for DSP Or to compare two compilers for the same processor Compiler benchmarking sounds simple, but raises complex questions Benchmark the compiler or the compiler+processor? Allow intrinsics? Allow C source code optimizations? Allow limited assembly tweaking? Where to draw the line? 42 Embedded Systems Conference Page 21

Conclusions Today s DSP-oriented processors cannot be meaningfully compared using simplified metrics Relevant, meaningful benchmark results are essential to processor evaluation There is no ideal processor Fastest doesn t mean best The best processor depends on the details of the application Different architectural approaches make different performance trade-offs Understanding these is key to selecting a processor 43 Conclusions Consider all the options Increasing performance overlap between dissimilar architectures Alternatives increasingly viable Application requirements and processor performance are both moving targets Emerging architectures and technologies require benchmarking evolution Factors other than performance are often important Compatibility, tools, off-the-shelf software, 44 Embedded Systems Conference Page 22

For More Information www.bdti.com Free Information BDTImark2000 scores DSP Insider newsletter Pocket Guide to Processors for DSP White papers on processor architectures and benchmarking Article reprints on DSP-oriented processors and applications EE Times IEEE Spectrum IEEE Computer and others comp.dsp FAQ 2001 Edition 45 Embedded Systems Conference Page 23