Introduction to the Latest Tensilica Baseband Solutions



Similar documents
BEAGLEBONE BLACK ARCHITECTURE MADELEINE DAIGNEAU MICHELLE ADVENA

IBM CELL CELL INTRODUCTION. Project made by: Origgi Alessandro matr Teruzzi Roberto matr IBM CELL. Politecnico di Milano Como Campus

FLIX: Fast Relief for Performance-Hungry Embedded Applications

Architectures and Platforms

Should Pakistan Leapfrog the Developed World in Broadband? By: Syed Ismail Shah Iqra University Islamabad Campus

LTE Release 10 Small Cell Physical Layer Evolution Issues and Challenges Facing Small Cell Product Developers in Multi- Core Environments

Technical and economical assessment of selected LTE-A schemes.

NVIDIA SDR (Software Defined Radio) Technology

ARM Microprocessor and ARM-Based Microcontrollers

MPSoC Designs: Driving Memory and Storage Management IP to Critical Importance

ADVANCED PROCESSOR ARCHITECTURES AND MEMORY ORGANISATION Lesson-12: ARM

FPGAs in Next Generation Wireless Networks

LTE and Network Evolution

Use Current Success to Develop Future Business

7a. System-on-chip design and prototyping platforms

Embedded System Hardware - Processing (Part II)

Wireless Broadband Access

Tensilica Software Development Toolkit (SDK)

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

Evolution of the Air Interface From 2G Through 4G and Beyond

HSPA+ and LTE Test Challenges for Multiformat UE Developers

Breaking the Interleaving Bottleneck in Communication Applications for Efficient SoC Implementations

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

HUAWEI B315s-22 LTE CPE V200R001. Product Description. Issue 01. Date HUAWEI TECHNOLOGIES CO., LTD.

HSPA, LTE and beyond. HSPA going strong. PRESS INFORMATION February 11, 2011

GSM v. CDMA: Technical Comparison of M2M Technologies

FPGA. AT6000 FPGAs. Application Note AT6000 FPGAs. 3x3 Convolver with Run-Time Reconfigurable Vector Multiplier in Atmel AT6000 FPGAs.

Quectel Wireless Solutions Wireless Module Expert U10 UMTS Module Presentation

Upcoming Enhancements to LTE: R9 R10 R11!

Extending the Power of FPGAs. Salil Raje, Xilinx

LTE protocol tests for IO(D)T and R&D using the R&S CMW500

A case study of mobile SoC architecture design based on transaction-level modeling

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

PROBLEMS #20,R0,R1 #$3A,R2,R4

1 Introduction Services and Applications for HSPA Organization of the Book 6 References 7

System Considerations

LTE, WLAN, BLUETOOTHB

LTE-Advanced Carrier Aggregation Optimization

Next Generation GPU Architecture Code-named Fermi

3GPP Wireless Standard

Solution: start more than one instruction in the same clock cycle CPI < 1 (or IPC > 1, Instructions per Cycle) Two approaches:

Security Considerations for Cellular 3G Modems & 3G Wireless Routers

Mobile Broadband of Deutsche Telekom AG LTE to cover White Spaces. Karl-Heinz Laudan Deutsche Telekom AG 16 June 2011

GSM/GPRS PHYSICAL LAYER ON SANDBLASTER DSP

Huawei Answer to ARCEP s public consultation on the challenges tied to new frequency bands for electronic communication services access networks

Ericsson s view on the different wireless access technologies

Q. Consider a dynamic instruction execution (an execution trace, in other words) that consists of repeats of code in this pattern:

Cooperative Techniques in LTE- Advanced Networks. Md Shamsul Alam

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Computer Architecture Lecture 2: Instruction Set Principles (Appendix A) Chih Wei Liu 劉 志 尉 National Chiao Tung University

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Radeon HD 2900 and Geometry Generation. Michael Doggett

WAR: Write After Read

Wireless Technologies for the 450 MHz band

Next Generation of Railways and Metros wireless communication systems IRSE ASPECT 2012 Alain BERTOUT Alcatel-Lucent

LTE Performance and Analysis using Atoll Simulation

Bringing Mobile Broadband to Rural Areas. Ulrich Rehfuess Head of Spectrum Policy and Regulation Nokia Siemens Networks

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Universal Flash Storage: Mobilize Your Data

Introducción. Diseño de sistemas digitales.1

Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1

VLIW Processors. VLIW Processors

ARM Webinar series. ARM Based SoC. Abey Thomas

Data Analysis on Mobile Ad-Hoc Networks with Commercial Pre- WiMAX, EVDO and Wi-Fi Products

BDTI Solution Certification TM : Benchmarking H.264 Video Decoder Hardware/Software Solutions

OC By Arsene Fansi T. POLIMI

SOC architecture and design

WHITE PAPER. Realistic LTE Performance From Peak Rate to Subscriber Experience

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Bandwidth Optimization and Protection for Wireless Backhaul

Outline. Introduction. Multiprocessor Systems on Chip. A MPSoC Example: Nexperia DVP. A New Paradigm: Network on Chip

Exploring the Design of the Cortex-A15 Processor ARM s next generation mobile applications processor. Travis Lanier Senior Product Manager

Potential of LTE for Machine-to-Machine Communication. Dr. Joachim Sachs Ericsson Research

LAS SOLUCIONES TECNOLOGICAS DE 3G WCDMA

Cloud RAN. ericsson White paper Uen September 2015

Evolution and Applications

Introduction to GPU Architecture

High-Level Synthesis for FPGA Designs


Which ARM Cortex Core Is Right for Your Application: A, R or M?

Study Plan Masters of Science in Computer Engineering and Networks (Thesis Track)

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

SC-FDMA for 3GPP LTE uplink. Hong-Jik Kim, Ph. D.

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

ARM Architecture. ARM history. Why ARM? ARM Ltd developed by Acorn computers. Computer Organization and Assembly Languages Yung-Yu Chuang

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Delivering 4x4 MIMO for LTE Mobile Devices. March SkyCross Dual imat 4x4 MIMO Technology for LTE. Introduction

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

AMD GPU Architecture. OpenCL Tutorial, PPAM Dominik Behr September 13th, 2009

Choosing the Right DSP for High-Resolution Imaging in Mobile and Wearable Applications

LTE Technology and Rural Broadband DiploFoundation Webinar. Milan Vuckovic Analyst, Wireless Policy Development Verizon Communications

The future of mobile networking. David Kessens

MIMO detector algorithms and their implementations for LTE/LTE-A

SPARC64 VIIIfx: CPU for the K computer

Transcription:

Introduction to the Latest Tensilica Baseband Solutions Dr. Chris Rowen Founder and Chief Technology Officer Tensilica Inc.

Outline The Mobile Wireless Challenge Multi-standard Baseband Tensilica Fits Baseband Design Styles How Efficient is an Xtensa DSP? Optimized Building Blocks for 3G and 4G BBE64 the World s Highest Performance DSP Core The Next Generation Efficient Solutions for LTE-Advanced 2

Baseband Processing: The Key to Mobility Evolution of Major Cellular Standards 4G Rollout 2006 2007 2008 2009 2010 2011 2012 WiMAX Evolution Fixed WiMAX Mobile WiMAX Wave 1 DL: 23 Mbps UL: 4 Mbps 10 MHz 3:1 TDD Mobile WiMAX Wave 2 DL: 46 Mbps UL: 4 Mbps 10 MHz 3:1 TDD CDMA2000 Evolution EVDO Rev 0 DL: 23 Mbps UL: 153 Kbps In 1.25 MHz EVDO Rev 1 DL: 3.1 Mbps UL: 1.8 Mbps In 1.25 MHz EVDO Rev 2 DL: 14.7 Mbps UL: 4.9 Mbps In 5 MHz EVDO Rev 3 DL: 100 Mbps UL: 50 Mbps In 20 MHz 3GPP GSM Edge Evolution EDGE DL: 474 Kbps UL: 474 Kbps Enhanced EDGE DL: 1.3 Mbps UL: 653 Kbps Evolved EDGE DL: 1.89 Mbps UL: 947 Kbps 3GPP UMTS Evolution HSDPA DL: 14.4 Mbps UL: 384 Kbps In 5 MHz HSUPA/ HSDPA DL: 14.4 Mbps UL: 5.76 Mbps In 5 MHz HSPA Evolution DL: 42 Mbps UL: 11.5 Mbps In 5 MHz 3GPP Long Term Evolution Source: Pysavy Research Mobile Broadband: EDGE, HSPA & LTE 2006 LTE (Rel 8) DL: 150 Mbps UL: 50 Mbps In 20 MHz LTE Advanced DL: 1 Gbps UL: 500 Mbps 5-10x LTE 3

Tensilica Fits All Design Styles Small, Programmable DPU/Controller Function-Specific Light-Weight DPU/DSP Wireless Chip Customers In Production Baseband Engines DSPs, Multi-Standard SDR Only Tensilica offers this range of options and flexibility in choices Customers can further extend, customize to their specific cost, performance requirements All cores designed with same unified tool set 4

How Efficient is an Xtensa DSP? 1 W/MHz per MAC Example: Iterative equalizer for HSPA requires very high data rate programmable FIR filter: 7.5B complex taps per second. Approach: Optimized Xtensa processor with FIR shift registers and parallel multiply-add units, to implement 32 tap complex FIR engine 64 complex taps per cycle (16b+16b complex coefficients * 8b + 8b complex data samples) = 256 multiplies per cycle Low power operation: 120 MHz in 40LP Result: 125K gates, including full Xtensa processor 256 multipliers 256 adders 210 W/MHz total core power at 100% utilization <0.85 W/MHz per multiply equivalent to RTL efficiency 100% processor based with full debug, modeling and extensibility 256 Multiply Add Units Mem Base Mem 5

Tensilica Meets a Breadth of DSP Design Requirements Performance (GMACs) User-specified DSPs ConnX VectraLX DSP ConnX D2 Dual-MAC DSP ConnX BBE16 ConnX BSP3 ConnX Turbo16 ConnX SSP16 ConnX BBE64-128 100 GMACs/sec ConnX BBE64-UE Xtensa µdsps: As small as ~0.01 mm 2 (28nm) Core Size 6

Tensilica is Number 1 for LTE DSP IP Multiple optimized processors giving very high performance per area and power for LTE PHY systems 20% smaller area than conventional single DSP implementation 30% lower power than conventional single DSP implementation C-programming model for all cores Further customization of processor core instructions Achieve extremely high performance with custom instructions Direct connectivity and operation on external Hardware blocks Flexible cores, implementation schemes, to meet all customers system integration requirements Libraries and large ecosystem support for full PHY solution World-class single development tool suite for all cores 7

ConnX Baseband Cores All the building blocks for multi-standard wireless ConnX BBE16 16MAC DSP 16 simultaneous 18-bit x 18-bit MACs per cycle 8-way SIMD with 3-issue VLIW DSP computation acceleration FFT, FIR, Matrix Multiply, Complex arithmetic Small size, high performance vector DSP with advanced compiler ConnX SSP16 - Soft Stream Processor Tailored for high efficiency processing of soft bits 16-way SIMD with 2-issue VLIW Optimized for 8/10-bit computation Optimum soft stream processing performance, automatic vectorization ConnX Turbo16 Multi-Standard Turbo Decoder Implements LTE and HSPA+ turbo decoding up to 150 Mbps 16-way SIMD with 4-issue VLIW 2000 RISC operations per cycle Software programmable turbo decoder at 150 Mbps ConnX BSP3 - Bit Stream Processor Advanced bit manipulation operations 3-issue VLIW, dual Load/Store units CRC, interleaver, scrambler Very high bit processing performance at ultra small size All ConnX Baseband DPUs are derived from the same Xtensa processor Unified single set of development tools for all ConnX baseband DPUs 8

Atlas System Block Diagram Fully Software Programmable LTE Cat4 Reference Design Complete subsystem combines PHY hardware with LTE software libraries Optimized multi-core SDR design to minimize power and MHz Intelligent data movement engines ( DMA) hide latency and maximize throughput 9

Atlas System Block Diagram Fully Software Programmable LTE Cat4 Reference Design Others Tensilica Tensilica s User Equipment CAT4 PHY implementation is 20% smaller and 30% lower power Plus software: General DSP libraries LTE wireless building blocks mimoon complete LTE SW stack 10

BBE64 - the World s Highest Performance DSP Core Design Goals and Philosophy World-leading DSP performance for baseband PHY in and infrastructure Up to 1GHz in available 28nm fast standard cell process x 128 MAC Combine SIMD, VLIW and configurable instruction set features for large applications sweet-spot. Leverage high memory system bandwidth of Xtensa LX4 1024b per cycle Good control code performance Broad range of built-in options and user-defined extensions Advanced C compilers eliminate need for assembly coding ConnX BBE16 upward compatibility Fully synthesizable RTL, with complete system modeling, verification and back-end flows environment Core as building block for multi-core SOC 11

ConnX BBE64 Block Diagram Data Memory Interface Local Data RAM Banks 512 bits Wide 512 bits Wide Data Load/Store Unit 0 (16/32/64/128/512 bit to 640 bit) Align/Pack Data Load/Store Unit 1 (16/32/64/128/512 bit to 640 bit) Align/Pack Vector Register File 16 x 640 bits 32 x 20 real 16 x 20 complex 64 x 10 real 32 x 10 complex 8 x 640 bits 16 x 40 real 8 x 40 complex 8 x 64 bits 64 x 1 Boolean General Register File 32 bits x 32 bits 32 bits 640 bits 640 bits Instruction Memory Interface Local Memory or Cache (1-4 ways) 96 bits Wide Load Store Load Store ALU MAC ALU MAC ALU 4 Way VLIW Instruction Decoder Computation Unit 64 Way MAC 64 Way MAC 32 Way 32 Way SIMD 32 ALU Way SIMD ALU SIMD ALU 32b ALU 32b ALU 32b ALU Optimized Architecture for DSP Applications 4-way VLIW x 32-way SIMD 128 DSP ops/cycle 16/24/96b 4-issue VLIW almost any instruction in any slot 128 MAC ops/cycle for matrix and filter functions (BBE64-128) Guard-bits on all DSP data for numerical accuracy Protected pipeline: interlocks/bypasses for robustness Support for all data types from C Complex/real Scalar/vector Fractional/integer High Bandwidth Configurable Memory Subsystem Interface Dual load/stores with dual 512b memory interfaces Full bandwidth on packed and unaligned data vectors DMA support for local data memory Extensible with special memory ports and direct-connect data queues: up to 4 x 640b per cycle 12

BBE64 Pipeline I AdrGen I Data/tag I Align Decode Reg Read Exec AdrGen L1 data/tag L1 Align WB DSP Reg Read DSP Ex1 DSP Ex2 DSP WB Two pipeline options: 9 stage pipeline higher MHz or larger memories 7 stage pipeline lower power and area Wide static in-order issue No dynamic branch prediction, but zerooverhead loops and SIMD predication Simple length encoding enables single-stage instruction decode and register specifier extraction DSP operations start with load return: zero load-use bubbles Simplified ALU/MAC operations allow DSP pipe reduction to two stages + write back for reduced regfile cost fewer values in flight, better utilization of slots 13

Data Reorganization Key to SIMD: Selection Example: operation BBE_SEL32X20 {out vec c, in vec h, in vec l, in vec s} Vector h Vector l 64:1 32 mux select fields from Vector s Options: Select Immediate (45 patterns) Shuffle (Single Input Vector) Shuffle Immediate (75 patterns) 14

Relative Performance on Basic Metrics 9 BBE64 Performance (BBE16 = 1) 8 7 6 5 4 3 2 1 0 BBE16 BBE64-128 Preliminary subject to change Performance Metric 15

Simple Code Example: 4x4 Complex Matrix Mul Scalar C code (with DSP-extended scalar types e.g. complex fractions): static xb_cq15 a1[4][4][nsamples]; static xb_cq15 b1[4][4][nsamples]; static xb_cq15 c1[4][4][nsamples]; void mm_auto_opt_4x4_stream_complex () { int i, j, h; for (i = 0; i < 4; i++) { for (j = 0; j < 4; j+=3) { for (h = 0; h < NSAMPLES; h++) { c1[i][j][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j][h]) + (xb_cq4_15)(a1[i][3][h] * b1[3][j][h]); c1[i][j+1][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+1][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+1][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j+1][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+1][h]); c1[i][j+2][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]); c1[i][j+3][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]); } } } } Inner loop compiler-generated code with code: vectorization, software pipelining and op-merging: loopgtz a4,.lbb34_mm_auto_opt_4x4_stream_complex {bbe_lv32x16s.ip v0,a2,512 nop bbe_mula32x18cpackq v5,v11,v0 bbe_mula32x18cpackq v6,v15,v0} {bbe_lv32x16s.i v0,a2,1536 bbe_lv32x16s.i v3,a2,3584 bbe_mul32x18cpackq v1,v8,v0 bbe_mul32x18cpackq v2,v12,v0} {bbe_lv32x16s.i v0,a2,5632 bbe_lv32x16s.ip v4,a2,512 bbe_mula32x18cpackq v1,v9,v0 bbe_mula32x18cpackq v2,v13,v0} {bbe_lv32x16s.i v3,a2,1536 bbe_lv32x16s.i v7,a2,3584 bbe_mula32x18cpackq v1,v10,v3 bbe_mula32x18cpackq v2,v14,v3} {bbe_sv32x16s.ip v5,a3,512 bbe_lv32x16s.i v0,a2,5632 bbe_mula32x18cpackq v1,v11,v0 bbe_mula32x18cpackq v2,v15,v0} {nop bbe_sv32x16s.i v6,a3,1536 bbe_mul32x18cpackq v5,v8,v4 bbe_mul32x18cpackq v6,v12,v4} {bbe_sv32x16s.ip v1,a3,512 nop bbe_mula32x18cpackq v5,v9,v3 bbe_mula32x18cpackq v6,v13,v3} {bbe_sv32x16s.i v2,a3,1536 nop bbe_mula32x18cpackq v5,v10,v7 bbe_mula32x18cpackq v6,v14,v7} 16

SDR SOC LTE-A (CAT-6): One BBE64 + auxiliary cores 17

Wrap up: Looking at long-term silicon and system trends 1. Continued focus on energy for mobility and cost 2. Volume is in terminal devices: dominated by access (radios) and presentation (media) 3. Applications continue to migrate to data-center/cloud: dominated by data storage and access 4. Expertise in data-intensive real-time functions essential: Wireless/DSP Multimedia: audio, video Imaging/recognition/rendering Data manipulation under IP 5. Sea-of-processors SOC design increasingly real: design simplicity with processor generation greater technical and market flexibility 18