Introduction to the Latest Tensilica Baseband Solutions Dr. Chris Rowen Founder and Chief Technology Officer Tensilica Inc.
Outline The Mobile Wireless Challenge Multi-standard Baseband Tensilica Fits Baseband Design Styles How Efficient is an Xtensa DSP? Optimized Building Blocks for 3G and 4G BBE64 the World s Highest Performance DSP Core The Next Generation Efficient Solutions for LTE-Advanced 2
Baseband Processing: The Key to Mobility Evolution of Major Cellular Standards 4G Rollout 2006 2007 2008 2009 2010 2011 2012 WiMAX Evolution Fixed WiMAX Mobile WiMAX Wave 1 DL: 23 Mbps UL: 4 Mbps 10 MHz 3:1 TDD Mobile WiMAX Wave 2 DL: 46 Mbps UL: 4 Mbps 10 MHz 3:1 TDD CDMA2000 Evolution EVDO Rev 0 DL: 23 Mbps UL: 153 Kbps In 1.25 MHz EVDO Rev 1 DL: 3.1 Mbps UL: 1.8 Mbps In 1.25 MHz EVDO Rev 2 DL: 14.7 Mbps UL: 4.9 Mbps In 5 MHz EVDO Rev 3 DL: 100 Mbps UL: 50 Mbps In 20 MHz 3GPP GSM Edge Evolution EDGE DL: 474 Kbps UL: 474 Kbps Enhanced EDGE DL: 1.3 Mbps UL: 653 Kbps Evolved EDGE DL: 1.89 Mbps UL: 947 Kbps 3GPP UMTS Evolution HSDPA DL: 14.4 Mbps UL: 384 Kbps In 5 MHz HSUPA/ HSDPA DL: 14.4 Mbps UL: 5.76 Mbps In 5 MHz HSPA Evolution DL: 42 Mbps UL: 11.5 Mbps In 5 MHz 3GPP Long Term Evolution Source: Pysavy Research Mobile Broadband: EDGE, HSPA & LTE 2006 LTE (Rel 8) DL: 150 Mbps UL: 50 Mbps In 20 MHz LTE Advanced DL: 1 Gbps UL: 500 Mbps 5-10x LTE 3
Tensilica Fits All Design Styles Small, Programmable DPU/Controller Function-Specific Light-Weight DPU/DSP Wireless Chip Customers In Production Baseband Engines DSPs, Multi-Standard SDR Only Tensilica offers this range of options and flexibility in choices Customers can further extend, customize to their specific cost, performance requirements All cores designed with same unified tool set 4
How Efficient is an Xtensa DSP? 1 W/MHz per MAC Example: Iterative equalizer for HSPA requires very high data rate programmable FIR filter: 7.5B complex taps per second. Approach: Optimized Xtensa processor with FIR shift registers and parallel multiply-add units, to implement 32 tap complex FIR engine 64 complex taps per cycle (16b+16b complex coefficients * 8b + 8b complex data samples) = 256 multiplies per cycle Low power operation: 120 MHz in 40LP Result: 125K gates, including full Xtensa processor 256 multipliers 256 adders 210 W/MHz total core power at 100% utilization <0.85 W/MHz per multiply equivalent to RTL efficiency 100% processor based with full debug, modeling and extensibility 256 Multiply Add Units Mem Base Mem 5
Tensilica Meets a Breadth of DSP Design Requirements Performance (GMACs) User-specified DSPs ConnX VectraLX DSP ConnX D2 Dual-MAC DSP ConnX BBE16 ConnX BSP3 ConnX Turbo16 ConnX SSP16 ConnX BBE64-128 100 GMACs/sec ConnX BBE64-UE Xtensa µdsps: As small as ~0.01 mm 2 (28nm) Core Size 6
Tensilica is Number 1 for LTE DSP IP Multiple optimized processors giving very high performance per area and power for LTE PHY systems 20% smaller area than conventional single DSP implementation 30% lower power than conventional single DSP implementation C-programming model for all cores Further customization of processor core instructions Achieve extremely high performance with custom instructions Direct connectivity and operation on external Hardware blocks Flexible cores, implementation schemes, to meet all customers system integration requirements Libraries and large ecosystem support for full PHY solution World-class single development tool suite for all cores 7
ConnX Baseband Cores All the building blocks for multi-standard wireless ConnX BBE16 16MAC DSP 16 simultaneous 18-bit x 18-bit MACs per cycle 8-way SIMD with 3-issue VLIW DSP computation acceleration FFT, FIR, Matrix Multiply, Complex arithmetic Small size, high performance vector DSP with advanced compiler ConnX SSP16 - Soft Stream Processor Tailored for high efficiency processing of soft bits 16-way SIMD with 2-issue VLIW Optimized for 8/10-bit computation Optimum soft stream processing performance, automatic vectorization ConnX Turbo16 Multi-Standard Turbo Decoder Implements LTE and HSPA+ turbo decoding up to 150 Mbps 16-way SIMD with 4-issue VLIW 2000 RISC operations per cycle Software programmable turbo decoder at 150 Mbps ConnX BSP3 - Bit Stream Processor Advanced bit manipulation operations 3-issue VLIW, dual Load/Store units CRC, interleaver, scrambler Very high bit processing performance at ultra small size All ConnX Baseband DPUs are derived from the same Xtensa processor Unified single set of development tools for all ConnX baseband DPUs 8
Atlas System Block Diagram Fully Software Programmable LTE Cat4 Reference Design Complete subsystem combines PHY hardware with LTE software libraries Optimized multi-core SDR design to minimize power and MHz Intelligent data movement engines ( DMA) hide latency and maximize throughput 9
Atlas System Block Diagram Fully Software Programmable LTE Cat4 Reference Design Others Tensilica Tensilica s User Equipment CAT4 PHY implementation is 20% smaller and 30% lower power Plus software: General DSP libraries LTE wireless building blocks mimoon complete LTE SW stack 10
BBE64 - the World s Highest Performance DSP Core Design Goals and Philosophy World-leading DSP performance for baseband PHY in and infrastructure Up to 1GHz in available 28nm fast standard cell process x 128 MAC Combine SIMD, VLIW and configurable instruction set features for large applications sweet-spot. Leverage high memory system bandwidth of Xtensa LX4 1024b per cycle Good control code performance Broad range of built-in options and user-defined extensions Advanced C compilers eliminate need for assembly coding ConnX BBE16 upward compatibility Fully synthesizable RTL, with complete system modeling, verification and back-end flows environment Core as building block for multi-core SOC 11
ConnX BBE64 Block Diagram Data Memory Interface Local Data RAM Banks 512 bits Wide 512 bits Wide Data Load/Store Unit 0 (16/32/64/128/512 bit to 640 bit) Align/Pack Data Load/Store Unit 1 (16/32/64/128/512 bit to 640 bit) Align/Pack Vector Register File 16 x 640 bits 32 x 20 real 16 x 20 complex 64 x 10 real 32 x 10 complex 8 x 640 bits 16 x 40 real 8 x 40 complex 8 x 64 bits 64 x 1 Boolean General Register File 32 bits x 32 bits 32 bits 640 bits 640 bits Instruction Memory Interface Local Memory or Cache (1-4 ways) 96 bits Wide Load Store Load Store ALU MAC ALU MAC ALU 4 Way VLIW Instruction Decoder Computation Unit 64 Way MAC 64 Way MAC 32 Way 32 Way SIMD 32 ALU Way SIMD ALU SIMD ALU 32b ALU 32b ALU 32b ALU Optimized Architecture for DSP Applications 4-way VLIW x 32-way SIMD 128 DSP ops/cycle 16/24/96b 4-issue VLIW almost any instruction in any slot 128 MAC ops/cycle for matrix and filter functions (BBE64-128) Guard-bits on all DSP data for numerical accuracy Protected pipeline: interlocks/bypasses for robustness Support for all data types from C Complex/real Scalar/vector Fractional/integer High Bandwidth Configurable Memory Subsystem Interface Dual load/stores with dual 512b memory interfaces Full bandwidth on packed and unaligned data vectors DMA support for local data memory Extensible with special memory ports and direct-connect data queues: up to 4 x 640b per cycle 12
BBE64 Pipeline I AdrGen I Data/tag I Align Decode Reg Read Exec AdrGen L1 data/tag L1 Align WB DSP Reg Read DSP Ex1 DSP Ex2 DSP WB Two pipeline options: 9 stage pipeline higher MHz or larger memories 7 stage pipeline lower power and area Wide static in-order issue No dynamic branch prediction, but zerooverhead loops and SIMD predication Simple length encoding enables single-stage instruction decode and register specifier extraction DSP operations start with load return: zero load-use bubbles Simplified ALU/MAC operations allow DSP pipe reduction to two stages + write back for reduced regfile cost fewer values in flight, better utilization of slots 13
Data Reorganization Key to SIMD: Selection Example: operation BBE_SEL32X20 {out vec c, in vec h, in vec l, in vec s} Vector h Vector l 64:1 32 mux select fields from Vector s Options: Select Immediate (45 patterns) Shuffle (Single Input Vector) Shuffle Immediate (75 patterns) 14
Relative Performance on Basic Metrics 9 BBE64 Performance (BBE16 = 1) 8 7 6 5 4 3 2 1 0 BBE16 BBE64-128 Preliminary subject to change Performance Metric 15
Simple Code Example: 4x4 Complex Matrix Mul Scalar C code (with DSP-extended scalar types e.g. complex fractions): static xb_cq15 a1[4][4][nsamples]; static xb_cq15 b1[4][4][nsamples]; static xb_cq15 c1[4][4][nsamples]; void mm_auto_opt_4x4_stream_complex () { int i, j, h; for (i = 0; i < 4; i++) { for (j = 0; j < 4; j+=3) { for (h = 0; h < NSAMPLES; h++) { c1[i][j][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j][h]) + (xb_cq4_15)(a1[i][3][h] * b1[3][j][h]); c1[i][j+1][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+1][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+1][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j+1][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+1][h]); c1[i][j+2][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]); c1[i][j+3][h] = (xb_cq4_15)(a1[i][0][h] * b1[0][j+2][h]) + (xb_cq4_15)(a1[i][1][h] * b1[1][j+2][h]) + (xb_cq4_15)(a1[i][2][h] * b1[2][j+2][h]) + (xb_cq4_15)(a1[i][3][h] * b1[2][j+2][h]); } } } } Inner loop compiler-generated code with code: vectorization, software pipelining and op-merging: loopgtz a4,.lbb34_mm_auto_opt_4x4_stream_complex {bbe_lv32x16s.ip v0,a2,512 nop bbe_mula32x18cpackq v5,v11,v0 bbe_mula32x18cpackq v6,v15,v0} {bbe_lv32x16s.i v0,a2,1536 bbe_lv32x16s.i v3,a2,3584 bbe_mul32x18cpackq v1,v8,v0 bbe_mul32x18cpackq v2,v12,v0} {bbe_lv32x16s.i v0,a2,5632 bbe_lv32x16s.ip v4,a2,512 bbe_mula32x18cpackq v1,v9,v0 bbe_mula32x18cpackq v2,v13,v0} {bbe_lv32x16s.i v3,a2,1536 bbe_lv32x16s.i v7,a2,3584 bbe_mula32x18cpackq v1,v10,v3 bbe_mula32x18cpackq v2,v14,v3} {bbe_sv32x16s.ip v5,a3,512 bbe_lv32x16s.i v0,a2,5632 bbe_mula32x18cpackq v1,v11,v0 bbe_mula32x18cpackq v2,v15,v0} {nop bbe_sv32x16s.i v6,a3,1536 bbe_mul32x18cpackq v5,v8,v4 bbe_mul32x18cpackq v6,v12,v4} {bbe_sv32x16s.ip v1,a3,512 nop bbe_mula32x18cpackq v5,v9,v3 bbe_mula32x18cpackq v6,v13,v3} {bbe_sv32x16s.i v2,a3,1536 nop bbe_mula32x18cpackq v5,v10,v7 bbe_mula32x18cpackq v6,v14,v7} 16
SDR SOC LTE-A (CAT-6): One BBE64 + auxiliary cores 17
Wrap up: Looking at long-term silicon and system trends 1. Continued focus on energy for mobility and cost 2. Volume is in terminal devices: dominated by access (radios) and presentation (media) 3. Applications continue to migrate to data-center/cloud: dominated by data storage and access 4. Expertise in data-intensive real-time functions essential: Wireless/DSP Multimedia: audio, video Imaging/recognition/rendering Data manipulation under IP 5. Sea-of-processors SOC design increasingly real: design simplicity with processor generation greater technical and market flexibility 18