Xilinx Zynq-7000 EPP An Extensible Processing Platform Family Vidya Rajagopalan, Vamsi Boppana, Sandeep Dutta, Brad Taylor, Ralph Wittig August 18, 2011
Agenda Xilinx Series 7 Highlights Zynq-7000 EPP Architecture & Silicon Zynq-7000 Software & Applications Summary Page 2
Xilinx 7 Series Highlights 7 Series silicon devices 28 nm Technology, TSMC HPL process 50% reduction in power over 40 nm devices 3 FPGA Fabrics Artix = Low cost, low power FPGA ( 1W FPGA ) Design Green by Xilinx Kintex = Density & performance FPGA ( Market Sweet spot ) Virtex = Highest density and performance FPGA ( More than Moore ) More than Moore density increase Up to 2M logic cells Using Stacked Silicon Interconnect Technology (SSIT) Improved GT bandwidth GT bandwidth increased to 28 GHz Zynq Embedded Processing Platform (EPP) Page 3
More Than Moore Xilinx Stacked Silicon Interconnect Technology Microbumps Access to power / ground / IOs Access to logic regions Leverages ubiquitous image sensor Through-silicon micro-bump technology Vias (TSV) Only bridge power / ground / IOs to C4 bumps Coarse pitch, low density aids manufacturability Etch process (not laser drilled) Passive Silicon Interposer (65nm Generation) 4 conventional metal layers connect micro bumps & TSVs No transistors means low risk and no TSV induced performance degradation Side-by-Side Die Layout Minimal heat flux issues Minimal design tool flow impact 28nm FPGA Slice 28nm FPGA Slice 28nm FPGA Slice 28nm FPGA Slice Microbumps Silicon Interposer Through Silicon Vias Package Substrate C4 Bumps BGA Balls Page 4
Agenda Xilinx Series 7 Highlights Zynq-7000 EPP Architecture & Silicon Zynq-7000 Software & Applications Summary Page 5
Zynq-7020 Device Processor System (PS) ARM Cortex-A9 MPcore Standard Peripherals 32-bit DDR3 / LPDDR2 controller 54 Multi-Use IOs 73 DDR IOs Programmable Logic (PL) 85 K Logic Cells 106K FFs 140 32-Kb Block RAM 220 DSP Blocks Dual 12-bit ADC Secure configuration engine 4 Clock Management Tiles 200 Select IO (1.2-3.3V) Secure Configuration Processor System Processor System Programmable Logic Select IOs DSP BRAM XADC CMT Page 6
Zynq-7000 Device Family Zynq-7000 EPP Devices Z-7010 Z-7020 Z-7030 Z-7040 Processing System Processor Core Processor Extensions Max Frequency Memory External Memory Support Peripherals Dual ARM Cortex -A9 MPCore NEON & Single / Double Precision Floating Point 800MHz L1 Cache 32KB I / D, L2 Cache 512KB, on-chip Memory 256KB DDR3, DDR2, LPDDR2, 2x QSPI, NAND, NOR 2x USB 2.0 (OTG), 2x Tri-mode Gigabit Ethernet, 2x SD/SDIO, 2x UART, 2x CAN 2.0B, 2x I2C, 2x SPI, 4x 32b GPIO Programmable Logic I/O Approximate ASIC Gates ~430K (30k LC) ~1.3M (85k LC) ~1.9M (125k LC) ~3.5M (235k LC) Extensible Block RAM 240KB 560KB 1,060KB 1,860KB Peak DSP Performance (Symmetric FIR) 58 GMACS 158 GMACS 480 GMACS 912 GMACS PCI Express (Root Complex or Endpoint) - Gen2 x4 Gen2 x8 Agile Mixed Signal (XADC) 2x 12bit 1Msps A/D Converter Processor System IO 130 Multi Standards 3.3V IO 100 200 100 200 Multi Standards High Performance 1.8V IO - - 150 150 Multi Gigabit Transceivers - - 4 12 Page 7
Zynq-7000 Processor System (PS) Dual Core Cortex ARM A9 NEON, 512 KB L2 cache 256 KB On-Chip-Memory (OCM) DDR Interface DDR3 Performance High BW utilization Config & Legacy Memory I/F Quad-SPI, NOR, NAND Standard Peripherals GigE Available to PS IO or to Programmable Logic System Level Peripherals Clock generation, Counter Timers 8 Channel DMA controller Coresight Debugging PS Peripherals can be multiplexed onto 54 external Multi-Use-IOs (MIO) MIO[53:0] GigE (2) USB (2) SPI (2) I2C (2) UART (2) CAN (2) GPIO PLL(3) Coresight Config Config PS Peripherals can also be routed through Security the Programmable XADC Logic Zynq 7000 EPP Processing System Cortex A9 NEON 32KB I$/D$ SCU Cortex A9 NEON 32KB I$/D$ Timers, AWDT, GIC, ACP L2 Cache 512KB OCM 256 KB DMA TTC SWDT AMBA AXI Interconnect Programmable Logic Quad SPI CTRL Parallel CTRL NAND CTRL DDR CTRL SDIO (2) General Purpose ACP High Performance Select IO GTs PCIe LPDDR2 DDR2 DDR3 32-bit NOR NAND Quad-SPI Page 8
Zynq-7000 Programmable Logic (PL) Programmable Logic Resources 30K 235 K Logic Cells Dedicated 36 K-bit BRAMs, DSP, CMT XADC dual channel 12-bit ADC Up to 12 GTs with PCIe hard core Up to 300 Select IOs Programmable Logic AXI Interfaces Multiple 32/64 bit AXI interfaces to PL Accelerator Coherency Port (ACP) with access to caches Programmable Logic System Interfaces Interrupts, DMA control Debug High Performance PL Configuration Security Decryption Engine Under 200 ms configuration time from flash Debugging interfaces GigE (2) USB (2) SPI (2) I2C (2) UART (2) CAN (2) GPIO PLL(3) Coresight Config Config Security XADC Zynq 7000 EPP Processing System Cortex A9 NEON 32KB I$/D$ SCU Cortex A9 NEON 32KB I$/D$ Timers, AWDT, GIC, ACP L2 Cache 512KB OCM 256 KB DMA TTC SWDT AMBA AXI Interconnect Programmable Logic Quad SPI CTRL Parallel CTRL NAND CTRL DDR CTRL SDIO (2) General Purpose ACP High Performance Select IO GTs PCIe DDR3 32-bit 4-8 GB/sec Page 9
Customizing Zynq Tools for the Programmable Logic System Builder Clocking Flexible clock sources (PS or PL) Simple clock interfaces Memory and Peripheral access PL access to all memory: Caches, OCM, DDR 2 dedicated DDR ports ensure bandwidth PL access to all peripherals in PS Interconnect AXI Interconnect IP available from Xilinx Optimized for FPGA implementation Debug and Misc. Bidirectional cross-triggers (Coresight and Chipscope) 16 general purpose interrupts from PL to PS Simple Clock Interfaces Relative Latency 3.0 2.5 2.0 1.5 1.0 0.5 D Q Q D To PL Zynq-7000 DDR Utilization vs Latency 0.0 0% 20% 40% 60% 80% 100% DDR Utilization PL clock From PL Page 10
SW managed Programmable Logic (PL) Linux based, remote controlled, programmable logic SW user experience: SoC with integrated PL Configure PL (full and partial) Start/stop & single step clocks Setup & update HW triggers Monitor HW performance counters Observe & sync to PL hardware events PS ARM Coresight H/W Events Cross Trigger PL Config clocks Page 11
Zynq-7000 Power Saving Features Low power 1.0V HPL 28 nm process silicon technology Programmable Logic can be powered off and on as needed 40-90% reduction in static power depending on device Very fast configuration times when loaded from DRAM Low power ARM Cortex-A9 MP Incorporates clock gating and power-down modes Support for LPDDR2 devices Ultra low power self refresh Peripherals shutdown Design Green by Xilinx Page 12
Engineering Insights Process Selection Criteria Power Frequency Process Considerations: 28 HP: Highest Performance HKMG Process (but must be able to afford power; e.g. GPU) 28 HPL: Low power HKMG process (shifts down HP power / performance range) 28 LP: No HKMG low power process (cheaper than HPL, but less performance) Xilinx Reasons for Selecting HPL: Higher performance than LP (at same power level) Higher performance vs HP at FPGA TDP (or lower power at same performance) Power Frequency Page 13
Engineering Insights Finding The Frequency Sweet Spot (within the HPL Process) 2.00 1.50 1.00 Normalized Power vs. Frequency 0.50 450 550 650 750 Worst Setup -40c Hold Timing Histograms Vt usage Typical High Vt Med Vt Low Vt Page 14 Delay Variation 2 1 0 Normalized Path Delay Max Nom Min
Engineering Insights Configuring Interconnect CPU: CPU 800 MHz FPGA 200 Mhz Switch OCM 400 MHz FPGA: OCM: 1 1 1 1 1 1 1 2 3 -stalled- 3 1 1 1 1 1 1 1 1 2 2 2 CPU: CPU 800 MHz FPGA 200 Mhz Switch Threshold OCM 400 MHz FPGA: OCM: 1 1 1 1 1 release 1 2 1 3 1 1 1 1 2 2 2 2 1 1 1 1 3 15 Copyright Copyright 2011 2011 Xilinx Xilinx
Agenda Xilinx Series 7 Highlights Zynq-7000 EPP Architecture & Silicon Zynq-7000 Software & Applications Summary Page 16
Zynq-7000 Use Cases #1 #2 #3 Embedded Control Fabric Datapath SW Acceleration ARM CPU ARM CPU ARM CPU Memory Peripheral Peripheral Datapath Accelerator Ex: Motor Control Ex: Static Video Stream Ex: Interactive Image Processing Use Case #1 Access peripheral configuration registers Use Case #2 Access datapath configuration registers Access datapath memory (coefficient tables) Use Case #3 Low latency/high bandwidth shared work spaces Move data between SW and HW domains Page 17
Application Programming Using Only C Application C/C++ Device Information SW-Centric Design Environment High-Level Synthesis via AutoESL Binary for CPU Bitstream for PL fabric CPU Memory Data Movement Interconnect FPGA Fabric Video Codec Encryption LTE Modem
AutoESL Generated Accelerators C-Based, High-Level Synthesis Tools at Xilinx Application Example: Back Projection Algorithm (recreate CT scan images from samples) gprof Locate SW hot spot function(s) on ARM AutoESL Synthesize hot spot function(s) to HW/PL 52 Floating Point Operators @ 200Mhz Fits in lowest cost Zynq 7010 device 3X Performance vs SW only Page 19
Engineering Insights Data Movement Gotchas Processor Host Program Engine Matrix Mult Memory 1 Memory 2 Memory 3 Using CPU Programmed IO IO dominates accelerator compute time Page 20
Engineering Insights Data Movement Gotchas Processor Engine Memory 1 Memory 2 Memory 3 Using DMA DMA setup time dominates (white space between green bars) Page 21
Xilinx Evolution Towards Multicore Legacy Logic BRAM DSP Glue MicroBlaze ub ub acc1 acc2 Soft Multi-Core + Accelerators ZYNQ Dual A9 acc1 acc2 GPP + Accelerators Future Zynq Next Gen GPP MC Array PE PE PE PE acc1 acc2 GPP + Multicore + Soft Processing Engine + Accelerators Multicore programming models being ported to Zynq Page 22
Agenda Xilinx Series 7 Highlights Zynq 7000 EPP Architecture & Silicon Zynq 7000 Software & Applications Summary Page 23
Summary Zynq SoC Device Family with Integrated Programmable Logic $15 Price Point* / 28nm Fab Process Microcontroller and Accelerator Use Models Industry Standard Tools (ARM Ecosystem, Android, ISE) Emerging Tools (AutoESL, Multicore) Emulation platforms in use for prototyping Available 1H 2012 Android on Zynq emulation board Source: iveia LLC Page 24 * High volume price for smallest device and package, slowest speed grade