Altera SDK for OpenCL v14.1. February 6, 2015

Save this PDF as:
 WORD  PNG  TXT  JPG

Size: px
Start display at page:

Download "Altera SDK for OpenCL v14.1. February 6, 2015"

Transcription

1 Altera SDK for OpenCL v14.1 February 6, 2015

2 Industry Challenges Variety of applications are becoming bottlenecked by scalable performance requirements E.g. Object detection and recognition, image tracking and processing, cryptography, cloud, search engines, deep packet inspection, etc Overloading CPUs capabilities Frequencies are capped Processors keep adding more cores Need to coordinate all the cores and manage data Product life cycles are long GPUs lifespan is short Require re-optimization and regression testing between generations 2 Support agreement for GPUs costly Power dissipation of CPUs and GPUs limits system size Maintaining coherency throughout scalable system

3 OpenCL and FPGAs Address These Challenges Power efficient acceleration Typically 1/5 power of GPU and orders of magnitude more performance per watt of CPU FPGA lifecycle over 15 years GPUs lifespan is short Require re-optimization testing between generations FPGA OpenCL code retargeted to future devices without modification Our OpenCL flow abstracts away FPGA hardware flow Puts FPGA into software engineers hands Our OpenCL SDK allows for streaming IO channels and kernel channels Data movement without host involvement 3 Low latency data transmissions to accelerator Shared virtual memory IBM CAPI and Intel QPI

4 Efficiency via Specialization FPGAs ASICs GPUs Source: Bob Broderson, Berkeley Wireless group

5 Application Development Paradigm ASIC FPGA Programmers OpenCL expands The number of application developers Parallel Programmers Standard CPU Programmers 5

6 More SW Engineering Resources than HW? 1000:1 software engineers to FPGA designers Software engineers are not used to long compile times OpenCL Solves This! 6 Our OpenCL flow abstracts away FPGA hardware flow bringing the FPGA to low level software programmers Software developers write, optimize and debug in their software familiar environment Quartus is run behind the scenes Emulator and profiler are software development tools Pushing long compile times to end OpenCL optimization doesn t require a board Allowing SW to drive board requirements (.xml file)

7 OpenCL On FPGAs Fit Into All Markets Automotive/Industrial (Pedestrian Detection, Motion Estimation) Military/Government (Crypto, Image Detection ) Data Processing Algorithms Networking (DPI, SDN, NFV) Computer & Storage (HPC, Financial, Data Compression) 7 Broadcast, Consumer (Video image processing) Medical (Diagnostic Image Processing, BioInformatics)

8 OpenCL and FPGA Acceleration in the News IBM and Altera Collaborate on OpenCL IBM s collaboration with Altera on OpenCL and support of the IBM Power architecture with the Altera SDK for OpenCL can bring more innovation to address Big Data and cloud computing challenges, said Tom Rosamilia, senior vice president, IBM Systems Intel Reveals FPGA and Xeon in One Socket "That allows end users that have applications that can benefit from acceleration to load their IP and accelerate that algorithm on that FPGA as an offload," explained the vice president of Intel's data center group, Diane Bryant Search Engine Gets Help From FPGA "Altera was really interesting in helping with the development the resources they were willing to throw our way were more significant than those from Xilinx Microsoft Engr Manager Baidu and Altera Demonstrate Faster Image Classification Altera Corp. and Baidu, China s largest online search engine, are collaborating on using FPGAs and convolutional neural network (CNN) algorithms for deep learning applications. Xilinx Announces SDAccel Development Environment for OpenCL Delivering Up to 25X Better Performance/Watt to the Data Center 8

9 What is OpenCL? 9 A software programming model for software engineers and a software methodology for system architects First industry standard for heterogeneous computing Provides increased performance with hardware acceleration Low Level Programming language Based on ANSI C99 Open, royalty-free, standard Managed by Khronos Group Altera active member Conformance requirements V1.0 is current reference V2.0 is current release Host C/C++ API OpenCL C Accelerator

10 Driver Stream API CUDA Why Does OpenCL Exist? CPU Programmability Single-Core Multi-Core C/C++ AVX/OpenMP Heterogeneous Programming Architecture PCIe Accelerator Programming Language OpenCL GPGPU Performance 10

11 Programming Language Offerings Target GPU Multi-Core CPU DSP/ Embedded FPGA System (Heterogeneous Platform) Device IP Block Scope Designer Programmer Embedded Programmer Hardware Designer Design Flow CUDA/OpenCL Code Composer Studio (TI C) Quartus II (Verilog/VHDL) Design Activity Design Constraints Hardware Knowledge Task Parallelism Data Parallelism Throughput/Latency Power Efficiency None (Coding Style Guidelines) Real Time Function Acceleration Real Time Execution Cost Limited (macro architecture bandwidth level) HLS IP Design and Integration Clock Frequency Resource Utilization Interface Requirements Power Today Today PoC Yes (protocol-level, timing closure, micro architecture) 11

12 HLS vs OpenCL Positioning Targets CPU, GPU and FPGAs Target user is Software developer Implements FPGA in software development flow Performance is determined by resources allocated Host Required Targets FPGA Target user is FPGA designer Implements FGPA in traditional FPGA development flow Performance is defined and amount of resource to achieve is reported Host not required 12

13 Altera SDK for OpenCL Competitive Differentiator Altera s SDK for OpenCL has proven to be a powerful solution for many vendors Won design tool and development software Elektra award in Europe Won Ultimate Product of the Year for Actively being used today: I was extremely happy to get a great performance with such low effort. I was so impressed with how powerful the Altera tool was! --- Senior Engineer, Altera OpenCL Customer

14 First Conformant OpenCL Solution for FPGAs!!! OpenCL v1.0 specification >8500 Programs tested Supports Arm Host CV and AV SoC

15 Heterogeneous Platform Model OpenCL Platform Model Host Memory (Compute) Device Host Compute Unit Global Memory Processing Element Example Platform x86 PCIe 15

16 Heterogeneous Platform Model OpenCL Platform Model Host Memory Device Device Host Global Memory Example Platform x86 PCIe 16

17 OpenCL Use Model: Abstracting the FPGA away Host Code main() { read_data( ); manipulate( ); clenqueuewritebuffer( ); clenqueuendrange(,sum, ); clenqueuereadbuffer( ); display_result( ); } OpenCL Accelerator Code kernel void sum ( global float *a, global float *b, global float *y) { int gid = get_global_id(0); y[gid] = a[gid] + b[gid]; } Standard gcc Compiler Altera Offline Compiler Verilog EXE AOCX Quartus II Accelerator 17 Host

18 OpenCL Programming Model host.c opencl.h gcc Driver Platform Context Device Queue Acquire Compute Program Kernel Visualize Buffer Launch device.cl aoc 18

19 Interconnect The Only Custom Accelerator Solution: Platforms DDR DDR DDR3 Memory Interface DDR3 Memory Interface OpenCL Domain Built with Altera OpenCL Compiler QDR QDRII Memory Interface QDR QDRII Memory Interface QDR QDRII Memory Interface QDR QDRII Memory Interface Kernel IP Kernel IP 10G Network 10Gb MAC/UOE Data Interface 10Gb MAC/UOE Data Interface 20 Host PCIe gen2x8 Host Interface IO Infrastructure Prebuilt BSP with standard HDL Tools by FPGA Developer

20 10G UDP 10G UDP Altera Reference Platforms Requirement Network Enabled Low Latency High Performance Computing (HPC) Compute Power/ Memory Bandwidth Architecture OpenCL API HAL UMD KMD Stratix V FPGA DMA PCIe CPLD Bridge CPLD DDR3 (OpenCL Kernels) FLASH DDR3 OpenCL API HAL UMD KMD Stratix V FPGA DMA PCIe DDR3 DDR3 (OpenCL Kernels) Global Memory DDR and QDRII+ Large amount of DDR IO Channels 2x10GbE (MAC/UOE) None (Minimize IP overhead) Reference Design OPRA (Streaming) Trading (with global memory access) Option Pricing 21

21 SoC Reference Platforms HPS block removes the complexities of the BSP creation Coherency between Host and Accelerator HPS DDR3 Stratix V FPGA H2F/F2H HPS LWH2F F2S CSR 32bit, 50Mz FPGA Memory OpenCL Kernels DVI DVO Scratch DDR3 Camera Monitor OpenCL Platforms Page contains CV SoC devkit platform users guide 22

22 Altera Network Enabled Reference Platform for OpenCL C/C++ API OpenCL C host.c device.cl Compiler Reference Design Software Layer Hardware Layer Reference Platform Host Device Reference Board bit RHEL 6.4 Windows 7 s5_hft (S5PH-Q)

23 Guaranteed Timing Flow kernel.cl Boardspec.xml AOC Post-fit QXP partition (PCIe, UniPHY, DMA, ) Synthesis / P&R / STA on the OpencL Kernels ONLY No Meet Timing Yes Reconfig kernel PLL Re-run STA with the new PLL value 24 DONE!

24 Interface Heterogeneous Memory Support Host Memory Host IO Global Memory1 Global Memory2 IO Device CU Memories with different characteristics DDR Sequential Access QDR Random Access On-Chip Low Latency kernel void foo( global uint *data attribute((buffer_location(qdr) )) ) { foo(data[i]); } MoSysEfficient HMC High Capacity Combine different memories Attribute-based Automatic 25

25 Interconnect Interconnect Channels Advantage Standard OpenCL Altera Vendor Extension IO and Kernel Channels DDR DDR QDR QDR QDR QDR DDR3 Interface DDR3 Interface QDRII Interface QDRII Interface QDRII Interface QDRII Interface CvP Update OpenCL Kernels OpenCL Kernels DDR DDR QDR QDR QDR QDR DDR3 Interface DDR3 Interface QDRII Interface QDRII Interface QDRII Interface QDRII Interface CvP Update OpenCL Kernels OpenCL Kernels 10G Network 10Gb Interface 10Gb Interface 10G Network 10Gb Interface 10Gb Interface Host Host Interface Host Host Interface 26

26 Kernel Development Flow Modify kernel.cl x86 Emulator (sec) Functional Bugs? Hardware performance met? Optimization Report (min) Prototype (min) Stall-free pipeline? Memory coalesced? Profiler (hours) 28 DONE!

27 x86 emulator Beta v14.1 Enable functional debug on x86 system of kernel code Prototype support to allow users run kernels on x86 platform Debug support for Altera vendor specific debug support such as channels kernel void accel( ) { gid = get_global_id(0); out[gid] = proc(data[gid]); } x86 Kernel Compiler./kernel_tb Running Supports OpenCL syntax Channels Printf 29

28 Example: Load to Store dependency kernel void prefixsum( global int* restrict A, unsigned N ) { for ( unsigned i = 1 ; i < N ; i++ ) { int a = A[i-1]; A[i] += a; } } ============================================================================== *** Optimization Report *** ============================================================================== Relative cost of global Kernel: prefixsum Ln.Col ============================================================================== memory to local Loop for.body computation 2.25 Pipelined execution inferred. Successive iterations launched every 321 cycles due to: Memory dependency on Load Operation from: 3.21 Store Operation 4.7 Largest Critical Path Contributors: True fix requires 49%: Load Operation restructuring the code %: Store Operation 4.7 ============================================================================= 30

29 Example: Accumulating a value kernel void test( global float* restrict input, global float* restrict output, unsigned N ) { float mul = 1.0f; for ( unsigned i = 0; i < N; i++ ) { mul *= input[ i ]; } *output = mul; } ================================================================================== *** Optimization Report *** ================================================================================== Kernel: test Ln.Col ================================================================================== Loop for.body 5.24 Pipelined execution inferred. Successive iterations launched every 3 cycles due to: Data dependency on variable mul 4.10 Largest Critical Path Contributor: 100%: Fmul Operation 6.7 ================================================================================== 31

30 Rapid Prototyping Beta v14.1 Increases productivity during application development Uses a library of pre-compiled templates to skip Quartus II compilation Can test small versions of the final design on hardware very quickly OpenCL Compiler aoc Quartus II ~ hours User Program OpenCL Compiler HW Implementation aoc march=prototype Configuration Template Library ~minutes Ability to generate custom templates based on user kernels Tailors the Rapid Prototyping Template Library to the user 32

31 Profiler BETA v14.1 Instrument the pipeline with performance counters and profiling logic Transfer the profiling information to the host via PCIe link Kernel Pipeline kernel void accel( ) { gid = get_global_id(0); out[gid] = a[gid]+b[gid]; } Load + Load Memory Mapped Registers Store 33

32 Profiler BETA v14.1 Bottlenecks, bandwidth, saturation, pipeline occupancy 34

33 OpenCL Host Library & Run Time Environment (RTE) Host library improvements: Lower CPU usage Improved scalability Lower memory footprint Faster run time SDK & Run Time Environment: OS SDK (needs ACDS) RTE Windows x86-64 Installer Installer Linux (RHEL) x86-64 Installer, RPM Installer, RPM Linux (RHEL) Power - RPM Linux (custom) CV SoC - Tarball 35

34 Installable Client Driver BETA v14.1 host.c clgetplatformid opencl.h ICD nvidiaopencl Acquire Compute AlteraOpenCL Visualize device.cl HKEY_LOCAL_MACHINE\SOFTWARE\ Khronos\OpenCL\Vendors <library>.dll DWORD /etc/opencl/vendors /<vendor>.icd <library>.so 36

35 Altera Client Driver BETA v14.1 host.c clgetplatformid opencl.h ICD nvidiaopencl Acquire clgetdeviceid Compute AlteraOpenCL ACD Visualize device.cl 37

36 OpenCL + FPGA Key Benefits Faster development vs. traditional FPGA design flow Puts the FPGA in the software developers hands Familiar C-based development flow Higher performance/watt vs. CPU/GPGPU Implement exactly what you need Pipeline parallel structures Custom interconnect converging with data processing cores Lower power vs. CPU/GPGPU Core frequency lower: MHz vs 1GHz Turn off unused logic Up to 1/5 the power Portability & Obsolescence free Code can transfer between different HW accelerators (CPU, GPGPU, FPGA, etc) Code ports seamlessly to new generations of the FPGA 38 FPGA life cycle considerably longer than CPUs or GPGPUs

37 Additional Resources

38 Optimize Design Set Up Altera SDK for OpenCL Design Flow Getting Started Guide (document) Install Quartus II v13.1 with Altera SDK for OpenCL Install C Compiler or Development Environment Obtain and setup license from the Self Service Licensing Center Install the FPGA (OpenCL) board aocl install Programming Guide (document) Develop kernel code and compile on CPU/GPU for functional correctness Build, compile & link the host application (Visual Studio/GCC) Compile the OpenCL kernel with Altera offline Compiler (aoc) Run the application Best Practices (document) Optimize kernel for FPGA hardware 40

39 Additional Altera OpenCL Collateral White papers on OpenCL OpenCL online demos OpenCL design examples Instructor-Led training Parallel Computing with OpenCL Workshop by Altera (1 Day) Optimization of OpenCL for Altera FPGAs Training by Altera (1 Day) Online training Introduction to Parallel Computing with OpenCL Writing OpenCL Programs for Altera FPGAs Running OpenCL on Altera FPGAs Single-Threaded vs. Multi-Threaded Kernels Building Custom Platforms for Altera SDK for OpenCL OpenCL board partners page 41

40 Application Benchmarking

41 Case Study: GZIP Compression OpenCL Was 10% Slower 12% more resources 3x faster development time Altera summer intern ported and optimized GZIP algorithm in a little more than a month Industry leading companies FPGA engineer coded Verilog in 3 months Much lower design effort and design time 43

42 Conclusions Results CHREC/Univ of Florida Sobel, Canny, & SURF OpenCL vs. VHDL productivity table VHDL development time OpenCL development time 6 months 1 month Apps. OpenCL vs. VHDL performance table Frames/sec VHDL performance OpenCL performance Stratix 4 Predicted Stratix 5 Stratix 5 Max freq. Frames/sec Max freq. Frames/sec Max freq. Sobel Canny SURF Avoid productivity challenges of HDL 6 increase in productivity OpenCL offers familiar C environment Develop fully pipeline kernels Minimum performance cost < 10 % overhead Productivity Performance 44

43 Case Study: Image Classification Deep Learning Algorithm Convolutional Neural Networking Based on Hinton s CNN Early Results on Stratix V 2X Perf./Power vs. gpgpu despite soft floating point 8+ simultaneous kernels vs. 2 on gpgpu Exploiting OpenCL channels between kernels A10 Expectations Hard floating point Better density and frequency ~ 4X performance/watt v SV The CIFAR-10 dataset consists of x32 colour images in 10 classes, with 6000 images per class. There are training images and test images. Here are the classes in the dataset, as well as 10 random images from each: airplane automobile bird cat deer dog frog horse ship truck Hinton s CNN Algorithm 45

44 AES Encryption Encryption/decryption 256bit key Counter (CTR) method Platform Advantage FPGA Integer arithmetic Coarse grain bit operations Complex decision making Results E5503 Xeon Processor (single core) Power (W) Performance (GB/s) Efficiency (MB/s/W) est AMD Radeon HD 7970 est PCIe385 A7 Accelerator

45 Multi-Asset Barrier Option Pricing Monte-Carlo simulation No closed form solution possible High quality random number generator required Billions of simulations required Used GPU vendors example code Advantage FPGA Complex Control Flow Optimizations Channels, loop pipelining Results Platform Power (W) Performance (Bsims/s) Efficiency (Msims/s/W) W3690 Xeon Processor nvidia Kepler Bittware S5-PCIe-HQ

46 Document Filtering Unstructured data analytics Bloom Filter 48 Platform Advantage FPGA Integer Arithmetic Flexible Memory Configuration Results Power (W) Performance (MTs) Efficiency (MTs/W) W3690 Xeon Processor nvidia Tesla C PCIe385 A7 Accelerator

47 Consumer (Japan) Image Processing Adaptive weighted images p xy c d c d c c 1 ij 1 xy 1 ( i 1) j 2 xy 2 ij xy 2 ( i 1 ) j W d d 2 xy Advantage FPGA Integer Arithmetic Results Platform Power (W) Performance (FPS) Efficiency (FPS/W) W3565 Xeon Processor est nvidia Quadro 4000 est PCIe385 A7 Accelerator

48 Smith-Waterman Sequence Alignment Scoring Matrix Platform Advantage FPGA Integer Arithmetic SMT Streaming Results Power (W) Performance (MCUPS) Efficiency (MCUPS/W) W3565 Xeon Processor nvidia K PCIe385 A7 Accelerator

49 Multi Function Printer Image Processing RGB output of raster scanner converted to CMYK colorants for printing Advantage FPGA SoC Solution IO and Kernel Channels Heterogeneous memory accesses Goal 50PPM at A4/letter size Results >40X improvement over C based algorithm on ARM only No NEON coprocessor used C6 speed grade part improved 20% to 128PPM 51

50 Suricata: IDS/IPS Implementation (Cybersecurity) 2x 10 Gbps ETH IO ETH IO STD IDS PKT PKT Processing Analysis DPI PKT PKT Processing Analysis Traffic Control IPS PKT Manipulation ETH IO ETH IO 2x 10 Gbps Ingress Network Path STD Rules Memory (QDR or DDR) DPI Rules Memory (QDR or DDR) IDS/IPS MGMT Mirror for Egress Network Path Packet Analysis Kernel IDS (task) Stream in decoded packets and store in local memory (aoclreadchannel) Parallel regex with STD rules in global memory (heterogeneous memory support) Write results to global memory Stream out decoded packets (aoclwritechannel) Host IDS/IPS Management Read results from global memory and log Decide to modify or delete packets Packet Manipulation Kernel - IPS (task) Stream in decoded packets (aoclreadchannel) Read and process decision from the host Stream out decoded packets (aoclwritechannel) Decoder Kernel (autorun) Stream in encoded packets (aoclreadchannel) Unpack single streams Stream out decoded packets (aoclwritechannel) Packet Analysis Kernel - DPI (task) Stream in decoded packets and store in local memory (aoclreadchannel) Parallel regex with DPI rules in global memory (heterogeneous memory support) Write results to global memory Stream out decoded packets (aoclwritechannel) Encoder Kernel (autorun) Stream in decoded packets (aoclreadchannel) Repack multiple streams Stream out encoded packets (aoclwritechannel)

51 Haplotype Caller (Pair-HMM) Smith Waterman like algorithm Uses hidden markov models to compare gene sequences 3 stages: Assembler, Pair-HMM (70%), Traversal +Genotyping Floating point (SP + DP) C++ code starting point (from JAVA) Whole genome takes 7.6 days! Results Platform Runtime (ms) Java (gatk 2.8) 10,800 Intel Xeon E nvidia Tesla K40 70 Nallatech SV-D

52 Sobel Filter Fundamental image filter algorithm Used commonly in industrial and automotive applications Sliding window based design pattern Same shift register structure, except in two dimensions WIDTH*3 WIDTH*4-1 WIDTH*4-9 A B C WIDTH*3-9 E F G WIDTH* Pixels enter here WIDTH-9 WIDTH-1

53 Task Implementation and Results Altera OpenCL kernel void sobel(int iters) { // Coefficients int Gx[3][3] = {{-1,-2,-1},{0,0,0},{1,2,1}}; int Gy[3][3] = {{-1,0,1},{-2,0,2},{-1,0,1}}; int rows[2 * COLS + 3]; // line buffer int count = 0; while (count!= iters) { // Shift the line buffer #pragma unroll for (int i = COLS * 2 + 2; i > 0; --i) { rows[i] = rows[i - 1]; } rows[0] = read_channel_altera(in_channel); On our design example website pport/examples/opencl/s obel-filter.html Device Resolution FPS } } int x_dir = 0; int y_dir = 0; #pragma unroll for (int i = 0; i < 3; ++i) { #pragma unroll for (int j = 0; j < 3; ++j) { x_dir += rows[i * COLS + j] * Gx[i][j]; y_dir += rows[i * COLS + j] * Gy[i][j]; } } int edge_weight = abs(x_dir) + abs(y_dir); write_channel_altera(out_channel, edge_weight); ++count; Cyclone V 1080p 60 Stratix V 1080p

How OpenCL enables easy access to FPGA performance?

How OpenCL enables easy access to FPGA performance? How OpenCL enables easy access to FPGA performance? Suleyman Demirsoy Agenda Introduction OpenCL Overview S/W Flow H/W Architecture Product Information & design flow Applications Additional Collateral

More information

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25 December 2014 FPGAs in the news» Catapult» Accelerate BING» 2x search acceleration:» ½ the number of servers»

More information

Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms. Doris Chen, Deshanand Singh Jan 24 th, 2013

Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms. Doris Chen, Deshanand Singh Jan 24 th, 2013 Fractal Video Compression in OpenCL: An Evaluation of CPUs, GPUs, and FPGAs as Acceleration Platforms Doris Chen, Deshanand Singh Jan 24 th, 2013 Platform Evaluation Challenges Conducting a fair platform

More information

Embedded Systems: map to FPGA, GPU, CPU?

Embedded Systems: map to FPGA, GPU, CPU? Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven jos@vectorfabrics.com Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware

More information

Data Center and Cloud Computing Market Landscape and Challenges

Data Center and Cloud Computing Market Landscape and Challenges Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution

More information

Xeon+FPGA Platform for the Data Center

Xeon+FPGA Platform for the Data Center Xeon+FPGA Platform for the Data Center ISCA/CARL 2015 PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA Accelerator Platform Applications and Eco-system

More information

Intel Xeon +FPGA Platform for the Data Center

Intel Xeon +FPGA Platform for the Data Center Intel Xeon +FPGA Platform for the Data Center FPL 15 Workshop on Reconfigurable Computing for the Masses PK Gupta, Director of Cloud Platform Technology, DCG/CPG Overview Data Center and Workloads Xeon+FPGA

More information

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who

More information

CFD Implementation with In-Socket FPGA Accelerators

CFD Implementation with In-Socket FPGA Accelerators CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009 Outline

More information

GPU System Architecture. Alan Gray EPCC The University of Edinburgh

GPU System Architecture. Alan Gray EPCC The University of Edinburgh GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems

More information

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.

More information

Cloud Data Center Acceleration 2015

Cloud Data Center Acceleration 2015 Cloud Data Center Acceleration 2015 Agenda! Computer & Storage Trends! Server and Storage System - Memory and Homogenous Architecture - Direct Attachment! Memory Trends! Acceleration Introduction! FPGA

More information

Extending the Power of FPGAs. Salil Raje, Xilinx

Extending the Power of FPGAs. Salil Raje, Xilinx Extending the Power of FPGAs Salil Raje, Xilinx Extending the Power of FPGAs The Journey has Begun Salil Raje Xilinx Corporate Vice President Software and IP Products Development Agenda The Evolution of

More information

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab FPGA Accelerator Virtualization in an OpenPOWER cloud Fei Chen, Yonghua Lin IBM China Research Lab Trend of Acceleration Technology Acceleration in Cloud is Taking Off Used FPGA to accelerate Bing search

More information

Networking Virtualization Using FPGAs

Networking Virtualization Using FPGAs Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,

More information

Spectra-Q Engine BACKGROUNDER

Spectra-Q Engine BACKGROUNDER BACKGROUNDER Spectra-Q Engine 2010 s 2000 s 1990 s >50K >500K >5M FPGAs and SoCs have taken huge leaps with next-generation capabilities. These include multi-million logic elements, complex interface protocols,

More information

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011 Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis

More information

Parallel Programming Survey

Parallel Programming Survey Christian Terboven 02.09.2014 / Aachen, Germany Stand: 26.08.2014 Version 2.3 IT Center der RWTH Aachen University Agenda Overview: Processor Microarchitecture Shared-Memory

More information

Architectures and Platforms

Architectures and Platforms Hardware/Software Codesign Arch&Platf. - 1 Architectures and Platforms 1. Architecture Selection: The Basic Trade-Offs 2. General Purpose vs. Application-Specific Processors 3. Processor Specialisation

More information

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,

More information

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads

A Scalable VISC Processor Platform for Modern Client and Cloud Workloads A Scalable VISC Processor Platform for Modern Client and Cloud Workloads Mohammad Abdallah Founder, President and CTO Soft Machines Linley Processor Conference October 7, 2015 Agenda Soft Machines Background

More information

Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary

Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary OpenCL Optimization Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary 2 Overall Optimization Strategies Maximize parallel

More information

Direct GPU/FPGA Communication Via PCI Express

Direct GPU/FPGA Communication Via PCI Express Direct GPU/FPGA Communication Via PCI Express Ray Bittner, Erik Ruf Microsoft Research Redmond, USA {raybit,erikruf}@microsoft.com Abstract Parallel processing has hit mainstream computing in the form

More information

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November 2014. www.linleygroup.

Xilinx SDAccel. A Unified Development Environment for Tomorrow s Data Center. By Loring Wirbel Senior Analyst. November 2014. www.linleygroup. Xilinx SDAccel A Unified Development Environment for Tomorrow s Data Center By Loring Wirbel Senior Analyst November 2014 www.linleygroup.com Copyright 2014 The Linley Group, Inc. This paper examines Xilinx

More information

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA

OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization

More information

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014

CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 CLOUD GAMING WITH NVIDIA GRID TECHNOLOGIES Franck DIARD, Ph.D., SW Chief Software Architect GDC 2014 Introduction Cloud ification < 2013 2014+ Music, Movies, Books Games GPU Flops GPUs vs. Consoles 10,000

More information

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK

HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance

More information

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and

More information

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule All Programmable Logic Hans-Joachim Gelke Institute of Embedded Systems Institute of Embedded Systems 31 Assistants 10 Professors 7 Technical Employees 2 Secretaries www.ines.zhaw.ch Research: Education:

More information

Embedded Systems Lecture 15: HW & SW Optimisations. Björn Franke University of Edinburgh

Embedded Systems Lecture 15: HW & SW Optimisations. Björn Franke University of Edinburgh Embedded Systems Lecture 15: HW & SW Optimisations Björn Franke University of Edinburgh Overview SW Optimisations Floating-Point to Fixed-Point Conversion HW Optimisations Application-Specific Instruction

More information

Case Study on Productivity and Performance of GPGPUs

Case Study on Productivity and Performance of GPGPUs Case Study on Productivity and Performance of GPGPUs Sandra Wienke wienke@rz.rwth-aachen.de ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia

More information

Hardware Acceleration for Map/Reduce Analysis of Streaming Data Using OpenCL

Hardware Acceleration for Map/Reduce Analysis of Streaming Data Using OpenCL DESIGN SOLUTION: A C U S T O M E R S U C C E S S S T O R Y Hardware Acceleration for Map/Reduce Analysis of Streaming Data Using OpenCL By Jim Costabile, CEO/Founder, Syncopated Engineering Inc. The Project

More information

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai 2007. Jens Onno Krah (DSF) Soft Core Prozessor NIOS II Stand Mai 2007 Jens Onno Krah Cologne University of Applied Sciences www.fh-koeln.de jens_onno.krah@fh-koeln.de NIOS II 1 1 What is Nios II? Altera s Second Generation

More information

Multi-Threading Performance on Commodity Multi-Core Processors

Multi-Threading Performance on Commodity Multi-Core Processors Multi-Threading Performance on Commodity Multi-Core Processors Jie Chen and William Watson III Scientific Computing Group Jefferson Lab 12000 Jefferson Ave. Newport News, VA 23606 Organization Introduction

More information

Model-based system-on-chip design on Altera and Xilinx platforms

Model-based system-on-chip design on Altera and Xilinx platforms CO-DEVELOPMENT MANUFACTURING INNOVATION & SUPPORT Model-based system-on-chip design on Altera and Xilinx platforms Ronald Grootelaar, System Architect RJA.Grootelaar@3t.nl Agenda 3T Company profile Technology

More information

Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014

Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014 Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014 Title & Abstract Title: Data & Control Plane Interconnect for SDN & NFV networks Abstract: Software defined

More information

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and Simulink targeting ASIC/FGPA. Previously Worked as logic

More information

High-Level Synthesis for FPGA Designs

High-Level Synthesis for FPGA Designs High-Level Synthesis for FPGA Designs BRINGING BRINGING YOU YOU THE THE NEXT NEXT LEVEL LEVEL IN IN EMBEDDED EMBEDDED DEVELOPMENT DEVELOPMENT Frank de Bont Trainer consultant Cereslaan 10b 5384 VT Heesch

More information

COSCO 2015 Heterogeneous Computing Programming

COSCO 2015 Heterogeneous Computing Programming COSCO 2015 Heterogeneous Computing Programming Michael Meyer, Shunsuke Ishikuro Supporters: Kazuaki Sasamoto, Ryunosuke Murakami July 24th, 2015 Heterogeneous Computing Programming 1. Overview 2. Methodology

More information

Infrastructure Matters: POWER8 vs. Xeon x86

Infrastructure Matters: POWER8 vs. Xeon x86 Advisory Infrastructure Matters: POWER8 vs. Xeon x86 Executive Summary This report compares IBM s new POWER8-based scale-out Power System to Intel E5 v2 x86- based scale-out systems. A follow-on report

More information

GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012

GPUs: Doing More Than Just Games. Mark Gahagan CSE 141 November 29, 2012 GPUs: Doing More Than Just Games Mark Gahagan CSE 141 November 29, 2012 Outline Introduction: Why multicore at all? Background: What is a GPU? Quick Look: Warps and Threads (SIMD) NVIDIA Tesla: The First

More information

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association

Making Multicore Work and Measuring its Benefits. Markus Levy, president EEMBC and Multicore Association Making Multicore Work and Measuring its Benefits Markus Levy, president EEMBC and Multicore Association Agenda Why Multicore? Standards and issues in the multicore community What is Multicore Association?

More information

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices WS on Models, Algorithms and Methodologies for Hierarchical Parallelism in new HPC Systems The High Performance Internet of Things: using GVirtuS for gluing cloud computing and ubiquitous connected devices

More information

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.

Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip. Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide

More information

High Performance or Cycle Accuracy?

High Performance or Cycle Accuracy? CHIP DESIGN High Performance or Cycle Accuracy? You can have both! Bill Neifert, Carbon Design Systems Rob Kaye, ARM ATC-100 AGENDA Modelling 101 & Programmer s View (PV) Models Cycle Accurate Models Bringing

More information

Kalray MPPA Massively Parallel Processing Array

Kalray MPPA Massively Parallel Processing Array Kalray MPPA Massively Parallel Processing Array Next-Generation Accelerated Computing February 2015 2015 Kalray, Inc. All Rights Reserved February 2015 1 Accelerated Computing 2015 Kalray, Inc. All Rights

More information

FPGA-based MapReduce Framework for Machine Learning

FPGA-based MapReduce Framework for Machine Learning FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China

More information

Stream Processing on GPUs Using Distributed Multimedia Middleware

Stream Processing on GPUs Using Distributed Multimedia Middleware Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework Examples Software crisis

More information

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik Contents Überblick: Aufbau moderner FPGA Einblick: Eigenschaften

More information

Altera SDK for OpenCL

Altera SDK for OpenCL Altera SDK for OpenCL Best Practices Guide Subscribe OCL003-15.0.0 101 Innovation Drive San Jose, CA 95134 www.altera.com TOC-2 Contents...1-1 Introduction...1-1 FPGA Overview...1-1 Pipelines... 1-2 Single

More information

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment

Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 231 237 doi: 10.14794/ICAI.9.2014.1.231 Parallelization of video compressing

More information

AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS

AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS David Rupe (BittWare, Concord, NH, USA; drupe@bittware.com) ABSTRACT The role of FPGAs in Software

More information

NVIDIA GeForce GTX 580 GPU Datasheet

NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines

More information

Packet-based Network Traffic Monitoring and Analysis with GPUs

Packet-based Network Traffic Monitoring and Analysis with GPUs Packet-based Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar wenji@fnal.gov, demar@fnal.gov GPU Technology Conference 2014 March 24-27, 2014 SAN JOSE, CALIFORNIA Background Main

More information

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it

Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o igiro7o@ictp.it Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket

More information

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com Best Practises for LabVIEW FPGA Design Flow 1 Agenda Overall Application Design Flow Host, Real-Time and FPGA LabVIEW FPGA Architecture Development FPGA Design Flow Common FPGA Architectures Testing and

More information

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1 AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL AMD Embedded Solutions 1 Optimizing Parallel Processing Performance and Coding Efficiency with AMD APUs and Texas Multicore Technologies SequenceL Auto-parallelizing

More information

Key-Value Store Acceleration with OpenPower

Key-Value Store Acceleration with OpenPower Key-Value Store Acceleration with OpenPower Michaela Blott, Principal Engineer Xilinx Research #OpenPOWERSummit Join the conversation at #OpenPOWERSummit 1 Agenda Background Acceleration of KVS with FPGAs

More information

Next Generation GPU Architecture Code-named Fermi

Next Generation GPU Architecture Code-named Fermi Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time

More information

Algorithm and Programming Considerations for Embedded Reconfigurable Computers

Algorithm and Programming Considerations for Embedded Reconfigurable Computers Algorithm and Programming Considerations for Embedded Reconfigurable Computers Russell Duren, Associate Professor Engineering And Computer Science Baylor University Waco, Texas Douglas Fouts, Professor

More information

High Performance GPGPU Computer for Embedded Systems

High Performance GPGPU Computer for Embedded Systems High Performance GPGPU Computer for Embedded Systems Author: Dan Mor, Aitech Product Manager September 2015 Contents 1. Introduction... 3 2. Existing Challenges in Modern Embedded Systems... 3 2.1. Not

More information

A Computer Vision System on a Chip: a case study from the automotive domain

A Computer Vision System on a Chip: a case study from the automotive domain A Computer Vision System on a Chip: a case study from the automotive domain Gideon P. Stein Elchanan Rushinek Gaby Hayun Amnon Shashua Mobileye Vision Technologies Ltd. Hebrew University Jerusalem, Israel

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Accelerating CST MWS Performance with GPU and MPI Computing. CST workshop series

Accelerating CST MWS Performance with GPU and MPI Computing.  CST workshop series Accelerating CST MWS Performance with GPU and MPI Computing www.cst.com CST workshop series 2010 1 Hardware Based Acceleration Techniques - Overview - Multithreading GPU Computing Distributed Computing

More information

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61

Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase

More information

7a. System-on-chip design and prototyping platforms

7a. System-on-chip design and prototyping platforms 7a. System-on-chip design and prototyping platforms Labros Bisdounis, Ph.D. Department of Computer and Communication Engineering 1 What is System-on-Chip (SoC)? System-on-chip is an integrated circuit

More information

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS Perhaad Mistry, Yash Ukidave, Dana Schaa, David Kaeli Department of Electrical and Computer Engineering Northeastern University,

More information

Scaling from Datacenter to Client

Scaling from Datacenter to Client Scaling from Datacenter to Client KeunSoo Jo Sr. Manager Memory Product Planning Samsung Semiconductor Audio-Visual Sponsor Outline SSD Market Overview & Trends - Enterprise What brought us to NVMe Technology

More information

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:

More information

High-performance vswitch of the user, by the user, for the user

High-performance vswitch of the user, by the user, for the user A bird in cloud High-performance vswitch of the user, by the user, for the user Yoshihiro Nakajima, Wataru Ishida, Tomonori Fujita, Takahashi Hirokazu, Tomoya Hibi, Hitoshi Matsutahi, Katsuhiro Shimano

More information

Accelerating variant calling

Accelerating variant calling Accelerating variant calling Mauricio Carneiro GSA Broad Institute Intel Genomic Sequencing Pipeline Workshop Mount Sinai 12/10/2013 This is the work of many Genome sequencing and analysis team Mark DePristo

More information

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015

Development With ARM DS-5. Mervyn Liu FAE Aug. 2015 Development With ARM DS-5 Mervyn Liu FAE Aug. 2015 1 Support for all Stages of Product Development Single IDE, compiler, debug, trace and performance analysis for all stages in the product development

More information

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful:

Course materials. In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: Course materials In addition to these slides, C++ API header files, a set of exercises, and solutions, the following are useful: OpenCL C 1.2 Reference Card OpenCL C++ 1.2 Reference Card These cards will

More information

What is a System on a Chip?

What is a System on a Chip? What is a System on a Chip? Integration of a complete system, that until recently consisted of multiple ICs, onto a single IC. CPU PCI DSP SRAM ROM MPEG SoC DRAM System Chips Why? Characteristics: Complex

More information

Consumer vs Professional How to Select the Best Graphics Card For Your Workflow

Consumer vs Professional How to Select the Best Graphics Card For Your Workflow Consumer vs Professional How to Select the Best Graphics Card For Your Workflow Allen Bourgoyne Director, ISV Alliances, AMD Professional Graphics Learning Objectives At the end of this class, you will

More information

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu wuren@baidu.com

Deep Learning Meets Heterogeneous Computing. Dr. Ren Wu Distinguished Scientist, IDL, Baidu wuren@baidu.com Deep Learning Meets Heterogeneous Computing Dr. Ren Wu Distinguished Scientist, IDL, Baidu wuren@baidu.com Baidu Everyday 5b+ queries 500m+ users 100m+ mobile users 100m+ photos Big Data Storage Processing

More information

Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling

Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling Tutorial: Harnessing the Power of FPGAs using Altera s OpenCL Compiler Desh Singh, Tom Czajkowski, Andrew Ling OPENCL INTRODUCTION Programmable Solutions Technology scaling favors programmability and parallelism

More information

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Sockets vs. RDMA Interface over 1-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck Pavan Balaji Hemal V. Shah D. K. Panda Network Based Computing Lab Computer Science and Engineering

More information

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks Cloud-Based Apps Drive the Need for Frequency-Flexible Generators in Converged Data Center Networks Introduction By Phil Callahan, Senior Marketing Manager, Timing Products, Silicon Labs Skyrocketing network

More information

GeoImaging Accelerator Pansharp Test Results

GeoImaging Accelerator Pansharp Test Results GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance

More information

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics

GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),

More information

White Paper. Recording Server Virtualization

White Paper. Recording Server Virtualization White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...

More information

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga

Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.

More information

Fujisoft solves graphics acceleration for the Android platform

Fujisoft solves graphics acceleration for the Android platform DESIGN SOLUTION: A C U S T O M E R S U C C E S S S T O R Y Fujisoft solves graphics acceleration for the Android platform by Hiroyuki Ito, Senior Engineer Embedded Core Technology Department, Solution

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville CS 594 Lecture Notes March 4, 2015 1/18 Outline! Introduction - Hardware

More information

Using Stratix II GX in HDTV Video Production Applications

Using Stratix II GX in HDTV Video Production Applications White Paper Introduction The television broadcasting market is rapidly shifting from the established methods of analog video capture and distribution to digital television (DTV), which provides three main

More information

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer

ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop. Emily Apsey Performance Engineer ArcGIS Pro: Virtualizing in Citrix XenApp and XenDesktop Emily Apsey Performance Engineer Presentation Overview What it takes to successfully virtualize ArcGIS Pro in Citrix XenApp and XenDesktop - Shareable

More information

How SSDs Fit in Different Data Center Applications

How SSDs Fit in Different Data Center Applications How SSDs Fit in Different Data Center Applications Tahmid Rahman Senior Technical Marketing Engineer NVM Solutions Group Flash Memory Summit 2012 Santa Clara, CA 1 Agenda SSD market momentum and drivers

More information

High Performance Computing in CST STUDIO SUITE

High Performance Computing in CST STUDIO SUITE High Performance Computing in CST STUDIO SUITE Felix Wolfheimer GPU Computing Performance Speedup 18 16 14 12 10 8 6 4 2 0 Promo offer for EUC participants: 25% discount for K40 cards Speedup of Solver

More information

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey A Survey on ARM Cortex A Processors Wei Wang Tanima Dey 1 Overview of ARM Processors Focusing on Cortex A9 & Cortex A15 ARM ships no processors but only IP cores For SoC integration Targeting markets:

More information

COMPUTING. SharpStreamer Platform. 1U Video Transcode Acceleration Appliance

COMPUTING. SharpStreamer Platform. 1U Video Transcode Acceleration Appliance COMPUTING Preliminary Data Sheet SharpStreamer Platform 1U Video Transcode Acceleration Appliance The SharpStreamer 1U Platform enables high density voice and video processing in a 1U rack server appliance

More information

White Paper COMPUTE CORES

White Paper COMPUTE CORES White Paper COMPUTE CORES TABLE OF CONTENTS A NEW ERA OF COMPUTING 3 3 HISTORY OF PROCESSORS 3 3 THE COMPUTE CORE NOMENCLATURE 5 3 AMD S HETEROGENEOUS PLATFORM 5 3 SUMMARY 6 4 WHITE PAPER: COMPUTE CORES

More information

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology

The Lagopus SDN Software Switch. 3.1 SDN and OpenFlow. 3. Cloud Computing Technology 3. The Lagopus SDN Software Switch Here we explain the capabilities of the new Lagopus software switch in detail, starting with the basics of SDN and OpenFlow. 3.1 SDN and OpenFlow Those engaged in network-related

More information

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect

OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE. Guillène Ribière, CEO, System Architect OPTIMIZE DMA CONFIGURATION IN ENCRYPTION USE CASE Guillène Ribière, CEO, System Architect Problem Statement Low Performances on Hardware Accelerated Encryption: Max Measured 10MBps Expectations: 90 MBps

More information

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration

Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Implementation of Stereo Matching Using High Level Compiler for Parallel Computing Acceleration Jinglin Zhang, Jean François Nezan, Jean-Gabriel Cousin, Erwan Raffin To cite this version: Jinglin Zhang,

More information

Understanding the Benefits of IBM SPSS Statistics Server

Understanding the Benefits of IBM SPSS Statistics Server IBM SPSS Statistics Server Understanding the Benefits of IBM SPSS Statistics Server Contents: 1 Introduction 2 Performance 101: Understanding the drivers of better performance 3 Why performance is faster

More information

Deep Neural Networks in Embedded and Real Time Systems

Deep Neural Networks in Embedded and Real Time Systems Deep Neural Networks in Embedded and Real Time Systems Deep Learning Neural Networks Deep Learning A family of neural network methods using high number of layers Focused on feature representations Convolutional

More information

Le langage OCaml et la programmation des GPU

Le langage OCaml et la programmation des GPU Le langage OCaml et la programmation des GPU GPU programming with OCaml Mathias Bourgoin - Emmanuel Chailloux - Jean-Luc Lamotte Le projet OpenGPU : un an plus tard Ecole Polytechnique - 8 juin 2011 Outline

More information

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG).

High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG). HEVC - Introduction High Efficiency Video Coding (HEVC) or H.265 is a next generation video coding standard developed by ITU-T (VCEG) and ISO/IEC (MPEG). HEVC / H.265 reduces bit-rate requirement by 50%

More information