GPU-based Decompression for Medical Imaging Applications
|
|
|
- Nathaniel Summers
- 10 years ago
- Views:
Transcription
1 GPU-based Decompression for Medical Imaging Applications Al Wegener, CTO Samplify Systems 160 Saratoga Ave. Suite 150 Santa Clara, CA (888) LESS-BITS +1 (408)
2 Outline Medical Imaging Bottlenecks Compression: Benefits for Medical Imaging Compression for Computed Tomography Decompression Alternatives: FPGA, CPU and GPU Speeding up GPU-based Decompression Reducing I/O Bottlenecks for CUDA Applications Conclusions 2
3 Motivation: Faster Imaging Systems Parameter CT Ultrasound MRI Channel Count 64, , Sample Rate 3k 15k samp/sec 40M 65 M samp/sec 10M 100M samp/sec Bits per sample Total bandwidth 15 GB/sec 250 GB/sec 26 GB/sec Storage per scan 900 GB 5 15 TB 2 3 TB
4 Imaging Bottlenecks 2 CT Scanner: 1. Slip ring ($5k-$30k) 2. Disk drive array ($8k - $50k) Ultrasound: 1. ADCs beamformer ($1k - $5k) 2. Beams Display ($1k - $2k) MRI: 1. ADCs console ($500 - $5k) 2. Image recon RAM ($200 - $2k)
5 Compression Benefits for Medical Imaging Modality Bottleneck Compression Ratio Before Prism After Prism CT Slip ring 2:1 $32,000 $18,000 Ultrasound RAID arrays 2:1 $40,000 $20,000 ADC to beamformer 2.5:1 $3,000 $1,400 MR Optical cables 4:1 $8,000 $2,000 Blade RAM 4:1 4 GB x 8 1 GB x 8 GPU Host RAM 2:1 10 GB/sec 20 GB/sec
6 Prism Lossless Results on CT Signals Data Set Description Lempel- Ziv Samplify Prism Golomb- Rice Huffman Prism CT 1 Low signal phantom (shoulders) Low signal phantom (shoulders) Low signal phantom (shoulders) Low signal phantom (shoulders) Low signal phantom (shoulders) Low signal phantom (shoulders) cm acrylic phantom (semi-circle) cm acrylic phantom (semi-circle) Air scan Offset scan cm poly (65mm Offset in Y) cm poly (65mm Offset in Y) Anatomical chest phantom Anatomical head phantom Catphan low contrast detectability Catphan low contrast detectability Erlangen abdomen phantom Erlangen head phantom Erlangen thorax phantom Patient head scan Patient head scan Patient head scan Patient head scan Patient head scan Patient head scan Patient chest scan Best algorithm: Prism CT Source: A. Wegener, R. Senzig, et al., Real-time compression of raw computed tomography data: technology, architecture, and benefits, SPIE Medical Imaging Conference, Vol (Feb 2009). Best mean: 1.95 MEAN COMPRESSION RATIO MEAN BITS PER SAMPLE BEST COMPRESSION WORST COMPRESSION Table 2. Prism CT obtains 2009 the highest Samplify Systems, lossless Inc. mean compression ratio simply (1.95:1) the and bits the that best minimum matter compression ratio (1.59:1), compared to Huffman, Golomb-Rice, 6 and Lempel-Ziv algorithms. Highest min: 1.59
7 Radiologist Scoring of Prism 3:1 7
8 Samplified CT Scanner GANTRY Prism CT 2:1 CONSOLE 20 Gbps Prism CT decompress Console NIC 10 Gbps GPU Gantry NIC 10 Gbps 8
9 Samplify Prism Compression Overview Peak to Average Ratio Bandlimiting Effective Number of Bits 12 Bit Resolution 10.5 Effective Bits 9
10 Decompression Alternatives Two approaches: 1. Decompress in software on the image recon platform 2. Dedicated HW decompress card feeds CPU/GPU via PCIe Alternative HW Pros Cons Intel CPU Intel Larrabee FPGA Nvidia GPU Std. x86 architecture Std. Wintel/Linux OS Std. x86 arch+gpu Open CL Best bang-for-the-buck; flexible HW Cheap & available Most mature GPGPU Lots of cores on die Max of 2 sockets Max of 16 cores Relatively expensive Not yet available Legacy x86 instruction set Hard to program; few experienced developers Divergent warps All I/O thru host RAM
11 Baseline: Decompress on Intel CPU With only C compiler optimizations (no asm), Prism decompress runs at ~100 Msamp/sec per core, using Core2 2.3 GHz Linear speed-up using parallelized OpenMP code: 2 cores: ~200 Msamp/sec 4 cores: ~400 Msamp/sec Still to be exploited: MMX via compiler intrinsics Integer instructions dominate decompress time, so SSE instructions don t improve decompress speed 11
12 CUDA Decompress: First Effort Packetized bitstream Compressed packets in texture memory CUDA thread responsible for decompressing single packet of compressed data CUDA threads decode n VLC symbols into shared memory buffer, then copy back out to device memory n < length of decompressed packet due to lack of shared memory GTX260 PERFORMANCE: 300 Msamp/sec CUDA kernel pseudo-code0 do { parse VLC symbol bit-whacking ops to decode VLC symbol into decompressed samples copy n symbols out to device memory } while (we haven t decompressed all samples in the packet) Key problem with vlc decode is that it is inherently bit-serial (must decode symbol i-1 before decoding symbol i) which inhibits parallelism Samplify decompress has advantage that it is packetized so we can parallelize decompress 12
13 CUDA Decompress: Second Effort Issues encountered during first effort: Small amount of shared memory (16 kb) prohibits storing one entire decompressed packet (~ bit samples) per thread Decomp kernel full of conditional control type logic Conditionals lead to divergent warps that reduce overall performance IDEA: Instead of having each kernel in a while loop: Decode n VLC symbols, copy back out to shared memory, then decode the next n VLC symbols, and so on CUDA kernel decodes one symbol at a time PERFORMANCE: 350 Msamp/sec (17% faster) We expected more than 17% improvement, although this idea seems in line with the keep your CUDA kernels light-weight idiom 13
14 CUDA Decompress: A Better Way Initial CUDA decompress kernel filled with conditionals while loop has 16 case statement inside a switch block. High warp serialize rate reported by profiler (divergent warps) Variable-length decode doesn t fit single-program, multiple data (SPMD) IDEA: Paradigm shift and change in thought process: Very difficult for C programmers! Perform the worst-case thread all the time (counter-intuitive) Sometimes perform needless computations to remove conditionals Replace traditional imperative code with data-driven LUT approach Very simple example below More complicated examples involve arrays of pointers BIG WIN: Linear code speed-up with increased # of cores!!! replaced with 14
15 What s Next: More to Do Decompress throughput profiled on GTX 260 (CUDA toolkit 2.2) First attempt: 300 Msamp/sec (less than linear speed-up) Latest attempt: 450 MS/s (close to linear speed-up) Thruput without data xfer time just kernel execution time Additional kernel tuning: Fix apparent nvcc bug involving shared memory arrays of pointers, where pointees all point to shared memory scalar entities. All LUTs reside in const memory move to texture memory to reduce register pressure No caching of decompressed samples into shared memory move compressed stream into texture memory Low-level optimizations: write 64 or 128 decompressed bits at a time when copying back out to device memory
16 Lessons Learned about CUDA Decompress Variable-length code (VLC) decode and decompress: difficult to map efficiently to CUDA due to bit-serial nature. Packetized bit-streams can be parallelized Have to work around limited amount of shared memory Conditionals cause divergent warps SPMD (single program multiple data) model dictates that each thread in a warp must executes the same instruction at the same time. Need to think outside the box to remove conditionals Key insight: data-driven LUT methodology was key to overcoming conditional code high warp serialization rate (profiler) of serialized CUDA threads is significantly reduced by the data-driven LUT methodology
17 Compress & Decompress for CUDA Apps GPU precedent: H.264, VC-1, MPEG2, and WMV9 decode acceleration in GTX200 series from Getting Started with CUDA, Nvision08 Computer vision Augmented reality Supercomputing Computational finance Computational fluid dynamics Advanced numeric computing GPU computing in education AstroGPU MultiGPU programming GPU Clusters Digital content creation Video and image processing Advanced visualization Computer aided engineering Molecular dynamics Reality simulation 3D Stereo Scientific visualization Visual simulation C/C++ with CUDA Extensions DirectCompute OpenCL OpenGL DirectX 11 Medical imaging Life sciences Energy exploration Film and broadcast Automotive
18 Conclusions Medical imaging bottlenecks are increasing significantly Compression performed in FPGAs near the sensors Decompression in GPUs, along with image reconstruction Prism reduces I/O bottlenecks between sensors and GPUs Replacing conditionals with table look-ups results in a nearlinear GPU speed-up of variable-length decoding Low-complexity hardware compress & decompress core could significantly reduce CPU-GPU I/O bottlenecks for many CUDA-enabled numerical applications Special thanks to Shehrzad Qureshi for his significant contributions to these results.
NVIDIA GeForce GTX 580 GPU Datasheet
NVIDIA GeForce GTX 580 GPU Datasheet NVIDIA GeForce GTX 580 GPU Datasheet 3D Graphics Full Microsoft DirectX 11 Shader Model 5.0 support: o NVIDIA PolyMorph Engine with distributed HW tessellation engines
Introduction to GPU hardware and to CUDA
Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 37 Course outline Introduction to GPU hardware
Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011
Graphics Cards and Graphics Processing Units Ben Johnstone Russ Martin November 15, 2011 Contents Graphics Processing Units (GPUs) Graphics Pipeline Architectures 8800-GTX200 Fermi Cayman Performance Analysis
GPU System Architecture. Alan Gray EPCC The University of Edinburgh
GPU System Architecture EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? GPU-CPU comparison Architectural reasons for GPU performance advantages GPU accelerated systems
Next Generation Operating Systems
Next Generation Operating Systems Zeljko Susnjar, Cisco CTG June 2015 The end of CPU scaling Future computing challenges Power efficiency Performance == parallelism Cisco Confidential 2 Paradox of the
Programming models for heterogeneous computing. Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga
Programming models for heterogeneous computing Manuel Ujaldón Nvidia CUDA Fellow and A/Prof. Computer Architecture Department University of Malaga Talk outline [30 slides] 1. Introduction [5 slides] 2.
Next Generation GPU Architecture Code-named Fermi
Next Generation GPU Architecture Code-named Fermi The Soul of a Supercomputer in the Body of a GPU Why is NVIDIA at Super Computing? Graphics is a throughput problem paint every pixel within frame time
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected]
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o [email protected] Informa(on & Communica(on Technology Sec(on (ICTS) Interna(onal Centre for Theore(cal Physics (ICTP) Mul(ple Socket
How To Build An Ark Processor With An Nvidia Gpu And An African Processor
Project Denver Processor to Usher in a New Era of Computing Bill Dally January 5, 2011 http://blogs.nvidia.com/2011/01/project-denver-processor-to-usher-in-new-era-of-computing/ Project Denver Announced
Stream Processing on GPUs Using Distributed Multimedia Middleware
Stream Processing on GPUs Using Distributed Multimedia Middleware Michael Repplinger 1,2, and Philipp Slusallek 1,2 1 Computer Graphics Lab, Saarland University, Saarbrücken, Germany 2 German Research
GPU File System Encryption Kartik Kulkarni and Eugene Linkov
GPU File System Encryption Kartik Kulkarni and Eugene Linkov 5/10/2012 SUMMARY. We implemented a file system that encrypts and decrypts files. The implementation uses the AES algorithm computed through
Interactive Level-Set Deformation On the GPU
Interactive Level-Set Deformation On the GPU Institute for Data Analysis and Visualization University of California, Davis Problem Statement Goal Interactive system for deformable surface manipulation
Packet-based Network Traffic Monitoring and Analysis with GPUs
Packet-based Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar [email protected], [email protected] GPU Technology Conference 2014 March 24-27, 2014 SAN JOSE, CALIFORNIA Background Main
E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices
E6895 Advanced Big Data Analytics Lecture 14: NVIDIA GPU Examples and GPU on ios devices Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist,
Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected].
Medical Image Processing on the GPU Past, Present and Future Anders Eklund, PhD Virginia Tech Carilion Research Institute [email protected] Outline Motivation why do we need GPUs? Past - how was GPU programming
Clustering Billions of Data Points Using GPUs
Clustering Billions of Data Points Using GPUs Ren Wu [email protected] Bin Zhang [email protected] Meichun Hsu [email protected] ABSTRACT In this paper, we report our research on using GPUs to accelerate
GPUs for Scientific Computing
GPUs for Scientific Computing p. 1/16 GPUs for Scientific Computing Mike Giles [email protected] Oxford-Man Institute of Quantitative Finance Oxford University Mathematical Institute Oxford e-research
Computer Graphics Hardware An Overview
Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and
Evaluation of CUDA Fortran for the CFD code Strukti
Evaluation of CUDA Fortran for the CFD code Strukti Practical term report from Stephan Soller High performance computing center Stuttgart 1 Stuttgart Media University 2 High performance computing center
Lecture 11: Multi-Core and GPU. Multithreading. Integration of multiple processor cores on a single chip.
Lecture 11: Multi-Core and GPU Multi-core computers Multithreading GPUs General Purpose GPUs Zebo Peng, IDA, LiTH 1 Multi-Core System Integration of multiple processor cores on a single chip. To provide
NVIDIA CUDA Software and GPU Parallel Computing Architecture. David B. Kirk, Chief Scientist
NVIDIA CUDA Software and GPU Parallel Computing Architecture David B. Kirk, Chief Scientist Outline Applications of GPU Computing CUDA Programming Model Overview Programming in CUDA The Basics How to Get
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
Introduction to GPGPU. Tiziano Diamanti [email protected]
[email protected] Agenda From GPUs to GPGPUs GPGPU architecture CUDA programming model Perspective projection Vectors that connect the vanishing point to every point of the 3D model will intersecate
GPU Parallel Computing Architecture and CUDA Programming Model
GPU Parallel Computing Architecture and CUDA Programming Model John Nickolls Outline Why GPU Computing? GPU Computing Architecture Multithreading and Arrays Data Parallel Problem Decomposition Parallel
Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming
Overview Lecture 1: an introduction to CUDA Mike Giles [email protected] hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data
Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data Amanda O Connor, Bryan Justice, and A. Thomas Harris IN52A. Big Data in the Geosciences:
High-Density Network Flow Monitoring
Petr Velan [email protected] High-Density Network Flow Monitoring IM2015 12 May 2015, Ottawa Motivation What is high-density flow monitoring? Monitor high traffic in as little rack units as possible
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK
HETEROGENEOUS HPC, ARCHITECTURE OPTIMIZATION, AND NVLINK Steve Oberlin CTO, Accelerated Computing US to Build Two Flagship Supercomputers SUMMIT SIERRA Partnership for Science 100-300 PFLOPS Peak Performance
Introduction to GPU Programming Languages
CSC 391/691: GPU Programming Fall 2011 Introduction to GPU Programming Languages Copyright 2011 Samuel S. Cho http://www.umiacs.umd.edu/ research/gpu/facilities.html Maryland CPU/GPU Cluster Infrastructure
Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child
Introducing A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child Bio Tim Child 35 years experience of software development Formerly VP Oracle Corporation VP BEA Systems Inc.
Applications to Computational Financial and GPU Computing. May 16th. Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61
F# Applications to Computational Financial and GPU Computing May 16th Dr. Daniel Egloff +41 44 520 01 17 +41 79 430 03 61 Today! Why care about F#? Just another fashion?! Three success stories! How Alea.cuBase
Case Study on Productivity and Performance of GPGPUs
Case Study on Productivity and Performance of GPGPUs Sandra Wienke [email protected] ZKI Arbeitskreis Supercomputing April 2012 Rechen- und Kommunikationszentrum (RZ) RWTH GPU-Cluster 56 Nvidia
How To Use An Amd Ramfire R7 With A 4Gb Memory Card With A 2Gb Memory Chip With A 3D Graphics Card With An 8Gb Card With 2Gb Graphics Card (With 2D) And A 2D Video Card With
SAPPHIRE R9 270X 4GB GDDR5 WITH BOOST & OC Specification Display Support Output GPU Video Memory Dimension Software Accessory 3 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort 1.2
Binary search tree with SIMD bandwidth optimization using SSE
Binary search tree with SIMD bandwidth optimization using SSE Bowen Zhang, Xinwei Li 1.ABSTRACT In-memory tree structured index search is a fundamental database operation. Modern processors provide tremendous
Go Faster - Preprocessing Using FPGA, CPU, GPU. Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING
Go Faster - Preprocessing Using FPGA, CPU, GPU Dipl.-Ing. (FH) Bjoern Rudde Image Acquisition Development STEMMER IMAGING WHO ARE STEMMER IMAGING? STEMMER IMAGING is: Europe's leading independent provider
OpenCL Optimization. San Jose 10/2/2009 Peng Wang, NVIDIA
OpenCL Optimization San Jose 10/2/2009 Peng Wang, NVIDIA Outline Overview The CUDA architecture Memory optimization Execution configuration optimization Instruction optimization Summary Overall Optimization
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi
Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi ICPP 6 th International Workshop on Parallel Programming Models and Systems Software for High-End Computing October 1, 2013 Lyon, France
Introduction to GP-GPUs. Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1
Introduction to GP-GPUs Advanced Computer Architectures, Cristina Silvano, Politecnico di Milano 1 GPU Architectures: How do we reach here? NVIDIA Fermi, 512 Processing Elements (PEs) 2 What Can It Do?
GeoImaging Accelerator Pansharp Test Results
GeoImaging Accelerator Pansharp Test Results Executive Summary After demonstrating the exceptional performance improvement in the orthorectification module (approximately fourteen-fold see GXL Ortho Performance
IT of SPIM Data Storage and Compression. EMBO Course - August 27th! Jeff Oegema, Peter Steinbach, Oscar Gonzalez
IT of SPIM Data Storage and Compression EMBO Course - August 27th Jeff Oegema, Peter Steinbach, Oscar Gonzalez 1 Talk Outline Introduction and the IT Team SPIM Data Flow Capture, Compression, and the Data
Networking Virtualization Using FPGAs
Networking Virtualization Using FPGAs Russell Tessier, Deepak Unnikrishnan, Dong Yin, and Lixin Gao Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Massachusetts,
The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA
The Evolution of Computer Graphics Tony Tamasi SVP, Content & Technology, NVIDIA Graphics Make great images intricate shapes complex optical effects seamless motion Make them fast invent clever techniques
GPU Hardware and Programming Models. Jeremy Appleyard, September 2015
GPU Hardware and Programming Models Jeremy Appleyard, September 2015 A brief history of GPUs In this talk Hardware Overview Programming Models Ask questions at any point! 2 A Brief History of GPUs 3 Once
Stingray Traffic Manager Sizing Guide
STINGRAY TRAFFIC MANAGER SIZING GUIDE 1 Stingray Traffic Manager Sizing Guide Stingray Traffic Manager version 8.0, December 2011. For internal and partner use. Introduction The performance of Stingray
White Paper. Recording Server Virtualization
White Paper Recording Server Virtualization Prepared by: Mike Sherwood, Senior Solutions Engineer Milestone Systems 23 March 2011 Table of Contents Introduction... 3 Target audience and white paper purpose...
Embedded Systems: map to FPGA, GPU, CPU?
Embedded Systems: map to FPGA, GPU, CPU? Jos van Eijndhoven [email protected] Bits&Chips Embedded systems Nov 7, 2013 # of transistors Moore s law versus Amdahl s law Computational Capacity Hardware
Turbomachinery CFD on many-core platforms experiences and strategies
Turbomachinery CFD on many-core platforms experiences and strategies Graham Pullan Whittle Laboratory, Department of Engineering, University of Cambridge MUSAF Colloquium, CERFACS, Toulouse September 27-29
Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software
GPU Computing Numerical Simulation - from Models to Software Andreas Barthels JASS 2009, Course 2, St. Petersburg, Russia Prof. Dr. Sergey Y. Slavyanov St. Petersburg State University Prof. Dr. Thomas
Network Traffic Monitoring & Analysis with GPUs
Network Traffic Monitoring & Analysis with GPUs Wenji Wu, Phil DeMar [email protected], [email protected] GPU Technology Conference 2013 March 18-21, 2013 SAN JOSE, CALIFORNIA Background Main uses for network
Introduction to GPU Computing
Matthis Hauschild Universität Hamburg Fakultät für Mathematik, Informatik und Naturwissenschaften Technische Aspekte Multimodaler Systeme December 4, 2014 M. Hauschild - 1 Table of Contents 1. Architecture
Parallelization of video compressing with FFmpeg and OpenMP in supercomputing environment
Proceedings of the 9 th International Conference on Applied Informatics Eger, Hungary, January 29 February 1, 2014. Vol. 1. pp. 231 237 doi: 10.14794/ICAI.9.2014.1.231 Parallelization of video compressing
ultra fast SOM using CUDA
ultra fast SOM using CUDA SOM (Self-Organizing Map) is one of the most popular artificial neural network algorithms in the unsupervised learning category. Sijo Mathew Preetha Joy Sibi Rajendra Manoj A
SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST
SAPPHIRE VAPOR-X R9 270X 2GB GDDR5 OC WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory 4 x Maximum Display Monitor(s) support 1 x HDMI (with 3D) 1 x DisplayPort
Experiences on using GPU accelerators for data analysis in ROOT/RooFit
Experiences on using GPU accelerators for data analysis in ROOT/RooFit Sverre Jarp, Alfio Lazzaro, Julien Leduc, Yngve Sneen Lindal, Andrzej Nowak European Organization for Nuclear Research (CERN), Geneva,
Network Traffic Monitoring and Analysis with GPUs
Network Traffic Monitoring and Analysis with GPUs Wenji Wu, Phil DeMar [email protected], [email protected] GPU Technology Conference 2013 March 18-21, 2013 SAN JOSE, CALIFORNIA Background Main uses for network
Introduction to GPU Architecture
Introduction to GPU Architecture Ofer Rosenberg, PMTS SW, OpenCL Dev. Team AMD Based on From Shader Code to a Teraflop: How GPU Shader Cores Work, By Kayvon Fatahalian, Stanford University Content 1. Three
High Performance. CAEA elearning Series. Jonathan G. Dudley, Ph.D. 06/09/2015. 2015 CAE Associates
High Performance Computing (HPC) CAEA elearning Series Jonathan G. Dudley, Ph.D. 06/09/2015 2015 CAE Associates Agenda Introduction HPC Background Why HPC SMP vs. DMP Licensing HPC Terminology Types of
Parallel Computing: Strategies and Implications. Dori Exterman CTO IncrediBuild.
Parallel Computing: Strategies and Implications Dori Exterman CTO IncrediBuild. In this session we will discuss Multi-threaded vs. Multi-Process Choosing between Multi-Core or Multi- Threaded development
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing
Accelerating Simulation & Analysis with Hybrid GPU Parallelization and Cloud Computing Innovation Intelligence Devin Jensen August 2012 Altair Knows HPC Altair is the only company that: makes HPC tools
Guided Performance Analysis with the NVIDIA Visual Profiler
Guided Performance Analysis with the NVIDIA Visual Profiler Identifying Performance Opportunities NVIDIA Nsight Eclipse Edition (nsight) NVIDIA Visual Profiler (nvvp) nvprof command-line profiler Guided
CUDA SKILLS. Yu-Hang Tang. June 23-26, 2015 CSRC, Beijing
CUDA SKILLS Yu-Hang Tang June 23-26, 2015 CSRC, Beijing day1.pdf at /home/ytang/slides Referece solutions coming soon Online CUDA API documentation http://docs.nvidia.com/cuda/index.html Yu-Hang Tang @
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC
OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC Driving industry innovation The goal of the OpenPOWER Foundation is to create an open ecosystem, using the POWER Architecture to share expertise,
<Insert Picture Here> An Experimental Model to Analyze OpenMP Applications for System Utilization
An Experimental Model to Analyze OpenMP Applications for System Utilization Mark Woodyard Principal Software Engineer 1 The following is an overview of a research project. It is intended
Data Center and Cloud Computing Market Landscape and Challenges
Data Center and Cloud Computing Market Landscape and Challenges Manoj Roge, Director Wired & Data Center Solutions Xilinx Inc. #OpenPOWERSummit 1 Outline Data Center Trends Technology Challenges Solution
DCT-JPEG Image Coding Based on GPU
, pp. 293-302 http://dx.doi.org/10.14257/ijhit.2015.8.5.32 DCT-JPEG Image Coding Based on GPU Rongyang Shan 1, Chengyou Wang 1*, Wei Huang 2 and Xiao Zhou 1 1 School of Mechanical, Electrical and Information
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS
A GPU COMPUTING PLATFORM (SAGA) AND A CFD CODE ON GPU FOR AEROSPACE APPLICATIONS SUDHAKARAN.G APCF, AERO, VSSC, ISRO 914712564742 [email protected] THOMAS.C.BABU APCF, AERO, VSSC, ISRO 914712565833
IP Video Rendering Basics
CohuHD offers a broad line of High Definition network based cameras, positioning systems and VMS solutions designed for the performance requirements associated with critical infrastructure applications.
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries
Performance Evaluations of Graph Database using CUDA and OpenMP Compatible Libraries Shin Morishima 1 and Hiroki Matsutani 1,2,3 1Keio University, 3 14 1 Hiyoshi, Kohoku ku, Yokohama, Japan 2National Institute
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration. Sina Meraji [email protected]
Towards Fast SQL Query Processing in DB2 BLU Using GPUs A Technology Demonstration Sina Meraji [email protected] Please Note IBM s statements regarding its plans, directions, and intent are subject to
SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST
SAPPHIRE TOXIC R9 270X 2GB GDDR5 WITH BOOST Specification Display Support Output GPU Video Memory Dimension Software Accessory supports up to 4 display monitor(s) without DisplayPort 4 x Maximum Display
Overview of HPC systems and software available within
Overview of HPC systems and software available within Overview Available HPC Systems Ba Cy-Tera Available Visualization Facilities Software Environments HPC System at Bibliotheca Alexandrina SUN cluster
Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data
White Paper Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data What You Will Learn Financial market technology is advancing at a rapid pace. The integration of
Chapter 2 Parallel Architecture, Software And Performance
Chapter 2 Parallel Architecture, Software And Performance UCSB CS140, T. Yang, 2014 Modified from texbook slides Roadmap Parallel hardware Parallel software Input and output Performance Parallel program
GPU Architectures. A CPU Perspective. Data Parallelism: What is it, and how to exploit it? Workload characteristics
GPU Architectures A CPU Perspective Derek Hower AMD Research 5/21/2013 Goals Data Parallelism: What is it, and how to exploit it? Workload characteristics Execution Models / GPU Architectures MIMD (SPMD),
Several tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
Equalizer. Parallel OpenGL Application Framework. Stefan Eilemann, Eyescale Software GmbH
Equalizer Parallel OpenGL Application Framework Stefan Eilemann, Eyescale Software GmbH Outline Overview High-Performance Visualization Equalizer Competitive Environment Equalizer Features Scalability
Several tips on how to choose a suitable computer
Several tips on how to choose a suitable computer This document provides more specific information on how to choose a computer that will be suitable for scanning and postprocessing of your data with Artec
64-Bit versus 32-Bit CPUs in Scientific Computing
64-Bit versus 32-Bit CPUs in Scientific Computing Axel Kohlmeyer Lehrstuhl für Theoretische Chemie Ruhr-Universität Bochum March 2004 1/25 Outline 64-Bit and 32-Bit CPU Examples
Parallel Firewalls on General-Purpose Graphics Processing Units
Parallel Firewalls on General-Purpose Graphics Processing Units Manoj Singh Gaur and Vijay Laxmi Kamal Chandra Reddy, Ankit Tharwani, Ch.Vamshi Krishna, Lakshminarayanan.V Department of Computer Engineering
Enhancing Cloud-based Servers by GPU/CPU Virtualization Management
Enhancing Cloud-based Servers by GPU/CPU Virtualiz Management Tin-Yu Wu 1, Wei-Tsong Lee 2, Chien-Yu Duan 2 Department of Computer Science and Inform Engineering, Nal Ilan University, Taiwan, ROC 1 Department
Hue Streams. Seismic Compression Technology. Years of my life were wasted waiting for data loading and copying
Hue Streams Seismic Compression Technology Hue Streams real-time seismic compression results in a massive reduction in storage utilization and significant time savings for all seismic-consuming workflows.
GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile
GPU Computing with CUDA Lecture 2 - CUDA Memories Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile 1 Outline of lecture Recap of Lecture 1 Warp scheduling CUDA Memory hierarchy
Oracle Database Reliability, Performance and scalability on Intel Xeon platforms Mitch Shults, Intel Corporation October 2011
Oracle Database Reliability, Performance and scalability on Intel platforms Mitch Shults, Intel Corporation October 2011 1 Intel Processor E7-8800/4800/2800 Product Families Up to 10 s and 20 Threads 30MB
This Unit: Putting It All Together. CIS 501 Computer Architecture. Sources. What is Computer Architecture?
This Unit: Putting It All Together CIS 501 Computer Architecture Unit 11: Putting It All Together: Anatomy of the XBox 360 Game Console Slides originally developed by Amir Roth with contributions by Milo
In-Memory Data Management for Enterprise Applications
In-Memory Data Management for Enterprise Applications Jens Krueger Senior Researcher and Chair Representative Research Group of Prof. Hasso Plattner Hasso Plattner Institute for Software Engineering University
Architecting High-Speed Data Streaming Systems. Sujit Basu
Architecting High-Speed Data Streaming Systems Sujit Basu stream ing [stree-ming] verb 1. The act of transferring data to or from an instrument at a rate high enough to sustain continuous acquisition or
Agenda. Enterprise Application Performance Factors. Current form of Enterprise Applications. Factors to Application Performance.
Agenda Enterprise Performance Factors Overall Enterprise Performance Factors Best Practice for generic Enterprise Best Practice for 3-tiers Enterprise Hardware Load Balancer Basic Unix Tuning Performance
Fast Implementations of AES on Various Platforms
Fast Implementations of AES on Various Platforms Joppe W. Bos 1 Dag Arne Osvik 1 Deian Stefan 2 1 EPFL IC IIF LACAL, Station 14, CH-1015 Lausanne, Switzerland {joppe.bos, dagarne.osvik}@epfl.ch 2 Dept.
GPU-Based Network Traffic Monitoring & Analysis Tools
GPU-Based Network Traffic Monitoring & Analysis Tools Wenji Wu; Phil DeMar [email protected], [email protected] CHEP 2013 October 17, 2013 Coarse Detailed Background Main uses for network traffic monitoring
Auto-Tunning of Data Communication on Heterogeneous Systems
1 Auto-Tunning of Data Communication on Heterogeneous Systems Marc Jordà 1, Ivan Tanasic 1, Javier Cabezas 1, Lluís Vilanova 1, Isaac Gelado 1, and Nacho Navarro 1, 2 1 Barcelona Supercomputing Center
Seeking Opportunities for Hardware Acceleration in Big Data Analytics
Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who
Texture Cache Approximation on GPUs
Texture Cache Approximation on GPUs Mark Sutherland Joshua San Miguel Natalie Enright Jerger {suther68,enright}@ece.utoronto.ca, [email protected] 1 Our Contribution GPU Core Cache Cache
Findings in High-Speed OrthoMosaic
Findings in High-Speed OrthoMosaic David Piekny, Solutions Product Manager PCI Geomatics Committed To Image-Centric Excellence Technical Session 6, Rm. 203D Tuesday May 3 rd, 9:30-11:00 AM ASPRS 2011,
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter
Parallel Image Processing with CUDA A case study with the Canny Edge Detection Filter Daniel Weingaertner Informatics Department Federal University of Paraná - Brazil Hochschule Regensburg 02.05.2011 Daniel
Quantifying Hardware Selection in an EnCase v7 Environment
Quantifying Hardware Selection in an EnCase v7 Environment Introduction and Background The purpose of this analysis is to evaluate the relative effectiveness of individual hardware component selection
1 Storage Devices Summary
Chapter 1 Storage Devices Summary Dependability is vital Suitable measures Latency how long to the first bit arrives Bandwidth/throughput how fast does stuff come through after the latency period Obvious
Benchmarking Cassandra on Violin
Technical White Paper Report Technical Report Benchmarking Cassandra on Violin Accelerating Cassandra Performance and Reducing Read Latency With Violin Memory Flash-based Storage Arrays Version 1.0 Abstract
