First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction



Similar documents
NVIDIA GeForce GTX 580 GPU Datasheet

Router Architectures

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Graphical Processing Units to Accelerate Orthorectification, Atmospheric Correction and Transformations for Big Data

Comp 410/510. Computer Graphics Spring Introduction to Graphics Systems

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

1. If we need to use each thread to calculate one output element of a vector addition, what would

Advances in scmos Camera Technology Benefit Bio Research

Graphical displays are generally of two types: vector displays and raster displays. Vector displays

Next Generation GPU Architecture Code-named Fermi

Packet-based Network Traffic Monitoring and Analysis with GPUs

Next Generation Operating Systems

Direct GPU/FPGA Communication Via PCI Express

GPU-based Decompression for Medical Imaging Applications

Computed Tomography Resolution Enhancement by Integrating High-Resolution 2D X-Ray Images into the CT reconstruction

CHAPTER 3: DIGITAL IMAGING IN DIAGNOSTIC RADIOLOGY. 3.1 Basic Concepts of Digital Imaging

Chapter 3 SYSTEM SCANNING HARDWARE OVERVIEW

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

Network Traffic Monitoring & Analysis with GPUs

Implementation of Canny Edge Detector of color images on CELL/B.E. Architecture.

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Sockets vs. RDMA Interface over 10-Gigabit Networks: An In-depth Analysis of the Memory Traffic Bottleneck

Architectures and Platforms

Scalability and Classifications

Petascale Visualization: Approaches and Initial Results

Intel DPDK Boosts Server Appliance Performance White Paper

Flash Memory Arrays Enabling the Virtualized Data Center. July 2010

HP ProLiant SL270s Gen8 Server. Evaluation Report

Cloud Data Center Acceleration 2015

USB readout board for PEBS Performance test

Unified Computing Systems

Performance of Software Switching

Touchstone -A Fresh Approach to Multimedia for the PC

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

Lecture 4. Parallel Programming II. Homework & Reading. Page 1. Projects handout On Friday Form teams, groups of two

Intel Ethernet Switch Load Balancing System Design Using Advanced Features in Intel Ethernet Switch Family

Computer Graphics Hardware An Overview

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

SMB Direct for SQL Server and Private Cloud

NVIDIA Quadro M4000 Sync PNY Part Number: VCQM4000SYNC-PB. User Guide

COMPUTER HARDWARE. Input- Output and Communication Memory Systems

GPU Performance Analysis and Optimisation

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

Chapter 1 Reading Organizer

SAP HANA - Main Memory Technology: A Challenge for Development of Business Applications. Jürgen Primsch, SAP AG July 2011

Home Exam 3: Distributed Video Encoding using Dolphin PCI Express Networks. October 20 th 2015

How To Speed Up A Flash Flash Storage System With The Hyperq Memory Router

Network Traffic Monitoring and Analysis with GPUs

A Prototype For Eye-Gaze Corrected

How To Make A Car A Car Into A Car With A Car Stereo And A Car Monitor

GPU for Scientific Computing. -Ali Saleh

CT Image Reconstruction. Terry Peters Robarts Research Institute London Canada

The Dusk of FireWire - The Dawn of USB 3.0

Texture Cache Approximation on GPUs

NVIDIA VIDEO ENCODER 5.0

Scan Time Reduction and X-ray Scatter Rejection in Dual Modality Breast Tomosynthesis. Tushita Patel 4/2/13

Basler. Line Scan Cameras

Computer Systems Structure Input/Output

RAID. RAID 0 No redundancy ( AID?) Just stripe data over multiple disks But it does improve performance. Chapter 6 Storage and Other I/O Topics 29

MRC High Resolution. MR-compatible digital HD video camera. User manual

A PHOTOGRAMMETRIC APPRAOCH FOR AUTOMATIC TRAFFIC ASSESSMENT USING CONVENTIONAL CCTV CAMERA

Alberto Corrales-García, Rafael Rodríguez-Sánchez, José Luis Martínez, Gerardo Fernández-Escribano, José M. Claver and José Luis Sánchez

Rackspace Cloud Databases and Container-based Virtualization

Clustering Billions of Data Points Using GPUs

Nutaq. PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET. nutaq.com MONTREAL QUEBEC

Accelerating I/O- Intensive Applications in IT Infrastructure with Innodisk FlexiArray Flash Appliance. Alex Ho, Product Manager Innodisk Corporation

Mellanox Cloud and Database Acceleration Solution over Windows Server 2012 SMB Direct

Understanding Line Scan Camera Applications

Cloud-Based Apps Drive the Need for Frequency-Flexible Clock Generators in Converged Data Center Networks

Data and Control Plane Interconnect solutions for SDN & NFV Networks Raghu Kondapalli August 2014

CHAPTER FIVE RESULT ANALYSIS

Data Center and Cloud Computing Market Landscape and Challenges

Chapter 11 I/O Management and Disk Scheduling

VPX Implementation Serves Shipboard Search and Track Needs

GPU Architecture. Michael Doggett ATI

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Database Management Systems

Learn CUDA in an Afternoon: Hands-on Practical Exercises

Lustre Networking BY PETER J. BRAAM

GPGPU Computing. Yong Cao

GPU-Based Network Traffic Monitoring & Analysis Tools

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

DICOM Correction Item

Integrated Sensor Analysis Tool (I-SAT )

Parallel Computing. Benson Muite. benson.

FPGAs in Next Generation Wireless Networks

Data Centric Systems (DCS)

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

2-Megapixel Sony Progressive CMOS Sensor with Super Wide Dynamic Range and High Frame Rate

7 MEGAPIXEL 180 DEGREE IP VIDEO CAMERA

BIG data big problems big opportunities Rudolf Dimper Head of Technical Infrastructure Division ESRF

High Performance OpenStack Cloud. Eli Karpilovski Cloud Advisory Council Chairman

CUBIX ACCEL-APP SYSTEMS Linux2U Rackmount Elite

Pentaho High-Performance Big Data Reference Configurations using Cisco Unified Computing System

NVIDIA IndeX Enabling Interactive and Scalable Visualization for Large Data Marc Nienhaus, NVIDIA IndeX Engineering Manager and Chief Architect

Transcription:

First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction A.P. Lowell P. Kahn J. Ku 25 March 2014

Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution

Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution

Application General: Cardiac Fluoroscopy System

Application Makes Live Video of a Beating Heart

General: Overview Application Used for non-surgical cardiac procedures Assessment of stenosis Angioplasty Stent Placement Real-Time Digital X-ray Imaging High data throughput and processing Multiple equipment racks and custom enclosures across several rooms

Application Functional Blocks Control Room Equipment Space Exam Area Facility Installations Image Displays Image Displays Display Processor User Controls Equipment Racks Imaging Chassis I/O Gantry Gantry C-Arm X-ray Detector X-ray Tube Heat Exchanger HVPS PDU UPS Gantry Pedestal Motion Control Patient Table Motion Control User Controls

Background Application Triple Ring Technologies is a contract R&D firm specializing in sensor-based systems Project was funded by: NovaRay, Inc. National Institutes of Health Clinical work at University of Wisconsin, Madison Initial implementation by MultiCoreWare, Inc.

Application Performance Summary Real-time Tomosynthesis Input: continuous sensor images ~123 billion rays/second ~40 Gbps sensor downlink rate 640x320 photon-counting sensor array 10,000 scanned source locations 1.28 μs/snapshot Output: live video 32 1000x1000-pixel focal planes internally 1 1000x1000-pixel best-focus image output 30 frames/second ~1.4 trillion mathematical operations per second

Application Technological Novelty: Scanning-Beam Digital X-ray Geometry Traditional X-ray Point X-ray Source Large-Area Detector Close To Patient SBDX Large-Area X-ray Source Small-Area Detector Far From Patient Imaging Information Flat Projection 3-D Dose Acceptable 80% to 90% less

Application Technological Novelty: Reverse Geometry Standard Geometry Reverse Geometry

Application Technological Novelty: Reverse Geometry Scattered x-rays miss the detector: less noise! Standard Geometry Reverse Geometry

Application Technological Novelty: Reverse Geometry Multiple source perspectives 3D tomography Standard Geometry Reverse Geometry

Overview General Application Algorithms History and Limitations of Traditional Processors GPU Solution

Algorithms Tomosynthesis: Focal Planes D1 D2 D3 High Plane Focal-Plane Low Plane Detector Plane Images must be reconstructed Within a focal plane: Rays from a set of source/detector combinations converge to the same pixel constructive reinforcement Outside the focal plane: Rays from same set of source/detector combinations diverge into different pixels result is blurring Rate of divergence defines depth-of-field Requires multiple focal-planes to image full volume Source Locations S1 S2 S3

Algorithms Tomosynthesis: Digital Lens A Virtual Image Plane A B Virtual Lens Detector Plane Focal-Plane Mapping of rays to image pixels is the virtual equivalent of having a physical lens at the detector plane to bend the rays onto a focal plane Changing the bending characteristic of the virtual lens (ie. the mapping function) creates different focal-planes Source Plane

Algorithms Tomosynthesis: Digital Lens Focal Plane A In-Focus A Virtual Image Plane Focal Plane B In-Focus B Virtual Image Plane A Detector Plane Detector Plane A Focal-Plane A Focal-Plane B B Source Plane Source Plane

Application Tomosynthesis: Focal Plane Example

Algorithms Reconstruction By Projection Geometric projection based on ray-tracing Both projection coefficients and extent vary with focalplane

Algorithms Reconstruction by Projection: Basic Geometry Detector Elements Rays from each source location to each detector element intersect the focal-plane within some window that spans a (typically) non-integer number of pixels Focal-Plane Pixels Source Locations

Algorithms Tomosynthesis: Basic Geometry Detector Elements Windows from adjacent detector elements will (in general) overlap at the boundary pixels Overlap is not constant -- projection kernel varies between detector elements Focal-Plane Pixels Source Locations

Algorithms Reconstruction by Projection: Basic Geometry Detector Elements Windows from adjacent source locations will overlap Multiple detector samples for each reconstructed image pixel Focal-Plane Pixels Source Locations

Algorithms Reconstruction By Projection: Rotated Detector Rotation of detector improves sampling as projection advances across the image

Algorithms Reconstruction By Projection: Rotated Detector Rotation of detector improves sampling as projection advances across the image However, now a given detector row or column does not map consistently onto a pixel row or column the pixel row indices change with detector column, and vice-versa

Algorithms Tomosynthesis = CT? CT No SBDX CT SBDX Perspective Parallel to Rays Perpendicular to Rays Sample Rate <~500 Msps (high-end) 7.7 Gsps Response Time ASAP 30 fps, < 100ms latency Projection Geometry Reconstruction Irregular Varies with rotation angle Correct geometric distortion Filtered back-projection Regular Integer source step-size Allow geometric distortion Unfiltered back-projection

Algorithms Plane-Selection Single Focal-Plane Best Focus

Plane-Selection Algorithms Detect features of interest (things in-focus ) in each focal-plane Algorithms may include matched filters, gradient estimation, topological operators,. Calculate figures-of-merit Major impediments high levels of Poisson noise in dark regions low contrast for small features Select which plane to display in final image on a pixelby-pixel basis Plane-to-plane comparison over a large number of planes

Application Live Image from GPU system

Algorithms Other Processing Artifact removal (per focal plane) Residue of reconstruction methods: pattern noise, gain corrections Dynamic range adjustment (per focal plane) Typical image dynamic range is far in excess of display capabilities and of the human visual system Noise management Noise is dominated by photon statistics rather than by scatter User-applied filters Temporal averaging with motion-detection Edge enhancement Contrast enhancement

Temporal Constraints Algorithms Thermal loading of x-ray target mandates re-scan of source locations Previously-reconstructed pixels must be re-visited Requires a large fraction of the final image to remain resident in memory for re-scanning Real-Time feedback of physical manipulations: Hand-Eye coordination for the surgeon Imposes maximum latency requirement of < ~100 ms along with sustained 30Hz frame rate

Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution

Previous Implementations History ~10x increase in resolution/calculations per generation 1 st and 2 nd Generations FPGAs: fully-custom parallel pipelines > $15k/focal-plane x 16 focal-planes = >$240k/system Memory-constrained Development and maintenance difficult 3 rd Generation MPPA (Ambric/Nethra): 336 processors with local memory and flexible data distribution mesh Obsolete architecture Still used FPGAs for input formatting/post-processing ~$1500/focal-plane x 32 focal-planes = ~$48k/system Proprietary development environment

History Generation 2 Blue: FPGAs 1 focalplane/board Green: FPGAs Artifact removal Dynamic Range Management Separate board for planeselection

History Generation 3 Blue: MPPAs 1 focalplane/chip Green: FPGAs Data input/format Artifact removal Dynamic Range Management Same board used for planeselection

Traditional Processors: History Previous attempts to map algorithms to common commercial processors failed DSP Cell GPU Limitations: I/O: bandwidth Memory: Available resources (buffer results for many focal planes) Memory: Cache sizing (fall off the cache) Memory: Burst optimization 2-D array access adjacent accesses in one dimension but not in the other Degree of management required by slower host processors

Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution

What is our configuration? GPU Solution

What is our configuration? GPU Solution 9x K20 ~$850/focal-plane x 32 planes = ~$27k/system 1x GTX680 (for managing displays) PCIe 2.0 backplane Redhat on Supermicro Cuda 5

GPU Solution Logical Configuration and Data Flow Ethernet Switch 1000-base-T Image Reconstruction PCIe K20 K20 K20 K20 K20 K20 K20 K20 PCIe Multi-Cast, RDMA PCIe Re-scan Aggregator Fiber X-ray Detector Framing Fiber System Controller (Mediation) PCIe Disk Array X-ray Source Framing Fiber Supermicro PCIe K20 Artifact Removal Dynamic Range Management Plane-Selection PCIe 1000-base-T GTX680 HDMI (GigE Vision) To External System Display

GPU Solution Physical Configuration 1000-base-T X-ray Detector 1000-base-T Re-scan Aggregator x8 (Gen 1) Multi-Cast Sensor Data Image Reconstruction PCIe Chassis (Cubix 8) x16 x16 Multi-Cast Sensor Data x16 PCIe Switch (Gen 2) x16 x16 x16 Multi-Cast Sensor Data x16 PCIe Switch (Gen 2) x16 x16 x16 K20 K20 x16 K20 PCIe Switch (Gen 2) K20 x16 x16 K20 x16 x16 x16 x16 K20 x16 K20 PCIe Switch (Gen 2) K20 PCIe Chassis (Cubix 8) x8 (Gen 1) Interconnect Multi-Cast Sensor Data 1000-base-T Disk Array GigE Vision Display 1000-base-T Ethernet Switch 1000-base-T X-ray Source 1000-base-T µp K20 x16 x16 x16 x16 PCIe Switch (Gen 3) GTX 680 Host Computer Artifact Removal Dynamic Range Management Plane-Selection Display 1000-base-T System Controller

GPU Solution What is new now that allows it to work? Gen 2 PCIe Interface with multi-cast High-enough bandwidth All planes use the same data set and must receive the same data stream GPU Direct or Remote DMA (RDMA) Allows source data streaming directly to GPUs, bypassing the host Dynamic Parallelism Decreases latency by allowing management of parallel operations without host intervention Significant increase in fast shared memory Significant increase in core density

Application Live Image from GPU system

GPU Solution What would make it better? More shared memory we are still bandwidth-limited! RDMA improvements Bidirectional: We have to get the images out as well as getting the data in Peer-to-Peer (GPU-to-GPU) communication/coordination without host intervention Better support for real-time operations Timeouts/Host waits Support for code executing on streaming multiprocessor CUDA API is optimized for batch operations, not streaming operations Better debugging support for multi-gpu systems Ability to isolate reporting to subsets

Reference S4363: Accelerated X-ray Imaging: Real- Time Multi-Plane Image Reconstruction with CUDA discusses an alternate implementation of the reconstruction algorithm

Thanks To Paul Kahn, Jamie Ku, and the rest of the TRT team NovaRay, Inc. NIH University of Wisconsin at Madison MultiCoreWare