First In Vivo Medical Images Using Photon- Counting, Real-Time GPU Reconstruction A.P. Lowell P. Kahn J. Ku 25 March 2014
Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution
Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution
Application General: Cardiac Fluoroscopy System
Application Makes Live Video of a Beating Heart
General: Overview Application Used for non-surgical cardiac procedures Assessment of stenosis Angioplasty Stent Placement Real-Time Digital X-ray Imaging High data throughput and processing Multiple equipment racks and custom enclosures across several rooms
Application Functional Blocks Control Room Equipment Space Exam Area Facility Installations Image Displays Image Displays Display Processor User Controls Equipment Racks Imaging Chassis I/O Gantry Gantry C-Arm X-ray Detector X-ray Tube Heat Exchanger HVPS PDU UPS Gantry Pedestal Motion Control Patient Table Motion Control User Controls
Background Application Triple Ring Technologies is a contract R&D firm specializing in sensor-based systems Project was funded by: NovaRay, Inc. National Institutes of Health Clinical work at University of Wisconsin, Madison Initial implementation by MultiCoreWare, Inc.
Application Performance Summary Real-time Tomosynthesis Input: continuous sensor images ~123 billion rays/second ~40 Gbps sensor downlink rate 640x320 photon-counting sensor array 10,000 scanned source locations 1.28 μs/snapshot Output: live video 32 1000x1000-pixel focal planes internally 1 1000x1000-pixel best-focus image output 30 frames/second ~1.4 trillion mathematical operations per second
Application Technological Novelty: Scanning-Beam Digital X-ray Geometry Traditional X-ray Point X-ray Source Large-Area Detector Close To Patient SBDX Large-Area X-ray Source Small-Area Detector Far From Patient Imaging Information Flat Projection 3-D Dose Acceptable 80% to 90% less
Application Technological Novelty: Reverse Geometry Standard Geometry Reverse Geometry
Application Technological Novelty: Reverse Geometry Scattered x-rays miss the detector: less noise! Standard Geometry Reverse Geometry
Application Technological Novelty: Reverse Geometry Multiple source perspectives 3D tomography Standard Geometry Reverse Geometry
Overview General Application Algorithms History and Limitations of Traditional Processors GPU Solution
Algorithms Tomosynthesis: Focal Planes D1 D2 D3 High Plane Focal-Plane Low Plane Detector Plane Images must be reconstructed Within a focal plane: Rays from a set of source/detector combinations converge to the same pixel constructive reinforcement Outside the focal plane: Rays from same set of source/detector combinations diverge into different pixels result is blurring Rate of divergence defines depth-of-field Requires multiple focal-planes to image full volume Source Locations S1 S2 S3
Algorithms Tomosynthesis: Digital Lens A Virtual Image Plane A B Virtual Lens Detector Plane Focal-Plane Mapping of rays to image pixels is the virtual equivalent of having a physical lens at the detector plane to bend the rays onto a focal plane Changing the bending characteristic of the virtual lens (ie. the mapping function) creates different focal-planes Source Plane
Algorithms Tomosynthesis: Digital Lens Focal Plane A In-Focus A Virtual Image Plane Focal Plane B In-Focus B Virtual Image Plane A Detector Plane Detector Plane A Focal-Plane A Focal-Plane B B Source Plane Source Plane
Application Tomosynthesis: Focal Plane Example
Algorithms Reconstruction By Projection Geometric projection based on ray-tracing Both projection coefficients and extent vary with focalplane
Algorithms Reconstruction by Projection: Basic Geometry Detector Elements Rays from each source location to each detector element intersect the focal-plane within some window that spans a (typically) non-integer number of pixels Focal-Plane Pixels Source Locations
Algorithms Tomosynthesis: Basic Geometry Detector Elements Windows from adjacent detector elements will (in general) overlap at the boundary pixels Overlap is not constant -- projection kernel varies between detector elements Focal-Plane Pixels Source Locations
Algorithms Reconstruction by Projection: Basic Geometry Detector Elements Windows from adjacent source locations will overlap Multiple detector samples for each reconstructed image pixel Focal-Plane Pixels Source Locations
Algorithms Reconstruction By Projection: Rotated Detector Rotation of detector improves sampling as projection advances across the image
Algorithms Reconstruction By Projection: Rotated Detector Rotation of detector improves sampling as projection advances across the image However, now a given detector row or column does not map consistently onto a pixel row or column the pixel row indices change with detector column, and vice-versa
Algorithms Tomosynthesis = CT? CT No SBDX CT SBDX Perspective Parallel to Rays Perpendicular to Rays Sample Rate <~500 Msps (high-end) 7.7 Gsps Response Time ASAP 30 fps, < 100ms latency Projection Geometry Reconstruction Irregular Varies with rotation angle Correct geometric distortion Filtered back-projection Regular Integer source step-size Allow geometric distortion Unfiltered back-projection
Algorithms Plane-Selection Single Focal-Plane Best Focus
Plane-Selection Algorithms Detect features of interest (things in-focus ) in each focal-plane Algorithms may include matched filters, gradient estimation, topological operators,. Calculate figures-of-merit Major impediments high levels of Poisson noise in dark regions low contrast for small features Select which plane to display in final image on a pixelby-pixel basis Plane-to-plane comparison over a large number of planes
Application Live Image from GPU system
Algorithms Other Processing Artifact removal (per focal plane) Residue of reconstruction methods: pattern noise, gain corrections Dynamic range adjustment (per focal plane) Typical image dynamic range is far in excess of display capabilities and of the human visual system Noise management Noise is dominated by photon statistics rather than by scatter User-applied filters Temporal averaging with motion-detection Edge enhancement Contrast enhancement
Temporal Constraints Algorithms Thermal loading of x-ray target mandates re-scan of source locations Previously-reconstructed pixels must be re-visited Requires a large fraction of the final image to remain resident in memory for re-scanning Real-Time feedback of physical manipulations: Hand-Eye coordination for the surgeon Imposes maximum latency requirement of < ~100 ms along with sustained 30Hz frame rate
Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution
Previous Implementations History ~10x increase in resolution/calculations per generation 1 st and 2 nd Generations FPGAs: fully-custom parallel pipelines > $15k/focal-plane x 16 focal-planes = >$240k/system Memory-constrained Development and maintenance difficult 3 rd Generation MPPA (Ambric/Nethra): 336 processors with local memory and flexible data distribution mesh Obsolete architecture Still used FPGAs for input formatting/post-processing ~$1500/focal-plane x 32 focal-planes = ~$48k/system Proprietary development environment
History Generation 2 Blue: FPGAs 1 focalplane/board Green: FPGAs Artifact removal Dynamic Range Management Separate board for planeselection
History Generation 3 Blue: MPPAs 1 focalplane/chip Green: FPGAs Data input/format Artifact removal Dynamic Range Management Same board used for planeselection
Traditional Processors: History Previous attempts to map algorithms to common commercial processors failed DSP Cell GPU Limitations: I/O: bandwidth Memory: Available resources (buffer results for many focal planes) Memory: Cache sizing (fall off the cache) Memory: Burst optimization 2-D array access adjacent accesses in one dimension but not in the other Degree of management required by slower host processors
Overview Application Algorithms History and Limitations of Traditional Processors GPU Solution
What is our configuration? GPU Solution
What is our configuration? GPU Solution 9x K20 ~$850/focal-plane x 32 planes = ~$27k/system 1x GTX680 (for managing displays) PCIe 2.0 backplane Redhat on Supermicro Cuda 5
GPU Solution Logical Configuration and Data Flow Ethernet Switch 1000-base-T Image Reconstruction PCIe K20 K20 K20 K20 K20 K20 K20 K20 PCIe Multi-Cast, RDMA PCIe Re-scan Aggregator Fiber X-ray Detector Framing Fiber System Controller (Mediation) PCIe Disk Array X-ray Source Framing Fiber Supermicro PCIe K20 Artifact Removal Dynamic Range Management Plane-Selection PCIe 1000-base-T GTX680 HDMI (GigE Vision) To External System Display
GPU Solution Physical Configuration 1000-base-T X-ray Detector 1000-base-T Re-scan Aggregator x8 (Gen 1) Multi-Cast Sensor Data Image Reconstruction PCIe Chassis (Cubix 8) x16 x16 Multi-Cast Sensor Data x16 PCIe Switch (Gen 2) x16 x16 x16 Multi-Cast Sensor Data x16 PCIe Switch (Gen 2) x16 x16 x16 K20 K20 x16 K20 PCIe Switch (Gen 2) K20 x16 x16 K20 x16 x16 x16 x16 K20 x16 K20 PCIe Switch (Gen 2) K20 PCIe Chassis (Cubix 8) x8 (Gen 1) Interconnect Multi-Cast Sensor Data 1000-base-T Disk Array GigE Vision Display 1000-base-T Ethernet Switch 1000-base-T X-ray Source 1000-base-T µp K20 x16 x16 x16 x16 PCIe Switch (Gen 3) GTX 680 Host Computer Artifact Removal Dynamic Range Management Plane-Selection Display 1000-base-T System Controller
GPU Solution What is new now that allows it to work? Gen 2 PCIe Interface with multi-cast High-enough bandwidth All planes use the same data set and must receive the same data stream GPU Direct or Remote DMA (RDMA) Allows source data streaming directly to GPUs, bypassing the host Dynamic Parallelism Decreases latency by allowing management of parallel operations without host intervention Significant increase in fast shared memory Significant increase in core density
Application Live Image from GPU system
GPU Solution What would make it better? More shared memory we are still bandwidth-limited! RDMA improvements Bidirectional: We have to get the images out as well as getting the data in Peer-to-Peer (GPU-to-GPU) communication/coordination without host intervention Better support for real-time operations Timeouts/Host waits Support for code executing on streaming multiprocessor CUDA API is optimized for batch operations, not streaming operations Better debugging support for multi-gpu systems Ability to isolate reporting to subsets
Reference S4363: Accelerated X-ray Imaging: Real- Time Multi-Plane Image Reconstruction with CUDA discusses an alternate implementation of the reconstruction algorithm
Thanks To Paul Kahn, Jamie Ku, and the rest of the TRT team NovaRay, Inc. NIH University of Wisconsin at Madison MultiCoreWare