Computation of Mutual Information Metric for Image Registration on Multiple GPUs

Size: px

Start display at page:

Download "Computation of Mutual Information Metric for Image Registration on Multiple GPUs"

Brook Burke
8 years ago
Views:

1 Computation of Mutual Information Metric for Image Registration on Multiple GPUs Andrew V. Adinetz 1, Markus Axer 2, Marcel Huysegoms 2, Stefan Köhnen 2, Jiri Kraus 3, Dirk Pleiter 1 1 JSC, Forschungszentrum Jülich 2 INM-1, Forschungszentrum Jülich 3 NVIDIA GmbH Presented at HeteroPar 13 workshop of EuroPar 13

Adinetz 1, Markus Axer 2, Marcel Huysegoms 2, Stefan Köhnen 2, Jiri Kraus 3,

2 Outline Brain Image Registration Multi-GPU Implementation system memory listupdate Performance Evaluation Conclusion 2

3 Preparation of the brain 3

4 Pushing the limits for a cellular brain model BigBrain first high-resolution brain model at microscopical scale! 7404 histological sec/ons stained for cell bodies! scanned with a flad bed scanner! original resolu/on μm 3 ( pixels)! downscaling to 20 μm isotropic! removal of ar/facts! 1 Terabyte in cooperation with Alan Evans, McGill, Montreal Amunts et al. (2013) Science

scanned with a flad bed scanner! original resolu/on 10 10 20 μm 3 (11.000 13.000 pixels)!

8 Image Registration Registration = process of image alignment ITK Workflow 8

9 Mutual Information Metric MI(I f,i m ) = p(i, j)log 2 j i, j p f (i) = p(i, j) i p m ( j) = p(i, j) p(i, j) p f (i) p m ( j) i, j pixel values ( ) successful for multi-modal registration 9

10 Two Image Cross-Histogram for(int y = 0; y < fixed_sz_y; y++) for(int x = 0; x < fixed_sz_x; x++) { int i = bin(fixed[x, y]); float x1 = transform_x(x, y); float y1 = transform_y(x, y); int j = bin(interpolate(moving, x1, y1)); histogram[i, j]++; // atomic on GPU } main computational kernel transform can be complex (1000+ parameters) GPU implementation: 1 pixel/thread, atomics 10

int j = bin(interpolate(moving, x1, y1)); histogram[i, j]++; // atomic on GPU } main

11 Large Data Size Large-area Polarimeter Polarizing Microscope size: px pixel size: µm file size: 30 MB size: px Need mul(ple GPUs! pixel size: 1.6 x 1.6 µm file size: 40 GB 11

000 px pixel size: 60 60 µm file size: 30 MB size:

12 Multi-GPU Mutual Information Domain decomposition distribute fixed and moving images histogram contributions summed up Moving image: how to handle? irregular access pattern Approaches System memory replication (sysmem) Listupdate (listupdate) 12

13 System Memory Replication Replicate entire moving image in pinned host RAM accessible to GPU + easy to implement system memory accesses are slower cannot use texture interpolation Optimizations moving image halo in GPU RAM 13

system memory accesses are slower cannot use texture

14 Listupdate On remote access send message On receiving message compute contributions Active messaging variant buffering relies on undocumented features Listupdate chunking buffer size bounded communication-computation overlap typedef struct { float[2] movingcoords; short destrank; char fixedbin; } message_t; 14

features Listupdate chunking buffer size bounded communication-computation

15 Writeout: Atomics vs Grouping Atomics determine write posi(on using atomics warp- aggregated increment Grouping write to per- pixel buffer group (compress) 15

16 Chunk Processing and Overlap y Fixed Image Fixed Image (0,0) x 1 2 Process chunk Group Exchange Handle messages Process chunk Group Exchange Process chunk Group

17 Listupdate typedef struct { float[2] movingcoords; short destrank; char fixedbin; } message_t; + computation-communication overlap hard to implement chunk processing (or won t fit into buffer) Optimizations buffers: AoS vs. SoA atomics vs. grouping using multiple streams 17

to implement chunk processing (or won t fit into buffer)

18 Benchmark setup y Remote access Fixed Image Fixed Image Mask (0,0) x 18

19 Test Hardware JUDGE 256-node GPU cluster Each M2070 node: 2x M2070 (Fermi) GPU, each 6 GB RAM 12-core X GHz, 96 GB RAM JuHydra single-node Kepler machine 2x K20X (Kepler) GPU, each 6 GB RAM 16-core E GHz, 64 GB RAM 19

67 GHz, 96 GB RAM JuHydra single-node Kepler machine 2x K20X

20 Baseline: Full Replication (M2070) Run/me in seconds GPU 2 - GPUs 4 - GPUs Rota/on angle ideal scalability 20

6 54 59.4 64.8 70.2 75.6 81 86.4 91.8 97.2 102.6 108 113.4 118.8 124.2 129.

21 Sysmem on Fermi Run/me in seconds GPU 2- GPUs Baseline 2 GPUs Rota/on angle 21

22 Sysmem on Fermi: Explanation No sysmem Access Good Coalescing Few sysmem Access Bad Coalescing Many sysmem Access Bad Coalescing Most sysmem Access Good Coalescing 22

23 Sysmem on Fermi: PCI-E Queries Run/me in seconds Sysmem_queries Rota/on angle 2- GPUs Baseline 2 GPUs Total Sysmem_queries 0 23

24 Sysmem: Halo Sizes Time, s Angle, degrees 2 K20X, baseline 2 K20X, sysmem 2 K20X, 5% halo 2 K20X, 10% halo 2 K20X, 15% halo 2 K20X, 20% halo 2 K20X, 25% halo mostly quan(ta(ve, not qualita(ve difference 24

25 Listupdate: Multiple Streams Time, s Angle, degrees 2 K20X, 1 stream 2 K20X, 2 streams 2 K20X, 3 streams 2 K20X, 4 streams streams look the best

26 Listupdate: AoS vs SoA, Atomics vs Group typedef struct { float[2] movingcoords; char fixedbin; } message_t; Time, s Angle, degrees SoA + atomics looks best K20X, SoA 2 K20X, AoS 2 K20X, compress

27 Sysmem vs. Listupdate: Fermi Time, s Angle, degrees 4 M2070, SoA 4 M2070, baseline 4 M2070, sysmem 4 M2070, 25% halo on Fermi, sysmem is be_er 27

28 Sysmem vs. Listupdate: Kepler (Closeup) Time, s Angle, degrees 2 K20X, SoA 2 K20X, baseline 2 K20X, sysmem 2 K20X, 25% halo on Kepler, listupdate is be_er 28

29 Conclusions Fermi performance limited by atomics system memory replication is better Kepler 10x faster than Fermi no longer dominated by atomics listupdate (atomic, SoA, 4 streams) is better Future work Compression Trials on real images 29

30 Questions? INM-1 at FZJ: NVidia Application Lab at FZJ: Andrew V. Adinetz: Jiri Kraus: Dirk Pleiter: 30

Computer Graphics Hardware An Overview

Computer Graphics Hardware An Overview Graphics System Monitor Input devices CPU/Memory GPU Raster Graphics System Raster: An array of picture elements Based on raster-scan TV technology The screen (and