Towards real-time image processing with Hierarchical Hybrid Grids

Towards real-time image processing with Hierarchical Hybrid Grids International Doctorate Program - Summer School Björn Gmeiner Joint work with: Harald Köstler, Ulrich Rüde August, 2011

Contents The HHG Framework Image processing for MRI Real-time processing 2

The HHG Framework 3

Combining finite element and multigrid methods FE mesh may be unstructured. What nodes to remove for coarsening? Not straightforward! Why not start from the coarse grid? The Hierarchical Hybrid Grids (HHG) concept Benjamin Bergen*: prototype Tobias Gradl: tuning, extensions and adaptivity * Dissertation in Erlangen, ISC award in 2005. Currently at Los Alamos Labs. 4

Advantages Properties of the HHG approach Multigrid is straightforward Very memory efficient Massive performance benefits on current computer architectures Subserves parallelization 10 12 unknowns are possible Limitation Coarse input grid needed Adaptivity (ongoing work by Tobias Gradl) 5

Two-grid cycle (correction scheme) 6

HHG Primitives (2d-example) inner points (macro) vertex points (macro) edge points ghost points communication 7

Weak scalability of HHG on Blue Gene/P (Jugene) Cores Struct. Regions Unknowns CG Time 128 1 536 534 776 319 15 5.64 256 3 072 1 070 599 167 20 5.66 512 6 144 2 142 244 863 25 5.69 1024 12 288 4 286 583 807 30 5.71 2048 24 576 8 577 357 823 45 5.75 4096 49 152 17 158 905 855 60 5.92 8192 98 304 34 326 194 175 70 5.86 16384 196 608 68 669 157 375 90 5.91 32768 393 216 137 355 083 775 105 6.17 65536 786 432 274 743 709 695 115 6.41 131072 1 572 864 549 554 511 871 145 6.42 262144 3 145 728 1 099 176 116 223 180 6.81 294912 294 912 824 365 314 047 110 3.80 8

Image processing for MRI 1. Denoising by homogeneous diffusion 2. High dynamic range compression 9

Domain generation (typical size: e.g. 1024 3 ) 1. Static domain partitioning, parallel file reading 2. Find relevant (information containing) regions 3. Distribute only relevant regions equally 10

1) Denoising by homogeneous diffusion Image with noise: u 0 = Ru + η R... linear operator incorporating blur (we assume R = Id) η... additive noise (e.g. white Gaussian noise) Simplest approach to reduce noise (better: anisotropic Diffusion): u u 0 = α u α... regularization parameter (α > 0) Variational formulation: a(u, v) = α u + uv dx, f (v) = Ω Ω u 0 v dx 11

Denoising by homogeneous diffusion (cont.) min J(u) := 1 a(u, u) f (u) 2 min J(u) := 1 α u u + u 2 dx u 0 u dx 2 Ω Ω min 2J(u) = α u u + u 2 2u 0 u dx min 2J(u) = Ω Ω α u u + u 2 2u 0 u + (u 0 ) 2 (u 0 ) 2 dx min Ω u 0 u 2 + α u 2 dx 12

The HHG Framework Image processing for MRI Real-time processing 2) High dynamic range compression Steps 1. compute gradient field 2. manipulate picture in the gradient domain (i.e. damp large gradients) 3. back transformation u = k( u 0 ) 13

Real-time processing 14

Objective platforms Jugene (FZ Jülich) lima (RRZE Erlangen) 4-way SMP processor 32-bit PowerPC 450 core 850 MHz Bandwidth: 13.6 GB/s 2 GB main memory 2 hexa-core processors Xeon 5650 Westmere 2 660 MHz Bandwidth: 32 GB/s 24 GB main memory 15

5-point stencil example: Blue-Gene/P 1 f o r ( i n t j =1; j <t s i z e 1; ++j ) { 2 // l e x. update ( a l l p o i n t s ) 3 f o r ( i n t i =1; i <t s i z e 1; ++i ) { 5 u [ k t s i z e t s i z e + j t s i z e + i ] = 6 c [ 0 ] ( f [ j t s i z e+i ] + 8 c [ 1 ] u [ ( j +1) t s i z e + ( i ) ] + 9 c [ 2 ] u [ ( j ) t s i z e + ( i +1) ] + 10 c [ 3 ] u [ ( j ) t s i z e + ( i 1) ] + 11 c [ 4 ] u [ ( j 1) t s i z e + ( i ) ] ) ; 12 } 13 } 16

Disjoint optimization : Blue-Gene/P 1 double u2 = u ; 2 f o r ( i n t j =1; j <t s i z e 1; ++j ) { 3 // f i r s t update ( r e d p o i n t s o n l y ) 4 f o r ( i n t i =1; i <t s i z e 1; i +=2) { 5 #pragma d i s j o i n t ( u, f ) 6 #pragma d i s j o i n t ( u, u2 ) 7 #pragma d i s j o i n t ( u2, f ) 9 u2 [ k t s i z e t s i z e + j t s i z e + i ] = 10 c [ 0 ] ( f [ j t s i z e+i ] + 12 c [ 1 ] u [ ( j +1) t s i z e + ( i ) ] + 13 c [ 2 ] u [ ( j ) t s i z e + ( i +1) ] + 14 c [ 3 ] u [ ( j ) t s i z e + ( i 1) ] + 15 c [ 4 ] u [ ( j 1) t s i z e + ( i ) ] ) ; 16 } 17 // second update ( b l a c k p o i n t s o n l y ) 18 } 17

7-point stencil (Blue-Gene/P) 40 30 MStencil/s 20 10 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint RRB Gauss-Seidel disjoint, index opt. 0 50 100 150 200 250 Size 18

27-point stencil (Blue-Gene/P) 10 8 MStencil/s 6 4 2 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint, index opt. 0 50 100 150 200 250 Size 19

Different stencils (Blue-Gene/P) 40 35 30 MStencil/s 25 20 15 7-point stencil 15-point stencil 27-point stencil 10 5 0 0 50 100 150 200 250 Size 20

10 Strong scaling (Blue-Gene/P) Time per V-cycle [s] 1 0.1 0.01 0 10,000 20,000 30,000 40,000 50,000 Number of Cores Figure: Strong Scaling behavior of HHG on PowerPC 450 cores. This test case was performed starting from 512 cores, solving 2.14 10 9 DoF. 21

7-point stencil (1 core per node, Westmere) 300 MStencil/s 200 100 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint RRB Gauss-Seidel disjoint, index opt. 50 100 150 200 250 Size 22

7-point stencil (12 core per node, Westmere) 300 250 MStencil/s 200 150 100 50 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint RRB Gauss-Seidel disjoint, index opt. 50 100 150 200 250 Size 23

Next steps / Outlook Parallel file reading Implementation of varying coefficients Nonlinear isotropic and anisotropic diffusion regularizers 24

Thank you for you attention! Any questions? The development of HHG was funded by the Elite Network of Bavaria within the International Doctorate Program Identification, Optimization and Control with Applications in odern Technologies KONWIHR 25