Towards real-time image processing with Hierarchical Hybrid Grids

Similar documents
FRIEDRICH-ALEXANDER-UNIVERSITÄT ERLANGEN-NÜRNBERG

walberla: Towards an Adaptive, Dynamically Load-Balanced, Massively Parallel Lattice Boltzmann Fluid Simulation

Fast Parallel Algorithms for Computational Bio-Medicine

Dual Methods for Total Variation-Based Image Restoration

A PARALLEL GEOMETRIC MULTIGRID METHOD FOR FINITE ELEMENTS ON OCTREE MESHES

A Multi-layered Domain-specific Language for Stencil Computations

Big Data Graph Algorithms

Mesh Generation and Load Balancing

Iterative Solvers for Linear Systems

CUDA for Real Time Multigrid Finite Element Simulation of

walberla: A software framework for CFD applications on Compute Cores

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

HPC enabling of OpenFOAM R for CFD applications

YALES2 porting on the Xeon- Phi Early results

P013 INTRODUCING A NEW GENERATION OF RESERVOIR SIMULATION SOFTWARE

Chapter 15: Distributed Structures. Topology

Computer Graphics Hardware An Overview

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

64-Bit versus 32-Bit CPUs in Scientific Computing

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

GPU Architecture. Michael Doggett ATI

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

How High a Degree is High Enough for High Order Finite Elements?

Yousef Saad University of Minnesota Computer Science and Engineering. CRM Montreal - April 30, 2008

AeroFluidX: A Next Generation GPU-Based CFD Solver for Engineering Applications

Large-Scale Reservoir Simulation and Big Data Visualization

Høgskolen i Narvik Sivilingeniørutdanningen STE6237 ELEMENTMETODER. Oppgaver

Divergence-Free Elements for Incompressible Flow on Cartesian Grids

An Additive Neumann-Neumann Method for Mortar Finite Element for 4th Order Problems

1 Bull, 2011 Bull Extreme Computing

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

Part II: Finite Difference/Volume Discretisation for CFD

Computation of crystal growth. using sharp interface methods

The Evolution of Computer Graphics. SVP, Content & Technology, NVIDIA

OpenMP Programming on ScaleMP

HPC Deployment of OpenFOAM in an Industrial Setting

Interconnection Networks. Interconnection Networks. Interconnection networks are used everywhere!

Medical Image Processing on the GPU. Past, Present and Future. Anders Eklund, PhD Virginia Tech Carilion Research Institute

Lecture 2 Parallel Programming Platforms

Hadoop on the Gordon Data Intensive Cluster

Introduction to High Performance Cluster Computing. Cluster Training for UCL Part 1

Highly Scalable Dynamic Load Balancing in the Atmospheric Modeling System COSMO-SPECS+FD4

Dynamic Resolution Rendering

Appro Supercomputer Solutions Best Practices Appro 2012 Deployment Successes. Anthony Kenisky, VP of North America Sales

A Multigrid Tutorial part two

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Distributed Dynamic Load Balancing for Iterative-Stencil Applications

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

TESLA Report

System Interconnect Architectures. Goals and Analysis. Network Properties and Routing. Terminology - 2. Terminology - 1

Numerical Methods For Image Restoration

Part II Redundant Dictionaries and Pursuit Algorithms

C3.8 CRM wing/body Case

High-fidelity electromagnetic modeling of large multi-scale naval structures

FEM Software Automation, with a case study on the Stokes Equations


Benchmark Tests on ANSYS Parallel Processing Technology

Recent Advances and Future Trends in Graphics Hardware. Michael Doggett Architect November 23, 2005

Performance Evaluation of Amazon EC2 for NASA HPC Applications!

Application and Micro-benchmark Performance using MVAPICH2-X on SDSC Gordon Cluster

Uniboard based digital receiver

QCD as a Video Game?

HIGH ORDER WENO SCHEMES ON UNSTRUCTURED TETRAHEDRAL MESHES

CHAPTER 5 FINITE STATE MACHINE FOR LOOKUP ENGINE

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

Finite Elements for 2 D Problems

LBM BASED FLOW SIMULATION USING GPU COMPUTING PROCESSOR

Edge Processing and Event Detection using Phasor Data. Raymond de Callafon and Sai Akhil Reddy

Numerical Calculation of Beam Coupling Impedances in the Frequency Domain using the Finite Integration Technique

Simulation of Fluid-Structure Interactions in Aeronautical Applications

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

Multicore Parallel Computing with OpenMP

Hardware design for ray tracing

Computational fluid dynamics (CFD) 9 th SIMLAB Course

Back to Elements - Tetrahedra vs. Hexahedra

Variational approach to restore point-like and curve-like singularities in imaging

A Load Balancing Tool for Structured Multi-Block Grid CFD Applications

Capacity Management for Oracle Database Machine Exadata v2

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Performance of the NAS Parallel Benchmarks on Grid Enabled Clusters

HPC Infrastructure Development in Bulgaria

Introduction to the Finite Element Method

CFD Implementation with In-Socket FPGA Accelerators

Introduction Our choice Example Problem Final slide :-) Python + FEM. Introduction to SFE. Robert Cimrman

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

Graphics Cards and Graphics Processing Units. Ben Johnstone Russ Martin November 15, 2011

User Guide Thank you for purchasing the DX90

COMP 422, Lecture 3: Physical Organization & Communication Costs in Parallel Machines (Sections 2.4 & 2.5 of textbook)

Arcane/ArcGeoSim, a software framework for geosciences simulation

Parallel Programming Survey

Paper Pulp Dewatering

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Transcription:

Towards real-time image processing with Hierarchical Hybrid Grids International Doctorate Program - Summer School Björn Gmeiner Joint work with: Harald Köstler, Ulrich Rüde August, 2011

Contents The HHG Framework Image processing for MRI Real-time processing 2

The HHG Framework 3

Combining finite element and multigrid methods FE mesh may be unstructured. What nodes to remove for coarsening? Not straightforward! Why not start from the coarse grid? The Hierarchical Hybrid Grids (HHG) concept Benjamin Bergen*: prototype Tobias Gradl: tuning, extensions and adaptivity * Dissertation in Erlangen, ISC award in 2005. Currently at Los Alamos Labs. 4

Advantages Properties of the HHG approach Multigrid is straightforward Very memory efficient Massive performance benefits on current computer architectures Subserves parallelization 10 12 unknowns are possible Limitation Coarse input grid needed Adaptivity (ongoing work by Tobias Gradl) 5

Two-grid cycle (correction scheme) 6

HHG Primitives (2d-example) inner points (macro) vertex points (macro) edge points ghost points communication 7

Weak scalability of HHG on Blue Gene/P (Jugene) Cores Struct. Regions Unknowns CG Time 128 1 536 534 776 319 15 5.64 256 3 072 1 070 599 167 20 5.66 512 6 144 2 142 244 863 25 5.69 1024 12 288 4 286 583 807 30 5.71 2048 24 576 8 577 357 823 45 5.75 4096 49 152 17 158 905 855 60 5.92 8192 98 304 34 326 194 175 70 5.86 16384 196 608 68 669 157 375 90 5.91 32768 393 216 137 355 083 775 105 6.17 65536 786 432 274 743 709 695 115 6.41 131072 1 572 864 549 554 511 871 145 6.42 262144 3 145 728 1 099 176 116 223 180 6.81 294912 294 912 824 365 314 047 110 3.80 8

Image processing for MRI 1. Denoising by homogeneous diffusion 2. High dynamic range compression 9

Domain generation (typical size: e.g. 1024 3 ) 1. Static domain partitioning, parallel file reading 2. Find relevant (information containing) regions 3. Distribute only relevant regions equally 10

1) Denoising by homogeneous diffusion Image with noise: u 0 = Ru + η R... linear operator incorporating blur (we assume R = Id) η... additive noise (e.g. white Gaussian noise) Simplest approach to reduce noise (better: anisotropic Diffusion): u u 0 = α u α... regularization parameter (α > 0) Variational formulation: a(u, v) = α u + uv dx, f (v) = Ω Ω u 0 v dx 11

Denoising by homogeneous diffusion (cont.) min J(u) := 1 a(u, u) f (u) 2 min J(u) := 1 α u u + u 2 dx u 0 u dx 2 Ω Ω min 2J(u) = α u u + u 2 2u 0 u dx min 2J(u) = Ω Ω α u u + u 2 2u 0 u + (u 0 ) 2 (u 0 ) 2 dx min Ω u 0 u 2 + α u 2 dx 12

The HHG Framework Image processing for MRI Real-time processing 2) High dynamic range compression Steps 1. compute gradient field 2. manipulate picture in the gradient domain (i.e. damp large gradients) 3. back transformation u = k( u 0 ) 13

Real-time processing 14

Objective platforms Jugene (FZ Jülich) lima (RRZE Erlangen) 4-way SMP processor 32-bit PowerPC 450 core 850 MHz Bandwidth: 13.6 GB/s 2 GB main memory 2 hexa-core processors Xeon 5650 Westmere 2 660 MHz Bandwidth: 32 GB/s 24 GB main memory 15

5-point stencil example: Blue-Gene/P 1 f o r ( i n t j =1; j <t s i z e 1; ++j ) { 2 // l e x. update ( a l l p o i n t s ) 3 f o r ( i n t i =1; i <t s i z e 1; ++i ) { 5 u [ k t s i z e t s i z e + j t s i z e + i ] = 6 c [ 0 ] ( f [ j t s i z e+i ] + 8 c [ 1 ] u [ ( j +1) t s i z e + ( i ) ] + 9 c [ 2 ] u [ ( j ) t s i z e + ( i +1) ] + 10 c [ 3 ] u [ ( j ) t s i z e + ( i 1) ] + 11 c [ 4 ] u [ ( j 1) t s i z e + ( i ) ] ) ; 12 } 13 } 16

Disjoint optimization : Blue-Gene/P 1 double u2 = u ; 2 f o r ( i n t j =1; j <t s i z e 1; ++j ) { 3 // f i r s t update ( r e d p o i n t s o n l y ) 4 f o r ( i n t i =1; i <t s i z e 1; i +=2) { 5 #pragma d i s j o i n t ( u, f ) 6 #pragma d i s j o i n t ( u, u2 ) 7 #pragma d i s j o i n t ( u2, f ) 9 u2 [ k t s i z e t s i z e + j t s i z e + i ] = 10 c [ 0 ] ( f [ j t s i z e+i ] + 12 c [ 1 ] u [ ( j +1) t s i z e + ( i ) ] + 13 c [ 2 ] u [ ( j ) t s i z e + ( i +1) ] + 14 c [ 3 ] u [ ( j ) t s i z e + ( i 1) ] + 15 c [ 4 ] u [ ( j 1) t s i z e + ( i ) ] ) ; 16 } 17 // second update ( b l a c k p o i n t s o n l y ) 18 } 17

7-point stencil (Blue-Gene/P) 40 30 MStencil/s 20 10 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint RRB Gauss-Seidel disjoint, index opt. 0 50 100 150 200 250 Size 18

27-point stencil (Blue-Gene/P) 10 8 MStencil/s 6 4 2 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint, index opt. 0 50 100 150 200 250 Size 19

Different stencils (Blue-Gene/P) 40 35 30 MStencil/s 25 20 15 7-point stencil 15-point stencil 27-point stencil 10 5 0 0 50 100 150 200 250 Size 20

10 Strong scaling (Blue-Gene/P) Time per V-cycle [s] 1 0.1 0.01 0 10,000 20,000 30,000 40,000 50,000 Number of Cores Figure: Strong Scaling behavior of HHG on PowerPC 450 cores. This test case was performed starting from 512 cores, solving 2.14 10 9 DoF. 21

7-point stencil (1 core per node, Westmere) 300 MStencil/s 200 100 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint RRB Gauss-Seidel disjoint, index opt. 50 100 150 200 250 Size 22

7-point stencil (12 core per node, Westmere) 300 250 MStencil/s 200 150 100 50 0 lex. Gauss-Seidel RRB Gauss-Seidel disjoint RRB Gauss-Seidel disjoint, index opt. 50 100 150 200 250 Size 23

Next steps / Outlook Parallel file reading Implementation of varying coefficients Nonlinear isotropic and anisotropic diffusion regularizers 24

Thank you for you attention! Any questions? The development of HHG was funded by the Elite Network of Bavaria within the International Doctorate Program Identification, Optimization and Control with Applications in odern Technologies KONWIHR 25