Ridgeway Kite Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on

Similar documents
Overview on Modern Accelerators and Programming Paradigms Ivan Giro7o

Adaptive Stable Additive Methods for Linear Algebraic Calculations

HPC Deployment of OpenFOAM in an Industrial Setting

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

A numerically adaptive implementation of the simplex method

Learn CUDA in an Afternoon: Hands-on Practical Exercises

OS/Run'me and Execu'on Time Produc'vity

Data Structures and Performance for Scientific Computing with Hadoop and Dumbo

David Rioja Redondo Telecommunication Engineer Englobe Technologies and Systems

Parallel Computing using MATLAB Distributed Compute Server ZORRO HPC

HPC enabling of OpenFOAM R for CFD applications

Part I Courses Syllabus

Design and Optimization of OpenFOAM-based CFD Applications for Hybrid and Heterogeneous HPC Platforms

HPC with Multicore and GPUs

PyFR: Bringing Next Generation Computational Fluid Dynamics to GPU Platforms

Analysis of Binary Search algorithm and Selection Sort algorithm

High Performance Computing. Course Notes HPC Fundamentals

Fast Multipole Method for particle interactions: an open source parallel library component

Overview of HPC Resources at Vanderbilt

GPU Computing with CUDA Lecture 2 - CUDA Memories. Christopher Cooper Boston University August, 2011 UTFSM, Valparaíso, Chile

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Intro to GPU computing. Spring 2015 Mark Silberstein, , Technion 1

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Turbomachinery CFD on many-core platforms experiences and strategies

Best practices for efficient HPC performance with large models

Parallel Programming Survey

Performance of Dynamic Load Balancing Algorithms for Unstructured Mesh Calculations

Hadoop Size does Hadoop Summit 2013

Next Generation GPU Architecture Code-named Fermi

Introduction GPU Hardware GPU Computing Today GPU Computing Example Outlook Summary. GPU Computing. Numerical Simulation - from Models to Software

Reference Architecture and Best Practices for Virtualizing Hadoop Workloads Justin Murray VMware

Data Stream Algorithms in Storm and R. Radek Maciaszek

22S:295 Seminar in Applied Statistics High Performance Computing in Statistics

Unstructured Data Accelerator (UDA) Author: Motti Beck, Mellanox Technologies Date: March 27, 2012

Data Centric Systems (DCS)

HPC Wales Skills Academy Course Catalogue 2015

Client Based Power Iteration Clustering Algorithm to Reduce Dimensionality in Big Data

Scientific Computing Programming with Parallel Objects

Performance Evaluation of NAS Parallel Benchmarks on Intel Xeon Phi

SOLVING LINEAR SYSTEMS

E6895 Advanced Big Data Analytics Lecture 14:! NVIDIA GPU Examples and GPU on ios devices

CSE373: Data Structures and Algorithms Lecture 3: Math Review; Algorithm Analysis. Linda Shapiro Winter 2015

Network Traffic Monitoring & Analysis with GPUs

Evaluation of CUDA Fortran for the CFD code Strukti

5. A full binary tree with n leaves contains [A] n nodes. [B] log n 2 nodes. [C] 2n 1 nodes. [D] n 2 nodes.

CS473 - Algorithms I

Big Data and Big Analytics

High Performance Cloud: a MapReduce and GPGPU Based Hybrid Approach

Multicore Parallel Computing with OpenMP

High Performance Matrix Inversion with Several GPUs

YALES2 porting on the Xeon- Phi Early results

Solution of Linear Systems

Overlapping Data Transfer With Application Execution on Clusters

HPC Programming Framework Research Team

Modeling Big Data/HPC Storage Using Massively Parallel Simula:on

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Lecture 10: Regression Trees

Optimizing a 3D-FWT code in a cluster of CPUs+GPUs

ParFUM: A Parallel Framework for Unstructured Meshes. Aaron Becker, Isaac Dooley, Terry Wilmarth, Sayantan Chakravorty Charm++ Workshop 2008

1 Finite difference example: 1D implicit heat equation

APPLICATIONS OF LINUX-BASED QT-CUDA PARALLEL ARCHITECTURE

A Case Study - Scaling Legacy Code on Next Generation Platforms

Dell High-Performance Computing Clusters and Reservoir Simulation Research at UT Austin.

How To Build A Supermicro Computer With A 32 Core Power Core (Powerpc) And A 32-Core (Powerpc) (Powerpowerpter) (I386) (Amd) (Microcore) (Supermicro) (

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

MOSIX: High performance Linux farm

CHAPTER - 5 CONCLUSIONS / IMP. FINDINGS

Parallel Algorithm for Dense Matrix Multiplication

Kashif Iqbal - PhD Kashif.iqbal@ichec.ie

Using RDBMS, NoSQL or Hadoop?

Simulation Platform Overview

Spring 2011 Prof. Hyesoon Kim

Big Graph Processing: Some Background

Memory Channel Storage ( M C S ) Demystified. Jerome McFarland

Building an energy dashboard. Energy measurement and visualization in current HPC systems

Parallel Programming at the Exascale Era: A Case Study on Parallelizing Matrix Assembly For Unstructured Meshes

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

IMPLEMENTATION OF P-PIC ALGORITHM IN MAP REDUCE TO HANDLE BIG DATA

The Assessment of Benchmarks Executed on Bare-Metal and Using Para-Virtualisation

SharePoint Capacity Planning Balancing Organiza,onal Requirements with Performance and Cost

Eclipse Visualization and Performance Monitoring

Hierarchically Parallel FE Software for Assembly Structures : FrontISTR - Parallel Performance Evaluation and Its Industrial Applications

Variable Base Interface

Big-data Analytics: Challenges and Opportunities

~ Greetings from WSU CAPPLab ~

DDC Sequencing and Redundancy

Transcription:

Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on Garf Bowen 16 th Dec 2013

Summary Introduce RKS Reservoir Simula@on HPC goals Implementa@on Simple example, results Full problem, results and challenges

RKS Start- up (April 2013) Long history in Reservoir Simula@on Sister company, NITEC consul@ng Differen@ators Massively Parallel Code Mul@ple Realiza@ons Unconven@onal Coupled surface network

Reservoir Simula@on Finite Volume Unstructured (features) Implicit R= M F=0

Driving from London to Manchester Check the Ferrari or the traffic jam? Lot of code that all needs to go fast Challenge is o_en not to go slow Can t just focus on hot spots

HPC goals not to go slow Portability CPU/GPU/Phi (+clusters) Want to be future proof Simplifica@on (massive) paralleliza@on is an opportunity Developer efficiency Same result on any plaeorm

Shuffle Calculate Pagern Scager I/O from node zero Shuffle Calculate one- to- one Gather output Calcula@ons are embarrassingly parallel No indirect addressing Ability to @me separately

Example calculate flows One flow two cells Different flow same cell One cell involved in Mul@ple flows Mul@ple copies slots More flows than cells

one code kernel many (independent) calls Simplicity Returns? Split to run MPI distributed Vectoriza@on on the CPU Underlying system - XPL Takes care of running Different modes Different architectures Code looks serial again

Maps & MPI Src Dest Slot i 1 j 1 0 i 2 j 2 1 i 3 j 3 0 i 4 j 4 1 Maps are defined in serial space Not recommended test.exe cpu test.exe gpu mpirun np 16 test.exe

Simple Example x i = A i 1 r i i template<typename KP> struct Testinv A - n*n small dense matrix ~millions of i s LU factoriza@on (par@al pivo@ng) host device Testinv(Args* inargs, int index, int N) int ia=0; mat<double,kp> a(inargs,ia++,index); vec<double,kp> r(inargs,ia++,index); vec<double,kp> x(inargs,ia++,index); mat<double,kp> w(inargs,ia++,index); case rks::testkernels::test_inv: w = a; calc(inargs, gpu<testinv<kp> >, cpu<testinv<kp> w.inv(); >,omp<testinv<kp>,phi<testinv<kp> >); break; x.zero(); w.mult(r,x);

Layout 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Array- of- structures (CPU friendly) 0 n 2n 3n 4n 5n 6n 7n 8n 1 n+1 2n+1 3n+1 4n+1 5n+1 6n+1 7n+1 8n+1 Structure- of- arrays (GPU friendly) Templated policy <KP> Run @me switch MPI jobs using both CPU & GPU Future proof? Prevents chea@ng no double* pt

Performance log 'me (secs) 100000 10000 1000 Scaling by matrix size - 1e6 (10 'mes) CPU GPU log 'me (secs) 5.00 4.00 3.00 2.00 0.40 0.60 0.80 1.00 1.20 Log n Scaling y = 2.35x + 2.31 y = 2.23x + 1.20 CPU GPU 100 2 4 8 log dense matrix size Scaling for the 3*3 case (10 'mes) 100000 log 'me (secs) 10000 1000 CPU GPU 100 5.00E+05 5.00E+06 5.00E+07 log number of matrices

Effect of layout GPU: Effect of layout 100000 CPU: Effect of layout log 'me (secs) 10000 1000 100 log 'me (secs) 100000 10000 1000 2 4 8 log dense matrix size s- of- a a- of- s s- of- a a- of- s 100 2 4 8 log dense matrix size

Now add complexity well 40 8.4 ==================================================== jac 40 19.1 Comparison mass between: 40 1.9 cpu flow 1243.630 and gpu 147.960 40 16.5 ==================================================== flow_ 4640 16.0 norm well 40 1.0 0.4 0.08 lin jac 30 1.0 52.7 12.62 52.5 ling mass 30 1.0 17.93 2.0 2.0 lins flow 30 1.0 50.0 11.66 orth-it flow_ 30 1.0 49.9 11.84 norm norm 1.0 2.19 219 0.1 precon lin 1.0 9.87 189 48.1 pressure ling 1.0 1.70 189 46.9 lins 1.0 10.08 orth-it 1.0 10.10 norm 1.0 48.40 precon 1.0 9.17 pressure 1.0 8.24

Linear Solver Strategy Linear Solver Important Communica@on Mechanism Challenge in parallel environments Like gesng the same results If we can implement a solver in XPL, then we get this for free but we re only a small company And don t really want to be linear solver experts Home grown May not be compe@@ve Using Nvidia s AmgX Lose the same algorithm Performing

Linear Solver Home Grown Massively helpful for development Challenged on difficult problems AmgX Many op@ons (pre- coded) Single GPU working well MPI is a challenge Implementa@on has to fit around it Some solvers missing

Summary & Conclusions Shuffle- Calculate pagern Works for us, so far Portable Allowing us to exploit the GPU Full system Commercial offering next year

Acknowledgements Co- authors: Bachar Zineddin & Tommy Miller The authors would like to acknowledge the work presented here made use of the IRIDIS*/EMERALD* HPC facility provided by the Centre for InnovaLon. Nvidia for AmgX beta access

Ques@ons?

Backup#1 LU code example Main elimination loop for (int j=0; j<m_xdim; j++) Sum for (int i=0; i<j;i++) double sum = (*this)(i,j); for (int k=0; k<i; k++) sum = sum - (*this)(i,k)*(*this)(k,j); } (*this)(i,j) = sum; } Max aamax = 0.0; for(int i=j; i<m_xdim; i++) double sum = (*this)(i,j); for( int k=0; k<j; k++) sum = sum - (*this)(i,k)*(*this)(k,j); } (*this)(i,j) = sum; } if ( std::fabs(vv[i]*sum)>=aamax ) imax = i; aamax = std::fabs(vv[i]*sum); } Swap if (j!=imax) for( int k=0; k<m_xdim; k++) double dum = (*this)(imax,j); (*this)(imax,k) = (*this)(j,k); (*this)(j,k) = dum; } vv[imax] = vv[j]; } Store piv[j] = imax; if ( (*this)(j,j)==0.0 ) (*this)(j,j) = 1e-20; } Set if(j!=m_xdim) double dum = 1.0/(*this)(j,j); for( int i=j+1; i<m_xdim; i++ ) (*this)(i,j) = (*this)(i,j)*dum; } } } End lu step

Backup#2 Home Grown Solver [ A ww & A wb @ A bw & A bb ][ x w @ x b ]= [ R w @ R b ] [ A ww &0@ A bw & A bb ][ I& A wb @0&I ][ x w @ x b ]= [ R w A bb = A bb A bw A ww 1 A wb Note: (1 x) 1 =1+x+ x 2 + x 3 +.. With: x= A bw A ww 1 A wb A bb 1