Ridgeway Kite Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on

Innova've Technology for Reservoir Engineers A Massively Parallel Architecture for Reservoir Simula'on Garf Bowen 16 th Dec 2013

Summary Introduce RKS Reservoir Simula@on HPC goals Implementa@on Simple example, results Full problem, results and challenges

RKS Start- up (April 2013) Long history in Reservoir Simula@on Sister company, NITEC consul@ng Differen@ators Massively Parallel Code Mul@ple Realiza@ons Unconven@onal Coupled surface network

Reservoir Simula@on Finite Volume Unstructured (features) Implicit R= M F=0

Driving from London to Manchester Check the Ferrari or the traffic jam? Lot of code that all needs to go fast Challenge is o_en not to go slow Can t just focus on hot spots

HPC goals not to go slow Portability CPU/GPU/Phi (+clusters) Want to be future proof Simplifica@on (massive) paralleliza@on is an opportunity Developer efficiency Same result on any plaeorm

Shuffle Calculate Pagern Scager I/O from node zero Shuffle Calculate one- to- one Gather output Calcula@ons are embarrassingly parallel No indirect addressing Ability to @me separately

Example calculate flows One flow two cells Different flow same cell One cell involved in Mul@ple flows Mul@ple copies slots More flows than cells

one code kernel many (independent) calls Simplicity Returns? Split to run MPI distributed Vectoriza@on on the CPU Underlying system - XPL Takes care of running Different modes Different architectures Code looks serial again

Maps & MPI Src Dest Slot i 1 j 1 0 i 2 j 2 1 i 3 j 3 0 i 4 j 4 1 Maps are defined in serial space Not recommended test.exe cpu test.exe gpu mpirun np 16 test.exe

Simple Example x i = A i 1 r i i template<typename KP> struct Testinv A - n*n small dense matrix ~millions of i s LU factoriza@on (par@al pivo@ng) host device Testinv(Args* inargs, int index, int N) int ia=0; mat<double,kp> a(inargs,ia++,index); vec<double,kp> r(inargs,ia++,index); vec<double,kp> x(inargs,ia++,index); mat<double,kp> w(inargs,ia++,index); case rks::testkernels::test_inv: w = a; calc(inargs, gpu<testinv<kp> >, cpu<testinv<kp> w.inv(); >,omp<testinv<kp>,phi<testinv<kp> >); break; x.zero(); w.mult(r,x);

Layout 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 Array- of- structures (CPU friendly) 0 n 2n 3n 4n 5n 6n 7n 8n 1 n+1 2n+1 3n+1 4n+1 5n+1 6n+1 7n+1 8n+1 Structure- of- arrays (GPU friendly) Templated policy <KP> Run @me switch MPI jobs using both CPU & GPU Future proof? Prevents chea@ng no double* pt

Performance log 'me (secs) 100000 10000 1000 Scaling by matrix size - 1e6 (10 'mes) CPU GPU log 'me (secs) 5.00 4.00 3.00 2.00 0.40 0.60 0.80 1.00 1.20 Log n Scaling y = 2.35x + 2.31 y = 2.23x + 1.20 CPU GPU 100 2 4 8 log dense matrix size Scaling for the 3*3 case (10 'mes) 100000 log 'me (secs) 10000 1000 CPU GPU 100 5.00E+05 5.00E+06 5.00E+07 log number of matrices

Effect of layout GPU: Effect of layout 100000 CPU: Effect of layout log 'me (secs) 10000 1000 100 log 'me (secs) 100000 10000 1000 2 4 8 log dense matrix size s- of- a a- of- s s- of- a a- of- s 100 2 4 8 log dense matrix size

Now add complexity well 40 8.4 ==================================================== jac 40 19.1 Comparison mass between: 40 1.9 cpu flow 1243.630 and gpu 147.960 40 16.5 ==================================================== flow_ 4640 16.0 norm well 40 1.0 0.4 0.08 lin jac 30 1.0 52.7 12.62 52.5 ling mass 30 1.0 17.93 2.0 2.0 lins flow 30 1.0 50.0 11.66 orth-it flow_ 30 1.0 49.9 11.84 norm norm 1.0 2.19 219 0.1 precon lin 1.0 9.87 189 48.1 pressure ling 1.0 1.70 189 46.9 lins 1.0 10.08 orth-it 1.0 10.10 norm 1.0 48.40 precon 1.0 9.17 pressure 1.0 8.24

Linear Solver Strategy Linear Solver Important Communica@on Mechanism Challenge in parallel environments Like gesng the same results If we can implement a solver in XPL, then we get this for free but we re only a small company And don t really want to be linear solver experts Home grown May not be compe@@ve Using Nvidia s AmgX Lose the same algorithm Performing

Linear Solver Home Grown Massively helpful for development Challenged on difficult problems AmgX Many op@ons (pre- coded) Single GPU working well MPI is a challenge Implementa@on has to fit around it Some solvers missing

Summary & Conclusions Shuffle- Calculate pagern Works for us, so far Portable Allowing us to exploit the GPU Full system Commercial offering next year

Acknowledgements Co- authors: Bachar Zineddin & Tommy Miller The authors would like to acknowledge the work presented here made use of the IRIDIS*/EMERALD* HPC facility provided by the Centre for InnovaLon. Nvidia for AmgX beta access

Ques@ons?

Backup#1 LU code example Main elimination loop for (int j=0; j<m_xdim; j++) Sum for (int i=0; i<j;i++) double sum = (*this)(i,j); for (int k=0; k<i; k++) sum = sum - (*this)(i,k)*(*this)(k,j); } (*this)(i,j) = sum; } Max aamax = 0.0; for(int i=j; i<m_xdim; i++) double sum = (*this)(i,j); for( int k=0; k<j; k++) sum = sum - (*this)(i,k)*(*this)(k,j); } (*this)(i,j) = sum; } if ( std::fabs(vv[i]*sum)>=aamax ) imax = i; aamax = std::fabs(vv[i]*sum); } Swap if (j!=imax) for( int k=0; k<m_xdim; k++) double dum = (*this)(imax,j); (*this)(imax,k) = (*this)(j,k); (*this)(j,k) = dum; } vv[imax] = vv[j]; } Store piv[j] = imax; if ( (*this)(j,j)==0.0 ) (*this)(j,j) = 1e-20; } Set if(j!=m_xdim) double dum = 1.0/(*this)(j,j); for( int i=j+1; i<m_xdim; i++ ) (*this)(i,j) = (*this)(i,j)*dum; } } } End lu step

Backup#2 Home Grown Solver [ A ww & A wb @ A bw & A bb ][ x w @ x b ]= [ R w @ R b ] [ A ww &0@ A bw & A bb ][ I& A wb @0&I ][ x w @ x b ]= [ R w A bb = A bb A bw A ww 1 A wb Note: (1 x) 1 =1+x+ x 2 + x 3 +.. With: x= A bw A ww 1 A wb A bb 1