UoB Structured CFD Code C.B. Allen April, 2013
Contents Background to code formulations Aspects of the coding structure, data structure, and parallel message passing approach Performance figures, serial and scalar Issues to consider for GPU porting; simple and not so! Code not really research area. Used primarily as tool to demonstrate other methods. Main research thrust is universal code- and mesh-independent technology: CFD-CSD coupling, volume and surface control and deformation, optimisation, data taransfer, etc..
Introductory Information Structured, multiblock Third-order upwind spatial stencil (convective), 5-points each direction Multigrid acceleration Explicit local time-stepping, steady Implicit pseudo-time, unsteady, with explicit local time-stepping within pseudo-time Aeroelastic coupled, forced and deforming; meshless CFD-CSD coupling Meshless mesh deformation approach Non-matching boundaries, with meshless interpolation Fortran90 and MPI.
Example Applications Unsteady rotor simulation: forward-flight with cyclic pitch variation. 4M and 32M cell, 208 block mesh.
Example Applications Static aeroelastic simulation, CFD-CSD coupling and mesh deformation via meshless radial basis function approach (details later). Mode 4 demonstration of CFD-CSD coupling. MDO wing static deflection calculation, C L = 0.65 0.19.
Example Applications Domain element shape parameterisation and mesh deformation, coupled with parallel gradient-based optimisation. Two-bladed rotor in hover, M Tip = 0.8, 63 parameters, minimise torque. C T = 29.6%.
Mesh Format Domain decomposed by blocks No global storage; mesh never considered in entirety. Grid header file: nblocks nsym version number block1filename block2filename etc. Each block separate file: ni nj nk x1 y1 z1 x2 y2 z2.. iminflag neighbour orientation imaxflag neighbour orientation jminflag neighbour orientation jmaxflag neighbour orientation kminflag neighbour orientation kmaxflag neighbour orientation
Decomposition Preprocessor processes mesh and control file and produces most of data required: - Moving mesh case, surface data processed and connectivity produced (later) - Aeroelastic case, surface mesh and structural mesh processed and interpolation dependence produced (later) Preproc grid.dims file. Solver initialisation routine splits blocks over processes using sizes in grid.dims: - Sorts block numbers by size then target ncells used to decide block split per process - Currently work split at block level only, for example 16 blocks means nprocs <= 16 - GPU port to allow work split at cell level ideal load balance.
Parallelisation and Data Structure Written in Fortran90 Domain decomposed by blocks No global storage; mesh never considered in entirety. Only list of block sizes and process owners built (plus tag data): nblocks(nprocs) blocknum(nprocs,nblocks(nprocs)) numproc(nblock) i1tag(nblock) k2tag(nblock) i1nb(nblock) k2nb(nblock) - number of blocks owned by each process - block number for each block owned by each process - process owner for each block - block boundary flags for each block - neighbour block number for each boundary for each block i1orient(nblock) k2orient(nblock) - orientation of neighbour block for each boundary for each block.
Initialisation and Coding Approach Code developed to mimise communication and storage Master performs all initialisation Blocks split over processes and all simulation data processed Global integer arrays and flow data broadcast to all processes Each process then defines its own local workspace length Solution loads and monitoring data processed on each process and sent to master Master collects convergence and load data and outputs Also sends (not broadcast) changed data to slaves Every process outputs its own solution data when required, to unique name No global collection or processing of solution data
Coding and Data Structure Each process builds its own storage (no global storage) All data stored in 1D arrays: rho(0:nijknb), x(0:nijknb), etc. NIJKNB local to each process. Sum over mglevels, nblocks, cells. (Different) 1D pointer to each block and multigrid level for each process offset(nb,m) do nbb=1,nblocks(myid) nb=blocknum(myid,nbb) do m=1,mglevels ioff=offset(nb,m) do k=1,nk(nb,m) do j=1,nj(nb,m) do i=1,ni(nb,m) ii=cellid(ioff,i,j,k) vol(ii)=vol(ii) + areas(ii)*dxns(ii)*...
Coding for Scalar Functions Code originally developed to minimise storage and number of ops Many local scalars and vectors defined and discarded Many dependencies. Example, cell left and right states (i west and east): F w (i) = F + e(i 1)+F w(i) do k=1,nk(nb,m) do j=1,nj(nb,m) do i=0,ni(nb,m)+1 ii=cellid(ioff,i,j,k) srho = limiter(rho(ii-1),rho(ii),rho(ii+1)) rhow = musclpos(srho,rho(ii-1),rho(ii),rho(ii+1),dsew(ii)...) rhoe = musclneg(srho,rho(ii-1),rho(ii),rho(ii+1),dsew(ii)...)... uw, ue, etc. IF(not solid surface) THEN FE = fluxneg(rhoe,ue,ve,we,ee,ce,pe,dxnw(ii+1),dynw(ii+1),dznw(ii+1)...) FW = FE + fluxpos(rhow,uw,vw,ww,ew,cw,pw,dxnw(ii),dynw(ii),dznw(ii)...) ELSE... RESW(ii) = FW*areaw(ii) enddo enddo enddo
Updated code more memory intensive: Coding for Scalar Functions ALLOCATE(rhow(0:ni(nb,m)+1,nj(nb,m),nk(nb,m)), ALLOCATE(uw(0:ni(nb,m)+1,nj(nb,m),nk(nb,m)), etc. ALLOCATE(rhoe(0:ni(nb,m)+1,nj(nb,m),nk(nb,m)), ALLOCATE(ue(0:ni(nb,m)+1,nj(nb,m),nk(nb,m)), etc. CALL EASTWEST(rhow,rhoe,...rho,u,v,w...,ni(nb,m),nj(nb,m),nk(nb,m)) -> do i=0,ni(nb,m)+1 cells -> overwrite surface values WALL_CELL() do k=1,nk(nb,m), do j=1,nj(nb,m) do i=1,ni(nb,m)+1 ii=cellid(ioff,i,j,k) RESW(ii)= totalflux(rhow,rhoe,uw,ue,vw,ve...)*areaw(ii) enddo enddo enddo Latest version of code: 800MBytes/million cells, double precision.
Parallelisation: Structure and Message Passing Code written such that no message passing in main subroutines: resid(), update(), restrict() etc. Called once per block, so independent of process/block/mg level. At each block boundary, two layers of halo data required for solution vector, for convective term evaluation. Separate subroutine for main message passing, boundarysolution(). One layer for prolong(), velocitygrads(), dtsmooth(). All messages packed into 1D temp arrays. Code written to allow same logic for serial and parallel version All message passing written so serial and parallel logic same. All messages sent as soon as available for efficiency No message ordering, non-blocking MPI calls throughout mpi wait() used to ensure message completion. Minimum use of mpi barrier().
Parallelisation: Structure and Message Passing At each block face requiring data passing, data packed into 1D temp arrays: boundxx and boundi/j/kxx XX = face number 1-6, variable number 1-? Consider connected imin face do i=1,2 do k=1,nk(nb,m) bound11(i1,nnn)=rho(ii) enddo do j=1,nj(nb,m) length=2*nj(nb,m)*nk(nb,m) nbn=i1nb(nb) nbb=numproc(nbn) IF(nbb.ne.myid) THEN i_to=nbb-1 if(i1orient(nb).eq.1) then i_tag1=6*nvariables*nbn+1 elseif(i1orient(nb).eq.2) then i_tag1=6*nvariables*nbn+nvariables+1 etc... call mpi_isend(bound11(1,nnn),length,mpi_...,i_to,i_tag1,...) i_tag1=6*nvariables*nb+1 call mpi_irecv(boundi11(1,nnn2),length,mpi_...,i_from,i_tag1,...)
Parallelisation: Structure and Message Passing ELSE ENDIF if(i1orient(nb).eq.1) then do ij=1,length boundi11(ij,blockpointer(nbn))=bound11(ij,nnn) elseif(i1orient(nb).eq.2) then do ij=1,length boundi21(ij,blockpointer(nbn))=bound11(ij,nnn) etc...! UNPACK if(i1orient(nb).eq.1) then do i=0,-1,-1 do k=1,nk(nb,m) do j=1,nj(nb,m) ij counter determined by orientation rho(ii)=boundi11(ij,nnn2) elseif(i1orient(nb).eq.2) then do i=0,-1,-1 do k=1,nk(nb,m) do j=1,nj(nb,m) ij counter determined by orientation rho(ii)=boundi21(ij,nnn2) etc. Adds small overhead to scalar code but nicer logic.
Solution Methods van Leer and Roe upwind convective fluxes, plus van Albada limited MUSCL 5-stage R-K time-stepping; halo values only set once per step, not every stage V-cycle multigrid used - Restrict residual, solution, error vector all levels. Once per cycle (message passing) - Standard volume weighted restriction - Single solution iteration on way down - 2 m 1 or 3 m 1 iterations on way up, limited at coarsest level - Trilinear interpolation for prolongation; no smoothing - Ramped up from min max min; 0.7 1.0 - One layer of Us exchanged at boundaries (message passing).
Convergence and Monitoring outprint() routine computes residuals and loads, plus standard deviations. Average flowfield conserved variables and loads stored each cycle σ(flow,loads) = standard deviation of this over nsamp cycles. Example, NACA0012 aerofoil case, 257 129 mesh. Inviscid case, M=0.5. COST: 32768 cells, 6 level MG, serial computation, gfortran -O3. Converge: 10 8, 199 cycles, 30s. 10 10, 261 cycles, 40s.
Code scaled at Daresbury 2004. Parallel Performance 996x speed-up for 1024 cores.
Profiling Typical mesh size for steady simulation gives following: Serial; Parallel resid() 80% outprint() 5% timestep() 4% boundsol() 4% update() 2% prolong() 2% geometric+initialise 2% restrictalllevels() 1% (calls resid,boundsol) resid() 72% boundsol() 7% outprint() 6% timestep() 5% prolong() 4% +geometric+initialise 3% update() 2% restrictalllevels() 1% (calls resid,boundsol) Consideration for GPUs: Once resid() 20 faster, outprint() most expensive function!
Issues for GPU Porting Message passing harness; single harness. Serial/multicore, multicore/gpu Multicore approach now limited to decomposition at block level. GPU to offer decomposition at cell/thread level ideal load balance CPU/GPU load balancing Core numerics no issues at all CFD-CSD coupling requires linear system solutions Mesh deformation requires linear system solutions Meshless interpolation requires expensive searches, complex message passing, AND numerous linear system solutions
Meshless Methods Much research performed on mesh independent methods Application areas CFD-CSD coupling, mesh deformation Based on function approximation methods, radial basis functions Global n-dimensional volume control methods: dimensions may be (x,y,z) and function displacement; or (Re,M,α,θ,γ) and function C L,C D etc. Objective: universal code- and mesh-independent methods.
Meshless Methods (a) Close surfaces. (b) Wing and beam. First application area CFD-CSD coupling Method sought to interpolate forces and displacements across the fluid-structure interface that satisfies the following requirements: - Mesh connectivity free code and mesh independent and perfectly parallel - Conservation of energy, total force and moment - Exact recovery of translation and rotation - Force and displacement association - Position of aerodynamic nodes a linear function of the position of the structural nodes.
Meshless Methods: Radial Basis Function Interpolation Define a coupling matrix, H, that transforms the displacements of the aerodynamic surface nodes according to the displacements of the structural nodes in a linear fashion using energy and force conservation can show that u a = Hu s (1) f s = H T f a (2) Let f(x) be the original function to be modelled, f i known values at N the control points, x i,i = 1,...,N, where x i is the n-dimensional vector at i. φ is the chosen basis function and the Euclidean norm, then an interpolation model s has the form N s(x) = β i φ( x x i )+p(x) (3) i=1 where β i,i = 1,...,N are model coefficients, and p is an optional polynomial.
These coefficients are found by requiring exact recovery of the original data, s X = f, for all points in the training data set X. For example, assume training data position of structural nodes, exact recovery of the centres gives, using up to linear polynomial terms X s = C ss a x Y s = C ss a y Z s = C ss a z (4) where X s = 0 0 0 0 x s x s = x s 1. x sn a x = γ x 0 γ x x γ x y γ x z β x s 1. β x s N (5)
(Analogous definitions hold for Y s and Z s and their a vectors) C ss = 0 0 0 0 1 1 1 0 0 0 0 x s1 x s2 x sn 0 0 0 0 y s1 y s2 y sn 0 0 0 0 z s1 z s2 z sn 1 x s1 y s1 z s1 φ s1 s 1 φ s1 s 2 φ s1 s N.......... 1 x sn y sn z sn φ sn s 1 φ sn s 2 φ sn s N (6) with φ s1 s 2 = φ( x s1 x s2 ) (7) To compute the aerodynamic surface points, equation (3) can be applied point by point Perfectly parallel. Either C 1 ss must be computed, or system solved for coefficient vectors.
Meshless Methods Much work performed to minimise cost. CFD-CSD not an issue, as system size Nstruct Nstruct, and use set of smaller patches. Mesh deformation, system size is Nsurface Nsurface; could be 10 6 10 6! Efficient point reduction and optimisation scheme developed, using greedy point selection, system size < 1000. Z X Y Two stages to mesh deformation: 1) system solution; 2) position vector update. 1) System solved on every process no comms 2) Mesh points on each process moved independently no comms Stage 2) ideal for GPU.
Mesh Deformation: Code Stages DO NT=1,NREALTIMESTEPS Update surface positions - prescribed Solve linear system -> betax(n) Solve linear system -> betay(n) Solve linear system -> betaz(n) Update mesh; X()=X0()+DeltaX() Update geometric data, gridspeeds, volumes (GCL) DO NIT=1,NMGCYCLES Update solution ENDDO ENDDO DO NT=1,NREALTIMESTEPS DO NC=1,NCOUPLINGCYCLES CFD surface pressures -> structural loads Compute new structural positions CSD displacements -> CFD surface positions Solve linear system -> betax(n) Solve linear system -> betay(n) Solve linear system -> betaz(n) Update mesh: X()=X0()+DeltaX() Update geometric data, gridspeeds (0 for static), volumes (GCL not for static) DO NIT=1,NMGCYCLES Update solution ENDDO ENDDO ENDDO
Mesh Deformation Greedy point selection scales as N op N sel n=1 (n3 +n.n surface ) Volume mesh deformation scales as N op N 3 sel +N sel.n volume Typical case: N surface = 7 10 5,N volume = 8 10 6 LAPACK CG method for system solution, single process, Mesh deformation 5-20% of solver cost. GPU solver: system solution dominates need faster system solution.
Meshless Methods: Non-Matching Boundaries Discontinuous patch boundary between two mesh blocks (A and B). Similar RBF interpolation used for high-order data transfer across boundary. Third to fifth order of accuracy proven.
NACA0012 Mach 0.5, drag coefficient convergence tested for non-matching. 129x65 129x21 + 81x21 mesh 257x129 257x49 + 161x49 mesh
NACA0012 Mach 0.5, drag coefficient convergence.
Non-Matching Boundaries: Nozzle Case Continuous (upper), discontinuous, spacing ratio 1.5 (centre). Every fourth point. Mach contours, continuous 512 80 mesh, M=0.8 (lower).
Mach contours and mesh. Continuous and discontinuous meshes. Entropy contours. Continuous and discontinuous meshes.
Mesh Spacing Ratio C d Difference (%) Continuous - 0.029971 - Discontinuous 1.33 0.029912 0.20 Discontinuous 1.50 0.029894 0.26 Discontinuous 2.00 0.029887 0.28 Discontinuous 3.00 0.029888 0.28
Non-Matching Interfaces: Issues Preprocessor currently computes cloud lists for each cell adjacent to a tagged interface. For each interface cell, halo point(s) need local cloud of control points ncloud(i,j,k,nb) list of points in terms of i,j,k, and nb values. Need to contruct φ = [ncloud ncloud] for each cloud Then solve φβ ρ = ρ, φβ E = E etc. or construct Aφ 1 where A = [2 ncloud] Point clouds include points from multiple block/processes, and each different size! Proof of concept stage: file I/O used: 2D + 20%; 3D + 50%. PROBLEM 1: Preprocessor performs searches and lists on every MG level For unsteady case, need every timestep need efficient search algorithms. PROBLEM 2: Complex data communication (not system solution and update costs).