CFD Implementation with In-Socket FPGA Accelerators Ivan Gonzalez UAM Team at DOVRES FuSim-E Programme Symposium: CFD on Future Architectures C 2 A 2 S 2 E DLR Braunschweig 14 th -15 th October 2009
Outline Project Goal FPGA Design Methodology In-socket Accelerators XtremeData XD2000i High Level Languages Mitrionics SDK Euler 1D Implementation Conclusions and Future Work 1
Project Goal The main goal is to reduce the design time of airplanes by acting at two stages of the design process Firstly, by providing guidelines on how to improve mathematical methods in order to take advance of parallel hardware Secondly, by using reconfigurable computing platforms to significantly accelerate the execution of CFD algorithms 2
FPGA Design Methodology This methodology is completely different compared with other acceleration technologies Involves hardware and software design Knowledge and expertise in HW design, SW programming and HW/SW codesign is a must Hardware: Design the custom hardware Imagine that Intel develops a chip for your special needs This step is critical, complex and requires a lot of time Software: Program the custom hardware Similar to other acceleration technologies But requires to implement custom APIs 3
In-Socket Accelerators There are several FPGA-based solutions Reconfigurable computers, PCI-e boards, In-Socket Accelerators (ISAs), etc. In-Socket Accelerators: Tightly coupling FPGAs with x86 processors FPGA is located at the same level than the microprocessor Access to host memory Local memory Custom coprocessors Reconfigurable logic DSP units Internal memory 8 Reconfigurable MB QDRII+ SRAM HW XtremeData XD2000i (Intel FSB compatible) 4
High Level Languages (HLLs) Traditional Hardware Description Languages (HDLs) such as VHDL or Verilog Long development cycle Better performance Only for HW experts New approach is coming to FPGA development: HLLs Minimize the development time Increase productivity (reduce dev. time) Make easy to use the FPGA for non-expert users Some examples: From Matlab: DSPLogic From C: Codeveloper Impulse C New languages: Mitrion SDK (similar to C) Poor productivity P = Performance / Dev. Time We are evaluating Mitrion and Impulse 5
High Level Languages (HLLs) Mitrionics Mitrion-C Mitrion-C is a language specifically intended for developing applications on FPGAs Single-assignment dataflow language with native support for wide (vector) and deep (pipeline) parallelism Mitrion Virtual Processor (MVP) A fine-grained, massively parallel processor Runs software written in the Mitrion-C programming language in FPGAs Has a unique architecture that lets it be adapted to each program it is running in order to maximize performance.» The MVP performs thousands of operations simultaneously by allocating multiple computational units for each instruction 6
Euler 1D Implementation Testbed DOVRES-UAM cluster Two compute nodes Intel Xeon Quad-core 5408 2.13 GHz @1066 MHz FSB One XtremeData ISA One GPU Tesla C1060 32 GB Memory Infiniband 4x DDR dual port 7
Euler 1D Implementation XtremeData ISA Bandwidth Analysis Streaming Transfer Test Study the bandwidth between the FPGA and the host memory The FPGA moves data from / to host memory doing simultaneous reads and writes Overlapping communication and computation Using Mitrion-C Host memory A simple loopback is implemented in the FPGA FPGA 8
Euler 1D Implementation XtremeData ISA Bandwidth Analysis Current ISA BW numbers are: 2 GB/s Host to Bridge 1 GB/s Bridge to Host 1 GB/s Bridge to Host and Host to Bridge Future ISA BW numbers are: 3.5 GB/s Host to Bridge 2.5 GB/s Bridge to Host 1.5 GB/s Bridge to Host and Host to Bridge Streaming Transfer Test Results Data packets larger than 1MB 9
Euler 1D Implementation Results Full implementation of the Euler 1D algorithm FPGA-adapted version One FPGA Simple precision (float) Mitrion SDK Design time: 2 weeks 67% FPGA 2.5 hours!! Sep 14 2009 Euler1D.mitc Quartus II 8.1.163 Started synthesis [10:47] Started place&route [11:26] FIT reported 67% Logic utilization FIT reported 117,356/203,520 (58%) dedicated registers FIT reported 667,011/15,040,512 (4%) block memory bits FIT reported 132/768 (17%) 18-bit DSP elements Creating device programming image [13:12] Running timing analysis [13:14] Finished [13:19] SPR success! 10
Software: One big instruction Euler1D(float *grid, uint grid_size, uint n_iterations) Hardware: Euler 1D processing unit The FPGA will process the complete grid in each iteration Supports any grid size (streaming approach) Streaming approach Simultaneous reads and writes allow us to overlap communication and computation The Memory FPGA bandwidth is the key to obtain a good performance Euler 1D Implementation How does it work? Host memory FPGA 11
Euler 1D Implementation SpeedUp FPGA Clock is 100 MHz Low Bandwidth Euler 1D PU has only 1 core Grid Size (number of points) 12
Conclusions FPGA technology offers promise results on accelerating CFD algorithms It is necessary to increase the bandwidth for small data packets Great speedup (7x) is obtained when computation time is larger than communication time And the FPGA is working at 100 MHz!!! The current design can be improved More than one Euler 1D processing unit per FPGA This will require to use fixed point arithmetic There is another FPGA available in the current ISA Local memory of the FPGA can be used to store small grids New ISA BW numbers double the current ones A VHDL solution can increase easily by 10 the performance over a Mitrion-C implementation But the development time will increase too 13
Future Work Continue working on Euler 1D for testing purposes Improve the current software to solve some issues related to overlap communication and computation Apply the hardware improvements described before Two Euler 1D units per FPGA, use the second FPGA, use the local memory of the FPGA, etc. Study multi-node approach Identify how communication between nodes can affect the performance The DOVRES-UAM cluster is equipped with DDR Infiniband Use a new tool: Codeveloper Impulse-C Different approach than Mitrion SDK Currently working on an Euler 2D version 14
Questions? 15