Dataflow Programming with MaCompiler
Lecture Overview Programming DFEs MaCompiler Streaming Kernels Compile and build Java meta-programming 2
Reconfigurable Computing with DFEs Logic Cell (10 5 elements) Xilin Virte-6 FPGA DSP Block IO Block Block RAM DSP Block Block RAM (20TB/s) 3
DFE Acceleration Hardware Solutions MaRack 10U, 20U or 40U MaCard MaNode MaRack PCI-Epress Gen 2 Typical 50W-80W 24-48GB RAM 1U server 4 MAX3 Cards Intel Xeon CPUs 10U, 20U or 40U Rack 4
How could we program it? Schematic entry of circuits Traditional Hardware Description Languages VHDL, Verilog, SystemC.org Object-oriented languages C/C, Python, Java, and related languages Functional languages: e.g. Haskell High level interface: e.g. Mathematica, MatLab Schematic block diagram e.g. Simulink Domain specific languages (DSLs) 5
Level of Abstraction Accelerator Programming Models DSL DSL DSL DSL Higher Level Libraries Higher Level Libraries Fleible Compiler System: MaCompiler Possible applications 6
Start Acceleration Development Flow Original Application Identify code for acceleration and analyze bottlenecks Transform app, architect and model performance Write MaCompiler code Integrate with Host code Simulate NO NO Accelerated Application YES Meets performance goals? Build for Hardware YES Functions correctly? 7
MaCompiler Complete development environment for Maeler DFE accelerator platforms Write MaJ code to describe the dataflow accelerator MaJ is an etension of Java for MaCompiler Eecute the Java generate the accelerator C software on CPUs uses the accelerator class MyAccelerator etends Kernel { public MyAccelerator( ) { DFEVar = io.input("", dfefloat(8, 24)); DFEVar y = io.input( y", dfefloat(8, 24)); DFEVar 2 = * ; DFEVar y2 = y * y; DFEVar result = 2 y2 30; } } io.output( z", result, dfefloat(8, 24)); 8
Application Components CPU Host application SLiC MaelerOS Kernels DFE PCI Epress * Memory Memory Manager 9
Programming with MaCompiler Host Application *.c, *.f90... Rewrite only code to be accelerated MyKernel.maj MyManager.maj User Input MaIDE User Input Compiler maeleros Library MaCompiler SLiC Library Linker MAX File Sim or H/W (.ma) Output Output Eecutable 10
Simple Application Eample CPU Host Code Main Memory Host Code (.c) int*, *y; for (int i =0; i < DATA_SIZE; i) y[i]= [i] * [i] 30; y i i i 30 11
Development Process CPU Host Code MaCompilerRT MaelerOS Main Memory y PCI Epress Memory Manager DFE 30 y 30 Host Code (.c) int*, *y; MyKernel(DATA_SIZE, for (int i =0; i < DATA_SIZE; i) y[i]= [i], * [i] DATA_SIZE*4, 30; y, DATA_SIZE*4); MyManager (.maj) Manager m = new Manager(); Kernel k = new MyKernel(); m.setkernel(k); m.setio( link( ", CPU), link( y", CPU)); m.build(); MyKernel (.maj) DFEVar = io.input("", dfeint(32)); DFEVar result = * 30; io.output("y", result, dfeint(32)); 12
Development Process CPU Host Code MaCompilerRT MaelerOS Main Memory PCI Epress Memory y Manager DFE 30 y 30 Host Code (.c) device = ma_open_device(mafile, "/dev/maeler0"); int*, *y; MyKernel( DATA_SIZE,, DATA_SIZE*4); MyManager (.maj) Manager m = new Manager(); Kernel k = new MyKernel(); m.setkernel(k); m.setio( link( ", CPU), link( y", DRAM_LINEAR1D)); m.build(); MyKernel (.maj) DFEVar = io.input("", dfeint(32)); DFEVar result = * 30; io.output("y", result, dfeint(32)); 13
The Full Kernel public class MyKernel etends Kernel { public MyKernel (KernelParameters parameters) { super(parameters); DFEVar = io.input("", dfeint(32)); } } DFEVar result = * 30; io.output("y", result, dfeint(32)); 30 y 14
Streaming Data through the Kernel 5 4 3 2 1 0 0 0 30 30 y 30 15
Streaming Data through the Kernel 5 4 3 2 1 0 1 1 30 31 y 30 31 16
Streaming Data through the Kernel 5 4 3 2 1 0 2 4 30 34 y 30 31 34 17
Streaming Data through the Kernel 5 4 3 2 1 0 3 9 30 39 y 30 31 34 39 18
Streaming Data through the Kernel 5 4 3 2 1 0 4 16 30 46 y 30 31 34 39 46 19
Streaming Data through the Kernel 5 4 3 2 1 0 5 25 30 55 y 30 31 34 39 46 55 20
Compile, Build and Run Java program generates a MaFile when it runs 1. Compile the Java into.class files 2. Eecute the.class file Builds the dataflow graph in memory Generates the hardware.ma file 3. Link the generated.ma file with your host program 4. Run the host program Host code automatically configures DFE(s) and interacts with them at run-time 21
Java meta-programming You can use the full power of Java to write a program that generates the dataflow graph Java variables can be used as constants in hardware int y; DFEVar ; = y; Hardware variables can not be read in Java! Cannot do: int y; DFEVar ; y = ; Java conditionals and loops choose how to generate hardware not make run-time decisions 22
Dataflow Graph Generation: Simple What dataflow graph is generated? DFEVar = io.input(, type); DFEVar y; y = 1; io.output( y, y, type); 1 y 23
Dataflow Graph Generation: Simple What dataflow graph is generated? DFEVar = io.input(, type); DFEVar y; y = ; io.output( y, y, type); y 24
Dataflow Graph Generation: Variables What s the value of h if we stream in 1? DFEVar h = io.input( h, type); int s = 2; 1 10 s = s 5 h = h 10 7 h = h s; 18 What s the value of s if we stream in 1? DFEVar h = io.input( h, type); int s = 2; s = s 5 h = h 10 s = h s; Compile error. You can t assign a hardware value to a Java int 25
Dataflow Graph Generation: Conditionals What dataflow graph is generated? DFEVar = io.input(, type); int s = 10; DFEVar y; if (s < 100) { y = 1; } else { y = 1; } io.output( y, y, type); What dataflow graph is generated? DFEVar = io.input(, type); DFEVar y; y 1 if ( < 10) { y = 1; } else { y = 1; } io.output( y, y, type); Compile error. You can t use the value of in a Java conditional 26
Conditional Choice in Kernels Compute both values and use a multipleer. = control.mu(select, option0, option1,, optionn) = select? option1 : option0 Ternary-if operator is overloaded DFEVar = io.input(, type); DFEVar y; y = ( > 10)? 1 : 1 > 10 1-1 io.output( y, y, type); y 27
Dataflow Graph Generation: Java Loops What dataflow graph is generated? DFEVar = io.input(, type); DFEVar y = ; for (int i = 1; i <= 3; i) { y = y i; } io.output( y, y, type); 1 2 Can make the loop any size until you run out of space on the chip! Larger loops can be partially unrolled in space and used multiple times in time see Lecture on loops and cyclic graphs y 3 28
29 Real data flow graph as generated by MaCompiler 4866 nodes; 10,000s of stages/cycles
Eercises 1. Write a MaCompiler kernel program that takes three input streams, y and z which are dfeint(32) and computes an output stream p, where: p i ( i y ( i z ( i yi i i ) 2 ) 2 z i ) if if z z i i otherwise i i 2. Draw the dataflow graph generated by the following program: for (int i = 0; i < 6; i) { DFEVar = io.input( i, dfeint(32)); DFEVar y = ; if (i % 3!= 0) { for (int j = 0; j < 3; j) { y = y *j; } } else { y = y * y; } io.output( y i, y, dfeint(32)); } 30