FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China 2 Hardware Computing Group Microsoft Research Asia 1
Outline Motivation Proposed solution: FPGA+MapReduce Case study: RankBoost acceleration Summary 2
The Power Barrier parallel Source : Shekhar Borkar, Intel 3
Cost and Energy are still a Big Issue 4
Challenges General purpose CPU architecture Memory wall CPUs are too fast; memory bandwidth is too slow Cache Real Estate Power Wall Most power: non-arithmetic operations (out-of-order, prediction) Higher freq: higher leakage power Large cache Traditional parallel programming Need to manage the concurrency explicitly 5
Customized Domain Specific Computing for Machine Learning Primary goal of this project Automatically utilize the parallelism in machine learning algorithms with 100x performance/power efficiency A few facts We have sufficient computing power for most applications * Each user/enterprise need high computation power for only selected tasks in its domain * (machine learning) Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture * MapReduce is a successful programming framework for ML/DM Approach Supercomputer in a box with reconfigurable hardware Field Programmable Gate Array (FPGA) and CPUs Parallel hardware programming with MapReduce framework *Jason Cong, FPL09 Keynote, Customizable Domain-Specific Computing 6
The Big Picture Machine Learning Applications MapReduce description of the Algorithm in C/C++ Reducer Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer Data Manage User Constraints Interconnection Network Scheduler Programming Architecture 7
Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. 8
Islands of reconfigurable logic in a sea of reconfigurable interconnects (Altera Stratix) Y = i 0 + i 1 + i 2 * i 3 9
Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. Implement desired functionality in hardware Example: X = 3*Y + 5*Z Hardware Description Languages (HDLs) C/C++ to HDL compilation tools: AutoPilot http://www.deepchip.com/items/0482-06.html CPU runs the application, FPGA is the application. 10
Why use FPGA? High flexibility Customized logic for application Match the application in bit level Best utilize parallelism and locality in application High computation density Several Pentium cores High I/O bandwidth Up to 100s Gbps High internal memory bandwidth Up to 10s Tbps Customized memory hierarchy with no cache miss Track Moore s Law Compared to ASIC Much lower design cost Compared to GPU Bit level flexibility Lower power 11
FPGA-based High Performance Computing 10X ~ 10,000X speedup reported Conferences: FCCM, FPGA, FPT, FPL, SC, ICS Domains: scientific computing, machine learning, data mining, graphics, financial computing, Challenges Ad-hoc solutions Design productivity 12
Framework: MapReduce Web Request Logs MapReduce Word Count Functionality Parallelization Two Primitive: Map (input) for each word in input emit (word, 1) programmer Data Distribution MapReduce Fault Tolerance Runtime Load Balance Reduce (key, values) int sum = 0; for each value in values sum += value; emit (word, sum) 13
The Big Picture Machine Learning Applications MapReduce description of the Algorithm in C/C++ Reducer Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer Data Manage User Constraints Interconnection Network Scheduler Programming Architecture 14
FPGA MapReduce (FPMR) Framework Global Memory CPU <key,value> Generator PCIe / Hyper-Transport 6 1 2 3 Data Controller Intermediate <key,value> enable parameters 4REDUCER REDUCER Processor Scheduler Merger Local Memory 5REDUCER REDUCER Reducer FPGA 15
Major Building Blocks Processors (workers) with pre-defined interfaces and reducer On-chip scheduler Dynamically scheduling Monitor status Queues to record Data access infrastructure Interconnection network Message passing and shared memory Storage hierarchy Global memory, local memory, and register file Data controller CPU, memories, and workers 16
Parallelism Task level/data level parallelism Among mappers/reducers Instruction level parallelism Within each worker 17
Case study: RankBoost An extension of AdaBoost to ranking problems [Yoav Freund, 2003] Learn a ranking function by combining weak learners Weak learner are usually represented by decision stumps of features Slow with large number of features and training samples E.g. Web search engine Weeks to get optimal result 18
Case study: RankBoost 19
RankBoost: mapper and reducer map (int key, pair value): // key : feature index fi // value : document bin fi, document π for each document d in value : hist(bin fi (d)) = hist(bin fi (d)) + π(d) EmitIntermediate (fi, hist fi ); reduce (int key, array value) : // key : feature index fi // value : histograms hist fi, fi = 1 N f for each histogram hist fi for i = N bin 1 to 0 integral fi (i) = hist fi (i) + integral fi (i+1) EmitIntermediate (fi, integral fi ) 20
RankBoost on FPMR Map RankBoost on FPMR Decide <key, value> #mapper/#reducer Global Memory bin (d ) Global Memory π (d ) CPU <bin(d),π(d)> Generator PCI-E Data Controller enable parameters Processor Scheduler Intermediate <fi,hist fi (bin)> REDUCER REDUCER Merger Local Memory Reducer FPGA 21
& Reducer Structure Shift Registers Local Memory Read Address DataOut Write Address DataIn Address Generator Bin FIFO M U X Read Address DataOut hist f RAM Dual Port Write Address DataIn M U X 8'b0 32'b0 Pi FIFO 32'b0 M U X M U X Floating Point Adder 32'b0 Floating Point ageb Comparator MUX Address Generator Local Memory Maximum Register M U X Floating Point Adder Reducer Read Address DataOut 22 Write Address DataIn
Target Accelerator PCI Express x8 interface (Xilinx V5 LXT FPGA) Altera StratixII FPGA DDR2 modules x2, 16GB, 6.25GBps, SRAMs Designed in HCG, MSRA 23
Experimental results 31.82X speedup with 146 parallel mappers Manual design: 33.5x #mapper #reducer WL / s Total / s WL Speedup Total 1 1 320.9 321.96 0.33 0.33 2 1 160.5 161.52 0.65 0.65 4 1 80.22 81.293 1.30 1.30 8 1 40.11 41.181 2.60 2.56 16 1 20.06 21.125 5.20 4.99 32 1 10.09 11.159 10.33 9.44 52 1 6.228 7.297 16.74 14.44 64 1 5.107 6.176 20.42 17.06 128 1 2.616 3.685 39.87 28.59 146 1 2.242 3.311 46.52 31.82 Optimized software 104.3 105.37 1 1 24
Speedup Scalability 80 1 2 4 8 16 WL with CDP ALUT 60 1% 2% 3% 5% 10% Register 1% 2% 4% Total with 6% CDP 11% 40 32 52 64 128 146 20 WL w/o CDP ALUT 19% 31% 38% 75% 86% Total w/o CDP 0 0 50 100 150 200 250 300 Register 17% 32% 39% 81% 89% N mappers 25
Design Productivity Manual design More than 3 months after the hardware circuit board was ready FPGA-based MapReduce Weeks Data layout and performance tuning took time 26
Summary Designed building blocks for MapReduce on FPGA Achieved comparable result with manual design Future work Use C2HDL compilers to further increase the design productivity Build Runtime for multiple machines Try more cases to build a tunable library 27
Thanks! ningyixu@microsoft.com 28