FPGA-based MapReduce Framework for Machine Learning

Size: px

Start display at page:

Download "FPGA-based MapReduce Framework for Machine Learning"

Angel Floyd
10 years ago
Views:

1 FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China 2 Hardware Computing Group Microsoft Research Asia 1

1 1 Department of Electronic Engineering Tsinghua University,

2 Outline Motivation Proposed solution: FPGA+MapReduce Case study: RankBoost acceleration Summary 2

3 The Power Barrier parallel Source : Shekhar Borkar, Intel 3

4 Cost and Energy are still a Big Issue 4

5 Challenges General purpose CPU architecture Memory wall CPUs are too fast; memory bandwidth is too slow Cache Real Estate Power Wall Most power: non-arithmetic operations (out-of-order, prediction) Higher freq: higher leakage power Large cache Traditional parallel programming Need to manage the concurrency explicitly 5

non-arithmetic operations (out-of-order, prediction) Higher freq: higher

6 Customized Domain Specific Computing for Machine Learning Primary goal of this project Automatically utilize the parallelism in machine learning algorithms with 100x performance/power efficiency A few facts We have sufficient computing power for most applications * Each user/enterprise need high computation power for only selected tasks in its domain * (machine learning) Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture * MapReduce is a successful programming framework for ML/DM Approach Supercomputer in a box with reconfigurable hardware Field Programmable Gate Array (FPGA) and CPUs Parallel hardware programming with MapReduce framework *Jason Cong, FPL09 Keynote, Customizable Domain-Specific Computing 6

integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture * MapReduce is a successful programming framework for ML/DM

The Big Picture Machine Learning Applications MapReduce description of the Algorithm in C/C++ Reducer Supercomputer in a box Supercomputer in a box Supercomputer in a

7 The Big Picture Machine Learning Applications MapReduce description of the Algorithm in C/C++ Reducer Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer Data Manage User Constraints Interconnection Network Scheduler Programming Architecture 7

box Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer

8 Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. 8

Create arbitrary logic with gate arrays Gate arrays: islands

9 Islands of reconfigurable logic in a sea of reconfigurable interconnects (Altera Stratix) Y = i 0 + i 1 + i 2 * i 3 9

Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable

10 Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. Implement desired functionality in hardware Example: X = 3*Y + 5*Z Hardware Description Languages (HDLs) C/C++ to HDL compilation tools: AutoPilot CPU runs the application, FPGA is the application. 10

11 Why use FPGA? High flexibility Customized logic for application Match the application in bit level Best utilize parallelism and locality in application High computation density Several Pentium cores High I/O bandwidth Up to 100s Gbps High internal memory bandwidth Up to 10s Tbps Customized memory hierarchy with no cache miss Track Moore s Law Compared to ASIC Much lower design cost Compared to GPU Bit level flexibility Lower power 11

and locality in application High computation density Several Pentium cores High I/O bandwidth Up to 100s Gbps

12 FPGA-based High Performance Computing 10X ~ 10,000X speedup reported Conferences: FCCM, FPGA, FPT, FPL, SC, ICS Domains: scientific computing, machine learning, data mining, graphics, financial computing, Challenges Ad-hoc solutions Design productivity 12

scientific computing, machine learning, data mining, graphics,

Framework: MapReduce Web Request Logs MapReduce Word Count Functionality Parallelization Two Primitive: Map (input) for each word in input emit (word, 1)

13 Framework: MapReduce Web Request Logs MapReduce Word Count Functionality Parallelization Two Primitive: Map (input) for each word in input emit (word, 1) programmer Data Distribution MapReduce Fault Tolerance Runtime Load Balance Reduce (key, values) int sum = 0; for each value in values sum += value; emit (word, sum) 13

programmer Data Distribution MapReduce Fault Tolerance Runtime Load Balance

14 The Big Picture Machine Learning Applications MapReduce description of the Algorithm in C/C++ Reducer Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer Data Manage User Constraints Interconnection Network Scheduler Programming Architecture 14

Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer Data

15 FPGA MapReduce (FPMR) Framework Global Memory CPU <key,value> Generator PCIe / Hyper-Transport Data Controller Intermediate <key,value> enable parameters 4REDUCER REDUCER Processor Scheduler Merger Local Memory 5REDUCER REDUCER Reducer FPGA 15

Intermediate <key,value> enable parameters 4REDUCER REDUCER

16 Major Building Blocks Processors (workers) with pre-defined interfaces and reducer On-chip scheduler Dynamically scheduling Monitor status Queues to record Data access infrastructure Interconnection network Message passing and shared memory Storage hierarchy Global memory, local memory, and register file Data controller CPU, memories, and workers 16

infrastructure Interconnection network Message passing and shared memory Storage

17 Parallelism Task level/data level parallelism Among mappers/reducers Instruction level parallelism Within each worker 17

18 Case study: RankBoost An extension of AdaBoost to ranking problems [Yoav Freund, 2003] Learn a ranking function by combining weak learners Weak learner are usually represented by decision stumps of features Slow with large number of features and training samples E.g. Web search engine Weeks to get optimal result 18

are usually represented by decision stumps of features Slow with large number

19 Case study: RankBoost 19

20 RankBoost: mapper and reducer map (int key, pair value): // key : feature index fi // value : document bin fi, document π for each document d in value : hist(bin fi (d)) = hist(bin fi (d)) + π(d) EmitIntermediate (fi, hist fi ); reduce (int key, array value) : // key : feature index fi // value : histograms hist fi, fi = 1 N f for each histogram hist fi for i = N bin 1 to 0 integral fi (i) = hist fi (i) + integral fi (i+1) EmitIntermediate (fi, integral fi ) 20

); reduce (int key, array value) : // key : feature index fi // value : histograms hist fi, fi = 1 N f for each

21 RankBoost on FPMR Map RankBoost on FPMR Decide <key, value> #mapper/#reducer Global Memory bin (d ) Global Memory π (d ) CPU <bin(d),π(d)> Generator PCI-E Data Controller enable parameters Processor Scheduler Intermediate <fi,hist fi (bin)> REDUCER REDUCER Merger Local Memory Reducer FPGA 21

22 & Reducer Structure Shift Registers Local Memory Read Address DataOut Write Address DataIn Address Generator Bin FIFO M U X Read Address DataOut hist f RAM Dual Port Write Address DataIn M U X 8'b0 32'b0 Pi FIFO 32'b0 M U X M U X Floating Point Adder 32'b0 Floating Point ageb Comparator MUX Address Generator Local Memory Maximum Register M U X Floating Point Adder Reducer Read Address DataOut 22 Write Address DataIn

23 Target Accelerator PCI Express x8 interface (Xilinx V5 LXT FPGA) Altera StratixII FPGA DDR2 modules x2, 16GB, 6.25GBps, SRAMs Designed in HCG, MSRA 23

24 Experimental results 31.82X speedup with 146 parallel mappers Manual design: 33.5x #mapper #reducer WL / s Total / s WL Speedup Total Optimized software

25 Speedup Scalability WL with CDP ALUT 60 1% 2% 3% 5% 10% Register 1% 2% 4% Total with 6% CDP 11% WL w/o CDP ALUT 19% 31% 38% 75% 86% Total w/o CDP Register 17% 32% 39% 81% 89% N mappers 25

26 Design Productivity Manual design More than 3 months after the hardware circuit board was ready FPGA-based MapReduce Weeks Data layout and performance tuning took time 26

27 Summary Designed building blocks for MapReduce on FPGA Achieved comparable result with manual design Future work Use C2HDL compilers to further increase the design productivity Build Runtime for multiple machines Try more cases to build a tunable library 27

28 Thanks! 28

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Seeking Opportunities for Hardware Acceleration in Big Data Analytics Paul Chow High-Performance Reconfigurable Computing Group Department of Electrical and Computer Engineering University of Toronto Who