FPGA-based MapReduce Framework for Machine Learning



Similar documents
Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Moving Beyond CPUs in the Cloud: Will FPGAs Sink or Swim?

Networking Virtualization Using FPGAs

Emerging storage and HPC technologies to accelerate big data analytics Jerome Gaysse JG Consulting

Data Center and Cloud Computing Market Landscape and Challenges

CFD Implementation with In-Socket FPGA Accelerators

Best Practises for LabVIEW FPGA Design Flow. uk.ni.com ireland.ni.com

Xeon+FPGA Platform for the Data Center

An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds

Digital Systems Design! Lecture 1 - Introduction!!

Embedded Systems: map to FPGA, GPU, CPU?

FPGA-based Multithreading for In-Memory Hash Joins

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Accelerate Cloud Computing with the Xilinx Zynq SoC

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

International Workshop on Field Programmable Logic and Applications, FPL '99

FPGA Accelerator Virtualization in an OpenPOWER cloud. Fei Chen, Yonghua Lin IBM China Research Lab

Parallel Computing. Benson Muite. benson.

Qsys and IP Core Integration

Digitale Signalverarbeitung mit FPGA (DSF) Soft Core Prozessor NIOS II Stand Mai Jens Onno Krah

OpenSoC Fabric: On-Chip Network Generator

Design of a High Speed Communications Link Using Field Programmable Gate Arrays

Aims and Objectives. E 3.05 Digital System Design. Course Syllabus. Course Syllabus (1) Programmable Logic

Cray Gemini Interconnect. Technical University of Munich Parallel Programming Class of SS14 Denys Sobchyshak

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Unleashing the Performance Potential of GPUs for Atmospheric Dynamic Solvers

OpenPOWER Outlook AXEL KOEHLER SR. SOLUTION ARCHITECT HPC

Intel Xeon +FPGA Platform for the Data Center

How To Build An Ark Processor With An Nvidia Gpu And An African Processor

Compiling PCRE to FPGA for Accelerating SNORT IDS

All Programmable Logic. Hans-Joachim Gelke Institute of Embedded Systems. Zürcher Fachhochschule

7a. System-on-chip design and prototyping platforms

Architekturen und Einsatz von FPGAs mit integrierten Prozessor Kernen. Hans-Joachim Gelke Institute of Embedded Systems Professur für Mikroelektronik

Architectures and Platforms

Introduction to Programmable Logic Devices. John Coughlan RAL Technology Department Detector & Electronics Division

NIOS II Based Embedded Web Server Development for Networking Applications

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

Parallel Programming Survey

Next Generation GPU Architecture Code-named Fermi

Binary search tree with SIMD bandwidth optimization using SSE

White Paper FPGA Performance Benchmarking Methodology

How To Design An Image Processing System On A Chip

Design and Implementation of an On-Chip timing based Permutation Network for Multiprocessor system on Chip

Model-based system-on-chip design on Altera and Xilinx platforms

Multi-Threading Performance on Commodity Multi-Core Processors

ACANO SOLUTION VIRTUALIZED DEPLOYMENTS. White Paper. Simon Evans, Acano Chief Scientist

FSMD and Gezel. Jan Madsen

Kalray MPPA Massively Parallel Processing Array

IMPLEMENTATION OF FPGA CARD IN CONTENT FILTERING SOLUTIONS FOR SECURING COMPUTER NETWORKS. Received May 2010; accepted July 2010

Computer Graphics Hardware An Overview

A Low Latency Library in FPGA Hardware for High Frequency Trading (HFT)

Microsoft Exchange Solutions on VMware

A survey on platforms for big data analytics

High-Level Synthesis for FPGA Designs

Bricata Next Generation Intrusion Prevention System A New, Evolved Breed of Threat Mitigation

INTRODUCTION TO DIGITAL SYSTEMS. IMPLEMENTATION: MODULES (ICs) AND NETWORKS IMPLEMENTATION OF ALGORITHMS IN HARDWARE

Quartus II Software Design Series : Foundation. Digitale Signalverarbeitung mit FPGA. Digitale Signalverarbeitung mit FPGA (DSF) Quartus II 1

Enhance Service Delivery and Accelerate Financial Applications with Consolidated Market Data

TKDM A Reconfigurable Co-processor in a PC s Memory Slot

A Survey on ARM Cortex A Processors. Wei Wang Tanima Dey

Unified Computing Systems

High Performance or Cycle Accuracy?

Turbomachinery CFD on many-core platforms experiences and strategies

Nutaq. PicoDigitizer 125-Series 16 or 32 Channels, 125 MSPS, FPGA-Based DAQ Solution PRODUCT SHEET. nutaq.com MONTREAL QUEBEC

Lesson 7: SYSTEM-ON. SoC) AND USE OF VLSI CIRCUIT DESIGN TECHNOLOGY. Chapter-1L07: "Embedded Systems - ", Raj Kamal, Publs.: McGraw-Hill Education

Design of a High-speed and large-capacity NAND Flash storage system based on Fiber Acquisition

Multicore Programming with LabVIEW Technical Resource Guide

10 Gbps Line Speed Programmable Hardware for Open Source Network Applications*

Parallel Databases. Parallel Architectures. Parallelism Terminology 1/4/2015. Increase performance by performing operations in parallel

Infrastructure Matters: POWER8 vs. Xeon x86

AN FPGA FRAMEWORK SUPPORTING SOFTWARE PROGRAMMABLE RECONFIGURATION AND RAPID DEVELOPMENT OF SDR APPLICATIONS

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

Cloud Data Center Acceleration 2015

Cryptography & Network-Security: Implementations in Hardware

KEEP IT SYNPLE STUPID

Multi-core Programming System Overview

Computer Architecture TDTS10

Von der Hardware zur Software in FPGAs mit Embedded Prozessoren. Alexander Hahn Senior Field Application Engineer Lattice Semiconductor

Why Computers Are Getting Slower (and what we can do about it) Rik van Riel Sr. Software Engineer, Red Hat

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

Open Flow Controller and Switch Datasheet

Chapter 7. Using Hadoop Cluster and MapReduce

Lecture 3: Modern GPUs A Hardware Perspective Mohamed Zahran (aka Z) mzahran@cs.nyu.edu

Agenda. Michele Taliercio, Il circuito Integrato, Novembre 2001

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

Performance Measurement of a High-Performance Computing System Utilized for Electronic Medical Record Management

How Router Technology Shapes Inter-Cloud Computing Service Architecture for The Future Internet

FPGA-Accelerated Heterogeneous Hyperscale Server Architecture for Next-Generation Compute Clusters

An Introduction to Parallel Computing/ Programming

Parallelized Architecture of Multiple Classifiers for Face Detection

Transcription:

FPGA-based MapReduce Framework for Machine Learning Bo WANG 1, Yi SHAN 1, Jing YAN 2, Yu WANG 1, Ningyi XU 2, Huangzhong YANG 1 1 Department of Electronic Engineering Tsinghua University, Beijing, China 2 Hardware Computing Group Microsoft Research Asia 1

Outline Motivation Proposed solution: FPGA+MapReduce Case study: RankBoost acceleration Summary 2

The Power Barrier parallel Source : Shekhar Borkar, Intel 3

Cost and Energy are still a Big Issue 4

Challenges General purpose CPU architecture Memory wall CPUs are too fast; memory bandwidth is too slow Cache Real Estate Power Wall Most power: non-arithmetic operations (out-of-order, prediction) Higher freq: higher leakage power Large cache Traditional parallel programming Need to manage the concurrency explicitly 5

Customized Domain Specific Computing for Machine Learning Primary goal of this project Automatically utilize the parallelism in machine learning algorithms with 100x performance/power efficiency A few facts We have sufficient computing power for most applications * Each user/enterprise need high computation power for only selected tasks in its domain * (machine learning) Application-specific integrated circuits (ASIC) can lead to 10,000X+ better power performance efficiency, but are too expensive to design and manufacture * MapReduce is a successful programming framework for ML/DM Approach Supercomputer in a box with reconfigurable hardware Field Programmable Gate Array (FPGA) and CPUs Parallel hardware programming with MapReduce framework *Jason Cong, FPL09 Keynote, Customizable Domain-Specific Computing 6

The Big Picture Machine Learning Applications MapReduce description of the Algorithm in C/C++ Reducer Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer Data Manage User Constraints Interconnection Network Scheduler Programming Architecture 7

Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. 8

Islands of reconfigurable logic in a sea of reconfigurable interconnects (Altera Stratix) Y = i 0 + i 1 + i 2 * i 3 9

Field-Programmable Gate Array Defined Field-programmable semiconductor device Change functionality after deployment Create arbitrary logic with gate arrays Gate arrays: islands of reconfigurable logic in a sea of reconfigurable interconnects. Implement desired functionality in hardware Example: X = 3*Y + 5*Z Hardware Description Languages (HDLs) C/C++ to HDL compilation tools: AutoPilot http://www.deepchip.com/items/0482-06.html CPU runs the application, FPGA is the application. 10

Why use FPGA? High flexibility Customized logic for application Match the application in bit level Best utilize parallelism and locality in application High computation density Several Pentium cores High I/O bandwidth Up to 100s Gbps High internal memory bandwidth Up to 10s Tbps Customized memory hierarchy with no cache miss Track Moore s Law Compared to ASIC Much lower design cost Compared to GPU Bit level flexibility Lower power 11

FPGA-based High Performance Computing 10X ~ 10,000X speedup reported Conferences: FCCM, FPGA, FPT, FPL, SC, ICS Domains: scientific computing, machine learning, data mining, graphics, financial computing, Challenges Ad-hoc solutions Design productivity 12

Framework: MapReduce Web Request Logs MapReduce Word Count Functionality Parallelization Two Primitive: Map (input) for each word in input emit (word, 1) programmer Data Distribution MapReduce Fault Tolerance Runtime Load Balance Reduce (key, values) int sum = 0; for each value in values sum += value; emit (word, sum) 13

The Big Picture Machine Learning Applications MapReduce description of the Algorithm in C/C++ Reducer Supercomputer in a box Supercomputer in a box Supercomputer in a box Supercomputer in a box CPUs MEM MEM MEM FPGAs High Level Synthesis Tool Reducer Data Manage User Constraints Interconnection Network Scheduler Programming Architecture 14

FPGA MapReduce (FPMR) Framework Global Memory CPU <key,value> Generator PCIe / Hyper-Transport 6 1 2 3 Data Controller Intermediate <key,value> enable parameters 4REDUCER REDUCER Processor Scheduler Merger Local Memory 5REDUCER REDUCER Reducer FPGA 15

Major Building Blocks Processors (workers) with pre-defined interfaces and reducer On-chip scheduler Dynamically scheduling Monitor status Queues to record Data access infrastructure Interconnection network Message passing and shared memory Storage hierarchy Global memory, local memory, and register file Data controller CPU, memories, and workers 16

Parallelism Task level/data level parallelism Among mappers/reducers Instruction level parallelism Within each worker 17

Case study: RankBoost An extension of AdaBoost to ranking problems [Yoav Freund, 2003] Learn a ranking function by combining weak learners Weak learner are usually represented by decision stumps of features Slow with large number of features and training samples E.g. Web search engine Weeks to get optimal result 18

Case study: RankBoost 19

RankBoost: mapper and reducer map (int key, pair value): // key : feature index fi // value : document bin fi, document π for each document d in value : hist(bin fi (d)) = hist(bin fi (d)) + π(d) EmitIntermediate (fi, hist fi ); reduce (int key, array value) : // key : feature index fi // value : histograms hist fi, fi = 1 N f for each histogram hist fi for i = N bin 1 to 0 integral fi (i) = hist fi (i) + integral fi (i+1) EmitIntermediate (fi, integral fi ) 20

RankBoost on FPMR Map RankBoost on FPMR Decide <key, value> #mapper/#reducer Global Memory bin (d ) Global Memory π (d ) CPU <bin(d),π(d)> Generator PCI-E Data Controller enable parameters Processor Scheduler Intermediate <fi,hist fi (bin)> REDUCER REDUCER Merger Local Memory Reducer FPGA 21

& Reducer Structure Shift Registers Local Memory Read Address DataOut Write Address DataIn Address Generator Bin FIFO M U X Read Address DataOut hist f RAM Dual Port Write Address DataIn M U X 8'b0 32'b0 Pi FIFO 32'b0 M U X M U X Floating Point Adder 32'b0 Floating Point ageb Comparator MUX Address Generator Local Memory Maximum Register M U X Floating Point Adder Reducer Read Address DataOut 22 Write Address DataIn

Target Accelerator PCI Express x8 interface (Xilinx V5 LXT FPGA) Altera StratixII FPGA DDR2 modules x2, 16GB, 6.25GBps, SRAMs Designed in HCG, MSRA 23

Experimental results 31.82X speedup with 146 parallel mappers Manual design: 33.5x #mapper #reducer WL / s Total / s WL Speedup Total 1 1 320.9 321.96 0.33 0.33 2 1 160.5 161.52 0.65 0.65 4 1 80.22 81.293 1.30 1.30 8 1 40.11 41.181 2.60 2.56 16 1 20.06 21.125 5.20 4.99 32 1 10.09 11.159 10.33 9.44 52 1 6.228 7.297 16.74 14.44 64 1 5.107 6.176 20.42 17.06 128 1 2.616 3.685 39.87 28.59 146 1 2.242 3.311 46.52 31.82 Optimized software 104.3 105.37 1 1 24

Speedup Scalability 80 1 2 4 8 16 WL with CDP ALUT 60 1% 2% 3% 5% 10% Register 1% 2% 4% Total with 6% CDP 11% 40 32 52 64 128 146 20 WL w/o CDP ALUT 19% 31% 38% 75% 86% Total w/o CDP 0 0 50 100 150 200 250 300 Register 17% 32% 39% 81% 89% N mappers 25

Design Productivity Manual design More than 3 months after the hardware circuit board was ready FPGA-based MapReduce Weeks Data layout and performance tuning took time 26

Summary Designed building blocks for MapReduce on FPGA Achieved comparable result with manual design Future work Use C2HDL compilers to further increase the design productivity Build Runtime for multiple machines Try more cases to build a tunable library 27

Thanks! ningyixu@microsoft.com 28