Hardware Acceleration for Map/Reduce Analysis of Streaming Data Using OpenCL

Similar documents
Fastest Path to Your Design. Quartus Prime Software Key Benefits

Fujisoft solves graphics acceleration for the Android platform

Using Altera MAX Series as Microcontroller I/O Expanders

Forging a Single Agile Radio Platform for Signal Intelligence and Public Safety

White Paper Utilizing Leveling Techniques in DDR3 SDRAM Memory Interfaces

FPGA Acceleration using OpenCL & PCIe Accelerators MEW 25

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

FPGAs for High-Performance DSP Applications

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

White Paper Streaming Multichannel Uncompressed Video in the Broadcast Environment

Task Scheduling in Hadoop

Achieving Real-Time Business Solutions Using Graph Database Technology and High Performance Networks

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Parallel Computing with MATLAB

Introducing PgOpenCL A New PostgreSQL Procedural Language Unlocking the Power of the GPU! By Tim Child

Actian SQL in Hadoop Buyer s Guide

VALAR: A BENCHMARK SUITE TO STUDY THE DYNAMIC BEHAVIOR OF HETEROGENEOUS SYSTEMS

Introduction to Parallel Programming and MapReduce

Four Keys to Successful Multicore Optimization for Machine Vision. White Paper

Enabling High performance Big Data platform with RDMA

Chapter 7. Using Hadoop Cluster and MapReduce

Avid ISIS

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Stingray Traffic Manager Sizing Guide

Cisco Unified Computing System and EMC VNX5300 Unified Storage Platform

Introducing EEMBC Cloud and Big Data Server Benchmarks

Bringing Big Data Modelling into the Hands of Domain Experts

High Performance Computing and Big Data: The coming wave.

Cray: Enabling Real-Time Discovery in Big Data

Big Data Primer. 1 Why Big Data? Alex Sverdlov alex@theparticle.com

PROFINET IRT: Getting Started with The Siemens CPU 315 PLC

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

High-Performance Business Analytics: SAS and IBM Netezza Data Warehouse Appliances

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Improving Grid Processing Efficiency through Compute-Data Confluence

Using an In-Memory Data Grid for Near Real-Time Data Analysis

How To Handle Big Data With A Data Scientist

high-performance computing so you can move your enterprise forward

GETTING STARTED WITH ANDROID DEVELOPMENT FOR EMBEDDED SYSTEMS

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

BigMemory and Hadoop: Powering the Real-time Intelligent Enterprise

HADOOP PERFORMANCE TUNING

Accelerate Cloud Computing with the Xilinx Zynq SoC

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

From GWS to MapReduce: Google s Cloud Technology in the Early Days

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Introduction to DISC and Hadoop

Java Coding Practices for Improved Application Performance

Understanding the Benefits of IBM SPSS Statistics Server

Rambus Smart Data Acceleration

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

IBM Netezza High Capacity Appliance

PEPPERDATA IN MULTI-TENANT ENVIRONMENTS

Oracle Database Scalability in VMware ESX VMware ESX 3.5

Altera SDK for OpenCL

Data Discovery, Analytics, and the Enterprise Data Hub

The Power of Pentaho and Hadoop in Action. Demonstrating MapReduce Performance at Scale

Introduction to Hadoop

Removing Performance Bottlenecks in Databases with Red Hat Enterprise Linux and Violin Memory Flash Storage Arrays. Red Hat Performance Engineering

Apache Hama Design Document v0.6

The search engine you can see. Connects people to information and services

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Oracle Big Data SQL Technical Update

RoCE vs. iwarp Competitive Analysis

Product Brief SysTrack VMP

Kronos Workforce Central on VMware Virtual Infrastructure

Distributed Computing and Big Data: Hadoop and MapReduce

White Paper. Big Data and Hadoop. Abhishek S, Java COE. Cloud Computing Mobile DW-BI-Analytics Microsoft Oracle ERP Java SAP ERP

Embedded Systems: map to FPGA, GPU, CPU?

In-Database Analytics

Accelerating Microsoft Exchange Servers with I/O Caching

Benchmark Study on Distributed XML Filtering Using Hadoop Distribution Environment. Sanjay Kulhari, Jian Wen UC Riverside

The Fastest Way to Parallel Programming for Multicore, Clusters, Supercomputers and the Cloud.

APRIL 2010 HIGH PERFORMANCE NETWORK SECURITY APPLIANCES

Dell* In-Memory Appliance for Cloudera* Enterprise

Tutorial: Big Data Algorithms and Applications Under Hadoop KUNPENG ZHANG SIDDHARTHA BHATTACHARYYA

AMD WHITE PAPER GETTING STARTED WITH SEQUENCEL. AMD Embedded Solutions 1

Using the On-Chip Signal Quality Monitoring Circuitry (EyeQ) Feature in Stratix IV Transceivers

Introduction to GPU Programming Languages

Redundant Data Removal Technique for Efficient Big Data Search Processing

HP ProLiant BL660c Gen9 and Microsoft SQL Server 2014 technical brief

Xeon+FPGA Platform for the Data Center

VMware Server 2.0 Essentials. Virtualization Deployment and Management

The big data revolution

USB-Blaster Download Cable User Guide

Providing Battery-Free, FPGA-Based RAID Cache Solutions

Optimizing Code for Accelerators: The Long Road to High Performance

White Paper Selecting the Ideal FPGA Vendor for Military Programs

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

See the Big Picture. Make Better Decisions. The Armanta Technology Advantage. Technology Whitepaper

Accelerating Business Intelligence with Large-Scale System Memory

Using In-Memory Computing to Simplify Big Data Analytics

Hadoop MapReduce over Lustre* High Performance Data Division Omkar Kulkarni April 16, 2013

BSPCloud: A Hybrid Programming Library for Cloud Computing *

Complex, true real-time analytics on massive, changing datasets.

Using the Altera Serial Flash Loader Megafunction with the Quartus II Software

Big Data With Hadoop

MapReduce and Hadoop Distributed File System V I J A Y R A O

Radar Processing: FPGAs or GPUs?

Transcription:

DESIGN SOLUTION: A C U S T O M E R S U C C E S S S T O R Y Hardware Acceleration for Map/Reduce Analysis of Streaming Data Using OpenCL By Jim Costabile, CEO/Founder, Syncopated Engineering Inc. The Project Typically analysts and analytic developers will pursue the solution to a complex The Design Team: Syncopated Engineering is a creative solution provider of software applications and embedded systems for wireless communications, signal processing and data analytics. Challenge: Map/Reduce applications, which break a huge static data set into small pieces, map those pieces to independent processors, and then analyze the data on each processor, are well-supported by industry-standard frameworks such as Hadoop. But there has been no easy way for developers to move code from a static Map/Reduce environment into an environment in which the data streams through the application. This change often required a complete recoding of the application and the use of hardware accelerators to keep up with the data stream. Solution: By coding the Map/Reduce functions as OpenCL kernels, developers can move from a static-data model on CPUs to a streaming model using hardware acceleration with virtually no application code changes, exploiting new streaming extensions to the underlying OpenCL platform. problem by operating on static data to explore, refine, and optimize the analytic prior to deploying it as an operational analytic. This is often called a data-at-rest approach. Many times this development and discovery process produces a solution that would provide significant value if it operated on real-time, dynamic data in a streaming fashion a data-in-motion implementation. The Design Challenge The challenge is to migrate these analytics from a static data analysis architecture to a dynamic streaming architecture. Today, the underlying computing infrastructure, analytic framework, and programming languages for static and dynamic architectures are different and non-compatible. The analytic developer needs the ability to develop and verify a static data analytic, and then in a simple and straightforward manner migrate this analytic to a streaming analytic architecture. Ideally, this migration process would produce a functionally equivalent analytic capable of operating with high throughput and low latency to keep up with the streaming data rates.

The current state-of-the-art for static data analytics is the Hadoop Map/Reduce software framework. Hadoop Map/Reduce implements the Map/Reduce parallel processing model on large clusters of commodity computers. Map/Reduce jobs typically include a user defined map function that operates on input key/value pairs producing intermediate key/value pairs, and a user defined reduce function that merges and operates on all intermediate values associated with a specific intermediate key to produce a reduced output result. This framework lends itself to data-parallel processing where the input data is partitioned and processed in parallel with a large number of concurrent and identical map functions followed by a smaller number of concurrent and identical reduce functions. Hadoop is a static data analytic framework written in Java and deployed across a large cluster of commodity machines. This cluster necessarily has a large size, weight, and power (SWAP) footprint. If you need more performance you have to add more commodity machines, at the expense of an even larger SWAP footprint. If you want to use the same analytic on streaming data, you have two choices: store the data first--effectively turning the data-in-motion scenario into a data-at-rest scenario--or re-write the analytic in a streaming analytic framework. The Design Solution Our solution is an implementation of the map/reduce parallel processing model using the OpenCL heterogeneous parallel programing framework. Analytic developers write custom map and reduce functions as OpenCL kernels and leverage Syncopated Engineering s map/reduce implementation to quickly deploy map/reduce applications as either static or streaming data analytics on commodity CPU architectures as well as hardware accelerated heterogeneous computing architectures using FPGAs. In order to demonstrate our OpenCL map/reduce framework, we developed a document filtering application that takes a collection of documents as the input, determines the term or word frequency count per document and then uses topic profiles to score the documents to determine the most relevant documents to the topic. We used a collection of technical papers as the input document set, and used the abstracts to create specific topic profiles. 2

Theory of Operation The document filtering application includes a pre-initialization process that transforms the input document set into a collection of <term, frequency> pairs per document, where the term is a word in the document and the frequency is the frequency of occurrence of that word in the document. The topic profiler performs the same transformation on one of the abstracts from the document set. Topic profiles are stored as <term, weight> pairs, where the weight represents the relevance of that term to the profile. The application compares the terms in the document to the terms in the profile and computes a weighted score for each document indicating the relevance of the document to the topic profile. In order to improve memory access performance, a Bloom filter is used to efficiently determine if a document term is in the topic profile. The Bloom filter is initialized by computing hash functions on all terms in the topic profile and using the hash results to set bits in the Bloom filter. The application computes the same hash functions of the each term in the document, and compares to the Bloom filter. If the Bloom Filter returns a zero, the term is guaranteed to not be in the topic profile and therefore no memory access is required. If the Bloom filter returns a one, the weight for that term in the topic profile is retrieved. False positives are possible, which result in an extra memory access but do not affect the overall accuracy of the results. We implemented this document filtering application using a map/reduce parallel processing approach where the mapper and reducer kernels were implemented in OpenCL and executed on the FPGA. The mapper kernel calculates the hash functions for each term in the input document dataset, compares it against the Bloom Filter and accesses the profile weights stored in global memory. The reducer kernel uses the document term frequencies and profile weights to calculate the document score. We implemented the map/reduce document filtering application as a static analytic using an NDrange approach, static analytic using a channelized approach, and streaming analytic using a channelized approach. 3

The NDRange implementation partitions the OpenCL kernel into independent and asynchronous work groups consisting of a number of work items performing the same operation in parallel. In this approach, the host loads the dataset into the device s global memory. The mapper kernel reads data from memory, processes the data and returns the intermediate results back to the host. The host then sends the intermediate results back to device global memory, so the reducer kernel can process the intermediate results. The final results processed by the reducer kernel are then written back to global memory for transfer back to the host. This is the standard approach to writing OpenCL kernels, and maps very well to the massively parallel SIMD (Single Instruction Multiple Data) compute unit architecture typical in modern GPUs. The FPGA architecture accommodates this SIMD programming approach also, but does not mandate it, allowing more flexibility in the parallel processing implementation. Static Data Analytic (NDRange Approach) Host Interim Results Stored Final Results Global Memory (DIMM1, DIMM2) Mapper Reducer FPGA 4

The channelized approach takes advantage of Altera s channel concept to pass data directly between kernels eliminating costly host/device memory transfers. Channels are a superset to the pipe concept written into the latest OpenCL specification. Altera channels are implemented as FIFOs allowing direct data transfer between kernels, as well as between kernels and external I/O. In the channelized implementation, the code is written as a simple single-threaded application. Parallelism is exploited by the use of vector data types (data parallelism) as implemented by the developer and by unrolling loops (programming parallelism) as provided primarily by the compiler. Therefore, the channelized approach is a more straight forward implementation for the developer than the NDRange approach. Additional helper functions (producer and consumer kernels) are used for writing and reading to external memory. Static Data Analytic (Channelized Approach) Host Stored Final Results Global Memory (DIMM1, DIMM2) Producer Mapper Reducer Consumer FPGA 5

Finally, we also implemented the document filtering application as a streaming map/reduce analytic using the channelized approach to stream data directly into the FPGA. The FPGA includes a UDP offload engine that receives and processes the document data set via an external network connection providing the document <term, frequency> pairs to the rest of the processing kernels. The amount of parallelism achievable using this approach is less than that achieved in the previous approaches due to the inclusion of the UDP offload engine within the FPGA, reducing the resources available for userdefined kernels. However, this approach eliminates the collection and storage of the dataset on the host, and subsequent transfer of the data set to the device for processing. The final results are transferred back to the host for storage, which is a significant data reduction improving both data transfer time as well as required host storage capacity. Streaming Data Analytic (Channelized) FPGA UDP Offload Engine Producer Mapper Reducer Consumer Global Memory (DIMM1, DIMM2) Host Final Results Stored 6

Results We conducted performance (execution time) and power consumption benchmark testing of the static data analytic document filtering application using both the NDRange and channelized implementations. Each implementation was executed on both the Nallatech PCI385N and Bittware S5PH-Q FPGA platforms using their respective High Performance Computer (HPC) Board Support Platform (BSP). The host was a 1U micro-server with an C2850 processor. We performed the benchmark testing using two separate methods. The first method conducted multiple iterations over a smaller data set re-deploying the kernel to the device at the beginning of each iteration. The second method used a much larger data set and deployed the kernel to the device only once at the beginning of the test. The performance was compared to CPU-only execution using a single-threaded implementation executed on both the 1U server and an i7 3930K 12 core processor, as well as the NDRange OpenCL implementation executed on the i7 processor using AMD s OpenCL SDK. 1.8E+09 Throughput Performance 3000 Iterations 330K Terms Single Iteration 21Million Terms 1.6E+09 Terms per Second 1.4E+09 1.2E+09 1.0E+09 8.0E+08 6.0E+08 4.0E+08 2.0E+08 0.0E+00 Intel I7 12 Core (Single Threaded) Intel I7 12 Core w/nallatech w/nallatech (Channels) w/bittware w/bittware (Channels) 7

D ESIGN SOLUTION MARCH 2015 Terms per Joule 4.50E+07 4.00E+07 3.50E+07 3.00E+07 2.50E+07 2.00E+07 1.50E+07 1.00E+07 5.00E+06 0.00E+00 Power Normalized Throughput Performance Intel I7 12 Core (Single Threaded) Intel I7 12 Core w/nallatech w/nallatech (Channels) 3000 Iterations 330K Terms Single Iteration 21Million Terms w/bittware w/bittware (Channels) Our research has clearly shown the benefits of implementing compute intensive algorithms using FPGAs within the OpenCL framework. Our OpenCL FPGA implementation resulted in significant performance improvements over single threaded CPU implementations as well as OpenCL implementations executing on a CPU only. The greatest performance realizations occur for very large data sets using the channelized approach, since it reduces the dependency on costly host/device memory transfers. In addition to the raw performance gains that are achievable, the real potential for this technology is related to its overwhelming power savings. When performance is analyzed in terms of throughput per Joule, FPGAs are able to realize order-of-magnitude gains. Our research focused on implementing a document filtering application using the Map/Reduce parallel processing model. While this approach aligned well to this technology, we see no reason to conclude that the performance benefits are limited to this type of computing method. In fact most parallel algorithms would likely benefit from employing FPGAs and OpenCL. Altera Corporation 101 Innovation Drive San Jose, CA 95134 USA Telephone: (408) 544 7000 www.altera.com Altera European Headquarters Holmers Farm Way High Wycombe Buckinghamshire HP12 4XF United Kingdom Telephone: (44) 1 494 602 000 Altera European Trading Company Ltd. Building 2100 Cork Airport Business Park, Cork, Republic of Ireland Telephone: +353 21 454 7500 Altera Japan Ltd. Shinjuku i-land Tower 32F 6-5-1, Nishi Shinjuku Shinjuku-ku, Tokyo 163-1332 Japan Telephone: (81) 3 3340 9480 www.altera.co.jp Altera International Ltd. Unit 11-18, 9/F Millennium City 1, Tower 1 388 Kwun Tong Road Kwun Tong Kowloon, Hong Kong Telephone: (852) 2945 7000 www.altera.com.cn Altera Corporation Technology Center Plot 6, Bayan Lepas Technoplex Medan Bayan Lepas 11900 Bayan Lepas Penang, Malaysia Telephone: 604 636 6100 2015 Altera Corporation. All rights reserved. ALTERA, ARRIA, CYCLONE, ENPIRION, MAX, MEGACORE, NIOS, QUARTUS and STRATIX words and logos are trademarks of Altera Corporation and registered in the U.S. Patent and Trademark Office and are trademarks or registered trademarks in other countries. All other words and logos identified as trademarks or service marks are the property of their respective holders as described at www.altera.com/legal. March 2015. DS-1001