SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA *



Similar documents
SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA

APPM4720/5720: Fast algorithms for big data. Gunnar Martinsson The University of Colorado at Boulder

Reconfigurable Architecture Requirements for Co-Designed Virtual Machines

Secured Embedded Many-Core Accelerator for Big Data Processing

FPGA area allocation for parallel C applications

RevoScaleR Speed and Scalability

MapReduce and Distributed Data Analysis. Sergei Vassilvitskii Google Research

Next Generation Operating Systems

How To Design An Image Processing System On A Chip

Clustering Big Data. Anil K. Jain. (with Radha Chitta and Rong Jin) Department of Computer Science Michigan State University November 29, 2012

Floating Point Fused Add-Subtract and Fused Dot-Product Units

BEHAVIOR BASED CREDIT CARD FRAUD DETECTION USING SUPPORT VECTOR MACHINES

Chapter 07: Instruction Level Parallelism VLIW, Vector, Array and Multithreaded Processors. Lesson 05: Array Processors

Extending the Power of FPGAs. Salil Raje, Xilinx

ACCELERATING COMMERCIAL LINEAR DYNAMIC AND NONLINEAR IMPLICIT FEA SOFTWARE THROUGH HIGH- PERFORMANCE COMPUTING

A General Framework for Tracking Objects in a Multi-Camera Environment

Similarity Search in a Very Large Scale Using Hadoop and HBase

BIG DATA What it is and how to use?

How To Write Security Enhanced Linux On Embedded Systems (Es) On A Microsoft Linux (Amd64) (Amd32) (A Microsoft Microsoft 2.3.2) (For Microsoft) (Or

Inference from sub-nyquist Samples

3 Orthogonal Vectors and Matrices

PARALLEL ALGORITHMS FOR PREDICTIVE MODELLING

Systolic Computing. Fundamentals

Float to Fix conversion

BUILDING A PREDICTIVE MODEL AN EXAMPLE OF A PRODUCT RECOMMENDATION ENGINE

Bag of Pursuits and Neural Gas for Improved Sparse Coding

A numerically adaptive implementation of the simplex method

Establishing Great Software Development Process(es) for Your Organization. By Dale Mayes

REAL-TIME STREAMING ANALYTICS DATA IN, ACTION OUT

Classifying Large Data Sets Using SVMs with Hierarchical Clusters. Presented by :Limou Wang

MONITORING AND DIAGNOSIS OF A MULTI-STAGE MANUFACTURING PROCESS USING BAYESIAN NETWORKS

ON SUITABILITY OF FPGA BASED EVOLVABLE HARDWARE SYSTEMS TO INTEGRATE RECONFIGURABLE CIRCUITS WITH HOST PROCESSING UNIT

Tackling Big Data with MATLAB Adam Filion Application Engineer MathWorks, Inc.

II. RELATED WORK. Sentiment Mining

BSc in Computer Engineering, University of Cyprus

Fast Analytics on Big Data with H20

Seeking Opportunities for Hardware Acceleration in Big Data Analytics

Forschungskolleg Data Analytics Methods and Techniques

MONTE-CARLO SIMULATION OF AMERICAN OPTIONS WITH GPUS. Julien Demouth, NVIDIA

Reconfigurable Low Area Complexity Filter Bank Architecture for Software Defined Radio

An Open-source Framework for Integrating Heterogeneous Resources in Private Clouds

CONQUEST: A Coarse-Grained Algorithm for Constructing Summaries of Distributed Discrete Datasets 1

α = u v. In other words, Orthogonal Projection

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

CFD Implementation with In-Socket FPGA Accelerators

Eli Levi Eli Levi holds B.Sc.EE from the Technion.Working as field application engineer for Systematics, Specializing in HDL design with MATLAB and

THE NAS KERNEL BENCHMARK PROGRAM

RN-Codings: New Insights and Some Applications

Implementation of Full -Parallelism AES Encryption and Decryption

Video-Rate Stereo Vision on a Reconfigurable Hardware. Ahmad Darabiha Department of Electrical and Computer Engineering University of Toronto

HARDWARE ACCELERATION IN FINANCIAL MARKETS. A step change in speed

Some Computer Organizations and Their Effectiveness. Michael J Flynn. IEEE Transactions on Computers. Vol. c-21, No.

Oracle BI EE Implementation on Netezza. Prepared by SureShot Strategies, Inc.

The Scientific Data Mining Process

Chapter 6. The stacking ensemble approach

Dynamic Load Balancing in Charm++ Abhinav S Bhatele Parallel Programming Lab, UIUC

Supervised Learning (Big Data Analytics)

Using In-Memory Computing to Simplify Big Data Analytics

Cellular Computing on a Linux Cluster

Design and Analysis of Parallel AES Encryption and Decryption Algorithm for Multi Processor Arrays

Development of the Fabry-Perot Spectrometer Application. Kathryn Browne Code 587

International Workshop on Field Programmable Logic and Applications, FPL '99

Networking Virtualization Using FPGAs

Department of Electrical & Computer Engineering

Bayesian Machine Learning (ML): Modeling And Inference in Big Data. Zhuhua Cai Google, Rice University

Department of Physical Medicine & Rehabilitation Northwestern University, Chicago IL USA

Advanced In-Database Analytics

Vinay Parisa 1, Biswajit Mohapatra 2 ;

Unsupervised Data Mining (Clustering)

Conclusions and Future Directions

Parallel AES Encryption with Modified Mix-columns For Many Core Processor Arrays M.S.Arun, V.Saminathan

A Pattern-Based Approach to. Automated Application Performance Analysis

Direct Methods for Solving Linear Systems. Matrix Factorization

Mixed Precision Iterative Refinement Methods Energy Efficiency on Hybrid Hardware Platforms

Scalable Data Analysis in R. Lee E. Edlefsen Chief Scientist UserR! 2011

Supercomputer Center Management Challenges. Branislav Jansík

Comparison of Non-linear Dimensionality Reduction Techniques for Classification with Gene Expression Microarray Data

Numerical Methods I Solving Linear Systems: Sparse Matrices, Iterative Methods and Non-Square Systems

Online Failure Prediction in Cloud Datacenters

PEST - Beyond Basic Model Calibration. Presented by Jon Traum

Image Compression through DCT and Huffman Coding Technique

Data Center and Cloud Computing Market Landscape and Challenges

Load Balancing for Distributed Stream Processing Engines. Muhammad Anis Uddin Nasir EMDC

Oracle9i Data Warehouse Review. Robert F. Edwards Dulcian, Inc.

Principles and characteristics of distributed systems and environments

Fast Matching of Binary Features

Hardware/Software Co-Design of a Java Virtual Machine

SCI Briefing: A Review of the New Hitachi Unified Storage and Hitachi NAS Platform 4000 Series. Silverton Consulting, Inc.

CYBER SCIENCE 2015 AN ANALYSIS OF NETWORK TRAFFIC CLASSIFICATION FOR BOTNET DETECTION

SYMMETRIC EIGENFACES MILI I. SHAH

GEDAE TM - A Graphical Programming and Autocode Generation Tool for Signal Processor Applications

Data Mining Practical Machine Learning Tools and Techniques

Oscillation Monitoring System

Hardware-accelerated Text Analytics

Scalable Developments for Big Data Analytics in Remote Sensing

Nonlinear Iterative Partial Least Squares Method

Massive Cloud Auditing using Data Mining on Hadoop

A Simultaneous Solution for General Linear Equations on a Ring or Hierarchical Cluster

How To Fix A 3 Bit Error In Data From A Data Point To A Bit Code (Data Point) With A Power Source (Data Source) And A Power Cell (Power Source)

Transcription:

SSketch: An Automated Framework for Streaming Sketch-based Analysis of Big Data on FPGA * Bita Darvish Rouhani, Ebrahim Songhori, Azalia Mirhoseini, and Farinaz Koushanfar Department of ECE, Rice University Houston Texas, USA * This work was supported in parts by the Office of Naval Research grant (ONR- N00014-11-1-0885)

Introduction Era of big data Decision making Find patterns Prevent failures Applications Medical Cybersecurity Media/social networks Finance Machine learning and statistical optimizations are main enablers 2

Efficient Data Transformation Data transformation Compact representation of the data collection Exploiting the redundancy present in the dataset An efficient data transformation should simultaneously consider the: Scalability Application Underlying platform constraints 3

Ensemble of Lower Dimensional Structures Many modern massive datasets are either low-rank or lie on a union of lower dimensional subspaces [1] Original dense dataset A m n 200 Singular value decomposition on A 150 100 SVD$ CSS$ - SVD - Suitable basis U m l V l n 50 0-50 Ensemble of lower dimensional structures -100-150 -50-40 -30-20 -10 0 10 20 30 40 50 D m l V l n [1] Mirhoseini et. al, Rankmap: A Platform-aware Framework for Distributed Learning from Dense Datasets, arxiv:1503.08169, 2015 4

SSketch Framework Streaming Sketch-based analysis of big data using FPGA Transforming the big data with dense correlations to an ensemble of lower dimensional subspaces Streaming applications: Limited storage Single pass access to data Data Source API Data sketching unit Data Source API Dictionary learner Data sketching unit Host (SW) Ethernet Interface Control unit Reconfigurable device (HW) Dictionary learner Ethernet Interface Control unit Host (SW) Reconfigurable device (HW) 5

SSketch Framework The transformed data is applicable to a broad set of matrix-based data analysis algorithms: Regularized loss function optimization Power method Image processing De-noising Super-resolution Classification 6

SSketch Methodology Data transformation: minimize ka DVk F subject to kvk 0 apple kn D"R m l,v"r l n Adaptive error-based dictionary learning Single pass access to data 7

SSketch Methodology Data transformation: Block splitting Amenable to FPGA accelerator minimize ka DVk F subject to kvk 0 apple kn D"R m l,v"r l n m A 1 A 2 A 3 m b m D 1 0 0 0 D 2 0 0 0 D 3 l V 1 V 2 V 3 n n l Schematic depiction of blocking SSketch 7

SSketch API Inputs: Stream of input data SSketch Algorithmic parameters Outputs: Dictionary matrix D Block-sparse matrix V SSketch consists of two main components: Dictionary learner unit Data sketching unit 8

Adaptive Dictionary Learning By Streaming the input data, SSketch adaptively: Learns/updates the corresponding dictionary of each block Computes the data sketch Input sample ( a i ) m b A D W (a i )= kd(dt D) 1 D t a i a i k 2 ka i k 2 Normalize a i & add it to D No W (a i ) apple Yes Data Sketching Unit Dictionary Learner Unit 9

Adaptive Dictionary Learning By Streaming the input data, SSketch adaptively: Learns/updates the corresponding dictionary of each block Computes the data sketch Input sample ( a i ) m b A D W (a i )= kd(dt D) 1 D t a i a i k 2 ka i k 2 Normalize a i & add it to D No W (a i ) apple Yes Data Sketching Unit Dictionary Learner Unit 9

Data Sketching Computing the block-sparse matrix V Applying the efficient, and greedy Orthogonal Matching Pursuit (OMP) routine Asynchronous parallel approach via a control unit Data Source API Data sketching unit Dictionary learner Ethernet Interface Control unit Host (SW) Reconfigurable device (HW) 10

Hardware Implementation Xilinx Virtex-6 FPGA ML605 Evaluation IEEE 754 single precision floating point format Resource utilization Virtex-6 resource utilization 11

OMP Routine OMP is mainly consists of three steps: Find best fitting column LS optimization Residual update Use QR decomposition to address the LS optimization 12

Benchmark Datasets Three different datasets: Light Field imaging A sequence of multi-dimensional array of images that are simultaneously captured from slightly different viewpoints Hyper-Spectral imaging A sequence of images generated by hundreds of detectors that capture the information from across the electromagnetic spectrum Synthetic data 13

Evaluation Results Xerr = ka Ãk F kak F, Ã = DV Gerr = kat A ka t Ak F ÃT Ãk F Compression_rate = nnz(d)+nnz(v) nnz(a) Data sketching error decreases as the dictionary size increases 14

Evaluation Results Xerr = ka Ãk F kak F, Ã = DV Gerr = kat A ka t Ak F ÃT Ãk F Compression_rate = nnz(d)+nnz(v) nnz(a) There is a trade-off between the sketch accuracy and the number of non-zeros in the block-sparse matrix V 14

Evaluation Results SSketch total processing time is linear in terms of the number of processed samples Runtime: T SSketch T dictionary learning SSketch runtime is dominated by + T Communication Overhead T FPGA Computation + T FPGA Computation Size of n 1k 5k 10k 20k T SSketch T SSketch (l = 128) (l = 64) 3.635s 2.31s 21.029s 12.01s 43.446s 24.32s 90.761s 48.52s m b = 256, =0.01,and =0.1 15

Conclusion Adaptive hardware-accelerated streaming-based data transformation Scalable streaming-based sketching methodology Amenable to FPGA acceleration Fixed, and low memory footprint User-friendly API Rapid prototyping of an arbitrary matrix-based data analysis Scalable, floating-point implementation of OMP on FPGA Up to 200 folds speed up compared to the software-only realization Less than 4% message passing delay for communication between the processor and accelerator 16

Acknowledgment From left to right: Bita Darvish Rouhani, PhD. student, Rice University (bita@rice.edu) Ebrahim Songhori, PhD. student, Rice University (ebrahim@rice.edu) Azalia Mirhoseini, PhD. student, Rice University (azalia@rice.edu) Farinaz Koushanfar, Associate Professor, Rice University (farinaz@rice.edu) This work was supported in parts by the Office of Naval Research grant (ONR- N00014-11-1-0885) 17