Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm



Similar documents
BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Architectures for massive data management

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Chapter 7. Using Hadoop Cluster and MapReduce

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

BIG DATA & ANALYTICS. Transforming the business and driving revenue through big data and analytics

Hadoop Technology for Flow Analysis of the Internet Traffic

BIG DATA ANALYTICS For REAL TIME SYSTEM

BigData. An Overview of Several Approaches. David Mera 16/12/2013. Masaryk University Brno, Czech Republic

Big Data Analytics - Accelerated. stream-horizon.com

Big Data Analysis: Apache Storm Perspective

Converged, Real-time Analytics Enabling Faster Decision Making and New Business Opportunities

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Steven C.H. Hoi School of Information Systems Singapore Management University

International Journal of Innovative Research in Computer and Communication Engineering

Advanced In-Database Analytics

The 4 Pillars of Technosoft s Big Data Practice

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

HiBench Introduction. Carson Wang Software & Services Group

Big Data Challenges in Bioinformatics

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Big Data Analytics. An Introduction. Oliver Fuchsberger University of Paderborn 2014

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Danny Wang, Ph.D. Vice President of Business Strategy and Risk Management Republic Bank

Luncheon Webinar Series May 13, 2013

Associate Professor, Department of CSE, Shri Vishnu Engineering College for Women, Andhra Pradesh, India 2

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

Ali Ghodsi Head of PM and Engineering Databricks

SIMPLE MACHINE HEURISTIC INTELLIGENT AGENT FRAMEWORK

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Reference Architecture, Requirements, Gaps, Roles

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

A Brief Outline on Bigdata Hadoop

NextGen Infrastructure for Big DATA Analytics.

From Spark to Ignition:

How to use Big Data in Industry 4.0 implementations. LAURI ILISON, PhD Head of Big Data and Machine Learning

Real Time Big Data Processing

BIG DATA CHALLENGES AND PERSPECTIVES

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Keywords Big Data, NoSQL, Relational Databases, Decision Making using Big Data, Hadoop

BIG DATA What it is and how to use?

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Data Centric Systems (DCS)

MapReduce and Hadoop Distributed File System V I J A Y R A O

Hadoop Architecture. Part 1

The Design and Implement of Ultra-scale Data Parallel. In-situ Visualization System

Customized Report- Big Data

Manifest for Big Data Pig, Hive & Jaql

A Study of Data Management Technology for Handling Big Data

Big Data Analytics Hadoop and Spark

Integrating VoltDB with Hadoop

ISSN: CONTEXTUAL ADVERTISEMENT MINING BASED ON BIG DATA ANALYTICS

Unified Batch & Stream Processing Platform

Hadoop Big Data for Processing Data and Performing Workload

Reducer Load Balancing and Lazy Initialization in Map Reduce Environment S.Mohanapriya, P.Natesan

MapReduce and Hadoop Distributed File System

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

SQLstream Blaze and Apache Storm A BENCHMARK COMPARISON

Indian Journal of Science The International Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

A SURVEY ON MAPREDUCE IN CLOUD COMPUTING

Large scale processing using Hadoop. Ján Vaňo

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Big Data Technology Map-Reduce Motivation: Indexing in Search Engines

The Lattice Project: A Multi-Model Grid Computing System. Center for Bioinformatics and Computational Biology University of Maryland

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Hadoop IST 734 SS CHUNG

UPS battery remote monitoring system in cloud computing

How To Write A Trusted Analytics Platform (Tap)

2015 The MathWorks, Inc. 1

Brave New World: Hadoop vs. Spark

Big Data: Study in Structured and Unstructured Data

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Unified Big Data Processing with Apache Spark. Matei

Big Data Mining Services and Knowledge Discovery Applications on Clouds

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Benchmark Hadoop and Mars: MapReduce on cluster versus on GPU

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

How In-Memory Data Grids Can Analyze Fast-Changing Data in Real Time

International Journal of Engineering Research ISSN: & Management Technology November-2015 Volume 2, Issue-6

Big Data on Microsoft Platform

MapReduce on GPUs. Amit Sabne, Ahmad Mujahid Mohammed Razip, Kun Xu

Extract Transform and Load Strategy for Unstructured Data into Data Warehouse Using Map Reduce Paradigm and Big Data Analytics

Using an In-Memory Data Grid for Near Real-Time Data Analysis

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Openbus Documentation

Data Refinery with Big Data Aspects

Lecture 10 - Functional programming: Hadoop and MapReduce

Amazon EC2 Product Details Page 1 of 5

Implement Hadoop jobs to extract business value from large and varied data sets

TRAINING PROGRAM ON BIGDATA/HADOOP

Parallel Computing. Benson Muite. benson.

Transcription:

Image Analytics on Big Data In Motion Implementation of Image Analytics CCL in Apache Kafka and Storm Lokesh Babu Rao 1 C. Elayaraja 2 1PG Student, Dept. of ECE, Dhaanish Ahmed College of Engineering, Chennai. 2 Head of Department, Dept. of ECE, Dhaanish Ahmed College of Engineering, Chennai. Abstract Continuous stream of Image and Video data processing is a big data challenge. Though the amount of data is comparatively small, it requires special resources to process the data since it is in continuous motion. Efficient transfer of Video / Image data to the centralized Data processing location is also a potential problem. Implementing Connected Component Labelling (CCL) using Apache Kafka and Apache Storm solves the problem of efficient data transfer and Centralized Image processing. Apache Kafka is an Open Source messaging system used for large data transfer across several data sources via a centralized cluster. Apache Storm is a Distributed Processing System used to process big data on a cluster. In this paper, big data processing framework and implementation of Image analytics are discussed. Index Terms Big Data, Apache Kafka, Apache Storm, Connected Component Labelling (CCL) I. INTRODUCTION Big data can be best defined as the data which has Volume, Variety and Velocity. Big data on image analytics is best applicable on processing photos and Surveillance videos. Processing Surveillance videos for information extraction requires real time stream processing. The Video data will be continuous in motion which can reach massive proportions over time. The Video data requires to get processed on time to extract the full benefit of surveillance. About 2.5 quintillion bytes of data is getting generated every day and 90% of all data was generated after 2011. Generally, big data is defined based on its main characteristics which are growing in three dimensions: volume, velocity and variety. In this new concept, the volume of unstructured data is in the scale of petabytes, and creation of them is in the fraction of the second. In the literature big data in motion is defined as continues data streams at high data transfer rate. Such big data represent data sets that cannot be analyzed with conventional algorithms and standard hardware platforms. Due to the typical memory capacity and bandwidth limitations, which determine the overall throughput, the processing of the continuously increasing amount of data is done online and locally on the streamed data. The new scale of volume, velocity and variety requires redesigning a number of infrastructure components and algorithms to support the large-volume, complex and growing data. Scientific data are defined in four different types: raw data, structured data, published data and data linked to the publication. First type of the scientific data is raw data which is generated from observation and experiment of different phenomena. For instance biological, climate and life sciences generate massive amounts of scientific data. Images and videos have the highest amount of volume among scientific data, and are analytically prepared to gain additional value. The preparation is done by extracting various properties of the image such as objects and movements. The processing of these large image collections usually requires large amounts of computational power. In addition to the number of images, there are other factors that influence the computational complexity of the processing and analysis tasks: the increasing resolution of images (e.g. mega images and experimental workflows that involve repeated algorithm execution varying different parameters. A solution to this problem is to use computational infrastructures (such as clusters and supercomputers) that can cope with the task. Stream processing clusters with the ability to collect the Data Stream efficiently is the need of the hour for the scientific community [3]. II. BIG DATA PROCESSING Image processing requires Stream data processing for the following reasons: 1. Video processing requires on time processing for surveillance 2. Processing video data after some span of time is hard and memory intensive. In order to process Image data, Data streaming or live data transfer from source to the Stream data processing cluster has to be dealt efficiently. Apache Kafka deals the problem of live data transfer with high efficiency. 12

A. Apache Kafka Apache Kafka is an open-source message broker project developed by the Apache Software Foundation written in Scala. The project aims to provide a unified, high-throughput, low-latency platform for handling real-time data feeds. The design is heavily influenced by transaction logs. Figure 1: Kafka-architechture To derive the real value from big data, any kind of information loss cannot be afforded. Apache Kafka is designed with O (1) disk structures that provide constant-time performance even with very large volumes of stored messages, which is in order of TB. B. Apache Storm Apache Storm is a distributed real-time computation system for processing fast, large streams of data. Storm adds reliable real-time data processing capabilities to Apache Hadoop 2.x. Storm in Hadoop helps capture new business opportunities with low-latency dashboards, security alerts, and operational enhancements integrated with other applications running in their Hadoop cluster. Figure 2: Storm-overview Storm is a distributed, reliable, fault-tolerant system for processing streams of data. The work is delegated to different types of components that are each responsible for a simple specific processing task. The input stream of a Storm cluster is handled by a component called a spout. The spout passes the data to a component called a bolt, which transforms it in some way. A bolt either persists the data in some sort of storage, or passes it to some other bolt. You can imagine a Storm cluster as a chain of bolt components that each makes some kind of transformation on the data exposed by the spout. III. CONNECTED COMPONENT LABELLING (CCL) Figure 3: CCL-Algorithm To demonstrate the applicability of Image analytics in Big Data Stream processing CCL Algorithm is implemented. Connected component labeling (alternatively connected-component analysis, blob extraction, region labeling, blob discovery, or region extraction) is an algorithmic application of graph theory, where subsets of connected components are uniquely labeled based on a given heuristic. Connected- 13

component labeling is not to be confused with segmentation. Connected-component component labeling is used in computer vision to detect connected regions in binary digital images, although color images and data with higherdimensionality can also be processed. Fig-3 represents the basic CCL Algorithm. Connected components labeling scans an image and groups its pixels into components based on pixel connectivity, i.e. all pixels in a connected component share similar pixel intensity values and are in some way connected with each other. Once all groups have been determined, each pixel is labeled with a gray level or a color (color labeling) according to the component it was assigned to. Extracting and labeling of various disjoint and connected components in an image is central to many automated image analysis applications. plications. IV. CCL IMPLEMENTATION The CCL Image analytics is implemented in Storm Topology in CCL Bolts. Topology is the graphical representation of Stream Data processing with Bolts as its nodes. Bolts are the processing units of Storm which acts as Virtual Task agent. The Kafka Agent code reads the Image from the location specified and converts it into String. The Kafka Agent can collect and stream data from multiple sources. The Collected Image string is then emitted to the Kafka cluster under specific Kafka Topic. Topic Name is the unique Stream Id applied to the Kafka Cluster by setting the configurations. The Kafka cluster receives the message and stores it in the cluster under the specified topic name. Figure 4: CCL-Topology The topology in Figure: 4contains the Kafka Spout which listens to the specified Kafka Topic name. When the message is received into the Kafka Cluster, Kafka Spout receives the copy of the message. The Kafka Spout then transmits the data to the CCL Bolt connected to it. The CCL bolt is implemented using the CCL algorithm. The image is reconstructed in the bolt from String to PNG format. The PNG file is converted into Portable Pitmap format (PPM) using Image Magic Tool. The CCL Algorithm is applied to the PPM file where multiple loops over the data is applied to get the final PPM output. The output PPM file is saved in the specified directory.thus, the image is efficiently transmitted using Apache Kafka and Processed live using Apache Storm. V. IMPLEMENTATION RESULTS The CCL-Topology is tested using Kafka-Storm single node Cluster. It is tested using 300Kb Image dataset with standard CCL Algorithm. The processing time (including the Kafka transfer time) is measured for performance calculation. The below table denotes the processing time of 300Kb image dataset for different number of Cores / Tasks. Table 1: Number of Cores Vs Processing Time Number of Tasks/Cores 4 8 16 Processing Time (Seconds) 16.0567 10.8952 5.5432 The Kafka-Storm system is compared with the following existing systems, (1) BIGS: A Framework for Large-Scale Image Processing and Analysis over Distributed and Heterogeneous Computing Resources, (2) CUDA: CUDA-enabled Hadoop Cluster for Fast Distributed Image Processing. A. BIGS Big Image Data Analysis Toolkit is a software framework for large scale image processing and analysis over heterogeneous computing resources, such as those available in clouds, grids, computer clusters or throughout scattered computer resources (desktops, labs) in an opportunistic manner. Through BIGS, 14

partitioning and parameter sweeps, avoiding the need of inter-process communication and, therefore, enabling loosely coupled computing nodes. Figure 5: BIGS Overview BIGS adopts an uncommitted resource allocation model where experimenters define their image processing pipelines in a simple configuration file, a schedule of jobs is generated and workers, as they become available, take over pending jobs as long as their dependency on other jobs is fulfilled. BIGS workers act autonomously, querying the job schedule to determine which one to take over. This removes the need for a central scheduling node, requiring only access by all workers to a shared information source [4]. B. CUDA CUDA or Compute Unified Device Architecture is a C-based programming model proposed by NVIDIA for leveraging the parallel computing capabilities of the GPU for general purpose computations. Figure 6: CUDA Hadoop Processing They are attempting to integrate CUDA acceleration into the Hadoop distributed processing framework to create a heterogeneous high performance image processing system.hadoop is a map-reduce based distributed processing framework, frequently used in the industry today, in areas of big data analysis, particularly text analysis. Graphics processing units (GPUs), on the other hand, are massively parallel platforms with attractive performance to price and power ratios, used extensively in the recent years for acceleration of data parallel computations. The Hadoop Architecture for small files is shown in Figure 6 [5]. C. KAFKA-STORM COMPARISION The approximate processing time of BIGS & CUDA systems are obtained for different datasets. The 300Kb dataset performance is compared with Kafka-Storm System. Processing Time (Seconds) Processing Time Vs Number of Cores 300K Dataset Comparision Number of Cores Kafka-Storm CUDA BIGS Figure 7: Comparison plot: Kafka-Storm Vs Other Systems 15

The above bar chart represents the Processing Time of Kafka-Storm, CUDA & BIGS for 300Kb data set for multiple Core / tasks. From the above comparison, the Kafka-Storm system s performance is similar to CUDA system and better than BIGS system. VI. FUTURE WORKS The Kafka-Storm Topology is being designed to analyze Surveillance video. The system will extract Human faces from Surveillance video and compares with faces in Database to detect. The system is a real time face recognition for Surveillance video. The Kafka Agent collects the images from various videos sources at a desired frequency. It transmits the image string data to the Kafka Cluster. The entry level of the Storm Topology is the Spout which receives the String data from Kafka Cluster. Figure 8: Real-Time Face Recognition using Kafka-Storm The Image String data is reconstructed in to png image and transferred to Image Detection Bolt. Image Detection Bolt is implemented using OpenCV library which uses Principal Component Analysis using Eigen faces. The Faces are separated in this bolt and applied to the Face Recognition bolt. The Face Recognition Bolt is connected to a Database which contains the images to be recognized and its feature values. The Face Recognition Bolt compares the current faces with the Database faces using Feature values. The faces which are matched are transferred to the final result bolt. VII. CONCLUSION The CCL Algorithm is successfully implemented in Apache Kafka and Apache Storm. The performance comparison results shows that the Kafka-Storm System is similar to CUDA system and better than BIGS System. In addition to Image stream processing, Kafka Storm System provides Image Stream data transfer from multiple image sources. Thus Kafka-Storm system provides complete solution for Big-data Image analytics. REFERENCES [1] S. M. Najmabadi, M. Klaiber, Z. Wang, Y. Baroud and S. Simon Stream, Processing of Scientific Big Data on Heterogeneous Platforms Image Analytics on Big Data in Motion, Institute for Parallel and Distributed Systems. [2] U. Chandrasekhar, A. Reddy, and R. Rath, A comparative study of Enterprise and open source big data analytical tools, Conference oninformation Communication Technologies (ICT),2013 IEEE, pp.372 377. [3] T. Eaton, D Deroos, Understanding big data, in McGraw-Hill Companies, April 2012. [4] Raúl Ramos-Pollán, Fabio A. González, Juan C. Caicedo, Angel Cruz-Roa, Jorge E. Camargo, Jorge A. Vanegas, Santiago A. Pérez, Jose David Bermeo, Juan Sebastián Otálora, Paola K. Rozo and John E. Arévalo, BIGS: A Framework for Large-Scale Image Processing and Analysis Over Distributed and Heterogeneous Computing Resources. [5] Ranajoy Malakar and Naga Vydyanathan, A CUDA-enabled Hadoop Cluster for Fast Distributed Image Processing. [6] P. N. Belhumeur, J. P. Hespanha, D. Kriegman, Eigenfaces vs. Fisherfaces: recognition using class specifi linear projection,pattern Analysis and Machine Intelligence, IEEE Transactions 19(7):711-720, 720, August 2002 [7] L. Cao., Singular value decomposition applied to digital image processing http://www.lokminglui.com/caosvdintro.pdfosvdintro.pdf [8] C. Hsu and C. Chen, SVD-based projection for face recognition http://ieeexplore.ieee.org/xpl/articledetails.jsp?arnumber=4374514 16