How To Write A Distributed Data Stream Processing System Based On Kafka And Spark

Similar documents
Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

CSE-E5430 Scalable Cloud Computing Lecture 11

Big Data Processing. Patrick Wendell Databricks


Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

[Hadoop, Storm and Couchbase: Faster Big Data]

Memory Database Application in the Processing of Huge Amounts of Data Daqiang Xiao 1, Qi Qian 2, Jianhua Yang 3, Guang Chen 4

Hadoop IST 734 SS CHUNG

Architectures for massive data management

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Apache Kafka Your Event Stream Processing Solution

Real Time Data Processing using Spark Streaming

BASHO DATA PLATFORM SIMPLIFIES BIG DATA, IOT, AND HYBRID CLOUD APPS

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

The WAMS Power Data Processing based on Hadoop

BIG DATA TECHNOLOGY. Hadoop Ecosystem

HiBench Introduction. Carson Wang Software & Services Group

Apache Hama Design Document v0.6

UPS battery remote monitoring system in cloud computing

WHITE PAPER. Reference Guide for Deploying and Configuring Apache Kafka

Hybrid Software Architectures for Big

Design and Implementation of Supermarket Management System Yongchang Rena, Mengyao Chenb

STREAM PROCESSING AT LINKEDIN: APACHE KAFKA & APACHE SAMZA. Processing billions of events every day

The Key Technology Research of Virtual Laboratory based On Cloud Computing Ling Zhang

Scaling Objectivity Database Performance with Panasas Scale-Out NAS Storage

Scaling Out With Apache Spark. DTL Meeting Slides based on

StreamStorage: High-throughput and Scalable Storage Technology for Streaming Data

Real-time Map Reduce: Exploring Clickstream Analytics with: Kafka, Spark Streaming and WebSockets. Andrew Psaltis

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

packet retransmitting based on dynamic route table technology, as shown in fig. 2 and 3.

3 Reasons Enterprises Struggle with Storm & Spark Streaming and Adopt DataTorrent RTS

Exploration on Security System Structure of Smart Campus Based on Cloud Computing. Wei Zhou

Big Data Analytics - Accelerated. stream-horizon.com

Building Scalable Big Data Infrastructure Using Open Source Software. Sam William

Pulsar Realtime Analytics At Scale. Tony Ng April 14, 2015

IT Platforms for Utilization of Big Data

From Spark to Ignition:

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Scala Storage Scale-Out Clustered Storage White Paper

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Agenda. Some Examples from Yahoo! Hadoop. Some Examples from Yahoo! Crawling. Cloud (data) management Ahmed Ali-Eldin. First part: Second part:

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Journal of Chemical and Pharmaceutical Research, 2015, 7(3): Research Article. E-commerce recommendation system on cloud computing

The Flink Big Data Analytics Platform. Marton Balassi, Gyula Fora" {mbalassi,

Putting Apache Kafka to Use!

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Mobile Storage and Search Engine of Information Oriented to Food Cloud

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

WSO2 Message Broker. Scalable persistent Messaging System

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

On a Hadoop-based Analytics Service System

CS555: Distributed Systems [Fall 2015] Dept. Of Computer Science, Colorado State University

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Comprehensive Analytics on the Hortonworks Data Platform

Kafka: a Distributed Messaging System for Log Processing

IPv4 and IPv6: Connecting NAT-PT to Network Address Pool

Big Data Analytics - Accelerated. stream-horizon.com

Research of Railway Wagon Flow Forecast System Based on Hadoop-Hazelcast

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Cisco UCS and Fusion- io take Big Data workloads to extreme performance in a small footprint: A case study with Oracle NoSQL database

BookKeeper. Flavio Junqueira Yahoo! Research, Barcelona. Hadoop in China 2011

Beyond Lambda - how to get from logical to physical. Artur Borycki, Director International Technology & Innovations

Chapter 7. Using Hadoop Cluster and MapReduce

BENCHMARKING CLOUD DATABASES CASE STUDY on HBASE, HADOOP and CASSANDRA USING YCSB

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Kafka & Redis for Big Data Solutions

Simplifying Big Data Analytics: Unifying Batch and Stream Processing. John Fanelli,! VP Product! In-Memory Compute Summit! June 30, 2015!!

Hadoop and Map-Reduce. Swati Gore

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Next-Gen Big Data Analytics using the Spark stack

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

CDH AND BUSINESS CONTINUITY:

Constructing a Data Lake: Hadoop and Oracle Database United!

The functionality and advantages of a high-availability file server system

XTM Web 2.0 Enterprise Architecture Hardware Implementation Guidelines. A.Zydroń 18 April Page 1 of 12

Exploring Oracle E-Business Suite Load Balancing Options. Venkat Perumal IT Convergence

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Protect SAP HANA Based on SUSE Linux Enterprise Server with SEP sesam

On Cloud Computing Technology in the Construction of Digital Campus

Research of Postal Data mining system based on big data

Talend Real-Time Big Data Sandbox. Big Data Insights Cookbook

Whitepaper: Back Up SAP HANA and SUSE Linux Enterprise Server with SEP sesam. Copyright 2014 SEP

Big Data and Scripting Systems beyond Hadoop

Hadoop Ecosystem B Y R A H I M A.

Keywords: Big Data, HDFS, Map Reduce, Hadoop

Apache Flink Next-gen data analysis. Kostas

Big Data Testbed for Research and Education Networks Analysis. SomkiatDontongdang, PanjaiTantatsanawong, andajchariyasaeung

GraySort and MinuteSort at Yahoo on Hadoop 0.23

Research on Clustering Analysis of Big Data Yuan Yuanming 1, 2, a, Wu Chanle 1, 2

Real-Time Analytical Processing (RTAP) Using the Spark Stack. Jason Dai Intel Software and Services Group

Large scale processing using Hadoop. Ján Vaňo

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Apache Hadoop. Alexandru Costan

Data processing goes big

Streaming items through a cluster with Spark Streaming

Drivers to support the growing business data demand for Performance Management solutions and BI Analytics

Transcription:

3rd International Conference on Mechatronics and Industrial Informatics (ICMII 2015) Distributed Data Streams Processing Based on /Kafka/Spark WangJun1, a *, Wang Wenhao2,b, Chen Renfei3,c 1,2 Henan Xuji metering limited Company, Xuchang Henan Province 461000, China 3 School of Electronic Engineering, Beijing University of Post & Telecommunication, Beijing 100876, China a 4948791@@qq.com, btianyumingyue@163.com,c313410646@qq.com Keywords:Distributed System, Stream Processing, Kafka,, Spark Abstract.Designed and implemented a distributed data streams processing system based on, Kafka and Spark, fetch and analyze datastreams and mining business intelligence information efficiently, real-timely and reliably, With high scalability and high reliability of, the data of multiple sous can be collected accurately and extended easily. Kafka's characteristics of high throughput, scalability, distribution meet the distribution requirements of massive data. Spark Streaming provides a set of efficient, fault-tolerant and real-time large-scale stream processing frame.thereby services and strategyof enterprise can be improved. Introduction In the information age, the data shows a tendency of explosive growth.enterprise can produce a lot of logs in many aspects,for example production and transaction.the scale of the logdata has jumped from GB to TB and even PB such orders of magnitude.faced with massive logs, conventional data processing system framework has been unable to meet the current needs of the enterprise.at the same time, the real-time demand of enterprise business to data processing is gradually increasing.throughput and fault tolerance of the traditional stream data processing framework have congenital defects, which does not apply to the high-speed extended business needs of these industries such as the Internet[1]. The change of business requirements has intensified the demand for data processing technology innovation. The current data system does not consider the distributed design, and just some single-components have distributed design, and there is no integrated distributed scheme. INaddition, the reliability of data is not well protected in the process of the datacollection. The forwarding throughput performance of data is inadequate in the context of highly concurrent massive logs. The processing of stream data and the fault tolerance of the system are not well considered. Aiming at a series of problems, a distributed data stream processing system is designed, it has the characteristics of distribution, scalability, large throughput and high reliability. It is able to extract target information accurately and in real time for the enterprise according to the datacontent, to help enterprise to take the relevant marketing measures, to provide users with timely personalized service, to enhance the competitiveness of enterprise [1]. Architecture Design and Implementation The system is divided into three modules: data collection module, distribution module logs, and data analysis module. module is responsible for collecting data from different data sous logs and sending to data distribution module. When the data distribution module receives the data, the data stream is grouped according to different topics, and then the corresponding data stream is sent to the data analysis module of the subject. The data analysis module receives a specific data analysis after receiving the data of a specific topic. Seamless connection between the three modules of the system, with the characteristics of distributed, high scalability, high reliability and real-time performance requirements. 2015. The authors - Published by Atlantis Press 948

Data sous Aggregation Kafka Spark Cluser Front end Data sous Data distribution Data analysis Fig. 1 System architecture data collection module module requires a distributed, reliable, highly available, capable of handling massive data framework, and the need to support multi sou collection and centralized storage. In a lot of data collection system, Apache isused [2]. From OG to NG, it has many advanced features. has a simple and flexible architecture based on streaming media data flow. uses transaction based data delivery to ensure the reliability of the transaction. The sou and destination of the data are encapsulated into a transaction. Transactions are stored in the channel until the transaction is processed, channel in the transaction can only be removed. This is flume to provide point-to-point reliable mechanism. supports users to build a multi-level flow, with high scalability, supports multi - level extension, the topology of the entire module is based on the specific logical needs by one or more agents. From the multilevel flow, the sou of the former agent and the post of a proxy is also the reliability of the data. Thus, the reliability of the is better than the other, such as Scribe, etc... In this paper, the structure of the data collection module is as follows: Agent Aggregation Agent Fig. 2 module 's "proxy" is the basic unit of data flow, which is composed of three parts: sou, channel and sink [3]. 1 and sou 2 are Exec types in Figure 2, which is used to monitor the data file, and when the file is added at the end of the file, the will be acquired in real time. Channel achieves the function of the message queue, is able to cache memory, but also to achieve the persistence of data. Places 1, 2, 3 sou belong to the Avro type, the data transmission between them is based on the Avro event, which can realize the remote transmission of the cross node, through the configuration of address and port to achieve data transmission and reception. 3 is a custom type, as a Kafka message production side, achieve the function about sending the collected data to Kafka. 's "proxy" is the basic unit of data flow, which is composed of three parts: sou, channel and sink [4]. 1 and sou 2 are Exec types in Figure 2, which is used to monitor the data file, and when the file is added at the end of the file, the will be acquired in real time. Channel achieves the function of the message queue, is able to cache memory, but also to achieve the persistence of data. Places 1, 2, 3 sou belong to the Avro type, the data transmission between them is based on the Avro event, which can realize the remote transmission of the cross node, through the 949

configuration of address and port to achieve data transmission and reception. 3 is a custom type, as a Kafka message production side, achieve the function about sending the collected data to Kafka. data distribution module Existing message queuing systems, such as ZeroMQ [3], are able to handle real-time or near real-time applications, but the data is not normally written on disk, which may be a problem for Hadoop or Spark applications [4]. Kafka is a publish / subscribe message-based system, support the Hadoop or Spark data parallel loading, for not only offline analysis, but also real-time processing. Kafka achieves the ability to provide message persistence in a time complexity of O (1), even for the TB level data can also guarantee the performance of a constant time complexity. Kafka is a fully distributed system, the proxy server, the message production side, and the message consumer end are all native support for the distributed automatically [5]. The number of the three can be more than one. Multiple proxy servers, message production, message consumption can be run on a large cluster, as a whole, as a whole external service. And through the use of Zookeeper (a distributed application management framework) to solve the problem of data management in distributed applications [6]. Kafka based on Java development, with many other framework to be friendly to convergence, the structure is flexible, good scalability, better than the use of Erlang prepared by the very heavyweight RabbitMQ. In this paper, the architecture of the data distribution module is as follows: ZooKeeper production () distribution (Kafka) consumption (Spark) Kafka Topic Fig. 3Data dissemination module data analysis module Spark is a highly efficient distributed computing system, using its in-memory characteristics, especially good at iterative and interactive data processing [7]. Streaming Spark is a real time computing framework based on Spark, through the rich API provided by Spark, based on the memory of the high-speed execution engine, it can be implemented to extend, high throughput, and has the fault-tolerant real-time data stream processing. It divides the input data into a DStream (discrete data stream), each section is converted into RDD (Spark), and then converted into Streaming Spark to convert Stream to RDD, and the RDD is converted into the intermediate result. The entire flow calculation based on business needs can be superimposed on the results of the middle, or stored to the external device. Fault tolerance is very important for flow calculation. Every RDD is a variable distributed recalculation of the data set, the record inheritance relation deterministic operations, as long as the input data is fault tolerant of, then an arbitrary RDD partition error or are not available are can use the original input data by switching operation and re calculated. This guarantees the fault tolerance of Streaming Spark. In this paper, the data processing flow chart of the data analysis module is as follows: Input Stream (Kafka) Discrete input data stream (batch processing) Discrete output data streams (results) Spark Streaming Spark Analysis Fig. 4 Log analysis module 950

System Test Experimental Environment Tab. 1 The hardware and software environment of experiment Node N1 N2 N3 Hardware RAM:4G RAM:2G RAM:2G OS ubuntu-12.04.5-amd64 ubuntu-12.04.5-amd64 ubuntu-12.04.5-amd64 Software spark-1.3.1-bin-hadoop2.4 kafka_2.10-0.8.2.1 Role master node, process node apache-flume-1.5.2-bin Backuo master node, process node process node Experimental Data The experimental data come from Tianchi data laboratory of Alibaba. The data set contains anonymous users shopping records in the six months. The information of the transaction data is analyzed to get the relationship between order quantity and the user's age, gender. In the test, a script is used to transform data into high concurrent data streams that are sent to the system, and the data stream data is analyzed in real time. Experimental Conclusion During the experiment, the data stream data is analyzed in real time, and the following experimental results are obtained. 400 300 200 100 The age distribution of customer order quantity 0 T1 T2 T3 T4 T5 T6 Unkown <18 18-24 25-29 30-34 35-39 40-49 >=50 Fig. 5 The age distribution of customer order quantity According to Fig. 5, the age distribution of order quantity. Customers are mainly distributed in the range of 18 to 49 years old, and the order quantity of 30 to 34 years old users is the largest. And according to the experimental results, we can quickly get the proportion of every age group. According to Fig. 6, the gender distribution of order quantity. The order quantity of female customer is significantly larger than the order quantity of male customer. And according to the experimental results, we can get the proportion of gender. 951

The gender distribution of customer order quantity 250 200 150 100 50 0 219 212 183 188 178 190 105 100 111 105 117 112 12 16 21 18 21 11 T1 T2 T3 T4 T5 T6 Female Male Unkown Fig.6 The genderdistribution of customer order quantity Conclusion A distributed data stream processing system based on /Kafka/Spark is presented. It can handle large-scale dataefficiently and accurately.with high scalability and high reliability of, the data of multiple data sous can be collected accurately and extended easily. Kafka's characteristics of high throughput, scalability, distribution meet the distribution requirements of massive data. Spark Streaming provides a set of efficient, fault-tolerant and real-time large-scale stream processing frame that can quickly analyze and calculate a large amount of data stream data. The architecture of the distributed data stream processing system is novel, and can realize the overall processing demand of the data stream data collection, distribution and analysis. According to the results of the analysis, the enterprise can adjust the advertising target population and take other relevant measures. The system completes the processing of data stream data efficiently and accurately, extracts the relevant information by analyzing the order data, helps enterprise to adjust the marketing strategy, and improves the performance of enterprise. References [1] Jay Kreps. I Heart Logs : Event Data, Stream Processing, and Data Integration.Sebastopol, CA, USA: O'Reilly Media, 2014. [2] Steve Hoffman. Apache : Distributed Data Collection for Hadoop (2nd Edition). Olton Birmingham, GBR: Packt Publishing Ltd, 2015. [3] Faruk Akgul.ZeroMQ.Olton, Birmingham, GBR: Packt Publishing, 2013. [4] Sandeep Karanth. Mastering Hadoop.Olton Birmingham, GBR: Packt Publishing Ltd, 2014. [5] Nishant Garg. Learning Apache Kafka (2nd Edition).OltonBirmingham, GBR: Packt PublishingLtd, 2015. [6] Flavio Junqueira, Benjamin Reed. ZooKeeper : Distributed Process Coordination.Sebastopol, CA, USA: O'Reilly Media, 2013. [7] Krishna Sankar, Holden Karau. Fast Data Processing with Spark (2nd Edition). Olton Birmingham, GBR: Packt Publishing Ltd, 2015. 952