In-memory Distributed Processing Method for Traffic Big Data to Analyze and Share Traffic Events in Real Time among Social Groups



Similar documents
Efficient Data Replication Scheme based on Hadoop Distributed File System

Multi-level Metadata Management Scheme for Cloud Storage System

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Big Data Analytics Hadoop and Spark

Customized Efficient Collection of Big Data for Advertising Services

Enhancing Dataset Processing in Hadoop YARN Performance for Big Data Applications

CSE-E5430 Scalable Cloud Computing Lecture 11

Redundant Data Removal Technique for Efficient Big Data Search Processing

Processing Large Amounts of Images on Hadoop with OpenCV

Cloud Computing based Livestock Monitoring and Disease Forecasting System

Spark. Fast, Interactive, Language- Integrated Cluster Computing

A Noble Integrated Management System based on Mobile and Cloud service for preventing various hazards

Architectures for massive data management

Smart Integrated Multiple Tracking System Development for IOT based Target-oriented Logistics Location and Resource Service

CLOUDDMSS: CLOUD-BASED DISTRIBUTED MULTIMEDIA STREAMING SERVICE SYSTEM FOR HETEROGENEOUS DEVICES

Rakam: Distributed Analytics API

Designing and Embodiment of Software that Creates Middle Ware for Resource Management in Embedded System

Two-Level Metadata Management for Data Deduplication System

Cyber Forensic for Hadoop based Cloud System

Big Data Collection Study for Providing Efficient Information

Suresh Lakavath csir urdip Pune, India

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

ANALYSIS OF BILL OF MATERIAL DATA USING KAFKA AND SPARK

Unified Big Data Processing with Apache Spark. Matei

From GWS to MapReduce: Google s Cloud Technology in the Early Days

Development of Integrated Management System based on Mobile and Cloud Service for Preventing Various Hazards

Developing Scalable Smart Grid Infrastructure to Enable Secure Transmission System Control

Benchmarking Cassandra on Violin

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

DBaaS Using HL7 Based on XMDR-DAI for Medical Information Sharing in Cloud

Snapshots in Hadoop Distributed File System

A Study on Integrated Operation of Monitoring Systems using a Water Management Scenario

Accelerating Enterprise Applications and Reducing TCO with SanDisk ZetaScale Software

International Journal of Advanced Engineering Research and Applications (IJAERA) ISSN: Vol. 1, Issue 6, October Big Data and Hadoop

A Study on IP Exposure Notification System for IoT Devices Using IP Search Engine Shodan

Volume 3, Issue 6, June 2015 International Journal of Advance Research in Computer Science and Management Studies

Effective Use of Android Sensors Based on Visualization of Sensor Information

Crime Hotspots Analysis in South Korea: A User-Oriented Approach

A Resilient Device Monitoring System in Collaboration Environments

A Research Using Private Cloud with IP Camera and Smartphone Video Retrieval

A Performance Benchmark for NetFlow Data Analysis on Distributed Stream Processing Systems

Understanding traffic flow

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Brave New World: Hadoop vs. Spark

Scalable Multiple NameNodes Hadoop Cloud Storage System

Fujitsu Big Data Software Use Cases

The Design and Implementation of the Integrated Model of the Advertisement and Remote Control System for an Elevator

Home Appliance Control and Monitoring System Model Based on Cloud Computing Technology

Design of Simulator for Cloud Computing Infrastructure and Service

A Load Balanced PC-Cluster for Video-On-Demand Server Systems

An Efficient Application Virtualization Mechanism using Separated Software Execution System

RevoScaleR Speed and Scalability

IMPROVED FAIR SCHEDULING ALGORITHM FOR TASKTRACKER IN HADOOP MAP-REDUCE

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Design and Implementation of Automatic Attendance Check System Using BLE Beacon

A Dynamic Resource Management with Energy Saving Mechanism for Supporting Cloud Computing

Distributed Framework for Data Mining As a Service on Private Cloud

A Virtual Machine Searching Method in Networks using a Vector Space Model and Routing Table Tree Architecture

Mobile Storage and Search Engine of Information Oriented to Food Cloud

UPS battery remote monitoring system in cloud computing

Heterogeneity-Aware Resource Allocation and Scheduling in the Cloud

Optimization and analysis of large scale data sorting algorithm based on Hadoop

Development of Integrated Management System based on Mobile and Cloud service for preventing various dangerous situations

Introduction to Hadoop

Hybrid System for Driver Assistance

Content-Aware Load Balancing using Direct Routing for VOD Streaming Service

Cloud Storage Solution for WSN Based on Internet Innovation Union

A Novel Cloud Based Elastic Framework for Big Data Preprocessing

Massive Cloud Auditing using Data Mining on Hadoop

Scaling Out With Apache Spark. DTL Meeting Slides based on

Spark ΕΡΓΑΣΤΗΡΙΟ 10. Prepared by George Nikolaides 4/19/2015 1

Research and Performance Analysis of HTML5 WebSocket for a Real-time Multimedia Data Communication Environment

NetFlow Analysis with MapReduce

RUBA: Real-time Unstructured Big Data Analysis Framework

SPARK USE CASE IN TELCO. Apache Spark Night ! Chance Coble!

A Load Balancing Algorithm based on the Variation Trend of Entropy in Homogeneous Cluster

Spark and Shark. High- Speed In- Memory Analytics over Hadoop and Hive Data

The Sensitive Information Management System for Merger and Acquisition (M&A) Transactions

Real Time Data Processing using Spark Streaming

A Study on Information Technology Plan and Status of University 2013

Apache Spark : Fast and Easy Data Processing Sujee Maniyam Elephant Scale LLC sujee@elephantscale.com

Development of Real-time Big Data Analysis System and a Case Study on the Application of Information in a Medical Institution

Development of a Service Robot System for a Remote Child Monitoring Platform

Sawmill Log Analyzer Best Practices!! Page 1 of 6. Sawmill Log Analyzer Best Practices

Adaptive Load Balancing Method Enabling Auto-Specifying Threshold of Node Load Status for Apache Flume

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Design of a NAND Flash Memory File System to Improve System Boot Time

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Security Measures of Personal Information of Smart Home PC

86 Int. J. Engineering Systems Modelling and Simulation, Vol. 6, Nos. 1/2, 2014

Apache Spark 11/10/15. Context. Reminder. Context. What is Spark? A GrowingStack

A Study on Data Analysis Process Management System in MapReduce using BPM

The Internet of Things and Big Data: Intro

A study on Standardization of Integrated database for Intelligent water information management

Data Mining for Data Cloud and Compute Cloud

Apache Spark and Distributed Programming

Study on the Vulnerability Level of Physical Security And Application of the IP-Based Devices

A RFID Data-Cleaning Algorithm Based on Communication Information among RFID Readers

Performance Comparison Analysis of Linux Container and Virtual Machine for Building Cloud

International Journal of Scientific & Engineering Research, Volume 4, Issue 11, November ISSN

Transcription:

, pp. 51-58 http://dx.doi.org/10.14257/ijseia.2016.10.1.06 In-memory Distributed Processing Method for Traffic Big Data to Analyze and Share Traffic Events in Real Time among Social Groups Dojin Choi 1, Bosung Kim 2, Insu Bae 2 and Seokil Song 1* 1 Department of Computer Engineering, Korea Transportation of National University, Chungju, Chungbuk, Republic of Korea 2 Department of Information Technology Convergence, Korea Transportation of National University, Chungju, Chungbuk, Republic of Korea mycdj91@ut.ac.kr, jikol2000@ut.ac.kr, gkdrmf23@ut.ac.kr, sisong@ut.ac.kr Abstract In this paper, we propose an in-memory distributed processing method that can rapidly process vehicle location and traffic event data using Spark Streaming. The proposed system enables to share information about surrounding vehicles, pedestrians, and traffic events in real time with drivers who use the WEVING service. In the proposed method, vehicle location and traffic event streams are indexed using the grid indexing technique according to time, and the continuous range query method is processed based on the index. Also, traffic events are grouped based on occurrence time, location, content, and road segment of the traffic event transferred in real time in order to avoid duplicated traffic events. Through experiments, we show that the proposed method is able to deduplicate similar traffic events efficiently. Keywords: traffic data, big data, social group, spark streaming 1. Introduction Road congestion makes a great negative impact on the community and the environment. Road congestion can be reduced by two approaches like the following. First approach is to increase road capacity. However, it is very difficult, especially in urban environments. Second approach is to reduce demand in congested areas by providing information about road status so that drivers change mode of transport, or alter the route or time. Collecting traffic information on road by using infrastructure sensors such as CCTVs, loop sensors in roads, and so on has some limits. The sensors cannot cover the whole road until deploying them the whole road with an enormous sum of money. Recently, with the rapid increase in vehicles equipped with such smart devices as black boxes or smartphones that are capable of communication, it becomes possible to collect and share traffic information by using those smart devices instead of using infrastructure sensors. Various applications or services for traffic safety and the convenience services based on collecting and sharing traffic information by using smart devices have been developed. WEVING (WE are driving together) is a social driver assistance system [1] with a smartphone application that automatically detects traffic events (i.e., delays, congestion, accidents, road conditions, etc.,) using cameras, GPS (Global Positioning System), gyroscope sensors, and the acceleration sensors mounted in smartphones. The detected traffic events, event images, and current locations are shared with users in social groups via WEVING. If there are many vehicles that use the WEVING service, and frequent attempts are made at detecting the current location, a large workload for the WEVING server is created when assisting in the sharing of vehicle locations, traffic events, and images ISSN: 1738-9984 IJSEIA Copy right c 2016 SERSC

between social groups. In particular, sharing the locations of surrounding vehicles and traffic events in real time is very important, given that delaying processing because of server overload might make such sharing of current locations and traffic events meaningless. In addition, traffic events detected in areas of heavy road traffic are likely to be duplicated. Therefore, a single server that processes data by storing them in hard disks cannot meet the requirements of applications such as WEVING. To meet such requirements, applying an in-memory-based processing method is necessary, in addition to managing and analyzing data in parallel through multiple servers in order to facilitate fast data storage and search. In this study, a method for data storage and management for traffic events and vehicle location data is proposed using Spark [2] and Spark Streaming [3], two of the distributed in-memory processing platforms. Spark Streaming is an in-memory stream processing system that processes data streams in batches during a pre-determined time (less than 1 sec) as it receives messages, logs, or text files in real time. Spark Streaming is based on the RDD (Resilient Distributed Datasets) [2] model of Spark. RDD provides data recovery functions stored in memory, which can process data without storing in hard disks. Spark Streaming processes data inputted in a pre-determined time using RDD, thereby providing fault-tolerant stream processing for applications based on Spark Streaming. In this paper, we propose a method that can process data rapidly using Spark Streaming with regard to vehicle location and traffic event data as stream data. The proposed method indexes vehicle location and traffic event streams transferred in real time using Spark Streaming according to location and time by means of a grid indexing technique. The grid indexing technique used in this paper is a modified version of the existing grid indexing technique based on suits Spark Streaming [4]. Vehicles that use the WEVING service set the region of interest based on the current vehicle location, and send this to the server in order to request information about other vehicles and pedestrians, as well as traffic events in the region of interesting from the server. The WEVING server processes the request as it recognizes the region of interesting sent by moving vehicles as a CRQ (Continuous Range Query). To process CRQ effectively, this paper proposes a grid indexing-based CRQ technique. In addition, this paper proposes a method that avoids unnecessary traffic event sharing because the same traffic events are sorted using occurrence time, location, and the content of traffic events sent in real time. The traffic event classification method is also proposed based on Spark Streaming. 3. Related Work Proposed method uses gird index we use distributed grid index of [5]. The distributed indexing method of [5] is based on Spark streaming. [5] adds some transformation operators and output operators such as bulkload, bulkinsert, splitindex, search to Spark Stream to index and query vehicles. The input stream is the postions of moving objects that are transmitted periodically. Spark Stream transforms the input stream into D- Streams. As shown in Figure 1, the input stream is transformed into DSt, DSt+1, DSt+2 and DSt+3 continuously by Spark Stream. It performs bulkload and bulkinsert operators on each D-Stream. bulkload builds a gird index GIt with position data in DSt. The proposed indexing method uses grid techniques to reduce the time for building and updating an index. The indexing method does not use lock based concurrency control method. The D-Stream model of Spark Stream is immutable, so update operations and search operations are not performed concurrently on an index. As shown in Figure 1, an index is updated by bulkload and bulkinsert operators, and multiple versions of the index at time t, t+1, t+2 an t+3, which can be accessed by users, are on main memory. Therefore, users can access GIt+2, while GIt+3 is being built. 52 Copy right c 2016 SERSC

Figure 1. Build Grid Index with Input Stream 2. Proposed In-Memory Traffic Big Data Processing Technique The server processes the received data differently depending on whether the data are the client's current location or traffic event. The server provides the search result through CRQ on the surrounding vehicles and traffic events displayed in the client screen. The CRQ sent by the client is revised as the vehicle travels, whereas the server calculates the result of the revised CRQ region in real time in order to send the result to the client. The server finds the road segment according to the received current vehicle location from the external map server, and stores the vehicle location and road segment identifier together. Simultaneously, the vehicle's current location is indexed through the grid index manager, and a continuous query result list is updated through the continuous query processing manager. Similarly, the traffic event manager in the server indexes the received traffic event through the grid index manager, and searches the indexed traffic event in order to group similar events. For each group, a representative traffic event is selected, and it is reflected in the CRQ result for each client through the continuous query processing manager. The push manager sends the updated continuous query result to each client in order to display the result on the screen. The aforementioned procedure is designed and implemented based on Spark Streaming. The grid index and continuous query list are maintained in memory as a type of RDD in Spark. In addition, this study proposes a management method that uses multiple version types in consideration of the no-modification characteristic of RDD. Through this, this study is designed to allow concurrency control between read and write operations on the grid index and continuous query result list. Copyright c 2016 SERSC 53

Figure 2. Architecture of the Proposed In-memory Big Traffic Data Processing System Figure 3 shows the overall structure of the CRQ processing technique that can search surrounding vehicles according to client location, which is designed based on Spark Streaming. The messages transferred from clients are collected for a certain time (from less than 1 sec to several seconds) in order to create the RDD. The created RDD is then modified to location-data-only RDD through preprocessing, thereby being modified into the version n grid index RDD via the index generator. Figure 3. Architecture of the Continuous Query Processing Method Figure 4 shows the CRQ processing method based on the grid index. The continuous query processor in Figure 2 registers CRQs that include each corresponding cell in the created grid indexes. Each query can be run over multiple cells so that a single query can 54 Copy right c 2016 SERSC

be registered over the query list of multiple cells. The algorithm to process the CRQ is described in Figure 5. Figure 4. Continuos Query Processing Method using Grid Index Create_CQ(object) { old_cq = object.cq old_cq.reference -= 1 if(old_cq.reference == 0) { old_cq.mbr = object.mbr else { new _CQ = new CQ(object.MBR) new _CQ.Reference_Objs += object Update_CQ(CQ, objects) { for(i <- 0 to objects.length) { for(j <- 0 to CQ.length) { CQ[j].Contain_Objs = Contain_Check (CQ[j].MBR, object[i].location) refreshmessage(cq.referenceobjs) Figure 5. Algorithm for CRQ Creation and Update The aforementioned CRQ processing technique is explained based on a search of the vehicle's current location. The CRQ of each client should search not only the moving objects, such as the surrounding vehicles and pedestrians, but also traffic events that occur nearby. The CRQ processing method that searches traffic events is processed similarly to the aforementioned CRQ processing method, and the detailed explanation is omitted here. The traffic events received at the server can be duplicated. This is because the same traffic event can be detected through vehicles moving in the same location. Even worse, where heavy traffic is found, the number of duplicated traffic events is great. In this paper, when the same events are detected, they are grouped and a representative event is selected from this group and shared with users within the social group. The grouping event procedure is shown in Figure 6. Copyright c 2016 SERSC 55

Figure 6. Process of the Traffic Event Grouping The server verifies whether the received message is an event type, and then adds it to the traffic event index. Indexing of the traffic event is performed through the previously described grid index manager. Next, the traffic event indexes are searched via location, time, and type in order to find the same event candidate groups, and finally, it is determined whether the searched events are located in the same road segment. If they are determined to be the same event, they are added to the existing event group; otherwise, a new event group is created. 3. Experiments The event data set used in the experiment was created by modifying the traffic dataset provided by [8]. This data set contains 5,000 moving objects that traveled for 20 secs. The moving objects are created randomly within the determined region. The created data contains moving object ID, type (newly created moving object or active moving object), latitude, longitude, and timestamp, as shown in Figure 7. (a) Sample of Traffic Data Set (b) Sample of Event Data Set Figure 7. Sample of Traffic and Event Data Our in-memory distributed traffic big data system is implemented based on Spark Streaming. Cluster server that is used in our experiment consists of 8 nodes and each node has 2 Intel CPUs, 8 Giga byte RAM and 500 Gigabyte HDD. We use Scala program language to develop our system. The event data set contains 100,000 virtual event data records (5,000 moving object timestamp of 20) by adding an event column to the data shown in Figure 7(a). The created event data set is shown in Figure 7(b), and a total of four event types (rain, snow, accident, and delay) are created. Figure 8 shows the event grouping results. The event data that consist of 100,000 records are grouped into 56 Copy right c 2016 SERSC

735 events. The left side of the figure shows the events prior to grouping, and the right side of the figure shows the grouped events on the map. 4. Conclusion Figure 8. Result of the Event Grouping This paper proposed a method that can rapidly process vehicle location and traffic event data using Spark Streaming in order to share information about surrounding vehicles, pedestrians, and traffic events in real time with drivers who use the WEVING service. In the proposed method, vehicle location and traffic event streams transferred in real time are indexed using the grid indexing technique according to time, and the CRQ method is processed based on the index. Furthermore, traffic events are grouped based on occurrence time, location, content, and road segment of the traffic event transferred in real time in order to share only a representative traffic event. Representative traffic events were extracted through an event grouping experiment. The experiment result verified that 100,000 events were grouped into 735 representative events. Acknowledgments This research was supported by a grant(14tlrp-c091025-01) from Transportation & Logistics Research Program (TLRP) funded by Ministry of Land, Infrastructure and Transport of Korean government, and also, this research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF- 2014R1A1A2059342) Copyright c 2016 SERSC 57

References [1] H. Li, Y. Lee, B. Kim, D. Choi, I. Bae, S. Song, M. Yeo and R. Oh, WEAVING : social driving assistant system, In Proceedings of ISITC, (2014). [2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: cluster computing with working set, Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, (2010), pp. 10-10. [3] M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica, Discretized streams: an efficient and fault-tolerant model for stream processing on large cluster, Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, (2012), pp. 10-10. [4] X. Xiong, M. F. Mokbel, and W. G. Aref, Lugrid: update-tolerant grid-based indexing for moving objects, In Proceedings of MDM, (2006), pp. 13-24. [5] H. Li, Y. Lee and S. Song, Grid based Distributed In-memory Indexing for Moving Objects, In Proceedings of ISITC, (2014). [6] https://hadoop.apache.org/. [7] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, (2008), pp. 107-113. [8] http://mntg.cs.umn.edu/tg. Authors Dojin Choi, received the BS degrees in Computer Engineering Department from Korea National University of Transportation of South Korea in 2014. He is an Master Course of the Computer Engineering Department, Korea National University of Transportation, Republic of Korea. His research interests are database systems, concurrency control, snapshot isolation and distributed processing systems. Bosung Kim, received the BS degrees in Computer Engineering Department from Korea National University of Transportation of South Korea in 2014. He is an Master Course of the Information Convergence Department, Korea National University of Transportation, Republic of Korea. His research interests are Big Data, DNA Sequence Analysis, Local Alignment Algorithm. Insoo Bae, received the BS degrees in Computer Engineering Department from Korea National University of Transportation of South Korea in 2014. He is an Master Course of the Information Convergence Department, Korea National University of Transportation, Republic of Korea. His research interests are file system, data replication and operation system. Seokil Song, received the BS, MS and PhD degrees in Computer and Communication Department from Chungbuk National University of South Korea in 1998, 2000 and 2003, respectively. He is an Associate Professor of the Computer Engineering Department, Korea National University of Transportation, Republic of Korea. His research interests are database systems, index structures, concurrency control, storage systems and distributed stream data processing. 58 Copy right c 2016 SERSC