, pp. 51-58 http://dx.doi.org/10.14257/ijseia.2016.10.1.06 In-memory Distributed Processing Method for Traffic Big Data to Analyze and Share Traffic Events in Real Time among Social Groups Dojin Choi 1, Bosung Kim 2, Insu Bae 2 and Seokil Song 1* 1 Department of Computer Engineering, Korea Transportation of National University, Chungju, Chungbuk, Republic of Korea 2 Department of Information Technology Convergence, Korea Transportation of National University, Chungju, Chungbuk, Republic of Korea mycdj91@ut.ac.kr, jikol2000@ut.ac.kr, gkdrmf23@ut.ac.kr, sisong@ut.ac.kr Abstract In this paper, we propose an in-memory distributed processing method that can rapidly process vehicle location and traffic event data using Spark Streaming. The proposed system enables to share information about surrounding vehicles, pedestrians, and traffic events in real time with drivers who use the WEVING service. In the proposed method, vehicle location and traffic event streams are indexed using the grid indexing technique according to time, and the continuous range query method is processed based on the index. Also, traffic events are grouped based on occurrence time, location, content, and road segment of the traffic event transferred in real time in order to avoid duplicated traffic events. Through experiments, we show that the proposed method is able to deduplicate similar traffic events efficiently. Keywords: traffic data, big data, social group, spark streaming 1. Introduction Road congestion makes a great negative impact on the community and the environment. Road congestion can be reduced by two approaches like the following. First approach is to increase road capacity. However, it is very difficult, especially in urban environments. Second approach is to reduce demand in congested areas by providing information about road status so that drivers change mode of transport, or alter the route or time. Collecting traffic information on road by using infrastructure sensors such as CCTVs, loop sensors in roads, and so on has some limits. The sensors cannot cover the whole road until deploying them the whole road with an enormous sum of money. Recently, with the rapid increase in vehicles equipped with such smart devices as black boxes or smartphones that are capable of communication, it becomes possible to collect and share traffic information by using those smart devices instead of using infrastructure sensors. Various applications or services for traffic safety and the convenience services based on collecting and sharing traffic information by using smart devices have been developed. WEVING (WE are driving together) is a social driver assistance system [1] with a smartphone application that automatically detects traffic events (i.e., delays, congestion, accidents, road conditions, etc.,) using cameras, GPS (Global Positioning System), gyroscope sensors, and the acceleration sensors mounted in smartphones. The detected traffic events, event images, and current locations are shared with users in social groups via WEVING. If there are many vehicles that use the WEVING service, and frequent attempts are made at detecting the current location, a large workload for the WEVING server is created when assisting in the sharing of vehicle locations, traffic events, and images ISSN: 1738-9984 IJSEIA Copy right c 2016 SERSC
between social groups. In particular, sharing the locations of surrounding vehicles and traffic events in real time is very important, given that delaying processing because of server overload might make such sharing of current locations and traffic events meaningless. In addition, traffic events detected in areas of heavy road traffic are likely to be duplicated. Therefore, a single server that processes data by storing them in hard disks cannot meet the requirements of applications such as WEVING. To meet such requirements, applying an in-memory-based processing method is necessary, in addition to managing and analyzing data in parallel through multiple servers in order to facilitate fast data storage and search. In this study, a method for data storage and management for traffic events and vehicle location data is proposed using Spark [2] and Spark Streaming [3], two of the distributed in-memory processing platforms. Spark Streaming is an in-memory stream processing system that processes data streams in batches during a pre-determined time (less than 1 sec) as it receives messages, logs, or text files in real time. Spark Streaming is based on the RDD (Resilient Distributed Datasets) [2] model of Spark. RDD provides data recovery functions stored in memory, which can process data without storing in hard disks. Spark Streaming processes data inputted in a pre-determined time using RDD, thereby providing fault-tolerant stream processing for applications based on Spark Streaming. In this paper, we propose a method that can process data rapidly using Spark Streaming with regard to vehicle location and traffic event data as stream data. The proposed method indexes vehicle location and traffic event streams transferred in real time using Spark Streaming according to location and time by means of a grid indexing technique. The grid indexing technique used in this paper is a modified version of the existing grid indexing technique based on suits Spark Streaming [4]. Vehicles that use the WEVING service set the region of interest based on the current vehicle location, and send this to the server in order to request information about other vehicles and pedestrians, as well as traffic events in the region of interesting from the server. The WEVING server processes the request as it recognizes the region of interesting sent by moving vehicles as a CRQ (Continuous Range Query). To process CRQ effectively, this paper proposes a grid indexing-based CRQ technique. In addition, this paper proposes a method that avoids unnecessary traffic event sharing because the same traffic events are sorted using occurrence time, location, and the content of traffic events sent in real time. The traffic event classification method is also proposed based on Spark Streaming. 3. Related Work Proposed method uses gird index we use distributed grid index of [5]. The distributed indexing method of [5] is based on Spark streaming. [5] adds some transformation operators and output operators such as bulkload, bulkinsert, splitindex, search to Spark Stream to index and query vehicles. The input stream is the postions of moving objects that are transmitted periodically. Spark Stream transforms the input stream into D- Streams. As shown in Figure 1, the input stream is transformed into DSt, DSt+1, DSt+2 and DSt+3 continuously by Spark Stream. It performs bulkload and bulkinsert operators on each D-Stream. bulkload builds a gird index GIt with position data in DSt. The proposed indexing method uses grid techniques to reduce the time for building and updating an index. The indexing method does not use lock based concurrency control method. The D-Stream model of Spark Stream is immutable, so update operations and search operations are not performed concurrently on an index. As shown in Figure 1, an index is updated by bulkload and bulkinsert operators, and multiple versions of the index at time t, t+1, t+2 an t+3, which can be accessed by users, are on main memory. Therefore, users can access GIt+2, while GIt+3 is being built. 52 Copy right c 2016 SERSC
Figure 1. Build Grid Index with Input Stream 2. Proposed In-Memory Traffic Big Data Processing Technique The server processes the received data differently depending on whether the data are the client's current location or traffic event. The server provides the search result through CRQ on the surrounding vehicles and traffic events displayed in the client screen. The CRQ sent by the client is revised as the vehicle travels, whereas the server calculates the result of the revised CRQ region in real time in order to send the result to the client. The server finds the road segment according to the received current vehicle location from the external map server, and stores the vehicle location and road segment identifier together. Simultaneously, the vehicle's current location is indexed through the grid index manager, and a continuous query result list is updated through the continuous query processing manager. Similarly, the traffic event manager in the server indexes the received traffic event through the grid index manager, and searches the indexed traffic event in order to group similar events. For each group, a representative traffic event is selected, and it is reflected in the CRQ result for each client through the continuous query processing manager. The push manager sends the updated continuous query result to each client in order to display the result on the screen. The aforementioned procedure is designed and implemented based on Spark Streaming. The grid index and continuous query list are maintained in memory as a type of RDD in Spark. In addition, this study proposes a management method that uses multiple version types in consideration of the no-modification characteristic of RDD. Through this, this study is designed to allow concurrency control between read and write operations on the grid index and continuous query result list. Copyright c 2016 SERSC 53
Figure 2. Architecture of the Proposed In-memory Big Traffic Data Processing System Figure 3 shows the overall structure of the CRQ processing technique that can search surrounding vehicles according to client location, which is designed based on Spark Streaming. The messages transferred from clients are collected for a certain time (from less than 1 sec to several seconds) in order to create the RDD. The created RDD is then modified to location-data-only RDD through preprocessing, thereby being modified into the version n grid index RDD via the index generator. Figure 3. Architecture of the Continuous Query Processing Method Figure 4 shows the CRQ processing method based on the grid index. The continuous query processor in Figure 2 registers CRQs that include each corresponding cell in the created grid indexes. Each query can be run over multiple cells so that a single query can 54 Copy right c 2016 SERSC
be registered over the query list of multiple cells. The algorithm to process the CRQ is described in Figure 5. Figure 4. Continuos Query Processing Method using Grid Index Create_CQ(object) { old_cq = object.cq old_cq.reference -= 1 if(old_cq.reference == 0) { old_cq.mbr = object.mbr else { new _CQ = new CQ(object.MBR) new _CQ.Reference_Objs += object Update_CQ(CQ, objects) { for(i <- 0 to objects.length) { for(j <- 0 to CQ.length) { CQ[j].Contain_Objs = Contain_Check (CQ[j].MBR, object[i].location) refreshmessage(cq.referenceobjs) Figure 5. Algorithm for CRQ Creation and Update The aforementioned CRQ processing technique is explained based on a search of the vehicle's current location. The CRQ of each client should search not only the moving objects, such as the surrounding vehicles and pedestrians, but also traffic events that occur nearby. The CRQ processing method that searches traffic events is processed similarly to the aforementioned CRQ processing method, and the detailed explanation is omitted here. The traffic events received at the server can be duplicated. This is because the same traffic event can be detected through vehicles moving in the same location. Even worse, where heavy traffic is found, the number of duplicated traffic events is great. In this paper, when the same events are detected, they are grouped and a representative event is selected from this group and shared with users within the social group. The grouping event procedure is shown in Figure 6. Copyright c 2016 SERSC 55
Figure 6. Process of the Traffic Event Grouping The server verifies whether the received message is an event type, and then adds it to the traffic event index. Indexing of the traffic event is performed through the previously described grid index manager. Next, the traffic event indexes are searched via location, time, and type in order to find the same event candidate groups, and finally, it is determined whether the searched events are located in the same road segment. If they are determined to be the same event, they are added to the existing event group; otherwise, a new event group is created. 3. Experiments The event data set used in the experiment was created by modifying the traffic dataset provided by [8]. This data set contains 5,000 moving objects that traveled for 20 secs. The moving objects are created randomly within the determined region. The created data contains moving object ID, type (newly created moving object or active moving object), latitude, longitude, and timestamp, as shown in Figure 7. (a) Sample of Traffic Data Set (b) Sample of Event Data Set Figure 7. Sample of Traffic and Event Data Our in-memory distributed traffic big data system is implemented based on Spark Streaming. Cluster server that is used in our experiment consists of 8 nodes and each node has 2 Intel CPUs, 8 Giga byte RAM and 500 Gigabyte HDD. We use Scala program language to develop our system. The event data set contains 100,000 virtual event data records (5,000 moving object timestamp of 20) by adding an event column to the data shown in Figure 7(a). The created event data set is shown in Figure 7(b), and a total of four event types (rain, snow, accident, and delay) are created. Figure 8 shows the event grouping results. The event data that consist of 100,000 records are grouped into 56 Copy right c 2016 SERSC
735 events. The left side of the figure shows the events prior to grouping, and the right side of the figure shows the grouped events on the map. 4. Conclusion Figure 8. Result of the Event Grouping This paper proposed a method that can rapidly process vehicle location and traffic event data using Spark Streaming in order to share information about surrounding vehicles, pedestrians, and traffic events in real time with drivers who use the WEVING service. In the proposed method, vehicle location and traffic event streams transferred in real time are indexed using the grid indexing technique according to time, and the CRQ method is processed based on the index. Furthermore, traffic events are grouped based on occurrence time, location, content, and road segment of the traffic event transferred in real time in order to share only a representative traffic event. Representative traffic events were extracted through an event grouping experiment. The experiment result verified that 100,000 events were grouped into 735 representative events. Acknowledgments This research was supported by a grant(14tlrp-c091025-01) from Transportation & Logistics Research Program (TLRP) funded by Ministry of Land, Infrastructure and Transport of Korean government, and also, this research was supported by Basic Science Research Program through the National Research Foundation of Korea(NRF) funded by the Ministry of Education (NRF- 2014R1A1A2059342) Copyright c 2016 SERSC 57
References [1] H. Li, Y. Lee, B. Kim, D. Choi, I. Bae, S. Song, M. Yeo and R. Oh, WEAVING : social driving assistant system, In Proceedings of ISITC, (2014). [2] M. Zaharia, M. Chowdhury, M. J. Franklin, S. Shenker, and I. Stoica, Spark: cluster computing with working set, Proceedings of the 2nd USENIX conference on Hot topics in cloud computing, (2010), pp. 10-10. [3] M. Zaharia, T. Das, H. Li, S. Shenker and I. Stoica, Discretized streams: an efficient and fault-tolerant model for stream processing on large cluster, Proceedings of the 4th USENIX conference on Hot Topics in Cloud Ccomputing, (2012), pp. 10-10. [4] X. Xiong, M. F. Mokbel, and W. G. Aref, Lugrid: update-tolerant grid-based indexing for moving objects, In Proceedings of MDM, (2006), pp. 13-24. [5] H. Li, Y. Lee and S. Song, Grid based Distributed In-memory Indexing for Moving Objects, In Proceedings of ISITC, (2014). [6] https://hadoop.apache.org/. [7] J. Dean and S. Ghemawat, MapReduce: simplified data processing on large clusters, Communications of the ACM, (2008), pp. 107-113. [8] http://mntg.cs.umn.edu/tg. Authors Dojin Choi, received the BS degrees in Computer Engineering Department from Korea National University of Transportation of South Korea in 2014. He is an Master Course of the Computer Engineering Department, Korea National University of Transportation, Republic of Korea. His research interests are database systems, concurrency control, snapshot isolation and distributed processing systems. Bosung Kim, received the BS degrees in Computer Engineering Department from Korea National University of Transportation of South Korea in 2014. He is an Master Course of the Information Convergence Department, Korea National University of Transportation, Republic of Korea. His research interests are Big Data, DNA Sequence Analysis, Local Alignment Algorithm. Insoo Bae, received the BS degrees in Computer Engineering Department from Korea National University of Transportation of South Korea in 2014. He is an Master Course of the Information Convergence Department, Korea National University of Transportation, Republic of Korea. His research interests are file system, data replication and operation system. Seokil Song, received the BS, MS and PhD degrees in Computer and Communication Department from Chungbuk National University of South Korea in 1998, 2000 and 2003, respectively. He is an Associate Professor of the Computer Engineering Department, Korea National University of Transportation, Republic of Korea. His research interests are database systems, index structures, concurrency control, storage systems and distributed stream data processing. 58 Copy right c 2016 SERSC