Final Report. FASTMOD: A Framework for ReAltime Spatio- Temporal Monitoring of MObility Data. Gautam Thakur, Yibin Wang, Kun Li

Transcription

1 Final Report FASTMOD: A Framework for ReAltime Spatio- Temporal Monitoring of MObility Data Gautam Thakur, Yibin Wang, Kun Li Abstract: In this study, we apply large data science techniques for monitoring, analysis, and modeling of vehicular traffic and user mobility in real time. Today, vehicular congestion is major challenge to the efficient handling of traffic and transit system. In the latest (2010) urban mobility report, vehicular congestion caused travel 4.8 billion hours(~30% extra). Also, drivers purchase an extra 3.9 billion gallons of fuel for a cost of $115 billion. On the other hand, user mobility impact network performance. Since user mobility is driven sometimes by behavioral pattern, it creates a disconnect between WL Network activity and Resource Utilization. In order to provide real time traffic analysis and also provide behavioral driven mobility protocols and models, we seek to use a trace driven approach. However, their enormity requires specialized tools and techniques. In order to solve this problem, we propose use of Map Reduce framework for efficient processing and analysis. In this study, we provide a framework for real time analysis using Hadoop and give some first order statistical results and correlations. We believe that our work and the dataset provide a much-needed contribution to the research community for realistic and data-driven design and evaluation of networks. Introduction: Real-time monitoring is an essential component for today s dynamical and time critical systems. It provides instant updates on what is happening in the system in response to performance issues or report problems. Real-time monitoring includes analysis of raw data and generation of statistics that entails the overall state of the system. However, enormous size and widespread deployment of systems like campus Wireless LAN and vehicular traffic monitoring pose serious challenges in the efficient processing of large stream of data. Any lag in the quick redressal to the problem not only affects the current performance but also incrementally complicate the resolution process. For example, if the traffic signal malfunctions during rush hours, congestion on the roads builds up with long queues of cars and other vehicles. Other challenges include real-time data sanitization and interpolation. To our surprise, current system seriously overload themselves in time of need and deemed inadequate to scaling and robust architecture. Existing tools are becoming inadequate to process such large data sets and processing data in real time. We try to propose solutions for the following questions: 1) How to design a system that takes incoming data stream from sensors deployed in a large scale? How to organize the data so that it supports both real time data processing and archive data processing? 1

2 2) How to design a system that can handle some sophisticated analysis algorithm on large streaming data in real time? (potential applications include traffic prediction, traffic causality mining, route suggestion) In this project, we provide a Map-Reduce framework for data-intensive realtime monitoring applications based on Hadoop/Hadoop Online Protocol (HOP). It includes: Real Time Monitoring, Data Acquisition and Processing from the sensors of Wireless Networks, planet-scale online traffic web-cameras and potentially many other types of sensor. Outlier Detection and Removal & Integration Data Analysis, Knowledge Discovery and Modeling Graphical Visualization The responsibility of each member in the group are split as follows: Gautam: Vehicular data analysis, Hive setup Yibin: Mobility data analysis, Hadoop setup Kun: Hadoop setup, Hive setup, HSQL Background: MapReduce[1] is a parallel processing framework proposed by Google in Since then, MapReduce has been heavily used in IT companies such as Google, Yahoo, and Facebook. Google was running about 3,000 computing jobs per day through MapReduce, representing thousands of machine-days according to a presentation by Dean. Among other things, these batch routines analyze the latest Web pages and update Google's indexes. Among all the open source implementation of MapReduce framework, Hadoop is the most popular implementation. We will use Hadoop as our underlying parallel computing framework to run our data cleaning and data analysis jobs. T. Condile et al proposed a pipelined version of Map Reduce[8], being implemented and known as Hadoop Online Prototype (HOP) which supports real time application such as event monitoring and stream processing. There are two prototype of traffic estimation and prediction system developed by MIT and UTX/UMD are named DynaMIT-R and DYNASMART-X, respectively. However, both systems are simulation-based system. Lin[7] proposed DynaCHIN which is a specially-built real-time traffic Prediction System for China. Singapore s Land Transport Authority, together with IBM developed a traffic estimation and prediction tool, which uses historical traffic data and real-time feeds with flow conditions from several sources, in order to predict the levels of congestion up to an hour in advance. The pilot results show overall prediction results above 85 percent of accuracy. Berkley Millenium project [9] also target on building a real time transportation monitoring system. In a general sense, our approach to tackle tracking real time traffic data and mobility data can be applied to ease the traffic congestion, city planning and resource allocation problem. Specifically, an ever increasing problem of vehicular traffic congestion on the roads has became severe around the world. In the latest (2010) urban mobility report[1], congestion caused urban Americans to travel 4.8 billion hours more and to purchase an extra 3.9 billion gallons of fuel for a cost of $115 billion. On average, yearly peak period delay caused by the traffic congestion for the average commuter was 34 hours and the cost to the average commuter has increased by 230% in two decades[1]. Congestions not only affect people during the peak period, but also at other hours, approximately half of total delay occurs in the midday and overnight. 2

3 System Description: Framework: In the following figure, we outline the proposed framework. It consist of the three components (i) Real Monitoring and Processing, (ii) Knowledge Discovery (ii) Modeling. Hadoop/HOP implementation involves a multi-stage distributed architecture for each of these components that include several master and reducers. Components: Monitoring sensor: we have monitoring sensors that are deployed over a certain area (city, campus). They continuously uploading sensed data. Crawler (on mappers): The crawlers are actually mappers in hadoop that crawl the readings from all monitoring sensors. They get the readings in a parallel download fashion. Processing engine: After Crawlers have downloaded the readings in HDFS, processing engine starts to preprocess the raw readings. These include: Outlier detection, density estimation, mobility processing. We use map-reduce framework for all processing task. Hive: A SQL-like hadoop-based query system. We run Hive query over the output of processing engine. This help us to get first order statistics from the preprocessed data HSQL: After gathering the statistics from Hive query, we propose to use the hybrid approach to start deep analysis of the data such as correlation and prediction. These tasks are done using external scripting languages like R, Matlab, Python (under Hadoop framework) since they already have a well-established library for statistical analysis. We call it as Hyper-SQL. Dynamic query: Image algorithm: 3

4 We aim to estimate traffic density(d) on roads considering the number of vehicles or pedestrians crossing the road. We have a sequence of images captured by webcams. Considering our problem, we have to be able to separate information we need, e.g. number of vehicles and pedestrians from the back ground image, which is normally road and buildings around. The main factor that can distinguish between vehicles and background image (road, buildings) is the fact that the vehicles are not in a stationary situation for a long period of time, however the back ground is stationary. The solution for the problem then seems to be applying a sort of high pass filtering over a sequence of images captured by a webcam over time. The high pass filter removes the stationary part of the images (road, buildings, etc.), and keeps the moving components (mainly vehicles). In order to implement such a high pass filter, we sub- tract result of a low pass filter over a sequence of images, from each still image. This is practically equivalent to implementing a high pass filter over sequence of images. In order to obtain low pass filtering effect, we run a moving average filter over a time sequence of images obtained from one webcam. The duration of the moving average filter can be adjusted in an adhoc way. The moving average filter is simply implemented by averaging over the intensity map for several images in a certain duration. At the output of the moving average filter, the intensity of each pixel is obtained by averaging intensity of corresponding pixels in the interval. The output of the moving average filter (low pass filter) is normally the required background image, which is still part of the image. Therefore, subtracting each image from the output of the low pass filter, gives us the moving components (e.g. vehicles). Having the high pass component of the image, the vehicles are highlighted from background. One could then use regular object detection techniques to identify and count number of vehicles in the high pass filtered image. However, this is computationally expensive and unnecessary. As an alternative, we simply count the number of active pixels (pixels with a value higher than a certain threshold). This is much faster than detecting and counting objects in an image. At the same time, it is more effective, because we are looking at the traffic densities (d), i.e. percentage of the street (road) which is covered by vehicles (as an indicator of how crowded is the street), rather than number of vehicles. Number of vehicles is not a good indicator of crowdedness, as a long vehicle may introduce more traffic than a small one. Second, our method overcomes the issues that object detection face in case of severe congestion. Counting number of active pixels can indicate what percentage of the road is covered, no matter how many vehicles are in the road. In many instances, images are duplicate, corrupted with zero sized or with extraneous bytes (noise). We use semi- supervised learning and hierarchical clustering to overcome the challenges of outliers detection and removal. The adjoining figure shows the algorithm output. 4

5 The data product: 1. A model of distributed system that is capable of of receiving and processing of the raw data for real time analysis and also capable of collaboratively organizing received data into archive for history analysis - using Hadoop/HOP 2. a multi-user query engine that serves the purpose of sending the vehicular traffic update a. can be done by creating a hadoop like architecture for distributing queries based on the user request. the queries can be prediction, current traffic updates, historical data information, and inference of the future status 3. developing a dynamic model for the finding optimized routes based on the start and end location. (or any other prediction/optimization algorithm/analysis method that utilize the incoming data stream) 4. outlier detection and removal techniques. a. a set of inter-connected and sequenced process for the data processing b. to develop a caching model so as to reduce the map-reduce job. 5. A visualization system to track the vehicular and mobility data in real time Method: In the adjoining figure, we illustrate a step by step process to achieve the goal of near real time traffic monitoring and modeling. 1. Start scripts. 2. Involves crawlers that capture pictures or user mobility instances every few minutes. These crawlers internally store images to external data storage. 3. External Data Storage is unit that maintain image archive. 4. In order to process the images and mobility records in real time, we copy the downloaded raw facts into HDFS. Since this copying happens per record wise and which is not too much, we actually circumvent the issues of copy times. It takes few seconds to copy individual records after crawler download. 5. HDFS is single repository distributed over many disks that store the processed data. 6. Next we perform outliers detection and estimation algorithms to extract traffic and mobility information. 7. this information is then stored into Hive which later on provide an interface for SQL type queries. 8. Along with other scripts, we use HSQL queries to extract information from the Hive DB. These queries are similar to SQL and provide added benefit of directly interacting with database. 9. Since HSQL queries are not sufficient enough, we augment certain Java based procedures to get information related to HotSpots, currene traffic updates and predictions. Framework: In this section, we describe our proposed framework, shown in adjoining figure, which is comprised of 5

6 three parts: (i) Measurements and pre-processing, (ii) Knowledge discovery, and (iii) Modeling and analysis. The measurements and pre-processing component is responsible to capture imagery snapshots, sanitize data and generate a quantifiable value of vehicular traffic, hereafter known as traffic density(d). We store the processed data in Hive for further querying. The knowledge discovery focuses on applying data mining tools to extract traffic patterns, and spatio-temporal information. This activity can help to develop rich mobility scenarios. Next, the modeling and analysis component focus on characterizing the vehicular traffic densities. It can aid in designing and developing new data-driven vehicular mobility models and simulators. Finally, applications like visualization can be developed from the previous component analysis. Real Time Monitoring and Processing: We view the connected global network of webcams as a highly versatile platform, enabling an untapped potential to monitor global trends, or changes in the flow of the city, and providing large- scale data to realistically model vehicular, or even human mobility. We also download and process mobility records from Access Point Controllers that are deployed on-campus. On average, we download 15 Gigabytes of imagery data per day from over 2700 traffic web cameras, with a overall dataset of 7.5 Terabytes containing around 125 million images. To fasten the process of images, we are using background subtraction, a technique with low turn around time. The mobility records from campus are text based and do not require any special processing. The processed information onwards saved in Hive and the processed images are removed from the HDFS. Knowledge Discovery: We did some initial traffic correlation analysis, to measure the degree to which traffic of a camera is linearly associated with itself for 42 days. Traffic Congestion show high Correlation (80%) for 1-2 hour lag. Decrease significantly to ~25-30% for 4 hour lag. Modeling and Analysis: Here, we focus on modeling empirical traffic densities against known theoretical distributions. The objective of this study is to help understand the underlying statistical patterns. We find that traffic at individual cameras can vary a lot, but in general log-logistic, gamma and Weibull distribution can capture some of the key features. In case of mobility data, we find a normal distribution of user traffic on campus wide scale with peaks occurs during noon hours. Applications: The experience gained from the analysis and modeling of traffic densities potentially aids in future design and evaluation of vehicular networks. To aid visualization, we are developing applications to demonstrate traffic conditions on desktop and handheld devices. In the adjoining figure, we show scenarios for vehicular traffic visualization. 6

7 Experiments: Dataset: We use two sets of spatio-temporal data-sets. First data-set has wireless LAN traces of mobile users and second data-set contains vehicular images as captured from online traffic web-cameras. The collective size of the data is well over seven TB. User Mobility Data: We collect different types of traces via network switches including netflows, DHCP and wireless access point (AP) session logs (MAC traps). The wireless session log is collected by each wireless AP or switch port (i.e., aggregate of APs in a building). The trace includes the start and end events for device associations (when they visited or left that specific AP), the device s MAC address, the date and time of those events, and the AP (or switch) IP and port numbers. From the above we can derive the association history (i.e., the location and time of user association) for all MAC addresses. The DHCP log contains the dynamic IP assignments to MAC addresses. The listed IP is given to the MAC address at the indicated date and time. User mobility is then extracted by its association (with AP) log provided that every AP location is pre-determined. Vehicular Data: We utilize the power of online traffic web cameras as pri- mary source of data collection. These web cameras are in- stalled on highways and on critical traffic signals of cities under study. At regular time interval, they capture still pictures of on-going road traffic and send them in the form of visual feeds to Department of Transportation(DoT) media servers. For this work, we collaborate with 10 cities (DoTs) across the globe. The details of cities and the data set are given in Table-1. We view the connected global network of these webcams as a highly versatile platform, enabling us to visualize the traffic flow of the city and realistically model vehicular, or even human mobility. We download these images and store them in our media image storage server. Experiement Results: For this project, we are targeting on the prototype of a real-time streaming data tracking system using Hadoop specifically for vehicular and mobility data. So we are going to present result of the analysis and system build for it. Mobility tracking results: We did two experiments using mobility data to test the tracking capability of the proposed system. For each of the experiments, we use animation to showcase the result as shown in our presentation. We show snapshots of these results here and briefly describe the experiments. (1) We try to tack the aggregate user movement among all buildings on campus. Recall that using WiFi log data collected on campus, we can show user location in terms of building. This kind of tracking can show interesting correlation between buildings and help admin to easily pinpoint some events in terms of user dynamics over time. In the following figures, the matrix shows the aggregated user movement transitions among buildings. Each row represents one building index and entry (i, j) means the number of users transit from building i to building j in the given time window that is being captured. We use heat map to show the density and grandniece change of user movement density over time. 7

8 (2) Google earth user density tracking As a complement study of the previous user movement tracking, we now try to track visualize the number of users in each building at each point of time using Google earth animation. We pre-process the mobility trace data together with the coordinates data of each building in Hive database and then using KML generation code as Reducer function to generate KML file which is the file type for Google Earth input. Finally, we can visualize the data in Google Earth in real time. The system also supports the query based filter which allows user to track on specific area/time window/particular user by specifying Hive query statement. In our presentation, we showed an animation with time window from 10 am to 11 am. Vehicular Data Result: For vehicular image data, we show two types of results using our tracking system. (1) Analysis data that show the traffic density of specific intersection/road over time. (2) The animation of the real time tracking system with row image data alone with processed data at each step and density plot in real time. (1) Traffic density The following four figures show the examples of different traffic density captured and analyzed by our tracking system. Note that the vehicular data, being processing from image data, requires more processing power to be pre processed. It can better show the real time tracking feature of our system. From top to bottom, these figures show: High traffic, low traffic, random traffic and rush hour traffic. The 8

9 x axis show the day index and y axis is the hour of day. This analysis can be used for more complex real time analysis like the correlation of traffic density over time. The system supports Hive query to change both spacial and temporal dimension for analysis. (2) Real time traffic tracking animation As we showed in presentation, our real time traffic tracking system is capable of tracking the system from different stages. First, it shows the raw image data captured by camra, then it shows the processed images by using the image analysis algorithm. Finally, it shows the real time plot for traffic density of the area of interest. The system supports Hive query to change the area of interest. Performance We test the system performance for (1) copy time (2) running load (3) system load to compare with the single machine case. We use two machines running Hadoop and Hive for all our analysis and 9

10 performance benchmark. We expect to see more performance advantage if our system is deployed on more machines. The first figure shows the relationship between data size and the amount of time needed for copying these data from local file system to HDFS for map-reduce processing. It shows the performance of batch job of copying crawled data from local file system to HDFS. Concluding Remarks: In this project, we applied the Map Reduce technique using Hadoop to process and analyze large data. We specifically took two different cases of large data processing, from user mobility and vehicular networks. We showed that using Map Reduce framework, we can achieve near real time processing and visualization. In future, we are looking for interactive query processing. In this work, we also introduced a novel framework for large-scale monitoring, analysis, and modeling of vehicular traffic and user mobility. We showed how can we leverage and overcome the challenge of data overloading by achieving near real time performance. However, we agree that its a case specific activity that make more sense for us, as the sensor data is arriving and processed in real time. Our performance analysis results show that Hadoop distributed not only accelerate the proces Finally, we believe that our work will help community to use Map Reduce framework for large data analysis in near real time. Reference: 10

11 1. J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun.ACM,51(1): , "IBM and Singapore's Land Transport Authority Pilot Innovative Traffic Prediction Tool". IBM Press release Retrieved Lin, Y. and H. Song. (2007) DynaCHINA: Specially built real-time traffic prediction system for China. Presented at the 86th Annual Meeting of the Transportation Research Board, Washington, DC. 8. T. Condie, N.Conway, P, Alvaro, J. M.Hellerstein: MapReduce Online