Big Data Time Series Analysis

Transcription

1 Big Data Time Series Analysis Integrating MapReduce and R Visakh C R Department of Information and Computer Science Aalto University Espoo, Finland visakh.nair@aalto.fi Abstract Explosion of data, especially from sensor devices, calls for new and innovative methods of storing, processing and analysis. In this paper, we look at the integration of big data technologies with a statistical computing framework, R. We tackle various time series analysis use cases and also develop a proof-ofconcept frontend application which can interact with the data and perform the computations. Keywords time series, machine learning, big data, apache hadoop, mapreduce, apache hive, R I. INTRODUCTION We live in the era of data. Web logs, internet media, social networks and sensor devices are generating petabytes of data every day. The advent of Internet-of-Things (IOT) devices has been heralded as a game changer. It has been estimated that there will be more than 25 billion IOT devices by 2020 [1]. These devices generate a huge amount of data, and it is imperative to have a proven and scalable mechanism to store, process and analyze this data. In this paper, we propose and develop a framework for storing and analyzing sensor data. Sensor data is predominantly a time series. Therefore, we integrate a proven big data platform, Apache Hive, with one of the most powerful open source statistical tool, R. We develop an architecture for the combined system and also discuss on how to seamlessly integrate the technologies. We discuss various machine learning use cases which are implemented using the combined power of R and Hive. We also provide a proof-of-concept dashboard developed in Shiny, which enables to perform various analytical and machine learning computations using a web based frontend. This paper is divided as follows: Section II introduces the various technologies and frameworks used. Section III details the data, collection methods, data models and pre-processing steps. In Section IV, we discuss selected use cases and their implementation. Section V details the experiments and the results. We also discuss the dashboard implementation of the use cases using Shiny. Section VI summarizes the project and discusses the takeaways. We also briefly outline the future direction for this project. We use Python scripts for collecting the data. The data is then loaded into Apache Hive tables. Hive queries spawn MapReduce jobs, which forms the core programming model of Apache Hadoop. We will discuss the architecture and implementation of Hive. For implementing the time series analyses and machine learning, we use R and various R libraries. For developing the dashboard, we utilize Shiny. A. Python Python is a general purpose, high-level programming language. First developed in 1991, Python has been widely accepted in the programming world. It has been ranked in the top ten widely used programming languages in the past few years [2]. In this project, we use Python for collecting the sensor data from a HTTP server and generate the source data files needed for loading the Hive tables. B. Apache Hadoop Apache Hadoop is an open-source framework that allows the distributed processing of large datasets across clusters of computers. The essence of Hadoop lies in two components a storage component known as Hadoop Distributed File System (HDFS) and a processing component known as MapReduce [3]. HDFS is a scalable and distributed file system. It stores large files across multiple machines. The data is replicated, therefore providing high reliability. MapReduce is the programming paradigm which drives Hadoop jobs. A MapReduce job consists of two components: a Map component which performs filtering and sorting, and a Reduce component which performs the summary operation. To elaborate further, the Map component takes an input pair and produces key-value pairs. These key-value pairs are collected and provided as input to the Reduce component. The Reduce component merges the values based on the key value [4]. Fig. 1 shows a high-level flow of a MapReduce job [5]. II. TOOLS & TECHNOLOGIES In this section, we discuss the various tools and technologies used in this project.

2 g) stats: stats package provides a wide array of statistical functions [14]. E. Shiny Shiny is a web application framework in R. Shiny makes it easier to develop web applications, especially without any HTML, CSS or JavaScript [9]. Fig MapReduce job flow C. Apache Hive Apache Hive is a data warehouse platform built on top of Hadoop. It facilitates the querying of large datasets stored in HDFS format. Hive provides a declarative SQL like query language called Hive Query Language (HQL), which translates internally to MapReduce programs, thereby providing a much simplistic method to query, analyze and process the data [6]. Hive supports User Defined Functions (UDFs) which helps to extend and implement your own logic for processing the data. We utilize a few Hive UDFs in this project, for data processing as well as for computing Moving Averages and Moving Standard Deviation. These UDFs are traditionally written in Java, though support for Python has been improving recently. D. R R is a programming language predominantly used for statistical computing. It is possible to perform time series analysis, classification, clustering, linear and non-linear modeling and various visualization methods in R [7]. The R platform includes thousands of packages which extends and adds more functionality. Now, we will briefly mention the major packages used in this project. a) RHive: RHive is used to connect R with Hive and thereby perform distributed computing using Hive queries in R [8]. We connect to Hive tables in R scripts using HiveServer, extract the relevant data and process it using R functions. b) shiny: Shiny enables to build web applications with R [9]. Shiny forms the core of the web based dashboard that we develop in this project and will be explained in detail later. c) TSA: This is a repository of time series analysis functions [10]. d) forecast: This package provides functions for univariate time series forecasting [11]. e) tseries: This packages provides functionality for time series analysis [12]. f) cluster: This package provides methods for cluster analysis [13]. In this project, we use Shiny to create various dashboard which will help users to interact with the time series data and perform analysis and machine learning without tinkering with any code. A selected few of these dashboards are showcased in Section V. III. DATA & DATA MODELS As discussed in the introductory section, this project is about analyzing huge amount of time series data from sensors. In this section, we expand upon the type of data (as to what kind of sensor data), the data collection methods and data preprocessing steps. We will also look at the data models developed for storing the data in Hive. A. The Data A time series can be defined as a sequence of data points measured over a time interval. There is a multitude of sensor devices around us generating different types of data. Since these devices tend to measure various phenomenon, they tend to provide a sequence of data values over a time period. In other words, sensor devices produce time series data. In this project, we utilize data from traffic sensors. A partner company [15] has installed sensors in a five lane road in a city in Northern Finland. Each lane has its own sensor. Three lanes take traffic towards South, while two lanes take traffic towards North. The sensors measure the following values: a) Sensor ID: Each sensor has been provided with a distinct sensor id. They range from 4 to 8. b) Timestamp: The timestamp when the vehicle information was captured by the sensor. c) Speed: The speed of the vehicle in km/h d) Height: The height of the vehicle in centimeters e) Length: The length of the vehicle in meters f) Snow Depth: The amount of snow on road in millimeters B. Data Collection The data from the sensors is accessed through a HTTP interface. The data collection is performed using a Python script. In order not to overload the server, the script is executed every hour. Each execution of the script collects the data from the previous one hour. The scheduling of the job is done using cron daemon in Linux. The outline of the script is as follows: The script pings the HTTP server and reads the data using the Python library urllib2. Using another library called BeautifulSoup [16], the relevant data is extracted, cleaned and transformed into a Python list. This data is written into a CSV file.

3 C. Data Model Apache Hive serves as the repository for storing the traffic sensor data. Hive provides an encapsulation of the data and serves it as a table, which is similar to a database table. We utilize staging tables to load the raw data from the CSV files. The data is transformed using SQL functions and loaded into a final target table. We now provide the data model and transformations performed. a) Staging table To read the CSV file, we utilize a Hive Serializer / Deserializer (SerDe). The Deserializer reads a record and translate it to a Java object. The Serializer takes the Java object and convert it to a format which can be written to HDFS [17]. We also perform type conversions on the various attributes and cast them to the correct data types. In addition to this, we also generate a derived column, Day of the week. Based on the timestamp, we utilize Hive UDFs to get an integer which specifies the day of the week. Monday to Sunday is represented using 1 to 7. b) Target table The data from staging table is loaded into the target table. The data definition of the relevant columns is provided in Table 1. Column Name ts sensor_id speed height length snow_depth day_of_week Data Type TIMESTAMP INTEGER DOUBLE DOUBLE DOUBLE DOUBLE INTEGER Table 1 Data Definition of target table IV. USE CASES & IMPLEMENTATION In this section, we discuss the various use cases of time series analysis and machine learning which we implement in this project. We divide them into six categories, namely moving average, moving standard deviation, forecasting, clustering, correlation analysis and dashboards. First, we provide the mathematical and scientific background for the use case. Then we elaborate on the implementation of the use case in this project. We provide a brief summary of the techniques used and how they were implemented in this project. A. Moving Average Moving average is a statistical calculation of creating a series of averages of subsets of the original time series. For a series, the moving average is computed by taking the arithmetic mean of subsequent terms of the series. The first element of the moving average is computed by taking the arithmetic mean of the first elements in the series. The series is shifted forward, excluding the first element of the series and including the next element, and computing the mean again. This is performed for the entire series. For the series defined as [18] : =, the moving average can be (1) Moving average is used to smoothen the fluctuations in the data. The resulting series helps to identify trends and cycles in the data. Moving average is also known as rolling average or running average. In this project, we compute the moving average of speed of the vehicles. From the resulting series, it becomes easier to understand the trends in the speed of vehicles for each lane for different time of a day, for different days in a week or for different months. Moving average can be computed using R functions. But since R holds the data objects in memory, the size of data on which we can perform computations is limited by the memory of the system [19]. Since we have the data already in HDFS, we implement a process to compute the moving average using MapReduce. The outline of the process is as follows: i. Get the attribute name (in this case, the column of the Hive table) and the moving average time window ii. iii. iv. Hive query extracts the data from HDFS and passes it as a Java object Based on the time window provided, we compute the averages using Hive UDF The computed values are returned to R and then visualized By using UDF, we are able to perform the computation using MapReduce paradigm. The values returned from the UDF is a time series of the moving averages, which can then be plotted or analyzed using various R functions. B. Moving Standard Deviation Standard deviation is a measure of the spread of the values in a series. It measures the amount of variation of the values in a dataset [20]. In other words, standard deviation can be described as how tightly the values are clustered around the mean. Standard deviation is calculated by taking the square root of the averages of the squared deviations of values from their mean. In this project, we compute the moving standard deviation of the speed of vehicles for various time periods. The objective is to understand how varied the speed observations are when compared to the mean value. It also helps to understand level of uncertainty in the speed forecasts. Moving standard deviation is also computed using Java UDF in Hive, similar to the moving average process.

4 C. Forecasting Time series forecasting involves the creation of a model to predict future values based on the previous values. There are various techniques by which a forecasting model can be created and applied to the data. In this project, we use exponential smoothing and autoregressive integrated moving average (ARIMA). In this section, we provide a summary of these models and how they are implemented in this project. We also provide the details of the statistical testing measures utilized to validate the forecast models. i. Forecast models We introduce the two forecasting models used, Exponential Smoothing and ARIMA. a. Exponential Smoothing Exponential smoothing is generally used to create a smoothened time series. The concept is similar to moving average, but the difference is that in exponential smoothing, the weights for the previous values in the time series exponentially decrease over time. For a series, the exponential smoothing output can be calculated using the following formula:. + (1) (2) where the smoothing factor,, is between 0 and 1. The model described by (2) is suitable for forecasting data with no trend or seasonal characteristic. Since we are dealing with traffic sensor data, especially speed of vehicles, we utilize higher order exponential smoothing, namely, Holt- Winters double exponential smoothing and triple exponential smoothing. These methods take the trend and seasonality into account by introducing extra terms in the model [21]. b. ARIMA Autoregressive Integrated Moving Average (ARIMA) is a generalization of Autoregressive Moving Average (ARMA) models. ARMA is a forecasting model in which both autoregressive (AR) and moving average (MA) methods are applied to the time series. We will first define the AR and MA part and then define the combined models. An autoregressive model of order p can be defined as: = + + (3) where,., are the model parameters, is the constant and is the white noise. A moving average model of order q can be defined as: + (4) where,..., are the model parameters, µ is the expectation of, and is the white noise. From (3) and (4), we can define an ARMA(p,q) model with p autoregressive terms and q moving average terms as: (5) = ARIMA introduces an additional parameter d to the model, which refers to the integrated part, which in turn corresponds to the degree of differencing. Thus, an ARIMA model can be applied to a time series even if it is not stationary, which is not the case with ARMA models. The following steps outline the general implementation of exponential smoothing and ARIMA models in this project. i. Using HQL queries, we extract the relevant data from Hive table. ii. The extracted data is stored as a data frame in R. iii. iv. We perform time series transformations on the data frame using TSA R library. The model is generated using the corresponding R function. v. Using forecast R library, we generate the forecasted values as needed. ii. Statistical testing We utilize Ljung-Box test to check whether the autocorrelations of the residuals in the forecast differs from zero. The null and alternate hypothesis are defined as: Null Hypothesis: : The data are independently distributed. Alternate Hypothesis: : The data are not independently distributed The alternate hypothesis can be rephrased to say that the data exhibit serial correlation [23]. D. Correlation Analysis Correlation can be defined as the tendency of two values to change together. A correlation value of 0 implies no relationship, -1 shows a perfect linear negative relationship and +1 implies a perfect linear positive relationship [24]. We compute the correlation co-efficient between various attributes of the time series using standard R functions. E. Clustering Clustering is a machine learning technique in which objects which are similar to each other are grouped together. It is an unsupervised learning technique. There are various methods to perform clustering, out of which we consider two: Centroid model (k-means) and distribution model (Expectation- Maximization). a. k-means The objective in k-means clustering is to partition n objects into k clusters. Each observation is identified with the cluster that has the closest mean. Though the original algorithm is NP-hard, most of the implementations utilize heuristic methods to converge to a local optimum. For brevity, we refer the readers to [25] for a detailed explanation of k-means clustering. b. Expectation-Maximization (EM)

5 EM algorithm is an iterative method to find the maximum likelihood estimates of parameters. The expectation step (E) creates a function for the expectation of the log-likelihood, and maximization step (M) computes the parameters by maximizing the expected log-likelihood computed in the E step. We again refer the readers to [26] for a deeper analysis. The objective of clustering in this project is to identify whether the vehicles can be divided into separate groups based on length and height. We implement the clustering algorithms using various R packages. F. Dashboards The success of a data science project depends on how the models and results can be presented to the end user. In the previous sections, we looked at the technology used and the models and use cases developed. We will now describe how we develop a web based dashboard which will allow a user to select the data, choose modeling techniques and visualize the results. Shiny is a web application framework for R. An application developed in Shiny consists of two components: a user interface script, which controls the layout and appearance, and a server script, which contains the scripts necessary to build the application. Fig. 2 shows the flow of a Shiny application. User selects input attributes and analysis functions Hive query extracts data and process/aggregate as needed Hive output is saved as a data frame in R Analysis is performed in R Output is shown to user using Shiny interface Fig. 2 Process flow of a Shiny application We will now describe the details of the various dashboards created. i. Moving Average & Moving Standard Deviation We have already discussed regarding the implementation of the computation of moving average and moving standard deviation. We created an application which will serve as an interface to choose various attributes and perform the calculations. The various attributes which can be selected are: a. Date: User can select a range of date for which the data needs to be pulled from Hive ii. iii. iv. b. Sensor id: User can choose the particular sensor id for which the data needs to be selected. c. Day of week: This provides the user with an option to choose data for a particular day of the week alone (for eg: Monday or Wednesday). d. Order: The order for the moving average or moving standard deviation can be provided. Once the options are selected, a connection is made to Hive and the data requested is selected. Hive UDFs to compute the moving average or moving standard deviation are executed, which in turn invokes MapReduce jobs. The final result set is saved as a data frame in R, which is then plotted and displayed in the application. Forecasting This dashboard serves as an interface for the various forecast methods. As in the moving average dashboard, the user selected the values for various attributes. The model as well as the number of forecast steps is also chosen. Based on the values provided, data from Hive table is extracted and saved as a data frame in R. Time series computations are performed in R and the final forecast result is displayed in the application. Clustering This application provides an interface to choose the date, sensor id and the columns on which the clustering needs to be performed. The number of clusters needed are also inputted. As in the other applications, the data from Hive is extracted and saved as a data frame in R. Clustering is performed on this data and the results are displayed in the application. Speed Comparison The speed comparison application serves as a tool to visualize the time series, especially the speed. The following values can be provided in the application: a. Dates: Two different dates can be chosen b. Sensor id c. Order: This value defines the smoothing order for the time series Based on the two dates provided, we select two sets of data from Hive. It is smoothened based on the order value and then displayed in the application. Comparison of speed with plots helps in understanding the data better. The purpose of the dashboards is to serve as a proofof-concept application which shows the seamless integration of HDFS, MapReduce and R. The data is stored in HDFS, MapReduce jobs are invoked to select the relevant data and R functions are used for processing the data and plotting it.

6 V. EXPERIMENTS & RESULTS So far, we have explained the details of the technology, the various use cases and the implementation steps and strategy followed in this project. In this section, we will first discuss the experimental setup. We will also discuss the results for each of the use cases. Table 2 shows the technology stack and the releases used. Technology Apache Hadoop Apache Hive R Python 2.7 Release details Table 2 Technology stack The configuration details of the server on which the experiments were run are given in Table 3. OS CPU Cores 8 Ubuntu LTS 64-bit Processors 2 x Intel(R) Xeon(R) 2.67GHz Memory 8 x 8 GB DIMM DDR3 Table 3 Server details We will now present the results for the various use cases. A. Moving Average We compute the moving average for a sensor for a particular date. The moving average computation is handled by Hive UDF. The results are saved as a data frame in R, which is then visualized. Fig. 3 shows the moving average for sensor 4 and sensor 5 for March 20, The source code is provided in Appendix A.1. B. Moving Standard Deviation Similar to moving average, we compute the moving standard deviation for sensors 4 and 5 for March 20, The results are shown in Fig. 4. The source code is provided in Appendix A.2. Fig. 4 Moving Standard Deviation C. Forecasting In section IV, we explained the two methods used in forecasting, ARIMA and exponential smoothing. Now, we present the speed forecast results using these two methods. Fig. 5 shows the output from Holt-Winter s exponential smoothing. In this, the black line represents the actual observed values, and the red line represents the fitted values. Fig. 5 Holt Winters forecast We forecast fifty future values using Holt-Winters and ARIMA models. To understand the feasibility of the model, we look at the residual values (the forecast errors). If there are correlations between forecast errors for successive predictions, then it might be possible to improve the forecasting model using other methods. In Fig. 6 and Fig. 7, we plot the correlogram of the forecast errors for the two methods. From the correlogram, we can see that the autocorrelation at lag 1 is within the significance bounds. Fig. 3 Moving Average

7 Fig. 6 Correlogram of forecast errrors of Holt Winters Fig. 7 Correlogram of forecast errors of ARIMA We also perform Ljung-Box test on the residuals to check whether there is any evidence of non-zero autocorrelations in the residuals. This test statistic has been explained in Section IV. Table 3 and 4 shows the Ljung-Box test results for forecasts using Holt-Winters and ARIMA models respectively. Box-Ljung test data: speed.forecast.hw$residuals X-squared = , df = 38, p-value < 2.2e-16 Fig. 8 Correlation analysis of various attributes From Fig. 8, we can infer that there is a positive correlation between speed and distance between cars and a negative correlation between speed and snow depth. The source code for the correlation analysis is provided in Appendix A.4. E. Clustering We perform clustering on the attributes length and height. The objective is to analyze whether the different vehicles that pass through the traffic sensors have any common factors and natural grouping. We conduct our experiments using k-means and Expectation-Maximization (EM) clustering methods. Visualizing the clustering results and understanding them is as important as the algorithm used. For k-means, we utilize two visualizations. In the first one, we plot the data points and color code them based on the cluster to which they have been assigned to. Fig. 9 shows the plot for the clustering. In the second one, we utilize a discriminant projection plot. In this, we plot the cluster results using the discriminant coordinates of the data. Fig. 10 shows the discriminant projection plot. Table 3 Box test result for Holt Winters residuals Box-Ljung test data: speed.forecast.arima$residuals X-squared = , df = 38, p-value = Table 4 Box test result for ARIMA residuals Based on the test statistic and the p value, we have little evidence of non-zero autocorrelations in the forecast errors. The source code for forecasting steps performed is provided in Appendix A.3. D. Correlation The correlation analysis is performed to understand the relationship between various attributes. We extract the following values from Hive tables for one sensor for one week: speed, length, height, distance between vehicles and snow depth. We perform correlation analysis on this data. Fig. 8 shows the correlation plot between the attributes. Fig. 9 k-means clusters Fig. 10 k-means discriminant plot

8 The results from EM clustering are plotted in Fig. 11. The source code for clustering is provided in Appendix A.5. Fig. 11 EM clusters F. Dashboards We developed various applications using Shiny to serve as dashboards for the analysis of the data. Fig. 2 provides a high level work flow of the dashboard. In this section, we showcase a few of the dashboards developed. Fig. 12 shows the dashboard screen for moving average and moving standard deviation computation. Fig. 13 shows the dashboard for forecasting the speed. Fig. 13 Shiny dashboard for forecasting speed The source code for the Shiny applications can be found in Appendix A.6. Fig. 12 Shiny dashboard for moving average & standard deviation VI. SUMMARY & CONCLUSION In this project, we have designed and implemented a method to process huge volume of time series data and perform time series analysis and machine learning on it. In Section II, we provided the necessary background about the various technologies used. We also outlined the methods by which these various technologies were brought together to perform as a single ecosystem for big data machine learning. In Section III, we discussed the data collection and processing strategies and also provided the details of the data model. In Section IV, we discussed various use cases and also described the implementation strategy for each of them. In Section V, we provided the details of the experiments conducted and the results. There are various paths which this project can take further. One of the planned future work is to extend this big data time

9 series analysis model to real-time streaming data. Another aspect is to replace MapReduce with Apache Spark and/or Apache Tez. Data infusion from multiple sources is also another aspect we would like to look into in future. REFERENCES [1] Press Release, "Gartner Says 4.9 Billion Connected "Things" Will Be in Use in 2015", Gartnet Inc, November [2] Tiobe Software, "TIOBE Index for May 2015", May [3] Apache Software Foundation, Apache Hadoop, [4] Jeffrey Dean & Sanjay Ghemawat, "MapReduce: simplified data processing on large clusters", OSDI'04 Proceedings of the 6th conference on Symposium on Opearting Systems Design & Implementation, Volume 6 [5] Jeremy Kun, "On the Computational Complexity of MapReduce", October [6] Ashish Thusoo, Joydeep Sen Sarma, Namit Jain, Zheng Shao, Prasad Chakka, Suresh Anthony, Hao Liu, Pete Wyckoff and Raghotham Murthy, "Hive - A Warehousing Solution Over a Map-Reduce Framework", VLDB [7] "The R project for statistical computing", The R Foundation, [8] The Comprehensive R Archive Network, "RHive: R and Hive", NexR [9] The Comprehensive R Archive Network, "shiny: Web Application Framework for R", Winston Chang et al. [10] The Comprehensive R Archive Network, "TSA: Time Series Analysis", Kung-Sik Chan, Brian Ripley [13] The Comprehensive R Archive Network, "cluster: Cluster Analysis Extended", Peter Rousseeuw et al [14] The Comprehensive R Archive Network, "R statistical functions" [15] Noptel Oy, [16] Leonard Richardson, Beautiful Soup Documentation, [17] Dave Lewis, SerDes architectures and applications, DesignCon 2004 [18] Rob J Hyndman, "Moving Averages", November 2009, Unpublished [19] R Core Team, R installation and administration manual, April 2015 [20] Population Moments, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [21] Exponential Smoothing, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [22] ARMA models, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p43-64 [23] Ljung, G. M. and G. E. P. Box, "On a measure of lack of fit in time series models", Biometrika , p [24] Correlation, James D Hamilton, "Time Series Analysis", Princeton University Press, 1994, ISBN , p [25] k-means clustering, Ethem Alpaydin, Introduction to Machine Learning, Second Edition, 2010, ISBN , p [26] Expectation-Maximization Algorithm, Ethem Alpaydin, Introduction to Machine Learning, Second Edition, 2010, ISBN , p To be added in the final report APPENDIX [11] The Comprehensive R Archive Network, "forecast: Forecasting Functions for Time Series and Linear Models", Rob J Hyndman et al [12] The Comprehensive R Archive Network, "tseries: Time Series Analysis and Computational Finance", Adrian Trapletti, Kurt Hornik, Blake LeBaron