Power Grid Time Series Data Analysis with Pig on a Hadoop Cluster compared to Multi Core Systems Felix Bach *, Hueseyin K. Çakmak *, Heiko Maass, Uwe Kuehnapfel Institute for Applied Computer Sciences Karlsruhe Institute of Technology Karlsruhe, Germany {bach, cakmak, maass, kuehnapfel}@kit.edu Abstract In order to understand the dependencies in the power system we try to derive state information by combining high-rate voltage time series captures at different locations together with data analysis at different scales. This may enable large-scale simulation and modeling of the grid. Data captured by our recently introduced Electrical Data Recorders (EDR) and power grid simulation data are stored in the large scale data facility (LSDF) at Karlsruhe Institute of Technology (KIT) and growing rapidly in size. In this article we compare classic sequential multithreaded time series data processing to a distributed processing using Pig on a Hadoop cluster. Further we present our ideas for a better organization for our raw- and metadata that is indexable, searchable and suitable for big data. Power system, time series, data analysis, big data, LSDF, Hadoop, Pig, multicore I. INTRODUCTION Knowledge of the power grid system state and its dynamics and dependencies is one important prerequisite for a reliable control and understanding of the power system. The worldwide integration of renewable energy sources in the power grid demands for a new dynamic grid architecture, the so called Smart Grid [1] that is proposed to replace centralized grids. For its realization, it is necessary to have appropriate methods for detailed modeling and simulation at large scale based on the analysis of real measurements. Available consumption data are not sufficient: The coarse time-scale of measurements by so called smart meters, providing cumulated power value time-series with typical frequencies of only one sample per 1 to 15 minutes, may be used for short-term local load forecasts on a statistical level but are not sufficient for global fine grain analysis or for use in physical simulations that could increase the knowledge of dynamics and dependencies in the grid. Thus we designed a system for capturing, archiving and analyzing voltage data using our previously presented high-rate voltage measurement device with sub-second resolution called EDR (Electrical Data Recorder) and presented first feature extraction and analysis results [2][3]. However, we found that our current approach lacks data scalability and usability. As a consequence, we evaluated new technology for the organization and analysis of our data. We implemented first * Both authors contributed equally to this paper. experimental processing of large amounts of archived data utilizing a Hadoop cluster together with Pig scripts and user-defined Pig functions (UDF) and compared its performance to classical sequential multi-threaded processing on a local multi core machine to find out which method is appropriate for which data scale and if our data allows efficient processing in Hadoop. The paper is organized as follows: In chapter II we give an overview over related work and a problem description. In chapter III we describe our data and the cluster setup. In chapter IV the results of our performance analysis of distributed and local sequential processing are shown. We also discuss data visualization aspects for time series data from the LSDF. In chapter V we present new concepts for improvements. II. RELATED WORK AND PROBLEM IDENTIFICATION A. Related work Different groups have worked on capturing, archiving and analysis of voltage data. However, to our knowledge no such work has previously been done for long-term data analysis with synchronized sub second sampling, at least in Europe. High-rate network analyzers capture more detailed data than smart meters but are currently only used on short-term failure analysis and not synchronized between different locations. Phasor measurement units (PMUs) are high speed sensors with the option for synchronous acquisition to enable monitoring of the power grid quality. However, PMUs are rarely used in Germany and data from the real network are expensive [1]. The Tennessee Valley Authority (TVA) collects PMU data on behalf of the North American Electric Reliability Corporation (NERC) to help ensure the reliability of the bulk power system in North America [4]. High voltage electric system busses and transmission lines are sampled at a substation several thousand times a second, which is then reported for collection and aggregation. The entirety of the stream is passed to archiving servers. A real-time data stream is forwarded to a server program hosted by TVA which passes the data in a standard phasor data protocol (IEEE C37.118-2005) to client visualization tools. Agents may move PMU archive files into a Hadoop cluster or
directly request these data at their remote location for analyses. By the end of 2010 around 40TB of PMU data were collected, in the next 5 years 500TB of PMU data are expected additionally. The Open Source Phasor Data Concentrator (openpdc), administered by the Grid Protection Alliance (GPA) [5] is a project providing a free software system designed to process streaming time series data in real-time. According to the homepage, measured data is gathered with GPS-time from many hundreds of input sources, time-sorted and provided to user defined actions. It is scalable and designed to consume all standard PMU input protocols. However, performance statistics are logged every 10 seconds only. Another free phasor data concentrator based on the IEEEC37.118 synchrophasor standard is ipdc [6]. It is more academic than openpdc and allows development and testing of algorithms and applications related to PMU data and may be used and modified without any restriction. The project Lumberyard [7] provides scalable indexing and low latency fuzzy pattern searching in time series data. It uses HBase (a column based Hadoop Storage) and isax [8] to achieve both scale and index/search respectively. Lumberyard is based on isax and uses the jmotif Java library that provides symbolic aggregate approximation (SAX) [9] and isax for time series data enabling outlier detection and the search for often occurring patterns. B. Problem description Our EDR measurements produce massive amounts of data every day and additional data will arise from power grid simulations. When measuring at the high rate of 25 khz, each EDR produces ~16 GiB per day. This adds up to a total of 5.7 TiB per year per device, exceeding typical hard disk storage sizes. As soon as we add many devices or run simulations with virtual EDRs, storing the data on disk drives connected to a single PC is obviously no longer possible and processing is not efficient. In order to transform our data into useful and searchable scientific information we need new ways to manage, archive, explore and analyze the resulting data because classical storage and mining methods were not found to be appropriate for huge amounts of data: As a first step to understand the meaning of the data, we wanted to compute some statistic measures for the time series. For small timeranges in measurement history, we implemented some basic sequential processing of the data. But the larger the size and number of data files that needed to be processed, the less appropriate classic sequential methods showed (as to be presented in chapter IV). To overcome these limits in time ranges and data sizes both for storage and processing, we searched for ways to bring computations and data residence together. 1) Technical Requirements First, we need a data storage that would not run out of space like traditional storage on hard disk and that would hold our valuable data fail-safe, e.g. by using redundant storage to enable maintenance and accessibility. Second, we need ways to access and analyze the data efficiently. It should be possible to flexibly analyze the archived data anytime afterward for arbitrary time-ranges and to quickly search and filter the data using customizable queries. For this, the storage should provide one namespace, so data distribution is abstracted for users. Also a graphical frontend for interactive navigation supported by different visualizations of the data was needed to provide some data overview. 2) Goals of this study We wanted to find out the scale at which distributed processing outperforms local parallel processing. Since the cluster our data resides on already has a Hadoop installation available, we tried utilizing its Map-Reduce framework to follow the approach of bringing the processing to the data rather than transferring data to processing units. A convenient way to do this was to write user defined functions (UDFs) for Apache Pig [10] (see chapter III). By running the same experimental statistical analyses over differently sized test-data sets using this distributed setup on the one hand and more classical parallel processing on the other, we then tried to evaluate and compare both approaches. A good storage and processing approach is the precondition to start developing and implementing more sophisticated methods for analysis, classification and clustering of big datasets. This is due to the totally different design patterns and requirements for the archival data layout involved in both processing approaches. III. INFRASTRUCTURE AND DATA LAYOUT A. Hadoop cluster setup We utilize the Large Scale Data Facility (LSDF), which is currently maintained at KIT, Steinbuch Centre for Computing (SCC) [11]. It is designed to fulfill the requirements of different data intensive scientific experiments or applications. At the moment the Hadoop cluster available as part of the LSDF consists of 58 nodes with 464 physical cores in total, each node having 2 sockets with Intel Xeon CPU E5520 (2.27GHz, 4 cores), 36 GB of RAM, 2 TB of disk, 1 GE network connection, OS Scientific Linux 5.5 and Linux kernel 2.6.18. The two additional name nodes are identical to the data nodes except of the increased 96 GB of RAM and a 10 GE network connection. Clouderas distribution of Hadoop in version CDH3u4 is currently installed, providing different interfaces. We used the Apache Pig API. Pig consists of a high level data flow description language called Pig Latin and a server running on top of Hadoop that internally uses Map-Reduce. B. Data Sources and Formats We have two data sources: Collected measurement data produced by our EDRs and simulation data gathered at virtual smart meters in a power grid simulation model. 1) Power Grid Time Series Data Captured by EDRs The investigation site is the Campus North (CN) of KIT, which is an enclosed research center located near Karlsruhe. One 110 kv incoming transmission line from the local energy supplier as well as one 2 MW block heating station, also used for electrical power generation, provide its
electrical energy. 534 electrical active power consumption smart meters are installed at the KIT Campus North and data are available over years from a central database [3]. Since February 2012 we additionally conduct continuous voltage time series measurements in the island-like electric network using our EDRs, where three phase voltages are acquired simultaneously at a typical rate of 10 khz (up to 25 khz). A GPS pulse-per-second (PPS) signal is captured at the same time for synchronization purposes [2]. Our EDR raw data is first stored locally on the disk drive of the PC that is recording the time series as xml-files. These are then aggregated at a file server, which writes the data to the Hadoop file system (HDFS) using Hadoop s Java API. Each xml-file contains the measured voltage-data for one minute in 60 one-second-blocks for each of the 3 phases stored as Base64 encoded 16 bit raw binary numbers. Some additional metadata is stored together with the raw data in the xml-files for each one second block describing the acquisition process and supplemental information. This data layout was designed when the first measurements were made in order to keep highest accuracy at lowest data transfer amount. Even if this layout does not seem to fit the idea of distributed processing on a Hadoop cluster, we need a way to process it after having collected daily measurements over nearly one year now in this format. We already have aggregated a large amount of time series data (454747 files in 172 folders, 1117674 MiB) with various recording sample rates. The amount of data is increasing by time and multiplies by the number of installed EDRs. 2) Power grid simulation data To understand the effects in a complex power grid we also put effort in its modeling. We developed the new software framework easimov (Electric Grid Analysis, Simulation, Modeling and Visualization) as a collection of multiple modules. The modeling module enables to interactively create a geo-information based representation of the power grid structure and components. We have access to the data of 534 smart meters at the KIT-CN which we utilize to simulate the currents based on the load and voltage profiles. We use GridLAB-D [12] as the simulation software. Each virtual smart meter in the simulation model can record voltage and current for each phase per second; higher sampling rates will be supported in GridLAB-D v3. The output of the recorders is saved to individual csv-files. The file sizes depend on the number of the smart meters (n sm ), the number of captured parameters (n p ) e.g. current, voltage, the duration of the simulation (t sim ) and the sampling rate for the output (dt). The total data size also depends on the number of the simulation runs (n sim ). The expected data size per model can be calculated as: datasize sim = 320.673 bytes n sm n p n sim t sim dt The constant value for the bytes was determined for n sm =n p = n sim =1, t sim =1h and dt=1sec. For a long-term simulation e.g. for 1 year datasize sim will rise with the coefficient 8.765, for a high sample rate simulation according to the EDR resolution datasize sim will rise with the coefficient up to 25.000. The KIT-CN model will produce 1,5 TB for one long-term simulation (n sm =534, n p =n sim =1, t sim =1y, dt=1sec). C. Current data organisation The raw data files are transferred continuously to the LSDF's HDFS storage, where each day is represented by a folder containing all the data files, each holding the information of one minute. The file names allow the identification by date, time and the identifier of the measuring GPS device. IV. PERFORMANCE ANALYSIS OF DATA PROCESSING In this study we analyze the voltage time series data captured by one EDR-device at the KIT-CN for the month June 2012. We are interested in efficient analysis and the extraction of statistical information. As a first step to understand the meaning of the data and to evaluate different processing approaches, we wanted to compute some statistic measures for the time series data. In order to test the capabilities of the new technology (Hadoop, Pig) and the limits of PC computing for our data, we implemented the analysis algorithms on a multicore environment and on the Hadoop cluster. Further we measured the runtime for each step of the data visualization with direct LSDF access. A. Comparison of the Processing Time: Multicore vs. Pig For comparing the performance of the statistical analysis for big data we developed a Java based multi-threaded application which was tested with two multicore PCs (PC-1: Intel i5-2500 CPU, 4 cores, 3.3 GHz, 8GiB RAM, Win7 64Bit, PC-2: 2xIntel Xeon CPU, 8 cores, 3.0GHz, 16GiB RAM, Win7 64 Bit). The multi-threaded application operates on measurement data which are stored locally. For testing purposes we concentrated on simple statistical calculations like the minimum, maximum, average and the standard deviations of minima and maxima for all three phases of the EDR-data. Since we are primarily interested in comparing the processing speed, the selected methods may be insufficient for serious electrical data analysis. The same functionality for data decoding and the calculations was also implemented as a user-defined function (UDF) for use in Pig scripts. UDFs are compiled to Map- Reduce Code running distributed on a Hadoop cluster. To be able to write and automatically start Pig Latin scripts from a development PC, the scripts were embedded in Java code. For this purpose, the Pig Java API provides a PigServer class, and methods for registering and running scripts. Using this API a function was implemented to run a simple Pig script that reads the data xml tags of all xml-files in a given HDFS folder and applies the processing of a UDF written in Java to it. The function includes all statistical analysis implementations in Java as well as the needed Base64 decoding. The compiled Java-class is packed into a jar file which has to be registered at the Pig server in order to use it in Pig scripts. Processing results are written to HDFS. We analyzed voltage measurement data for time periods ranging from six hours (~855 MiB, 360 files) up to three weeks, which is about 69,7GiB in 30127 data files. The files had a Hadoop replication factor of 2 and a HDFS block size of 64 MB was used. In Fig. 1 a comparison of the measured time needed for the calculation of the statistical parameters
on the multicore PCs (multi-threaded Java code, 8 threads) and with the Hadoop cluster is shown. From 6,2GiB onwards Hadoop cluster processing is superior to multi-threaded data processing. Due to the advanced configuration of PC-2, processing times are shorter but both curves show similar characteristics. We can observe that Pig processing always needs a certain time for initialization of the Map-Reduce step (creating and distributing jar-files); even for small data sizes the initialization seems to be the retarding factor. Figure 2: Runtime evaluation of data processing for LSDF data Figure 1: Runtime comparison Pig/Hadoop cluster vs. multicore PCs In a further step we examined the variation of the processing time for several runs within the Hadoop cluster. We calculated the average and median values for five run time measurements for the same data configuration. During the execution we encountered two Hadoop socket warnings DataStreamer Exception for the combinations, which nearly doubled the execution time. B. Runtime Analysis for local Data Visualization We developed a visualization module for the easimov software, which scans the EDR raw data on the LSDF and sorts them due to the id of the EDR devices and the captured time intervals. We were interested in the analysis of the needed time for direct data retrieval from the LSDF and for the visualization with our current data organization. For testing purposes we selected a 15 minutes interval of the voltage time series for July 21st, 2012 and measured the runtime for the data access, the data retrieval, the data processing and the data visualization (Fig. 2). The 15 minute example described above needs about 105 seconds from the interactive selection of the time range to the visualization. The data transfer from the LSDF is not the bottleneck for this short time range, rather the local data processing of the large data with Java. We verified this by measuring the necessary time for each processing step. The most time consuming step is data decoding and the data formatting for the rendering module (93.3%). This is followed by the parsing of the xml files (4.43%) with the SAX parser. A drawback of this processing pipeline is the repeated necessary access to the LSDF and the calculation of the statistics from scratch. Local data processing after a data transfer from the LSDF will not make any sense for the expected size of our measured and simulated data. V. PROPOSED IMPROVEMENTS Our experiences have reconfirmed that processing must be carried out near the data. This implies the data analysis and the preparation of data visualization. A further improvement will be the creation of metadata, which will structure our data and provide additional semantic information. The local runtime analysis also made clear, that the Base64 encoding of our raw data is computationally too intensive to scale with the data growth. We will have to accept increased file sizes and store values in a text based format. A. New Metadata structure for indexing and queries One problem with the present data organization in the LSDF is its flatness (one folder per day) and the lack of metadata at higher data abstraction layers. Computing the total size of our data folder took more than five hours showing the access problems for unpartitioned data. Currently, metadata is only kept for one second measurement intervals. It is desirable to also have information for longer time intervals, e.g. metadata for a specific day, month or year could describe the data characteristic for this time interval in a compressed manner and relate to events and properties in extern databases. In order to access the data domain in a top-down manner, a hierarchical metadata representation is proposed. We will explain this based on the introduced statistical functions. The procedure will give us the possibility to have information on the measured data at the top level (year), which contains value ranges for the minima, maxima, averages, etc. Traversing the tree to significant data sections will be possible. In order to build up the hierarchical metadata structure we analyze the content of data blocks bottom-up, then create metadata descriptions and merge in the next higher level (time interval). Figure 3 shows the principle: we have data blocks representing meta- and sample data for each second. Based on these data we calculate minima, maxima, average and standard deviations of minima and maxima. Based on 60 seconds data blocks we create a minute data block containing the ranges for all characteristics. We save the minima and maxima of these in the minutes description file. Also we save the file names pointing to the measurement data files. For the next hierarchy level, we group the minute metadata files into hour metadata files using the same principle, that
is, we outline the value ranges for the characteristics obtained from the minute metadata files. This is repeated for each EDR until we reach the top of the hierarchy, the year metadata file. Currently this is defined as our top level, which is stored separately for every EDR device. Figure 3: The proposed hierarchical metadata structuring and indexing The collected information on a year basis is available separately and also processible for each EDR device. This indexing combined with the data analysis of specific characteristics can be done as a regular job for incoming new data. Additionally we will be able to submit queries for arbitrary time ranges, which could be computed efficiently from the metadata instead of the computation from raw data. B. Further Improvements We will work on automated methods that will perform the analysis, indexing and structuring directly on the Hadoop cluster for our collected voltage time series measurements. The data analysis methods will be extended for advanced state of the art analysis and mining. We plan to convert our measurement data to the SAX representation and make it indexable using the isax approach. Further we will concentrate on the compact visual presentation of and interaction with statistical analysis data for the proposed new data structure. Promising starting points are the intelligent icons approach [13], VizTree [14] and the methods in [15] that could provide a semantic overview of our datasets and their similarity. We also plan to develop methods for semantic searches, search by data (similarity search), dataset recommenders and classification as well as clustering using the metadata extracted by our analyses. Further optimizations of data transfer between EDRs and HDFS storage, like the aggregation of the xml-blocks to fit the HDFS block-size of 64MB, are planned for speeding up data processing. The security of the system will be another aspect that needs to be dealt with seriously. We will have to provide secure archiving and querying. This may include separation and anonymization of stored personal data e.g. by using pseudonyms and secure but fast encryption methods. VI. CONCLUSION We presented our approach for high rate voltage power network measurements using EDRs and data archiving and analysis problems and methods. We compared multi-core with distributed data processing for simple statistics analyses. For this purpose we developed the statistical analysis functionality in Java for both the Hadoop cluster using the Pig-API and the local environment. We could show that for data sets in our format exceeding ~6.2 GiB the usage of Pig in our Hadoop cluster is superior to classical multicore processing regarding speed of computations and scalability. We also introduced a new time-oriented schema for raw time series and metadata organization enabling fast and easy information retrieval. REFERENCES [1] J. Depablos, V. Centeno, A. G. Phadke, und M. Ingram, Comparative testing of synchronized phasor measurement units, in IEEE Power Engineering Society General Meeting, 2004, 2004, S. 948 954 Vol.1. [2] H. Maass, H. K. Çakmak, W. Suess, A. Quinte, W. Jakob, K. U. Stucky, U. Kuehnapfel, Introducing the Electrical Data Recorder as a New Capturing Device for Power Grid Analysis, in AMPS 2012 PROCEEDINGS, 2012. [3] H. Maass, H. K. Çakmak, W. Suess, A. Quinte, W. Jakob, E. Müller, K. Boehm, K. U. Stucky, U. Kuehnapfel, Introducing a New Voltage Time Series Approach for Electrical Power Grid Analysis, presented at the 2nd IEEE ENERGYCON Conference & Exhibition, 2012 / ICT for Energy Symposium, 2012, S. 143 148. [4] The Smart Grid: Hadoop at the Tennessee Valley Authority (TVA) - cloudera, Josh Patterson, Cloudera. [Online]. Available: http://www.cloudera.com/blog/2009/06/smart-grid-hadoop-tennesseevalley-authority-tva/. [Accessed: 15-Aug-2012]. [5] Grid Protection Alliance (OpenPDC, OpenHist.,...). [Online]. Available: http://www.gridprotectionalliance.org/gsdefault.htm. [Accessed: 13-Sep-2012]. [6] ipdc - Free Phasor Data Concentrator. [Online]. Available: http://ipdc.codeplex.com/. [Accessed: 15-Aug-2012]. [7] Lumberyard: Time Series Indexing at Scale" OSCON 2011 - O Reilly Conferences, July 25-29, 2011, Portland, OR. [8] J. Shieh und E. Keogh, isax: disk-aware mining and indexing of massive time series datasets, Data Mining and Knowledge Discovery, Bd. 19, Nr. 1, S. 24 57, 2009. [9] E. Keogh, J. Lin, und A. Fu, Hot sax: Efficiently finding the most unusual time series subsequence, in Data Mining, Fifth IEEE International Conference on, 2005, S. 8 pp. [10] A. Gates, Programming Pig", ISBN-13: 978-1449302641, O'Reilly Media, 2011. [11] A. O. Garcia, S. Bourov, A. Hammad, V. Hartmann, T. Jejkal, J. C. Otte, S. Pfeiffer, T. Schenker, C. Schmidt, P. Neuberger, und others, Data-intensive analysis for scientific experiments at the Large Scale Data Facility, in Large Data Analysis and Visualization (LDAV), 2011 IEEE Symposium on, 2011, S. 125 126. [12] D. P. Chassin, K. Schneider, und C. Gerkensmeyer, GridLAB-D: An open-source power systems modeling and simulation environment, in Transmission and Distribution Conference and Exposition, 2008. T #x00026;d. IEEE/PES, 2008, S. 1 5. [13] L. Wei, E. Keogh, X. Xi, und S. Lonardi, Integrating Lite-Weight but Ubiquitous Data Mining into GUI Operating Systems, Journal of Universal Computer Science, Bd. 11, Nr. 11, S. 1820 1834, 2005. [14] VizTree. [Online]. [Accessed:07-Nov-2012]. Available: http://www.cs.gmu.edu/~jessica/viztree.htm. [15] W. Aigner, S. Miksch, W. Müller, H. Schumann, und C. Tominski, Visualizing time-oriented data A systematic view, Computers & Graphics, Bd. 31, Nr. 3, S. 401 409, Juni 2007.