Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo
Contents 1 Introduction 2 What & Why Sensor Network 3 Enterprise Sensor Network 4 Conclusion and Future work 2
Background Introduction Sensor technology is very famous and available at low cost in the market nowadays. ex: weather sensors, co2, radiation and so on. It is widely used in many fields of research and applications such as Environment monitoring, Pollution monitoring, Disater monitoring, Agriculture field monitoring and Traffic monitoring. Most of applications are developed based on its specification and application. Difficult to apply for using in other purpose or with different sensor. Sharing sensor information among system is difficult due to lack of standardization. There is Sensor Web Enablement (SWE) from OGC but not focus on a concrete detail of application development. 3
What & Why Sensor Network Sensor Network A group of heterogeneous sensor system connected together using communication infrastructure to exchange information between sensor stations or sensor nodes. All sensor nodes are able to link or synchronize data among each other or main station so that it acts as network. It is driven by the progress of 3 technologies: Sensors, Field platform and Internet. Sensors Platform Internet Sensor Network 4
What & Why Sensor Network What is needed for Sensor Network 5
What & Why Sensor Network Sensor Service GRID (SSG) Sensor Middleware 6
What & Why Sensor Network Issues in Sensor Network How to deal and handle large size sensor network (Nodes and Data) How to scale to larger size with minimizing efforts Insufficient processor, I/O, and storage resources for large-scale Heterogeneous and vender-specific sensor are difficult to connect with sensor network. It must be able to operate under any network even unstable network. Real-time and Near real-time It must provide channel or interface for 3 rd party application to connect with and use data in sensor service. Standardization interface to be compatible with other software Rapid installation and ease of use. Visualization with GIS enable Low cost?? 7
Enterprise Sensor Network The Goal: Design and develop a prototype of sensor network system supported various sensors, support any network topology and can easily scale from small to large size with minimizing efforts and human operation Key Features Large-scale support with cloud Massive data and real-time data processing Flexible data communication Easy integration, installation and ease of use High-frequency and multi-dimension support Open standard and integrating support Spatial data support 8
System Overview Enterprise Sensor Network 9
Enterprise Sensor Network Sensor Stations (SOSes) SOS is a sensor station installed and deployed at field site. It handles feeding data from sensors as well as sending data to cloud service. It can be fixed-station or mobile station with mission support. A combination of SOS Service and Web Server. It support both push and pull data feeding. Divided into 3 types based on its features Rich-node: fully functions with web UI and 2-way control Dump-node: data feeder only (storage, processing cost) Virtual-node: Share resource, no HW, more than one node 10
Enterprise Sensor Network Sensor Station Design 11
Enterprise Sensor Network Messaging Service as communication medium Enable 2-ways control between station and cloud services Support multiple Connectors Support various type of message storage Load balance and cluster support (Source: ActiveMQ, Apache) 12
Enterprise Sensor Network Network of Brokers Brokers can be linked together to form a network or cluster of brokers. A network of brokers can use various network topologies, such as hub-and-spoke, daisy chain, or mesh. 13
Enterprise Sensor Network Sensor Cloud Service It is a sensor data middleware which provides users with a platform to receive data from remote field sensor networks including data interface and virtualization. Typically characterized by the features: High Performance Scalability Reliability Open Architecture Spatial Query Arbitrary Processing Services Web Interface Web Services Sensor Virtualization Proprietary API Synchronization Services Open Standard API 3 rd App Connectors Command Services Spatial Database Cloud Service (Hadoop/Hive) 14
What is Hadoop Key Technology An open source framework, Free!! Distributed applications for large data Parallel processing Run on Commodity machines Scalable Very Famous In 2011, Facebook claimed that they had the largest Hadoop cluster in the world with 30 PB of storage with nearly 10,000 nodes. Hive is a data warehousing package on Hadoop with SQL-like. Hive provide a SQLlike language called HiveQL via Web GUI and JDBC 15
Key Technology Project under Hadoop umbrella Common A set of components and interfaces for distributed file systems and general I/O (serialization, Java RPC, persistent data structures). MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed filesystem that runs on large clusters of commodity machines. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. Sqoop A tool for efficiently moving data between relational databases and HDFS. (D2) 16
Key Technology Hadoop main component NameNode DataNode Secondary NameNode JobTracker TaskTracker NameNode is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem. Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS and the SNN help snapshots NameNode to help minimize the downtime and loss of data. JobTracker is the liaison between your application and Hadoop. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they re running. TaskTrackers is responsible for executing the individual tasks that the JobTracker assigns and manage the execution of individual tasks on each slave node. 17
Key Technology Hadoop main component Store & Process Data Keep Metadata & Distribute Job 1 PC add more node (D2) 18 (Source: Lam., 2011)
Hive Key Technology Hive is a data warehousing package built on top of Hadoop. Its target users remain data analysts who are comfortable with SQL and who need to do ad hoc queries, summarization, and data analysis on Hadoop-scale data. You interact with Hive by issuing queries in a SQL-like language called HiveQL via Web GUI and JDBC. 1 2 3 (D2) 19
Key Technology HiveQL 20 (Source: White., 2011)
Key Technology How Hadoop benefit Sensor Network Scalability Commodity hardware scales easily in many cases. Twenty Hadoop nodes may cost only as much as a single redundant database slave pair. Operational concerns Removing as many single-point-offailure cases as possible is crucial to smooth operation of a world-class service. Data processing speed Many system-wide calculations were simply not possible to perform with a monolithic system. Spatial Processing & Custom function Spatial Query: find point in polygon Specific custom function: interpolation, forecasting, model 21
Spatial Data Processing & Custom Function Hive with Spatial and Custom Function Use JTS (Java Topology suite) Pure Java native library for spatial function It can be easily attached to map/reduce task because hadoop is java native platform Good performance and Open Source Use User-Defined Function custom development UDF (User-Defined Function) UDAF (User-Defined Aggregate Function) UDTF (User-Defined Table Function) Create spatial function such as within using JTS and make it as UDF Then it can run on hive and auto generate to map/reduce. Use Join Method and Lateral View 22
Spatial Data Processing & Custom Function Example of Spatial Custom Function JTS (Java Topology suite), Use UDF (User-defined function) Identify location of GPS point (Lat,Lng) by search in shape polygons Prefecture 139.702777 35.694152 City Tokyo Grid 300,000++ points/sec Shinjuku-ku Code:533944151 23
Performance Comparisons of Spatial Data Processing Techniques for a Large Scale Mobile Phone Dataset App vs. RDBMS vs. Hadoop 21 Hours 1 min!!! Remark: 1 day data = 20 million records 24
Sensor Network with Cloud Hive and PostgreSQL (Programming view point) Java Servlet RMI Hibernate Spring Specific data processing SQL Hive (Metastore) MapReduce Java PostgreSQL Hadoop 25
Conclusion Conclusion We designed Enterprise Sensor Network to address current issues in development of sensor network such as handling large number of sensor node and sensor data, real-time data processing flexible data communication easy integration and installation We purposed Messaging Service and Hadoop distributed platform as main technologies to overcome those issues. On sensor station side, we designed the system as services. Web server and SOS service are separated and communicate each other via RMI. 26
Conclusion Conclusion SOS service is a combination of several services to handle specific operation such as SOS interface, Command Service, Scheduler Service, Data Synchronization Service and Data Feeder Service. Data Feeder Service was designed to be able to develop custom feeder for vender-specific sensor and can plug to the services. A combination of Sensor Station, Messaging Service and Sensor Cloud Service support sensor network system to archive Real-time, Scalability and Robustness. 27
Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn Email: apichon@iis.u-tokyo.ac.jp Department of Civil Engineering