Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo



Similar documents
CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Qsoft Inc

Implement Hadoop jobs to extract business value from large and varied data sets

COURSE CONTENT Big Data and Hadoop Training

Overview. Big Data in Apache Hadoop. - HDFS - MapReduce in Hadoop - YARN. Big Data Management and Analytics

HadoopRDF : A Scalable RDF Data Analysis System

Apache Hadoop. Alexandru Costan

Deploying Hadoop with Manager

Hadoop Ecosystem B Y R A H I M A.

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop implementation of MapReduce computational model. Ján Vaňo

Certified Big Data and Apache Hadoop Developer VS-1221

Big Data Introduction

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Chapter 7. Using Hadoop Cluster and MapReduce

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Hadoop IST 734 SS CHUNG

Hadoop. Apache Hadoop is an open-source software framework for storage and large scale processing of data-sets on clusters of commodity hardware.

Weekly Report. Hadoop Introduction. submitted By Anurag Sharma. Department of Computer Science and Engineering. Indian Institute of Technology Bombay

Large scale processing using Hadoop. Ján Vaňo

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop Submitted in partial fulfillment of the requirement for the award of degree of Bachelor of Technology in Computer Science

Big Data Analytics - Accelerated. stream-horizon.com

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

BIG DATA HADOOP TRAINING

Data-Intensive Computing with Map-Reduce and Hadoop

Hadoop & its Usage at Facebook

Facebook s Petabyte Scale Data Warehouse using Hive and Hadoop

Hadoop & its Usage at Facebook

Hadoop Architecture. Part 1

Welcome to the unit of Hadoop Fundamentals on Hadoop architecture. I will begin with a terminology review and then cover the major components

Sector vs. Hadoop. A Brief Comparison Between the Two Systems

Complete Java Classes Hadoop Syllabus Contact No:

Big Data on Microsoft Platform

Open source Google-style large scale data analysis with Hadoop

Big Data Weather Analytics Using Hadoop

ISSN: (Online) Volume 3, Issue 4, April 2015 International Journal of Advance Research in Computer Science and Management Studies

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Modernizing Your Data Warehouse for Hadoop

Using Open Source NoSQL technologies in Designing Systems for Delivering Electric Vehicle Data Analytics.

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Big Data With Hadoop

HADOOP AND MAINFRAMES CRAZY OR CRAZY LIKE A FOX? Mike Combs, VP of Marketing mcombs@veristorm.com

Hadoop and Map-Reduce. Swati Gore

Hadoop. Sunday, November 25, 12

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Pro Apache Hadoop. Second Edition. Sameer Wadkar. Madhu Siddalingaiah

Cloudera Certified Developer for Apache Hadoop

Apache Hadoop FileSystem and its Usage in Facebook

How To Handle Big Data With A Data Scientist

HADOOP AND THE BI ENVIRONMENT

Big Data - Infrastructure Considerations

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

GigaSpaces Real-Time Analytics for Big Data

Big Data and Hadoop with components like Flume, Pig, Hive and Jaql

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

<Insert Picture Here> Oracle and/or Hadoop And what you need to know

Data Warehousing and Analytics Infrastructure at Facebook. Ashish Thusoo & Dhruba Borthakur athusoo,dhruba@facebook.com

Distributed Computing and Hadoop in Statistics

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

A Brief Outline on Bigdata Hadoop

Apache Hadoop new way for the company to store and analyze big data

BIG DATA TRENDS AND TECHNOLOGIES

MySQL and Hadoop. Percona Live 2014 Chris Schneider

MapReduce with Apache Hadoop Analysing Big Data

A very short Intro to Hadoop

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Workshop on Hadoop with Big Data

THE HADOOP DISTRIBUTED FILE SYSTEM

A Multilevel Secure MapReduce Framework for Cross-Domain Information Sharing in the Cloud

Hadoop Distributed File System (HDFS) Overview

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

How To Scale Out Of A Nosql Database

Big Data and Apache Hadoop Adoption:

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Take An Internal Look at Hadoop. Hairong Kuang Grid Team, Yahoo! Inc

What is a Petabyte? Gain Big or Lose Big; Measuring the Operational Risks of Big Data. Agenda

Integrating SAP BusinessObjects with Hadoop. Using a multi-node Hadoop Cluster

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

CSE-E5430 Scalable Cloud Computing Lecture 2

BIG DATA SOLUTION DATA SHEET

Applied research on data mining platform for weather forecast based on cloud storage

NoSQL and Hadoop Technologies On Oracle Cloud

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Yu Xu Pekka Kostamaa Like Gao. Presented By: Sushma Ajjampur Jagadeesh

Hadoop: The Definitive Guide

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

project collects data from national events, both natural and manmade, to be stored and evaluated by

White Paper: 1) Architecture Objectives: The primary objective of this architecture is to meet the. 2) Architecture Explanation

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Transcription:

Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn apichon@iis.u-tokyo.ac.jp Department of Civil Engineering The University of Tokyo

Contents 1 Introduction 2 What & Why Sensor Network 3 Enterprise Sensor Network 4 Conclusion and Future work 2

Background Introduction Sensor technology is very famous and available at low cost in the market nowadays. ex: weather sensors, co2, radiation and so on. It is widely used in many fields of research and applications such as Environment monitoring, Pollution monitoring, Disater monitoring, Agriculture field monitoring and Traffic monitoring. Most of applications are developed based on its specification and application. Difficult to apply for using in other purpose or with different sensor. Sharing sensor information among system is difficult due to lack of standardization. There is Sensor Web Enablement (SWE) from OGC but not focus on a concrete detail of application development. 3

What & Why Sensor Network Sensor Network A group of heterogeneous sensor system connected together using communication infrastructure to exchange information between sensor stations or sensor nodes. All sensor nodes are able to link or synchronize data among each other or main station so that it acts as network. It is driven by the progress of 3 technologies: Sensors, Field platform and Internet. Sensors Platform Internet Sensor Network 4

What & Why Sensor Network What is needed for Sensor Network 5

What & Why Sensor Network Sensor Service GRID (SSG) Sensor Middleware 6

What & Why Sensor Network Issues in Sensor Network How to deal and handle large size sensor network (Nodes and Data) How to scale to larger size with minimizing efforts Insufficient processor, I/O, and storage resources for large-scale Heterogeneous and vender-specific sensor are difficult to connect with sensor network. It must be able to operate under any network even unstable network. Real-time and Near real-time It must provide channel or interface for 3 rd party application to connect with and use data in sensor service. Standardization interface to be compatible with other software Rapid installation and ease of use. Visualization with GIS enable Low cost?? 7

Enterprise Sensor Network The Goal: Design and develop a prototype of sensor network system supported various sensors, support any network topology and can easily scale from small to large size with minimizing efforts and human operation Key Features Large-scale support with cloud Massive data and real-time data processing Flexible data communication Easy integration, installation and ease of use High-frequency and multi-dimension support Open standard and integrating support Spatial data support 8

System Overview Enterprise Sensor Network 9

Enterprise Sensor Network Sensor Stations (SOSes) SOS is a sensor station installed and deployed at field site. It handles feeding data from sensors as well as sending data to cloud service. It can be fixed-station or mobile station with mission support. A combination of SOS Service and Web Server. It support both push and pull data feeding. Divided into 3 types based on its features Rich-node: fully functions with web UI and 2-way control Dump-node: data feeder only (storage, processing cost) Virtual-node: Share resource, no HW, more than one node 10

Enterprise Sensor Network Sensor Station Design 11

Enterprise Sensor Network Messaging Service as communication medium Enable 2-ways control between station and cloud services Support multiple Connectors Support various type of message storage Load balance and cluster support (Source: ActiveMQ, Apache) 12

Enterprise Sensor Network Network of Brokers Brokers can be linked together to form a network or cluster of brokers. A network of brokers can use various network topologies, such as hub-and-spoke, daisy chain, or mesh. 13

Enterprise Sensor Network Sensor Cloud Service It is a sensor data middleware which provides users with a platform to receive data from remote field sensor networks including data interface and virtualization. Typically characterized by the features: High Performance Scalability Reliability Open Architecture Spatial Query Arbitrary Processing Services Web Interface Web Services Sensor Virtualization Proprietary API Synchronization Services Open Standard API 3 rd App Connectors Command Services Spatial Database Cloud Service (Hadoop/Hive) 14

What is Hadoop Key Technology An open source framework, Free!! Distributed applications for large data Parallel processing Run on Commodity machines Scalable Very Famous In 2011, Facebook claimed that they had the largest Hadoop cluster in the world with 30 PB of storage with nearly 10,000 nodes. Hive is a data warehousing package on Hadoop with SQL-like. Hive provide a SQLlike language called HiveQL via Web GUI and JDBC 15

Key Technology Project under Hadoop umbrella Common A set of components and interfaces for distributed file systems and general I/O (serialization, Java RPC, persistent data structures). MapReduce A distributed data processing model and execution environment that runs on large clusters of commodity machines. HDFS A distributed filesystem that runs on large clusters of commodity machines. Hive A distributed data warehouse. Hive manages data stored in HDFS and provides a query language based on SQL (and which is translated by the runtime engine to MapReduce jobs) for querying the data. Sqoop A tool for efficiently moving data between relational databases and HDFS. (D2) 16

Key Technology Hadoop main component NameNode DataNode Secondary NameNode JobTracker TaskTracker NameNode is the bookkeeper of HDFS; it keeps track of how your files are broken down into file blocks, which nodes store those blocks, and the overall health of the distributed filesystem. Datanodes are the workhorses of the filesystem. They store and retrieve blocks when they are told to (by clients or the namenode), and they report back to the namenode periodically with lists of blocks that they are storing. Secondary NameNode (SNN) is an assistant daemon for monitoring the state of the cluster HDFS and the SNN help snapshots NameNode to help minimize the downtime and loss of data. JobTracker is the liaison between your application and Hadoop. Once you submit your code to your cluster, the JobTracker determines the execution plan by determining which files to process, assigns nodes to different tasks, and monitors all tasks as they re running. TaskTrackers is responsible for executing the individual tasks that the JobTracker assigns and manage the execution of individual tasks on each slave node. 17

Key Technology Hadoop main component Store & Process Data Keep Metadata & Distribute Job 1 PC add more node (D2) 18 (Source: Lam., 2011)

Hive Key Technology Hive is a data warehousing package built on top of Hadoop. Its target users remain data analysts who are comfortable with SQL and who need to do ad hoc queries, summarization, and data analysis on Hadoop-scale data. You interact with Hive by issuing queries in a SQL-like language called HiveQL via Web GUI and JDBC. 1 2 3 (D2) 19

Key Technology HiveQL 20 (Source: White., 2011)

Key Technology How Hadoop benefit Sensor Network Scalability Commodity hardware scales easily in many cases. Twenty Hadoop nodes may cost only as much as a single redundant database slave pair. Operational concerns Removing as many single-point-offailure cases as possible is crucial to smooth operation of a world-class service. Data processing speed Many system-wide calculations were simply not possible to perform with a monolithic system. Spatial Processing & Custom function Spatial Query: find point in polygon Specific custom function: interpolation, forecasting, model 21

Spatial Data Processing & Custom Function Hive with Spatial and Custom Function Use JTS (Java Topology suite) Pure Java native library for spatial function It can be easily attached to map/reduce task because hadoop is java native platform Good performance and Open Source Use User-Defined Function custom development UDF (User-Defined Function) UDAF (User-Defined Aggregate Function) UDTF (User-Defined Table Function) Create spatial function such as within using JTS and make it as UDF Then it can run on hive and auto generate to map/reduce. Use Join Method and Lateral View 22

Spatial Data Processing & Custom Function Example of Spatial Custom Function JTS (Java Topology suite), Use UDF (User-defined function) Identify location of GPS point (Lat,Lng) by search in shape polygons Prefecture 139.702777 35.694152 City Tokyo Grid 300,000++ points/sec Shinjuku-ku Code:533944151 23

Performance Comparisons of Spatial Data Processing Techniques for a Large Scale Mobile Phone Dataset App vs. RDBMS vs. Hadoop 21 Hours 1 min!!! Remark: 1 day data = 20 million records 24

Sensor Network with Cloud Hive and PostgreSQL (Programming view point) Java Servlet RMI Hibernate Spring Specific data processing SQL Hive (Metastore) MapReduce Java PostgreSQL Hadoop 25

Conclusion Conclusion We designed Enterprise Sensor Network to address current issues in development of sensor network such as handling large number of sensor node and sensor data, real-time data processing flexible data communication easy integration and installation We purposed Messaging Service and Hadoop distributed platform as main technologies to overcome those issues. On sensor station side, we designed the system as services. Web server and SOS service are separated and communicate each other via RMI. 26

Conclusion Conclusion SOS service is a combination of several services to handle specific operation such as SOS interface, Command Service, Scheduler Service, Data Synchronization Service and Data Feeder Service. Data Feeder Service was designed to be able to develop custom feeder for vender-specific sensor and can plug to the services. A combination of Sensor Station, Messaging Service and Sensor Cloud Service support sensor network system to archive Real-time, Scalability and Robustness. 27

Sensor Network Messaging Service Hive/Hadoop Mr. Apichon Witayangkurn Email: apichon@iis.u-tokyo.ac.jp Department of Civil Engineering