How To Use Hadoop For Gis



Similar documents
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data Spatial Analytics An Introduction

Chapter 7. Using Hadoop Cluster and MapReduce

Workshop on Hadoop with Big Data

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

How To Scale Out Of A Nosql Database

Hadoop. Sunday, November 25, 12

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

Manifest for Big Data Pig, Hive & Jaql

Bringing Big Data Modelling into the Hands of Domain Experts

How To Handle Big Data With A Data Scientist

Maximizing Hadoop Performance and Storage Capacity with AltraHD TM

W H I T E P A P E R. Deriving Intelligence from Large Data Using Hadoop and Applying Analytics. Abstract

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Hadoop Big Data for Processing Data and Performing Workload

Data-Intensive Computing with Map-Reduce and Hadoop

Big Data and Apache Hadoop s MapReduce

Big Data R&D Initiative

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Hadoop and Map-Reduce. Swati Gore

BIG DATA TRENDS AND TECHNOLOGIES

Data processing goes big

Are You Ready for Big Data?

Hadoop Cluster Applications

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

Open source Google-style large scale data analysis with Hadoop

Large scale processing using Hadoop. Ján Vaňo

Implement Hadoop jobs to extract business value from large and varied data sets

Energy Efficient MapReduce

Hadoop Ecosystem B Y R A H I M A.

Big Data Explained. An introduction to Big Data Science.

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Big Data and Hadoop. Sreedhar C, Dr. D. Kavitha, K. Asha Rani

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Using Big Data and GIS to Model Aviation Fuel Burn

Application Development. A Paradigm Shift

R.K.Uskenbayeva 1, А.А. Kuandykov 2, Zh.B.Kalpeyeva 3, D.K.Kozhamzharova 4, N.K.Mukhazhanov 5

BIG DATA What it is and how to use?

Keywords: Big Data, HDFS, Map Reduce, Hadoop

A very short Intro to Hadoop

High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Latest Developments in Oceanographic Applications of GIS including!

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

A Brief Outline on Bigdata Hadoop

Maximizing Hadoop Performance with Hardware Compression

Big Data. White Paper. Big Data Executive Overview WP-BD Jafar Shunnar & Dan Raver. Page 1 Last Updated

Hadoop & its Usage at Facebook

BIG DATA CHALLENGES AND PERSPECTIVES

HADOOP: Scalable, Flexible Data Storage and Analysis

Big Data Too Big To Ignore

How Companies are! Using Spark

marlabs driving digital agility WHITEPAPER Big Data and Hadoop

Scalable Cloud Computing Solutions for Next Generation Sequencing Data

Hadoop implementation of MapReduce computational model. Ján Vaňo

Massive Cloud Auditing using Data Mining on Hadoop

Outline. High Performance Computing (HPC) Big Data meets HPC. Case Studies: Some facts about Big Data Technologies HPC and Big Data converging

Big Data Use Case. How Rackspace is using Private Cloud for Big Data. Bryan Thompson. May 8th, 2013

Hadoop & its Usage at Facebook

A Performance Analysis of Distributed Indexing using Terrier

BIG DATA USING HADOOP

Data Mining in the Swamp

HDFS. Hadoop Distributed File System

Real Time Big Data Processing

Map Reduce & Hadoop Recommended Text:

Lambda Architecture. Near Real-Time Big Data Analytics Using Hadoop. January Website:

Oracle Big Data SQL Technical Update

A programming model in Cloud: MapReduce

BIG DATA SOLUTION DATA SHEET

Market Basket Analysis Algorithm on Map/Reduce in AWS EC2

Are You Ready for Big Data?

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

Parallel Processing of cluster by Map Reduce

Enabling High performance Big Data platform with RDMA

Search and Real-Time Analytics on Big Data

Big Data on Microsoft Platform

SCALABLE FILE SHARING AND DATA MANAGEMENT FOR INTERNET OF THINGS

BIRT in the World of Big Data

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Apache Hadoop FileSystem and its Usage in Facebook

A Survey on Big Data Concepts and Tools

The Next Wave of Data Management. Is Big Data The New Normal?

Hadoop IST 734 SS CHUNG

Accelerating Hadoop MapReduce Using an In-Memory Data Grid

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Large-Scale Data Processing

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Transcription:

2013 Esri International User Conference July 8 12, 2013 San Diego, California Technical Workshop Big Data: Using ArcGIS with Apache Hadoop David Kaiser Erik Hoel Offering 1330 Esri UC2013. Technical Workshop.

In this technical workshop This presentation is for anyone that uses ArcGIS and is interested in analyzing large amounts of data We will cover: - Big Data overview - The Hadoop platform - How Esri s GIS Tools for Hadoop enables developers to process spatial data on Hadoop - How ArcGIS can leverage these custom Hadoop applications for GIS analysis

Big Data Overview Esri UC2013. Technical Workshop.

Big Data Within ArcGIS, Geoprocessing was enhanced at 10.1 SP1 to support 64-bit address spaces - This is sufficient to handle traditional large GIS datasets However, this solution may run into problems when confronted with datasets of a size that are colloquially referred to as Big Data - Internet scale datasets

Age of Data Ubiquity Data is now central to our existence both for corporations and individuals Nimble, thin, data-centric apps exploiting massive data sets generated by both enterprises and consumers Hardware era: 20 30 years Software era: 20 30 years Data era:?

Age of Data Ubiquity Data is now central to our existence both for corporations and individuals "The Internet has caused a Cambrian explosion Nimble, thin, of data-centric new life forms, new apps applications, exploiting new massive datadata sets generated centric APIs, by literally both enterprises thousands and thousands consumers of new APIs are born every month." - Mike Hoskins Hardware era: 20 30 years Software era: 20 30 years Data era:?

Sensor data challenge Sensors are continuously observing our world - With increases in networking technology, we are moving away from pure remote sensing and moving towards direct sensing, where every sensor reports it s own data One researcher says: We are instrumenting the universe, referring to what is often called The Internet of Things

Internet of Things IoT - uniquely identifiable objects and their virtual representations in an Internet-like structure; the term was proposed by Kevin Ashton (MIT) in 1999 Your old refrigerator was a dumb refrigerator Your new refrigerator is a smart refrigerator - It s a digital asset aware of itself - It records time, temperatures, vibrations - It records the electricity it s consuming - It s connected to the Internet as an IP device More realistically, commercial jets or valuable equipment By continuously observing things, we obtain a massive amount of valuable information

Volume of data the rise of data anarchy If all 7 billion people on Earth joined Twitter and continually tweeted for one century, they would generate one zettabyte of data (billion terabytes) Almost double that amount, 1.8 zettabytes (1.8 x 10 21 ), was generated globally in 2011 (and 2.8 ZB in 2012) Rising 40-50% per year Estimated over 40 ZB in 2020

Volume of data the rise of data anarchy If all 7 billion people on Earth joined Twitter and continually tweeted for one century, they would generate one zettabyte of data (Hadhazy, 2010) Almost double that amount, 1.8 zettabytes (1.8 x 10 21 ), was generated globally in 2011(and 2.8 ZB in 2012) Rising 40-50% per year Estimated over 40 ZB in 2020

Example For every hour that it runs, a commercial jet aircraft can create 20 gigabytes of operations information For a single journey across the Atlantic Ocean, a twoengine jet can create over 125 GB of data Multiply that by the more than 100,000 flights flown each day, and you get an understanding of the enormous amount of data that exists - E.g., 5-10 petabytes/day; 2-4 exabytes/year

All this data has business value Smart Sensors - Electrical meters GPS Telemetry - Vehicle tracking, smartphone data collectors Internet Services - E-Commerce transactions, social media Monitoring Sensors - Heavy equipment, aircraft, etc.

Value when analyzing data at mass scale As observations increase in frequency - Each individual observation is worth less - as the set of all observations becomes more valuable One single metric from the jet aircraft is much less useful than the analysis of that metric against the same metric from every known flight of that aircraft over time Big Data is the accumulation and analytical processes that uses this data for business value

What is Hadoop? Esri UC2013. Technical Workshop. Racks of Hadoop Servers inside the Facebook.com Data Center

The need for distributed systems As data volumes have grown, individual computing systems have not scaled equivalently, e.g., - Single-node databases - Single computers Big data has led to distributed systems: - Data is stored across a number of servers - Processing the data takes place in the server where the data is stored

Legacy system architecture Processors Disks Network Bottleneck

Distributed system architecture Processing Elements

Distributed system architecture Processing Elements

Until Now Google implemented their enterprise on a distributed network of many nodes, fusing storage and processing into each node Hadoop is an open source implementation of the framework that Google has built their business around for many years

Until Now Google implemented their enterprise on a distributed network of many nodes, fusing storage and processing into each node Hadoop is an open implementation of the framework that Google has built their business around for many years

Hadoop Open-source data processing framework Provides scalable, distributed system for both storage and computation of data - It supports the running of applications on large clusters of commodity hardware - Hadoop was derived from Google's MapReduce and Google File System (GFS) scientific papers Platform consists of the kernel, HDFS, and MapReduce - Other applications have been built on Hadoop; this includes Hive, HBase, etc.

MapReduce A programming model for processing large data sets with a parallel distributed algorithm on a cluster A MapReduce program is comprised of: - A Map() procedure that performs filtering and comparing, and - A Reduce() procedure that performs a summary operation The MapReduce System marshals the distributed servers, runs the various tasks in parallel, manages all communications and data transfers, and provides for redundancy and failures MapReduce libraries have been written in many languages; Hadoop is a popular open source implementation

High Level MapReduce Walk-Through An instance of a MapReduce program is a job The job accepts arguments for data input and outputs The job combines - Functions from the MapReduce framework - Splitting large inputs into smaller pieces - Reading inputs and writing outputs - Functions that are written by the application developer - Map function, maps input values to keys - Reduce function, reduces many keys to one

High Level MapReduce Walk-Through An instance of a MapReduce program is a job MapReduce Job

High Level MapReduce Walk-Through The job accepts arguments for data input and outputs MapReduce Job data hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The MapReduce framework splits the data space MapReduce Job Split data hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through Many map() functions are run in parallel MapReduce Job Split map() data map() map() map() hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The framework runs combine to join map() outputs MapReduce Job Split Combine map() data map() map() map() hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The framework performs a shuffle and sort on all data MapReduce Job Split Combine Shuffle / Sort map() data map() map() map() hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through The reduce() functions work against the sorted data MapReduce Job Split Combine Shuffle / Sort map() data map() reduce() map() map() reduce() hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through Reducers each write a part of the results to a file MapReduce Job Split Combine Shuffle / Sort map() data map() reduce() part 1 map() reduce() part 2 map() hdfs://path/to/input hdfs://path/to/output

High Level MapReduce Walk-Through When the job has finished, ArcGIS can retrieve the results Split Combine Shuffle / Sort map() data map() reduce() part 1 map() reduce() part 2 map() hdfs://path/to/input hdfs://path/to/output

MapReduce Polygon Count Example Split 1 / Record 1 AK AK OR AK WA Split 2 / Record 1 WA OR AK WA Split 3 / Record 1 WA OR OR OR Map Map Map AK 1 AK 1 OR 1 AK 1 WA 1 WA 1 OR 1 AK 1 WA 1 WA 1 OR 1 OR 1 OR 1 Shuffle / Sort WA 1 WA 1 WA 1 AK 1 AK 1 AK 1 AK 1 OR 1 OR 1 OR 1 OR 1 OR 1 Reduce Reduce Washington Alaska Oregon WA 3 AK 4 OR 5

Yahoo! Hadoop Cluster in 2011 ~10,000 servers in a number of clusters Largest cluster is 1,600 nodes Nearly 1 Petabyte of user data Yahoo! runs nearly 10,000 research jobs per month

We didn t bring 10,000 servers to the user conference but we do have a small Hadoop cluster we can use for demos

GIS Tools for Hadoop Credit: Fuqing Zhang and Yonghui Weng, Pennsylvania State University; Frank Marks, NOAA; Gregory P. Esri UC2013. Technical Workshop. Johnson, Romy Schneider, John Cazes, Karl Schulz, Bill Barth, The University of Texas at Austin

GIS Tools for Hadoop Hadoop is a data processing system that is designed to store and process large amounts of data The most common Hadoop data processing task is to reduce a large amount of data to a smaller, more manageable amount of data The GIS Tools for Hadoop provide query functions and API methods that enable Hadoop application developers to perform this data reduction process on spatial data

Data Reduction Patterns Need to reduce large volumes of data into manageable datasets that can be processed in the ArcGIS Platform - Filtering - Grouping - Simple binning against grid cells - Aggregation into known spatial areas - More complex patterns if desired

GIS Tools for Hadoop Developer API to support data reduction workflows - Release in March 2013 at Esri Developer Summit Spatial reduction with full geometry capabilities - Relational operators (touch, disjoint, ) - Topological operators (buffer, union, intersection, ) - Accessible via - SQL Expressions - Java Small set of GP Tools to migrate data and invoke jobs Esri UC2013. Technical Workshop.

GIS Tools for Hadoop tools samples Spatial Framework for Hadoop hive Tools and samples using the open source resources that solve specific problems Spatial type functions for Hive, a SQL Engine on Hadoop JSON helper utilities spatial-sdk-hive.jar json spatial-sdk-json.jar Geoprocessing Tools for Hadoop HadoopTools.pyt Geometry API Java esri-geometry-api.jar Geoprocessing tools that Copy to/from Hadoop Convert to/from JSON Invoke Hadoop jobs Java geometry library for spatial data processing Esri UC2013. Technical Workshop.

DEMONSTRATION MapReduce Demos Esri UC2013. Technical Workshop.

Earthquake Measurement Data Data format is CSV, already stored in HDFS file 1964/03/28 06:42:57.00,57.98,-151.6,33.0,5.2,ML,0 1964/03/28 06:43:54.40,58.26,-151.25,4.0,6.1,ML,0 1964/03/28 06:50:52.00,56.9,-151.7,33.0,5.1,ML,0 1964/03/28 06:53:35.90,58.79,-149.54,20.0,5.7,ML,0 1964/03/28 07:09:07.60,59.79,-148.1,13.0,5.3,ML,0 1964/03/28 07:10:22.00,58.83,-149.29,17.0,6.2,ML,0 1964/03/28 07:16:16.70,58.0,-150.7,33.0,5.3,ML,0 1964/03/28 07:24:24.60,59.62,-148.7,20.0,5.1,ML,0 Coordinates Depth Magnitude

DEMONSTRATION Grouping by Polygons In this demo, we use ArcGIS to provide the polygonal boundaries of the U.S. States, and the Hadoop application aggregates earthquake data by the state boundaries Esri UC2013. Technical Workshop.

DEMONSTRATION Grouping into Grid Cells In this demo, we compute arbitrary grid cells such as a 1km grid, and aggregate earthquakes by these cells Esri UC2013. Technical Workshop.

The Future GIS Tools for Hadoop is a developer story - Our initial foray into Big Data Esri will continue forward in this domain - User stories and technology will be the focus Swing by the island to chat further Please fill out the evaluations: offering 1330

Questions? Esri UC2013. Technical Workshop.