Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park



Similar documents
Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

How To Use Hadoop For Gis

Big Data and Apache Hadoop s MapReduce

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 15

Workshop on Hadoop with Big Data

COURSE CONTENT Big Data and Hadoop Training

Implement Hadoop jobs to extract business value from large and varied data sets

Hadoop Ecosystem Overview. CMSC 491 Hadoop-Based Distributed Computing Spring 2015 Adam Shook

How To Scale Out Of A Nosql Database

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Jeffrey D. Ullman slides. MapReduce for data intensive computing

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

Qsoft Inc

Introduction to NoSQL Databases and MapReduce. Tore Risch Information Technology Uppsala University

What We Can Do in the Cloud (2) -Tutorial for Cloud Computing Course- Mikael Fernandus Simalango WISE Research Lab Ajou University, South Korea

CSE 590: Special Topics Course ( Supercomputing ) Lecture 10 ( MapReduce& Hadoop)

Programming Hadoop 5-day, instructor-led BD-106. MapReduce Overview. Hadoop Overview

Big Data Spatial Analytics An Introduction

Mr. Apichon Witayangkurn Department of Civil Engineering The University of Tokyo

Prepared By : Manoj Kumar Joshi & Vikas Sawhney

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Accelerating and Simplifying Apache

Chapter 11 Map-Reduce, Hadoop, HDFS, Hbase, MongoDB, Apache HIVE, and Related

Cloudera Certified Developer for Apache Hadoop

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Chase Wu New Jersey Ins0tute of Technology

Big Data With Hadoop

CSE-E5430 Scalable Cloud Computing Lecture 2

A Performance Analysis of Distributed Indexing using Terrier

Department of Computer Science University of Cyprus EPL646 Advanced Topics in Databases. Lecture 14

Hadoop. Sunday, November 25, 12

BIG DATA HANDS-ON WORKSHOP Data Manipulation with Hive and Pig

Big Data Rethink Algos and Architecture. Scott Marsh Manager R&D Personal Lines Auto Pricing

Xiaoming Gao Hui Li Thilina Gunarathne

COSC 6397 Big Data Analytics. 2 nd homework assignment Pig and Hive. Edgar Gabriel Spring 2015

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

American International Journal of Research in Science, Technology, Engineering & Mathematics

MapReduce with Apache Hadoop Analysing Big Data

Hadoop Ecosystem B Y R A H I M A.

Hadoop and Map-Reduce. Swati Gore

Hadoop Introduction. Olivier Renault Solution Engineer - Hortonworks

Big Data Weather Analytics Using Hadoop

Using distributed technologies to analyze Big Data

Click Stream Data Analysis Using Hadoop

ESS event: Big Data in Official Statistics. Antonino Virgillito, Istat

Constructing a Data Lake: Hadoop and Oracle Database United!

Hadoop: The Definitive Guide

BIG DATA TECHNOLOGY. Hadoop Ecosystem

Hadoop Job Oriented Training Agenda

Hadoop and Big Data Research

Infomatics. Big-Data and Hadoop Developer Training with Oracle WDP

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Apache Hadoop Ecosystem

International Journal of Advancements in Research & Technology, Volume 3, Issue 2, February ISSN

Spring,2015. Apache Hive BY NATIA MAMAIASHVILI, LASHA AMASHUKELI & ALEKO CHAKHVASHVILI SUPERVAIZOR: PROF. NODAR MOMTSELIDZE

A Brief Outline on Bigdata Hadoop

Application Development. A Paradigm Shift

Introduction to Hadoop. New York Oracle User Group Vikas Sawhney

Lecture 32 Big Data. 1. Big Data problem 2. Why the excitement about big data 3. What is MapReduce 4. What is Hadoop 5. Get started with Hadoop

Hadoop implementation of MapReduce computational model. Ján Vaňo

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

Chapter 7. Using Hadoop Cluster and MapReduce

Big Data Management. Big Data Management. (BDM) Autumn Povl Koch November 11,

Integrating Big Data into the Computing Curricula

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

Map Reduce & Hadoop Recommended Text:

Introduction to Big data. Why Big data? Case Studies. Introduction to Hadoop. Understanding Features of Hadoop. Hadoop Architecture.

Hadoop 只 支 援 用 Java 開 發 嘛? Is Hadoop only support Java? 總 不 能 全 部 都 重 新 設 計 吧? 如 何 與 舊 系 統 相 容? Can Hadoop work with existing software?

Cloud Application Development (SE808, School of Software, Sun Yat-Sen University) Yabo (Arber) Xu

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Large scale processing using Hadoop. Ján Vaňo

Introduction to Big Data Training

Near Real Time Indexing Kafka Message to Apache Blur using Spark Streaming. by Dibyendu Bhattacharya

Big Data and Scripting Systems build on top of Hadoop

Getting Started with Hadoop. Raanan Dagan Paul Tibaldi

Hadoop Distributed File System. Dhruba Borthakur Apache Hadoop Project Management Committee June 3 rd, 2008

Systems Infrastructure for Data Science. Web Science Group Uni Freiburg WS 2012/13

brief contents PART 1 BACKGROUND AND FUNDAMENTALS...1 PART 2 PART 3 BIG DATA PATTERNS PART 4 BEYOND MAPREDUCE...385

Moving From Hadoop to Spark

There are various ways to find data using the Hennepin County GIS Open Data site:

BIG DATA SOLUTION DATA SHEET

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

A bit about Hadoop. Luca Pireddu. March 9, CRS4Distributed Computing Group. (CRS4) Luca Pireddu March 9, / 18

Introduction to Hadoop

Big Data and Scripting Systems build on top of Hadoop

Big Data Too Big To Ignore

Session: Big Data get familiar with Hadoop to use your unstructured data Udo Brede Dell Software. 22 nd October :00 Sesión B - DB2 LUW

Large Scale Text Analysis Using the Map/Reduce

MapReduce for Data Warehouses

Comparing Scalable NOSQL Databases

Data-Intensive Programming. Timo Aaltonen Department of Pervasive Computing

Hadoop IST 734 SS CHUNG

BIG DATA What it is and how to use?

Hadoop Evolution In Organizations. Mark Vervuurt Cluster Data Science & Analytics

Open source Google-style large scale data analysis with Hadoop

Recognization of Satellite Images of Large Scale Data Based On Map- Reduce Framework

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Transcription:

Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park

Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS

Apache Hadoop What is Hadoop? Hadoop is a scalable open source framework for the distributed processing of extremely large data sets on clusters of commodity hardware - Maintained by the Apache Software Foundation - Assumes that hardware failures are common Hadoop is primarily used for: - Distributed storage - Distributed computation http://hadoop.apache.org/

Apache Hadoop What is Hadoop? Historically, development of Hadoop began in 2005 as an open source implementation of a MapReduce framework - Inspired by Google s MapReduce framework, as published in a 2004 paper by Jeffrey Dean and Sanjay Ghemawat (Google Lab) - Doug Cutting (Yahoo!) did the initial implementation Hadoop consists of a distributed file system (HDFS), a scheduler and resource manager, and a MapReduce engine - MapReduce is a programming model for processing large data sets in parallel on a distributed cluster - Map() a procedure that performs filtering and sorting - Reduce() a procedure that performs a summary operation http://hadoop.apache.org/

Apache Hadoop What is Hadoop? A number of frameworks have been built extending Hadoop which are also part of Apache - Cassandra - a scalable multi-master database with no single points of failure - HBase - a scalable, distributed database that supports structured data storage for large tables - Hive - a data warehouse infrastructure that provides data summarization and ad hoc querying - Pig - a high-level data-flow language and execution framework for parallel computation - ZooKeeper - a high-performance coordination service for distributed applications http://hadoop.apache.org/

MapReduce High level overview Split map() Combine Shuffle Partition Sort map() reduce() part 1 data map() reduce() map() part 2 hdfs://path/input hdfs://path/output

Apache Hadoop MapReduce The Word Count Example Map 1. Each line is split into words 2. Each word is written to the map with the word as the key and a value of 1 Partition/Sort/Shuffle 1. The output of the mapper is sorted and grouped based on the key 2. Each key and its associated values are given to a reducer Reduce 1. For each key (word) given, sum up the values (counts) 2. Emit the word and its count red red blue red green green blue red green green blue blue blue Map Map Map red 1 red 1 blue 1 red 1 green 1 green 1 blue 1 red 1 green 1 green 1 blue 1 blue 1 blue 1 Partition Shuffle Sort green 1 green 1 green 1 red 1 red 1 red 1 red 1 blue 1 blue 1 blue 1 blue 1 blue 1 Reduce Reduce green 3 red 4 blue 5

Apache Hadoop Hadoop Clusters Traditional Hadoop Clusters The Dredd Cluster

Adding GIS capabilities to Hadoop

Hadoop Cluster.jar

Adding GIS Capabilities to Hadoop General approach Need to reduce large volumes of data into manageable datasets that can be processed in the ArcGIS Platform - Clipping - Filtering - Grouping

Adding GIS Capabilities to Hadoop Spatial data in Hadoop Spatial data in Hadoop can show up in a number of different formats Comma Delimited ONTARIO,34.0544,-117.6058 RANCHO CUCAMONGA,34.1238,-117.5702 REDLANDS,34.0579,-117.1709 RIALTO,34.1136,-117.387 RUNNING with SPRINGS,34.2097,-117.1135 the location defined in multiple fields Tab Delimited ONTARIO POINT(34.0544,-117.6058) RANCHO CUCAMONGA POINT(34.1238,-117.5702) REDLANDS POINT(34.0579,-117.1709) RIALTO POINT(34.1136,-117.387) RUNNING SPRINGS with the POINT(34.2097,-117.1135) location defined in well-known text (WKT) JSON {{ attr :{ name = ONTARIO }, geometry :{ x :34.05, y :-117.60}} {{ attr :{ name = RANCHO }, geometry :{ x :34.12, y :-117.57}} {{ attr :{ name = REDLANDS }, geometry :{ x :34.05, y :-117.17}} {{ attr :{ name = RIALTO }, geometry :{ x :34.11, y :-117.38}} {{ attr :{ name = RUNNING }, geometry :{ x :34.20, y :-117.11}} with Esri s JSON defining the location

GIS Tools for Hadoop Esri on GitHub GIS Tools for Hadoop tools samples Spatial Framework for Hadoop hive Tools and samples using the open source resources that solve specific problems Hive user-defined functions for spatial processing JSON helper utilities spatial-sdk-hive.jar json spatial-sdk-json.jar Geoprocessing Tools for Hadoop HadoopTools.pyt Geometry API Java esri-geometry-api.jar Geoprocessing tools that Copy to/from Hadoop Convert to/from JSON Invoke Hadoop Jobs Java geometry library for spatial data processing

GIS Tools for Hadoop Java geometry API Topological operations - Buffer - Union - Convex Hull - Contains -... In-memory indexing Accelerated geometries for relationship tests - Intersects, Contains, Still being maintained on Github https://github.com/esri/geometry-api-java

GIS Tools for Hadoop Java geometry API OperatorContains opcontains = OperatorContains.local(); for (Geometry geometry : somegeometrylist) { opcontains.accelerategeometry(geometry, sref, GeometryAccelerationDegree.enumMedium); for (Point point : somepointlist) { boolean contains = opcontains.execute(geometry, point, sref, null); } OperatorContains.deaccelerateGeometry(geometry); }

GIS Tools for Hadoop Hive spatial functions Apache Hive supports analysis of large datasets in HDFS using a SQL-like language (HiveQL) while also maintaining full support for MapReduce - Maintains additional metadata for data stored in Hadoop - Specifically, schema definition that maps the original data to rows and columns - Allows SQL-like interaction with data using the Hive Query Language (HQL) - Sample of Hive table create statement for simple CSV? Hive User-Defined Functions (UDF) that wrap geometry API operators Modeled on the ST_Geometry OGC compliant geometry type https://github.com/esri/spatial-framework-for-hadoop

GIS Tools for Hadoop Hive spatial functions Defining a table on CSV data with a spatial component CREATE TABLE IF NOT EXISTS earthquakes ( earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Spatial query using the Hive UDFs SELECT counties.name, count(*) cnt FROM counties Check if polygon contains point JOIN earthquakes WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude)) GROUP BY counties.name ORDER BY cnt desc; Construct a point from latitude and longitude https://github.com/esri/spatial-framework-for-hadoop

GIS Tools for Hadoop Geoprocessing tools Geoprocessing tools that allow ArcGIS to interact with large data stored in Hadoop - Copy to HDFS Uploads files to HDFS - Copy from HDFS Downloads files from HDFS - Features to JSON Converts a feature class to a JSON file - JSON to Features Converts a JSON file to a feature class - Execute Workflow Executes Oozie workflows in Hadoop Copy to HDFS Copy from HDFS Execute Workflow Hadoop Tools Features to JSON JSON to Features https://github.com/esri/geoprocessing-tools-for-hadoop

Hadoop Cluster Copy to HDFS Copy from HDFS filter Features to JSON JSON JSON JSON to Features result

DEMO Point in Polygon Demo Mike Park

Aggregate Hotspots Step 1. Map/Reduce to aggregate points into bins Traditional hotspots and big data Each feature is weighted, in part, by the values of its neighbors Neighborhood searches in very large datasets can be extremely costly without a spatial index The result of such analysis would have as many features as the original data Aggregate Hotspots Features are aggregated and summarized into bins defined be a regular integer grid The size of the summarized data is not affected by the size of the original data, only the number of bins Hotspots can then be calculated on the summary data Count 35 Min 2 2 5 6 7 Max 2 6 7 3 Count 3 2 Min 3 Count 35 Min 2 Max 7 2 Max 2 6 7 Count 3 2 Min 3 Max 7 Step 2. Map/Reduce to calculate global values for bin aggregates Count 5 5 Min 2 Max 7 6 7 Step 3. Map/Reduce to calculate hotspots using bins (next few slides) 3

DEMO Aggregate Hotspot Analysis Mike Park

Integrating Hadoop with ArcGIS

Integrating Hadoop with ArcGIS Moving forward Optimizing data storage - What s wrong with the current data storage - Sorting and sharding Spatial indexing Data source Geoprocessing - Native implementations of key spatial statistical functions

Optimizing Data Storage Distribution of spatial data across nodes in a cluster hdfs:///path/to/dataset part-1.csv dredd0 part-2.csv dredd1 part-3.csv dredd2 processed on dredd1 processed on dredd0 dredd0 dredd1 dredd2

Point in Polygon in More Detail Using GIS Tools for Hadoop 1. The entire set of polygons is sent to every node 2. Each node builds an in-memory spatial index for quick lookups 3. Every point assigned to that node is bounced off the index to see which polygon contains the point 4. The nodes output their partial counts which are then combined into a single result Issues: Every record in the dataset had to be processed, but only a subset of the records contribute to the answer The memory requirements for the spatial index can be large as the number of polygons increases

Optimizing Data Storage Ordering and sharding Raw data in Hadoop is not optimized for spatial queries and analysis Techniques for optimized data storage 1. Sort the data in linearized space 2. Split the ordered data into equal density regions, known as shards Shards ensure that the majority of features are co-located on the same machine as their neighbors - This reduces network utilization when doing neighborhood searches

Hadoop and GIS Distribution of ordered spatial data across nodes in a cluster hdfs:///path/to/dataset part-1 dredd0 part-2 dredd1 part-3 dredd2 dredd0 dredd1 dredd2 dredd3 dredd4

Spatial Indexing Distributed quadtree The quadtree index of a dataset is composed of sub-indexes that are distributed across the cluster Each of these sub-indexes points to a shard with a 1-1 cardinality Each sub-index is stored on the same computer as the shard that it indexes 0 1 2 3 4 Shard Data Index

Point in Polygon Indexed Points Counting points in polygons using a spatially indexed dataset Rather than send every polygon to each node, we only send a subset of the polygons Each node queries the index for points that are contained in its polygon subset The polygons from each node are then combined to produce the final result

DEMO Filtering Areas of Interest with Features Mike Park

Conclusion Miscellaneous clever and insightful statements Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS

Caching I/O Reads (Where should this go?) dredd0 dredd1 dredd2 dredd3 dredd4 1 2 3