Big Data: Using ArcGIS with Apache Hadoop Erik Hoel and Mike Park
Outline Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS
Apache Hadoop What is Hadoop? Hadoop is a scalable open source framework for the distributed processing of extremely large data sets on clusters of commodity hardware - Maintained by the Apache Software Foundation - Assumes that hardware failures are common Hadoop is primarily used for: - Distributed storage - Distributed computation http://hadoop.apache.org/
Apache Hadoop What is Hadoop? Historically, development of Hadoop began in 2005 as an open source implementation of a MapReduce framework - Inspired by Google s MapReduce framework, as published in a 2004 paper by Jeffrey Dean and Sanjay Ghemawat (Google Lab) - Doug Cutting (Yahoo!) did the initial implementation Hadoop consists of a distributed file system (HDFS), a scheduler and resource manager, and a MapReduce engine - MapReduce is a programming model for processing large data sets in parallel on a distributed cluster - Map() a procedure that performs filtering and sorting - Reduce() a procedure that performs a summary operation http://hadoop.apache.org/
Apache Hadoop What is Hadoop? A number of frameworks have been built extending Hadoop which are also part of Apache - Cassandra - a scalable multi-master database with no single points of failure - HBase - a scalable, distributed database that supports structured data storage for large tables - Hive - a data warehouse infrastructure that provides data summarization and ad hoc querying - Pig - a high-level data-flow language and execution framework for parallel computation - ZooKeeper - a high-performance coordination service for distributed applications http://hadoop.apache.org/
MapReduce High level overview Split map() Combine Shuffle Partition Sort map() reduce() part 1 data map() reduce() map() part 2 hdfs://path/input hdfs://path/output
Apache Hadoop MapReduce The Word Count Example Map 1. Each line is split into words 2. Each word is written to the map with the word as the key and a value of 1 Partition/Sort/Shuffle 1. The output of the mapper is sorted and grouped based on the key 2. Each key and its associated values are given to a reducer Reduce 1. For each key (word) given, sum up the values (counts) 2. Emit the word and its count red red blue red green green blue red green green blue blue blue Map Map Map red 1 red 1 blue 1 red 1 green 1 green 1 blue 1 red 1 green 1 green 1 blue 1 blue 1 blue 1 Partition Shuffle Sort green 1 green 1 green 1 red 1 red 1 red 1 red 1 blue 1 blue 1 blue 1 blue 1 blue 1 Reduce Reduce green 3 red 4 blue 5
Apache Hadoop Hadoop Clusters Traditional Hadoop Clusters The Dredd Cluster
Adding GIS capabilities to Hadoop
Hadoop Cluster.jar
Adding GIS Capabilities to Hadoop General approach Need to reduce large volumes of data into manageable datasets that can be processed in the ArcGIS Platform - Clipping - Filtering - Grouping
Adding GIS Capabilities to Hadoop Spatial data in Hadoop Spatial data in Hadoop can show up in a number of different formats Comma Delimited ONTARIO,34.0544,-117.6058 RANCHO CUCAMONGA,34.1238,-117.5702 REDLANDS,34.0579,-117.1709 RIALTO,34.1136,-117.387 RUNNING with SPRINGS,34.2097,-117.1135 the location defined in multiple fields Tab Delimited ONTARIO POINT(34.0544,-117.6058) RANCHO CUCAMONGA POINT(34.1238,-117.5702) REDLANDS POINT(34.0579,-117.1709) RIALTO POINT(34.1136,-117.387) RUNNING SPRINGS with the POINT(34.2097,-117.1135) location defined in well-known text (WKT) JSON {{ attr :{ name = ONTARIO }, geometry :{ x :34.05, y :-117.60}} {{ attr :{ name = RANCHO }, geometry :{ x :34.12, y :-117.57}} {{ attr :{ name = REDLANDS }, geometry :{ x :34.05, y :-117.17}} {{ attr :{ name = RIALTO }, geometry :{ x :34.11, y :-117.38}} {{ attr :{ name = RUNNING }, geometry :{ x :34.20, y :-117.11}} with Esri s JSON defining the location
GIS Tools for Hadoop Esri on GitHub GIS Tools for Hadoop tools samples Spatial Framework for Hadoop hive Tools and samples using the open source resources that solve specific problems Hive user-defined functions for spatial processing JSON helper utilities spatial-sdk-hive.jar json spatial-sdk-json.jar Geoprocessing Tools for Hadoop HadoopTools.pyt Geometry API Java esri-geometry-api.jar Geoprocessing tools that Copy to/from Hadoop Convert to/from JSON Invoke Hadoop Jobs Java geometry library for spatial data processing
GIS Tools for Hadoop Java geometry API Topological operations - Buffer - Union - Convex Hull - Contains -... In-memory indexing Accelerated geometries for relationship tests - Intersects, Contains, Still being maintained on Github https://github.com/esri/geometry-api-java
GIS Tools for Hadoop Java geometry API OperatorContains opcontains = OperatorContains.local(); for (Geometry geometry : somegeometrylist) { opcontains.accelerategeometry(geometry, sref, GeometryAccelerationDegree.enumMedium); for (Point point : somepointlist) { boolean contains = opcontains.execute(geometry, point, sref, null); } OperatorContains.deaccelerateGeometry(geometry); }
GIS Tools for Hadoop Hive spatial functions Apache Hive supports analysis of large datasets in HDFS using a SQL-like language (HiveQL) while also maintaining full support for MapReduce - Maintains additional metadata for data stored in Hadoop - Specifically, schema definition that maps the original data to rows and columns - Allows SQL-like interaction with data using the Hive Query Language (HQL) - Sample of Hive table create statement for simple CSV? Hive User-Defined Functions (UDF) that wrap geometry API operators Modeled on the ST_Geometry OGC compliant geometry type https://github.com/esri/spatial-framework-for-hadoop
GIS Tools for Hadoop Hive spatial functions Defining a table on CSV data with a spatial component CREATE TABLE IF NOT EXISTS earthquakes ( earthquake_date STRING, latitude DOUBLE, longitude DOUBLE, magnitude DOUBLE) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' Spatial query using the Hive UDFs SELECT counties.name, count(*) cnt FROM counties Check if polygon contains point JOIN earthquakes WHERE ST_Contains(counties.boundaryshape, ST_Point(earthquakes.longitude, earthquakes.latitude)) GROUP BY counties.name ORDER BY cnt desc; Construct a point from latitude and longitude https://github.com/esri/spatial-framework-for-hadoop
GIS Tools for Hadoop Geoprocessing tools Geoprocessing tools that allow ArcGIS to interact with large data stored in Hadoop - Copy to HDFS Uploads files to HDFS - Copy from HDFS Downloads files from HDFS - Features to JSON Converts a feature class to a JSON file - JSON to Features Converts a JSON file to a feature class - Execute Workflow Executes Oozie workflows in Hadoop Copy to HDFS Copy from HDFS Execute Workflow Hadoop Tools Features to JSON JSON to Features https://github.com/esri/geoprocessing-tools-for-hadoop
Hadoop Cluster Copy to HDFS Copy from HDFS filter Features to JSON JSON JSON JSON to Features result
DEMO Point in Polygon Demo Mike Park
Aggregate Hotspots Step 1. Map/Reduce to aggregate points into bins Traditional hotspots and big data Each feature is weighted, in part, by the values of its neighbors Neighborhood searches in very large datasets can be extremely costly without a spatial index The result of such analysis would have as many features as the original data Aggregate Hotspots Features are aggregated and summarized into bins defined be a regular integer grid The size of the summarized data is not affected by the size of the original data, only the number of bins Hotspots can then be calculated on the summary data Count 35 Min 2 2 5 6 7 Max 2 6 7 3 Count 3 2 Min 3 Count 35 Min 2 Max 7 2 Max 2 6 7 Count 3 2 Min 3 Max 7 Step 2. Map/Reduce to calculate global values for bin aggregates Count 5 5 Min 2 Max 7 6 7 Step 3. Map/Reduce to calculate hotspots using bins (next few slides) 3
DEMO Aggregate Hotspot Analysis Mike Park
Integrating Hadoop with ArcGIS
Integrating Hadoop with ArcGIS Moving forward Optimizing data storage - What s wrong with the current data storage - Sorting and sharding Spatial indexing Data source Geoprocessing - Native implementations of key spatial statistical functions
Optimizing Data Storage Distribution of spatial data across nodes in a cluster hdfs:///path/to/dataset part-1.csv dredd0 part-2.csv dredd1 part-3.csv dredd2 processed on dredd1 processed on dredd0 dredd0 dredd1 dredd2
Point in Polygon in More Detail Using GIS Tools for Hadoop 1. The entire set of polygons is sent to every node 2. Each node builds an in-memory spatial index for quick lookups 3. Every point assigned to that node is bounced off the index to see which polygon contains the point 4. The nodes output their partial counts which are then combined into a single result Issues: Every record in the dataset had to be processed, but only a subset of the records contribute to the answer The memory requirements for the spatial index can be large as the number of polygons increases
Optimizing Data Storage Ordering and sharding Raw data in Hadoop is not optimized for spatial queries and analysis Techniques for optimized data storage 1. Sort the data in linearized space 2. Split the ordered data into equal density regions, known as shards Shards ensure that the majority of features are co-located on the same machine as their neighbors - This reduces network utilization when doing neighborhood searches
Hadoop and GIS Distribution of ordered spatial data across nodes in a cluster hdfs:///path/to/dataset part-1 dredd0 part-2 dredd1 part-3 dredd2 dredd0 dredd1 dredd2 dredd3 dredd4
Spatial Indexing Distributed quadtree The quadtree index of a dataset is composed of sub-indexes that are distributed across the cluster Each of these sub-indexes points to a shard with a 1-1 cardinality Each sub-index is stored on the same computer as the shard that it indexes 0 1 2 3 4 Shard Data Index
Point in Polygon Indexed Points Counting points in polygons using a spatially indexed dataset Rather than send every polygon to each node, we only send a subset of the polygons Each node queries the index for points that are contained in its polygon subset The polygons from each node are then combined to produce the final result
DEMO Filtering Areas of Interest with Features Mike Park
Conclusion Miscellaneous clever and insightful statements Overview of Hadoop Adding GIS capabilities to Hadoop Integrating Hadoop with ArcGIS
Caching I/O Reads (Where should this go?) dredd0 dredd1 dredd2 dredd3 dredd4 1 2 3