High Performance Spatial Queries and Analytics for Spatial Big Data. Fusheng Wang. Department of Biomedical Informatics Emory University

Similar documents

How To Write A Mapreduce Based Query Engine

Hadoop GIS: A High Performance Spatial Data Warehousing System over MapReduce

Hadoop-GIS: A High Performance Spatial Query System for Analytical Medical Imaging with MapReduce

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

A REVIEW PAPER ON THE HADOOP DISTRIBUTED FILE SYSTEM

How To Use Hadoop For Gis

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

Apache Hadoop FileSystem and its Usage in Facebook

SOLVING REAL AND BIG (DATA) PROBLEMS USING HADOOP. Eva Andreasson Cloudera

Analytics in the Cloud. Peter Sirota, GM Elastic MapReduce

An Oracle White Paper November Leveraging Massively Parallel Processing in an Oracle Environment for Big Data Analytics

Hadoop & its Usage at Facebook

Big Data on Microsoft Platform

Constructing a Data Lake: Hadoop and Oracle Database United!

Architectural patterns for building real time applications with Apache HBase. Andrew Purtell Committer and PMC, Apache HBase

Large scale processing using Hadoop. Ján Vaňo

Similarity Search in a Very Large Scale Using Hadoop and HBase

Conjugating data mood and tenses: Simple past, infinite present, fast continuous, simpler imperative, conditional future perfect

Advanced Big Data Analytics with R and Hadoop

Using distributed technologies to analyze Big Data

Introduction to Hadoop HDFS and Ecosystems. Slides credits: Cloudera Academic Partners Program & Prof. De Liu, MSBA 6330 Harvesting Big Data

Big Fast Data Hadoop acceleration with Flash. June 2013

Big Data: Using ArcGIS with Apache Hadoop. Erik Hoel and Mike Park

CSE-E5430 Scalable Cloud Computing Lecture 2

Architectures for Big Data Analytics A database perspective

Petabyte Scale Data at Facebook. Dhruba Borthakur, Engineer at Facebook, SIGMOD, New York, June 2013

Hadoop IST 734 SS CHUNG

Hadoop and its Usage at Facebook. Dhruba Borthakur June 22 rd, 2009

HadoopTM Analytics DDN

Hadoop implementation of MapReduce computational model. Ján Vaňo

Daniel J. Adabi. Workshop presentation by Lukas Probst

Hadoop. Sunday, November 25, 12

NoSQL for SQL Professionals William McKnight

A Comparison of Approaches to Large-Scale Data Analysis

Hadoop-BAM and SeqPig

Hadoop s Entry into the Traditional Analytical DBMS Market. Daniel Abadi Yale University August 3 rd, 2010

Integrating Apache Spark with an Enterprise Data Warehouse

Managing Big Data with Hadoop & Vertica. A look at integration between the Cloudera distribution for Hadoop and the Vertica Analytic Database

Design and Evolution of the Apache Hadoop File System(HDFS)

AGENDA. What is BIG DATA? What is Hadoop? Why Microsoft? The Microsoft BIG DATA story. Our BIG DATA Roadmap. Hadoop PDW

Oracle Big Data Spatial and Graph

BIG DATA What it is and how to use?

Beyond Web Application Log Analysis using Apache TM Hadoop. A Whitepaper by Orzota, Inc.

RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems CLOUD COMPUTING GROUP - LITAO DENG

HadoopRDF : A Scalable RDF Data Analysis System

Big Data With Hadoop

Big Data Processing with Google s MapReduce. Alexandru Costan

Real Time Big Data Processing

DATA MINING WITH HADOOP AND HIVE Introduction to Architecture

Hadoop Ecosystem B Y R A H I M A.

Big Data & QlikView. Democratizing Big Data Analytics. David Freriks Principal Solution Architect

In-Memory BigData. Summer 2012, Technology Overview

GeoGrid Project and Experiences with Hadoop

An Industrial Perspective on the Hadoop Ecosystem. Eldar Khalilov Pavel Valov

HADOOP ON ORACLE ZFS STORAGE A TECHNICAL OVERVIEW

Driving IBM BigInsights Performance Over GPFS Using InfiniBand+RDMA

Oracle8i Spatial: Experiences with Extensible Databases

Scaling Out With Apache Spark. DTL Meeting Slides based on

Virtualizing Apache Hadoop. June, 2012

How To Scale Out Of A Nosql Database

Lecture 10 - Functional programming: Hadoop and MapReduce

ebook Utilizing MapReduce to address Big Data Enterprise Needs Leveraging Big Data to shorten drug development cycles in Pharmaceutical industry.

CS2510 Computer Operating Systems

CS2510 Computer Operating Systems

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Collaborative Big Data Analytics. Copyright 2012 EMC Corporation. All rights reserved.

Best Practices for Hadoop Data Analysis with Tableau

Spatial Data Analysis

From Spark to Ignition:

Energy Efficient MapReduce

I/O Considerations in Big Data Analytics

Big Data and Data Science: Behind the Buzz Words

Open source software framework designed for storage and processing of large scale data on clusters of commodity hardware

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

Big Data for Investment Research Management

INTRODUCTION TO APACHE HADOOP MATTHIAS BRÄGER CERN GS-ASE

How To Use Big Data For Telco (For A Telco)

Native Connectivity to Big Data Sources in MSTR 10

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Cloud Computing at Google. Architecture

How Companies are! Using Spark

Enabling Multi-pipeline Data Transfer in HDFS for Big Data Applications

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

Big Data Data-intensive Computing Methods, Tools, and Applications (CMSC 34900)

Hadoop SNS. renren.com. Saturday, December 3, 11

An Oracle White Paper June High Performance Connectors for Load and Access of Data from Hadoop to Oracle Database

Hadoop & its Usage at Facebook

Big Data Research in the AMPLab: BDAS and Beyond

Unified Big Data Processing with Apache Spark. Matei

Massive Cloud Auditing using Data Mining on Hadoop

Big Graph Processing: Some Background

UPS battery remote monitoring system in cloud computing

Big Data Storage Options for Hadoop Sam Fineberg, HP Storage

Transcription:

High Performance Spatial Queries and Analytics for Spatial Big Data Fusheng Wang Department of Biomedical Informatics Emory University

Introduction Spatial Big Data Geo-crowdsourcing:OpenStreetMap Remote Sensing 2D Digital Pathology 3D Digital Pathology

Introduction Pathology Analytical Imaging Glass Slides Scanning Whole Slide Images Image Analysis Provide rich information about morphological and functional characteristics of biological systems, have tremendous potential for understanding diseases and supporting diagnosis e.g.: http://www.openpais.org/portal

Introduction Example: Distinguishing Characteristics in Glioblastoma Clinical data Molecular data [JAMIA 12]

Introduction GIS Centric Queries POINT CONTAINMENT WINDOW SPATIAL JOIN NEAREST NEIGHBOR DENSITY Need: Manage, Query and Analyze Spatial Big Data

Introduction Spatial Queries and Analytics Feature based descriptive queries Feature based filtering or feature aggregation Spatial relationship based queries Spatial join (two- or multi-), window, point-in-polygon Polygon overlay or spatial cross-matching Distance based queries Nearest neighbors Spatial analytics Find spatial clusters, hotspots, and anomalies Spatial relationship modeling, e.g., geographically weighted regression model (GWR)

Challenges Requirements and Challenges Requirements: fast query response, and scalable and cost-effective architecture Explosion of derived data 10 5 x10 5 pixels per image 1 million objects per image Hundreds to thousands of images per study OSM has two billion nodes, 600K contributors High computational complexity Multi-dimensional Spatial queries involve heavy duty geometric computations

Traditional Approach Traditional Approach: Parallel SDBMS Shared nothing architecture through partitioning to increase I/O bandwidth via parallel data access Extended from ORDBMS with spatial data types and access methods Partitioning: even distribution of data and colocation e.g: PAIS (500 images, 1TB, 30 partitions) [JPI 12, JPI 13]

Traditional Approach Advantages of Parallel SDBMS Comprehensive model and expressive query language for modeling and querying spatial data Example spatial join query: Scale out is possible

Traditional Approach Limitations of Parallel SDBMS The Cancer Genomics Atlas (TCGA): 14,000 whole slide images, 30TB of results Two months to get data loaded 4 days to do a spatial join with 30 partitions Partitioning based scale out is possible but is very difficult and expensive to scale to many nodes DBMS not optimized for computational intensive operations Data loading is a major bottleneck Lack load balancing Can we provide a solution with the advantages of RDBMS but is much more scalable and cost effective?

Hadoop-GIS Goal: High Performance Spatial Queries and Analytics MapReduce provides a highly scalable and cost effective framework for processing massive data Map step divides input into small problems by keys Reduce step collects answers and combine them HDFS for fault tolerance and efficiency Spatial queries and analytics are intrinsically complex and difficult to fit into the model Hybrid CPU/GPU systems commonly available, but the capacity is often underutilized There is a major step required on providing new spatial querying and analytical methods to run on such architectures

Hadoop-GIS Our Approach: Hadoop-GIS A general framework to support high performance spatial queries and analytics for spatial big data on MapReduce and CPU-GPU hybrid platforms Spatial data processing methods and pipelines with spatial partition level parallelism running on MapReduce Multi-level indexing methods to accelerate spatial data processing Query normalization methods for partitioning effect Declarative spatial queries and translation into MapReduce operations Utilize GPU to parallelize spatial operations and integrate them into MapReduce [VLDB 12, GIS 12, GIS 13, VLDB 13]

Spatial Query Processing with MapReduce Multi-level Spatial Indexing Tile indexing for managing tiles: filtering tasks in mapping stage Region based spatial indexing for grouping multiple neighboring tiles into a HDFS file: filtering data On demand indexing for intra-tile object queries Tile Tile Region HDFS

Hadoop-GIS A General Framework of Spatial Data Processing in MapReduce 1. Spatial partitioning 2. Data staging and global indexing in HDFS A. Block based filtering with region indexes B. foreach tile in input_region do (in parallel) a. Tile based filtering based on tile indexes b. On-demand indexing for objects in the tile c. Tile based spatial query processing C. Boundary-crossing object handling D. Post-query processing E. Result storage on HDFS

Spatial Query Processing with MapReduce Example: Spatial Join in MapReduce: Data Staging and Indexing Image Tile Boundary files Partitioning HDFS Algorithm results Data Staging& Indexing record(imageid, tileid, boundary) Result of Algorithm 1 Result of Algorithm 2

Spatial Query Processing with MapReduce Example: Spatial Join in MapReduce: Query Processing Image REDUCE : (one query reducer per tile) Tile engine per title Spatial Join MAP: group objects by tiles(keys) Algorithm results Boundary files Partitioning Data Staging& Indexing Data Scan Data Scan Data Scan HDFS record(imageid, tileid, boundary) Result of Algorithm 1 Result of Algorithm 2

Spatial Query Processing with MapReduce Example: Spatial Join in MapReduce: Result Normalization REDUCE MAP Data Scan Data Scan Data Scan

Spatial Query Processing with MapReduce Real-Time Spatial Query Engine (RESQUE) Effective querying methods that can run in parallel in distributed computing environments Spatial join, multi-way join, containment, nearest neighbor, and can be extended Geometric computation library (GEOS) On-demand indexing based query processing Mismatch between large block based storage and random index access Example: Two-way Spatial Join Create Bulk indexes R * -Tree on demand in memory or local files Indexing building overhead is a very small fraction Boundary File1 Building Bulk R * -Tree Building R*-Tree File1 Spatial Join Algorithm R*-Tree File2 Geometry Refinement Spatial Measure Result File Boundary File2 Real-time spatial querying engine (RESQUE)

Spatial Partitioning Spatial Partitioning Effective partitioning is critical for task parallelization and load balancing: data skew Criteria: balanced distribution, granularity, overlapping Methods: top-down: recursive slicing; bottom-up: R- Tree packing Parallezation with MapReduce

Spatial Partitioning Boundary-Crossing Object Handling S q T S T p p s p t q + Multi-assignment, single-join : replicate objects on boundaries to multiple tiles at partitioning Normalization methods are provided for each query type to correct answers Spatial join: removal of duplicates Voronoi diagram: reconstruction of diagrams on the boundaries

Spatial Query Processing with MapReduce Nearest Neighbor Query Processing Workflow with MapReduce e.g.: for each cell find the closest blood vessel and return distance to that blood vessel Access methods can vary (R*-Tree, Voronoi, etc..) Map Reduce Aggregation

Time (seconds) Performance System Performance: Hadoop-GIS vs Parallel Spatial SDBMS 2500 Hadoop-GIS DBMS-X 2000 1500 1000 500 0 5 10 20 30 Processing Units Spatial Join (boundary objects ignored)

Time (seconds) Time (seconds) Performance System Performance: Scalability 2400 2200 2000 1800 1600 1400 1200 1000 800 600 OSM Spatial Join 42GB 400 0 20 40 60 80 100 120 140 160 180 200 Processing Unit 5000 4000 3000 2000 1000 Imaging Cross-Matching 1X (18 images) 3X (54 images) 5X (90 images) 10X (180 images) 1.3TB 44GB 0 0 20 40 60 80 100 120 140 160 180 200 Processing Unit Nearest Neighbor 42GB

Software Integration with Apache Hive Integrating declarative query languages with MapReduce is a major trend Hive, Pig/Latin, Scope, Impala, Shark, Ysmart Hive is a data warehouse infrastructure built on top of Hadoop, with a SQL like query language (QL) Hive provides major aggregation operations, and support user defined functions and data compression No spatial query support Goal: Integrate spatial data processing into Hive

2012 2013 2014 Software Hadoop-GIS Architecture

GPU Accelerated Spatial Queries GPU Accelerated Spatial Operations Spatial Join Profiling GPU Computing Massively parallel, hundreds to thousands of cores Cheap and highly available Programmable: CUDA, OpenCL Goal: Exploit massive GPU parallelism and maximize data parallelism to accelerate spatial queries

GPU Accelerated Spatial Queries PixelBox Algorithms for Spatial Cross- Matching Monte-Carlo approach PixelBox p q Speed up on cross-matching for one image on a single GPU (512 cores): 120X [VLDB 12]

Ongoing Work Ongoing Work (NSF CAREER) Create a high performance software system for spatial queries and analytics of spatial big data on MapReduce and CPU-GPU hybrid platforms Promote the use of the created open source software to support problem solving in multiple disciplines, and educate the next generation workforce in big data Spatial and location based services (with Pitney Bowes) 3D Imaging GIS (with Leeds, Emory) Social media spatial analytics (twitter data) Complex spatial analytics: spatial clustering, regression Hybrid GPU-CPU processing

Thank you! https://github.com/emoryuniversity/libhadoopgis