On the Efficient Evaluation of Array Joins



Similar documents
Agile Retrieval of Big Data with. EarthServer. ECMWF Visualization Week, Reading, 2015-sep-29

Agile Analytics on Extreme-Size Earth Science Data

WCS as a Download Service for Big (and Small) Data

Databases & Web Applications Lab Big Data Project A

Handling Heterogeneous EO Datasets via the Web Coverage Processing Service

Adding Big Earth Data Analytics to GEOSS

A Big Picture for Big Data

RDA PROPOSAL FOR Array Database Working Group (AD-WG) Peter Baumann, Jacobs University

Oklahoma s Open Source Spatial Data Clearinghouse: OKMaps

NetCDF and HDF Data in ArcGIS

Investigating Hadoop for Large Spatiotemporal Processing Tasks

Web and Mobile GIS Applications Development

CLOUD BASED N-DIMENSIONAL WEATHER FORECAST VISUALIZATION TOOL WITH IMAGE ANALYSIS CAPABILITIES

Enabling embedded maps

EUMETSAT EO Portal. End User Image Access using OGC WMS/WCS services. EUM/OPS/VWG/10/0095 Issue <1> <14/01/2010> Slide: 1

Ching-Yung Lin, Ph.D. Adjunct Professor, Dept. of Electrical Engineering and Computer Science IBM Chief Scientist, Graph Computing. October 29th, 2015

Big Data Volume & velocity data management with ERDAS APOLLO. Alain Kabamba Hexagon Geospatial

Standardized data sharing through an open-source Spatial Data Infrastructure: the Afromaison project

Cloud Computing and Advanced Relationship Analytics

! E6893 Big Data Analytics Lecture 9:! Linked Big Data Graph Computing (I)

INSPIRE in practice: Experiences with INSPIRE data and services

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

INTEROPERABLE IMAGE DATA ACCESS THROUGH ARCGIS SERVER

VISUAL INSPECTION OF EO DATA AND PRODUCTS - OVERVIEW

How To Create An Integrated Visualization For A Network Security System (For A Free Download)

Oracle Big Data Spatial and Graph

INPE Spatial Data Infrastructure for Big Spatiotemporal Data Sets. Karine Reis Ferreira (INPE-Brazil)

V. Adamchik 1. Graph Theory. Victor Adamchik. Fall of 2005

Data Sharing in the Cloud: Scaling to the World, Unleashing Creativity, and Generating Value?

Remote Sensing Image Server for GMS (Greater Mekong Sub-Region) Countries

County of Los Angeles. Chief Information Office Preferred Technologies for Geographic Information Systems (GIS) September 2014

Topological Properties

Implementation of information system to respond to a nuclear emergency affecting agriculture and food products - Case of Morocco

New Approach of Computing Data Cubes in Data Warehousing

One-Size-Fits-All: A DBMS Idea Whose Time has Come and Gone. Michael Stonebraker December, 2008

Big data, big deal? Presented by: David Stanley, CTO PCI Geomatics

Developing of A GIS Based Enviromental Monitoring System with Open Source Softwares

Flash Controller Architecture for All Flash Arrays

Smart Cities require Geospatial Data Providing services to citizens, enterprises, visitors...

Analysing Large Web Log Files in a Hadoop Distributed Cluster Environment

Use of ISO standards by NERC (a snapshot!)

Web Map Service Architecture for Topographic Data in Finland

Challenges for Data Driven Systems

Real Time Fraud Detection With Sequence Mining on Big Data Platform. Pranab Ghosh Big Data Consultant IEEE CNSV meeting, May Santa Clara, CA

Databases for 3D Data Management: From Point Cloud to City Model

Alejandro Vaisman Esteban Zimanyi. Data. Warehouse. Systems. Design and Implementation. ^ Springer

UK Location Programme

smespire - Exercises for the Hands-on Training on INSPIRE Network Services April 2014 Jacxsens Paul SADL KU Leuven

HPC technology and future architecture

Data Mining in the Swamp

The ORIENTGATE data platform

Surfing the Data Tsunami: A New Paradigm for Big Data Processing and Analytics

Big Data and Big Analytics

<Insert Picture Here> Data Management Innovations for Massive Point Cloud, DEM, and 3D Vector Databases

Big Data Processing and Analytics for Mouse Embryo Images

Model examples Store and provide Challenges WCS and OPeNDAP Recommendations. WCS versus OPeNDAP. Making model results available through the internet.

SQL Server 2012 End-to-End Business Intelligence Workshop

Advanced Image Management using the Mosaic Dataset

The Scientific Data Mining Process

Information Processing, Big Data, and the Cloud

Graph Analytics in Big Data. John Feo Pacific Northwest National Laboratory

SUMMER SCHOOL ON ADVANCES IN GIS

Vision. South Pacific GIS/RS Conference /17/2015. Applying Geography Everywhere. Applying Geography Everywhere

OPEN STANDARD WEB SERVICES FOR VISUALISATION OF TIME SERIES DATA OF FLOOD MODELS

Big Data Challenges in Bioinformatics

Combining Drupal Content Management System with OGC Web Services

The USGS Landsat Big Data Challenge

Cloud-based Geospatial Data services and analysis

Big Data Challenges. technology basics for data scientists. Spring Jordi Torres, UPC - BSC

Cloud-based Infrastructures. Serving INSPIRE needs

Convex Optimization for Big Data: Lecture 2: Frameworks for Big Data Analytics

Big Data and Analytics: A Conceptual Overview. Mike Park Erik Hoel

GLOBAL DATA SPATIALLY INTERRELATE SYSTEM FOR SCIENTIFIC BIG DATA SPATIAL-SEAMLESS SHARING

Geospatial Platforms For Enabling Workflows

Data-intensive HPC: opportunities and challenges. Patrick Valduriez

NASA's Strategy and Activities in Server Side Analytics

Mining Large Datasets: Case of Mining Graph Data in the Cloud

BIG DATA IN THE CLOUD : CHALLENGES AND OPPORTUNITIES MARY- JANE SULE & PROF. MAOZHEN LI BRUNEL UNIVERSITY, LONDON

Databases 2 (VU) ( )

Cloud Computing at Google. Architecture

How Single-Sign-On Improves The Usability Of Protected Services For Geospatial Data

The distribution of marine OpenData via distributed data networks and Web APIs. The example of ERDDAP, the message broker and data mediator from NOAA

IE 680 Special Topics in Production Systems: Networks, Routing and Logistics*

Enable Location-based Services with a Tracking Framework

<no narration for this slide>

Web Log Data Sparsity Analysis and Performance Evaluation for OLAP

Big data analy+cs for global change monitoring and research in forestry and agriculture. Lubia Vinhas

Transcription:

[co-funded by EU through H2020 EarthServer -2] On the Efficient Evaluation of Array Joins Big Data in the Geosciences Workshop IEEE Big Data, Santa Clara, US, 2015-oct-29 Peter Baumann, Vlad Merticariu Jacobs University rasdaman GmbH {p.baumann,v.merticariu}@jacobs-university.de [gamingfeeds.com]

Data Homogenization With OGC Standards sensor feeds coverage server sensor, image [timeseries], simulation, statistics data 2

Web Coverage Service (WCS) OGC Coverages unifying regular & irregular grids, point clouds, meshes - OGC Coverage Implementation Schema WCS Core: access to spatio-temporal coverages & subsets - subset = trim slice WCS Extensions: optional functionality facets - Scaling, CRS transformation, Large, growing implementation basis: rasdaman, GDAL, QGIS, OpenLayers, OPeNDAP, MapServer, GeoServer, NASA WorldWind, EOx- Server; Pyxis, ERDAS, ArcGIS,...

OGC Web Coverage Processing Service (WCPS) = high-level spatio-temporal geo analytics language std [JacobsU, FhG; NASA; data courtesy BGS, ESA] "From MODIS scenes M1, M2, M3: difference between red & nir, as TIFF" but only those where nir exceeds 127 somewhere, within LandMask for $c in ( M1, M2, M3 ), $lm in ( LandMask ) where some( $c.nir > 127 and $lm ) return encode($c.red - $c.nir, image/tiff ) 4

EarthServer: Datacubes At Your Fingertips Agile Analytics on Earth & Planetary datacubes - rasdaman + NASA WorldWind - Rigorously standards: OGC WMS + WCS + WCPS - 100s of TB online now, next: 1+ Petabyte per cube Intercontinental initiative, 3+3 years: EU + US + AUS www.earthserver.eu Array Joins INSPIRE :: IEEE WCS Big :: Data 2015 :: P. 2015 Baumann rasdaman

Agile Array Analytics: rasdaman raster data manager : SQL + n-d arrays Scalable parallel tile streaming architecture - Joins! Supports R, QGIS, OpenLayers, MapServer, GDAL, EOxServer, Pyxis, ERDAS, ArcGIS,... Blueprint for OGC WCPS, ISO Array SQL stds rasdaman visitors

Array Partitioning Goal: faster loading by adapting storage units to access patterns Approach: split n-d array into n-d partitions ( tiles ) Tiling classification based on degree of alignment [ICDE 1999] - all implemented in rasdaman chunking [Sarawagi, Stonebraker, DeWitt,... ]

Why Irregular Partitioning? [Centrella et al: scidacreviews.org] [OpenStreetMap]

Array Join: What Happens? MODIS red band with cloud mask applied : Case 1: same tile shapes, same position - Easy Case 2: same tile shapes, different position - Overlaps, not so easy Case 3: different tile shapes - Worse $Modis.red * $CloudMask Case 4: different dimensions - Gimme a break!

Array Join: Problem Statement Goal: minimize tile reads when evaluating A op B in face of some arbitrary, independent partitioning of A and B $Modis.red * $CloudMask

Bipartite Traversal Graphs

Finding Complete Paths Leonhard Euler: complete, minimal path for even-degree nodes Pregel river, Königsberg [Wikipedia ] Carl Hierholzer: not even degree? Add aux edges!

Finding Shortest Paths Start / end! <B4,A3,B2,A1,B1,B3, A3,B5,A4,B2,A2> <B4,A3,B3,A1,B1,A2, B2,A4,B5,A3,B2,A1> Assumption: can hold 1 red, 1 blue tile How to find shortest path? See paper!

Complexity? Best case: full alignment E A + E B Worst case: All tiles of A x all tiles of B E A * E B

What else? We have: tile traversal with minimum duplicate tile reads Variation 1: buffer size to avoid dup reads? - query cost estimation Variation 2: for buffer size given, how much improvement? Variation 3: parallelize disconnected subgraph Variation 4: how many tiles to ship between nodes for optimal parallelization?

Wrap-Up Array Database queries offer new Big Data service level, including datacube fusion - Standardization: ISO Array SQL, OGC WCPS Need efficient algorithms - graph-based Array Join for arbitrary partitioning EarthServer: Petascale Datacube Analytics - rasdaman [rasdaman screenshots]

Datacube Research @ Jacobs U Large-Scale Scientific Information Systems research group Flexible, scalable n-d array services www.jacobs-university.de/lsis Main results: pioneer Array DBMS, rasdaman standardization: OGC Big Geo Data (also ISO, INSPIRE, W3C) ISO Array SQL Hiring PhD students, PostDocs