THE STATE OF GEO BIG DATA IN OPEN SOURCE. Rob Emanuele



Similar documents
Scalable Architecture on Amazon AWS Cloud

A programming model in Cloud: MapReduce

Real Time Big Data Processing

Razvoj Java aplikacija u Amazon AWS Cloud: Praktična demonstracija

Scalable Application. Mikalai Alimenkou

Thing Big: How to Scale Your Own Internet of Things.

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

Ali Ghodsi Head of PM and Engineering Databricks

GraySort on Apache Spark by Databricks

Hadoop & Spark Using Amazon EMR

Hadoop IST 734 SS CHUNG

AIST Data Symposium. Ed Lenta. Managing Director, ANZ Amazon Web Services

Trends and Research Opportunities in Spatial Big Data Analytics and Cloud Computing NCSU GeoSpatial Forum

Introduction to Big Data! with Apache Spark" UC#BERKELEY#

Open Cloud System. (Integration of Eucalyptus, Hadoop and AppScale into deployment of University Private Cloud)

Leveraging Big Data Technologies to Support Research in Unstructured Data Analytics

How To Create A Data Visualization With Apache Spark And Zeppelin

CAPTURING & PROCESSING REAL-TIME DATA ON AWS

Big Data for everyone Democratizing big data with the cloud. Steffen Krause Technical

Technology and Cost Considerations for Cloud Deployment: Amazon Elastic Compute Cloud (EC2) Case Study

Big Data on AWS. Services Overview. Bernie Nallamotu Principle Solutions Architect

Big Data and Analytics: Getting Started with ArcGIS. Mike Park Erik Hoel

Best Practices for Sharing Imagery using Amazon Web Services. Peter Becker

Unified Big Data Processing with Apache Spark. Matei

SAS BIG DATA SOLUTIONS ON AWS SAS FORUM ESPAÑA, OCTOBER 16 TH, 2014 IAN MEYERS SOLUTIONS ARCHITECT / AMAZON WEB SERVICES

Microservices on AWS

Last time. Today. IaaS Providers. Amazon Web Services, overview

Large scale processing using Hadoop. Ján Vaňo

Apache Ignite TM (Incubating) - In- Memory Data Fabric Fast Data Meets Open Source

ur skills.com

Comparing Ganeti to other Private Cloud Platforms. Lance Albertson

Apache Hadoop. Alexandru Costan

Cloud Computing Summary and Preparation for Examination

References. Introduction to Database Systems CSE 444. Motivation. Basic Features. Outline: Database in the Cloud. Outline

Introduction to Database Systems CSE 444

Using Big Data and GIS to Model Aviation Fuel Burn

Background on Elastic Compute Cloud (EC2) AMI s to choose from including servers hosted on different Linux distros

COMP9321 Web Application Engineering

Comparing Open Source Private Cloud (IaaS) Platforms

Fast and Expressive Big Data Analytics with Python. Matei Zaharia UC BERKELEY

Hadoop. MPDL-Frühstück 9. Dezember 2013 MPDL INTERN

EEDC. Scalability Study of web apps in AWS. Execution Environments for Distributed Computing

Scaling in the Cloud with AWS. By: Eli White (CTO & mojolive) eliw.com - mojolive.com

Cloud Computing Now and the Future Development of the IaaS

Cloud 101. Mike Gangl, Caltech/JPL, 2015 California Institute of Technology. Government sponsorship acknowledged

Hadoop: A Framework for Data- Intensive Distributed Computing. CS561-Spring 2012 WPI, Mohamed Y. Eltabakh

Introduction to AWS in Higher Ed

What is Cloud Computing? Tackling the Challenges of Big Data. Tackling The Challenges of Big Data. Matei Zaharia. Matei Zaharia. Big Data Collection

HBase A Comprehensive Introduction. James Chin, Zikai Wang Monday, March 14, 2011 CS 227 (Topics in Database Management) CIT 367

A Performance Analysis of Distributed Indexing using Terrier

Technical Overview Simple, Scalable, Object Storage Software

Big Data and Cloud Computing for GHRSST

Big Data, Cloud Computing, Spatial Databases Steven Hagan Vice President Server Technologies

Scaling Out With Apache Spark. DTL Meeting Slides based on

Lets SAAS-ify that Desktop Application

GeoCloud Project Report GEOSS Clearinghouse

An Open Source Memory-Centric Distributed Storage System

Big Data and Analytics: Challenges and Opportunities

Servers. Servers. NAT Public Subnet: /20. Internet Gateway. VPC Gateway VPC: /16

Cloud Computing Training

Cloud Big Data Architectures

Hadoop MapReduce and Spark. Giorgio Pedrazzi, CINECA-SCAI School of Data Analytics and Visualisation Milan, 10/06/2015

Big Data Course Highlights

BIG DATA USING HADOOP

Mesos: A Platform for Fine- Grained Resource Sharing in Data Centers (II)

Getting Started with Database As a Service on OpenStack

Building 1000 node cluster on EMR Manjeet Chayel

Cloud Computing. Lecture 24 Cloud Platform Comparison

Big data blue print for cloud architecture

OpenStack. Orgad Kimchi. Principal Software Engineer. Oracle ISV Engineering. 1 Copyright 2013, Oracle and/or its affiliates. All rights reserved.

AWS Performance Tuning

Cloud Hosting. QCLUG presentation - Aaron Johnson. Amazon AWS Heroku OpenShift

Big Data Analysis: Apache Storm Perspective

CSE-E5430 Scalable Cloud Computing Lecture 2

Leveraging BlobSeer to boost up the deployment and execution of Hadoop applications in Nimbus cloud environments on Grid 5000

Lustre * Filesystem for Cloud and Hadoop *

A Cost-Evaluation of MapReduce Applications in the Cloud

Cloud Providers, SciCloudand

Spark in Action. Fast Big Data Analytics using Scala. Matei Zaharia. project.org. University of California, Berkeley UC BERKELEY

In Memory Accelerator for MongoDB

6.S897 Large-Scale Systems

Search and Real-Time Analytics on Big Data

CSE 344 Introduction to Data Management. Section 9: AWS, Hadoop, Pig Latin TA: Yi-Shu Wei

Monitis Project Proposals for AUA. September 2014, Yerevan, Armenia

Massively! Continuous Integration! A case study for Jenkins at cloud-scale

Hadoop on OpenStack Cloud. Dmitry Mescheryakov Software

Amazon Hosted ESRI GeoPortal Server. GeoCloud Project Report

Big Data Analytics Hadoop and Spark

CUMULUX WHICH CLOUD PLATFORM IS RIGHT FOR YOU? COMPARING CLOUD PLATFORMS. Review Business and Technology Series

Introduction to Cloud Computing

Application Development. A Paradigm Shift

Transcription:

THE STATE OF GEO BIG DATA IN OPEN SOURCE Rob Emanuele

Who am I? open source geospatial developer working with big geo data. developer at Azavea in Philadelphia, US. maintainer of the GeoTrellis project.

GEOBIGDATA FROM A DEVELOPER S PERSPECTIVE

Frank Warmerdam PHOTO CREDIT: IAN TURTON

Frank Warmerdam Inventor of GDAL Founding director of OSGeo Worked at Google on geospatial systems using MapReduce

PlanetLabs

PlanetLabs

PlanetLabs

PlanetLabs

PlanetLabs

PlanetLabs Processes over 100,000 scenes per day 3-5 meter resolution

PlanetLabs - Pipeline position spatially apply geometric corrections apply radiometric corrections apply cloud masking ortho-rectify

PlanetLabs - Pipeline GDAL GRASS OSSIM OpenCV

BIG DATA IS ABOUT ORCHESTRATION

MapReduce (Hadoop) Inflexible: Everything must be a MapReduce job Running locally is painful Debugging is painful

PlanetLabs - JobServer PostgreSQL database for job management PostGIS for storing indexed imagery metadata Tasks are orchestrated by machines receiving imagery and the next stage in the pipeline.

PlanetLabs - JobServer Allows pipeline operations to be written with C++/python tooling Running batch is very similar to running local Easier to debug

PlanetLabs - JobServer 2000+ workers hitting database causes slowness Postgres/PostGIS is amazing, robust and very fast, but has its limit. It has sharding capabilities for horizontal scalability, but I haven t seen it used in geospatial (is anyone using this?)

Horizontal vs Vertical Scalability

MIXING HORIZONTAL AND VERTICAL SCALING IS GOING TO CAUSE PAIN.

PlanetLabs - JobServer Managing resource allocation is difficult Fault tolerance is hard Advanced orchestration like complex prioritization and task specification are tough, non-geo problems to be solving.

ORCHESTRATION IS HARD

PlanetLabs - JobServer Managing resource allocation is difficult Fault tolerance is hard Advanced orchestration like complex prioritization and task specification are tough, non-geo problems to be solving.

BIG DATA IS ABOUT DEPLOYMENT

DEPLOYMENT IS HARD

DEPLOYMENT IS HARD; CLOUD PROVIDERS HELP

Cloud Providers Amazon Web Services (AWS) Google Cloud Platform OpenStack (e.g. RackSpace)

Cloud Providers Amazon Web Services (AWS) Google Cloud Platform OpenStack

AWS A set of services for running software on the cloud Many services. SQS, CloudFormation, ECS, EFS, EBS, SWF, Elastic Beanstalk, DynamoDB, Redshift

AWS - EC2 Virtual machines that run a variety of hardware specs and operating systems. Spot Instances are cheap! Open source tooling for devops

AWS - S3 Object store High availability, distributed access Can share publicly or based on authentication

Landsat 8 on AWS Landsat 8 images are published to a public s3 bucket Over 85 TB worth of imagery https://aws.amazon.com/public-data-se ts/landsat/

Nasa NEX on AWS Downscaled Climate Projections (NEXDCP30) Global Daily Downscaled Projections (NEX-GDDP) MOD13Q1 (Vegetation Indices 16-Day L3 Global 250m) Landsat GLS (Global Land Survey) https://aws.amazon.com/nasa/nex/

Nasa NEX on AWS Downscaled Climate Projections (NEXDCP30) Global Daily Downscaled Projections (NEX-GDDP) MOD13Q1 (Vegetation Indices 16-Day L3 Global 250m) Landsat GLS (Global Land Survey) https://aws.amazon.com/nasa/nex/

Downsampled Climate Projections Monthly temperature and precipitation data over contiguous US Historical from 1950-2006 33 models, 4 RCP scenarios from 2006-2099 8190 netcdf files Over 5 TB of data

Local Climate Impact Assessment Modeling Funded by US Department of Energy Azavea in cooperation with Nature Conservancy Goal to make climate model data useful to local regional planners

Hadoop

Matei Zaharia

Apache Spark Open sourced in 2010 under BSD license Formally maintained by UC Berkeley s AMPLab Donated to the Apache Software Foundation in 2013 and relicensed as Apache 2.0 Graduated to a top level Apache project in 2014

Apache Spark a distributed computation engine. An API that lets you work with distributed data as a collection. Language bindings for use with Java, Python, and R.

GeoTrellis a Scala library for geospatial data types and operations. enables Spark with raster capabilities. storage and bounded retrievals from HDFS, Accumulo, and S3

Accumulo BigTable clone (columnar database) Records stored on HDFS Lexicographically sorted table index

Space Filling Curves

Space Filling Curves github.com/locationtech/sfcurve

Other projects using SFCurve GeoMesa GeoWave

Zonal Summaries

Zonal Summaries

Benchmark Results Yearly Average, 2006 to 2100 Single Layer, 439.5 GB uncompressed

Benchmark Results Yearly Average, 2006 to 2100 Single Layer, 439.5 GB uncompressed 40 m3.xlarge instances (estimated $2.00 USD per hour on spot market)

Summary big data is about orchestration. big data is about deployment. the state of geo big data is the state of big data, with work towards enabling geospatial data types. use Apache Spark! spatial indexing of distributed data is a hot topic.

LET S DEVELOP AND USE THE BEST TOOLS POSSIBLE

THANK YOU @lossyrob gitter.im/geotrellis/geotrellis github.com/geotrellis/geotrellis remanuele@azavea.com