THE STATE OF GEO BIG DATA IN OPEN SOURCE Rob Emanuele
Who am I? open source geospatial developer working with big geo data. developer at Azavea in Philadelphia, US. maintainer of the GeoTrellis project.
GEOBIGDATA FROM A DEVELOPER S PERSPECTIVE
Frank Warmerdam PHOTO CREDIT: IAN TURTON
Frank Warmerdam Inventor of GDAL Founding director of OSGeo Worked at Google on geospatial systems using MapReduce
PlanetLabs
PlanetLabs
PlanetLabs
PlanetLabs
PlanetLabs
PlanetLabs Processes over 100,000 scenes per day 3-5 meter resolution
PlanetLabs - Pipeline position spatially apply geometric corrections apply radiometric corrections apply cloud masking ortho-rectify
PlanetLabs - Pipeline GDAL GRASS OSSIM OpenCV
BIG DATA IS ABOUT ORCHESTRATION
MapReduce (Hadoop) Inflexible: Everything must be a MapReduce job Running locally is painful Debugging is painful
PlanetLabs - JobServer PostgreSQL database for job management PostGIS for storing indexed imagery metadata Tasks are orchestrated by machines receiving imagery and the next stage in the pipeline.
PlanetLabs - JobServer Allows pipeline operations to be written with C++/python tooling Running batch is very similar to running local Easier to debug
PlanetLabs - JobServer 2000+ workers hitting database causes slowness Postgres/PostGIS is amazing, robust and very fast, but has its limit. It has sharding capabilities for horizontal scalability, but I haven t seen it used in geospatial (is anyone using this?)
Horizontal vs Vertical Scalability
MIXING HORIZONTAL AND VERTICAL SCALING IS GOING TO CAUSE PAIN.
PlanetLabs - JobServer Managing resource allocation is difficult Fault tolerance is hard Advanced orchestration like complex prioritization and task specification are tough, non-geo problems to be solving.
ORCHESTRATION IS HARD
PlanetLabs - JobServer Managing resource allocation is difficult Fault tolerance is hard Advanced orchestration like complex prioritization and task specification are tough, non-geo problems to be solving.
BIG DATA IS ABOUT DEPLOYMENT
DEPLOYMENT IS HARD
DEPLOYMENT IS HARD; CLOUD PROVIDERS HELP
Cloud Providers Amazon Web Services (AWS) Google Cloud Platform OpenStack (e.g. RackSpace)
Cloud Providers Amazon Web Services (AWS) Google Cloud Platform OpenStack
AWS A set of services for running software on the cloud Many services. SQS, CloudFormation, ECS, EFS, EBS, SWF, Elastic Beanstalk, DynamoDB, Redshift
AWS - EC2 Virtual machines that run a variety of hardware specs and operating systems. Spot Instances are cheap! Open source tooling for devops
AWS - S3 Object store High availability, distributed access Can share publicly or based on authentication
Landsat 8 on AWS Landsat 8 images are published to a public s3 bucket Over 85 TB worth of imagery https://aws.amazon.com/public-data-se ts/landsat/
Nasa NEX on AWS Downscaled Climate Projections (NEXDCP30) Global Daily Downscaled Projections (NEX-GDDP) MOD13Q1 (Vegetation Indices 16-Day L3 Global 250m) Landsat GLS (Global Land Survey) https://aws.amazon.com/nasa/nex/
Nasa NEX on AWS Downscaled Climate Projections (NEXDCP30) Global Daily Downscaled Projections (NEX-GDDP) MOD13Q1 (Vegetation Indices 16-Day L3 Global 250m) Landsat GLS (Global Land Survey) https://aws.amazon.com/nasa/nex/
Downsampled Climate Projections Monthly temperature and precipitation data over contiguous US Historical from 1950-2006 33 models, 4 RCP scenarios from 2006-2099 8190 netcdf files Over 5 TB of data
Local Climate Impact Assessment Modeling Funded by US Department of Energy Azavea in cooperation with Nature Conservancy Goal to make climate model data useful to local regional planners
Hadoop
Matei Zaharia
Apache Spark Open sourced in 2010 under BSD license Formally maintained by UC Berkeley s AMPLab Donated to the Apache Software Foundation in 2013 and relicensed as Apache 2.0 Graduated to a top level Apache project in 2014
Apache Spark a distributed computation engine. An API that lets you work with distributed data as a collection. Language bindings for use with Java, Python, and R.
GeoTrellis a Scala library for geospatial data types and operations. enables Spark with raster capabilities. storage and bounded retrievals from HDFS, Accumulo, and S3
Accumulo BigTable clone (columnar database) Records stored on HDFS Lexicographically sorted table index
Space Filling Curves
Space Filling Curves github.com/locationtech/sfcurve
Other projects using SFCurve GeoMesa GeoWave
Zonal Summaries
Zonal Summaries
Benchmark Results Yearly Average, 2006 to 2100 Single Layer, 439.5 GB uncompressed
Benchmark Results Yearly Average, 2006 to 2100 Single Layer, 439.5 GB uncompressed 40 m3.xlarge instances (estimated $2.00 USD per hour on spot market)
Summary big data is about orchestration. big data is about deployment. the state of geo big data is the state of big data, with work towards enabling geospatial data types. use Apache Spark! spatial indexing of distributed data is a hot topic.
LET S DEVELOP AND USE THE BEST TOOLS POSSIBLE
THANK YOU @lossyrob gitter.im/geotrellis/geotrellis github.com/geotrellis/geotrellis remanuele@azavea.com