Big Data Infrastructures for Processing Sentinel Data Wolfgang Wagner Department for Geodesy and Geoinformation Technische Universität Wien Earth Observation Data Centre for Water Resources Monitoring What is Big Data? Big Data, Big Hype? Steve Dodson (2014) in An intrusion of privacy A successful business model of few big primarily American enterprises Sven Schade (2015) describes the Big Data era as a situation in where the volume, variety, velocity and veracity (3+1 Vs) in which data sets and streams become available challenges current management and processing capabilities Schade, S. (2015) Big Data breaking barriers - first steps on a long trail, ISPRS Archives, XL-7/W3, 691-697.
Infrastructures for Processing Big Data Google Council Bluffs data center (http://www.google.com/about/datacenters/gallery/#/all/2) Sentinel Programme A fleet of European earth observation satellites for environmental monitoring
Sentinel-1 A Game Changer C-band SAR satellite in continuation of ERS-1/2 and ENVISAT High spatio-temporal coverage Spatial resolution 20-80 m Temporal resolution < 3 days over Europe and Canada with 2 satellites Excellent data quality Highly dynamic land surface processes can be captured Impact on water management, health and other applications could be high if the challenges in the ground segment can be overcome Sentinel-1 Image of Upper Austria taken on 13/04/2015 Solar panel and SAR antenna of Sentinel-1 launched 3 April 2014. Image was acquired by the satellite's onboard camera. ESA
Sentinel-1 Data Volume From Byte to PetaByte 1 Byte 1 GigaByte 1 KiloByte 1 TeraByte 1 MegaByte 1 PetaByte
Speed of Data Transmission Download of 500 Gigabyte ( daily Sentinel-1 data volume over land) Wireless with 7 Mbit/s Landline with 1 Gbit/s Download of 1 Petabyte ( 7 years of Sentinel-1 data over land) Landline with 1 Gbit/s Speed of Data Processing Assumed processing speed of Sentinel-1 data with one computer/node ~ 4 Mbit/s Processing of 500 Gigabyte ( daily Sentinel-1 data over land) 1 computer Processing of 1 Petabyte ( 7 years of Sentinel-1 data over land) 1 computer 100 nodes 1000 nodes One needs supercomputers for processing Sentinel data!
Approaching Technological Frontiers? Information and communications technology (ICT) has improved dramatically over the past decades Moore s law, which states that the number of transistors in a dense integrated circuit doubles approximately every two years, still holds But there are physical limits to every technology! e.g. for any thermodynamic cycle operating between temperatures and none can exceed the efficiency of a Carnot cycle: = 1 Increasingly we face challenges related to Data volume Bandwidth and I/O Algorithmic complexity Earth Observation Ground Segment Past
Earth Observation Ground Segment Present Earth Observation Ground Segment Future
A New Paradigm for Earth Observation Reasons Fast growing volume and increasing variety of EO data Increasing complexity of algorithms with increasing resolution Higher scientific standards Algorithms must be validated with big data sets and competing algorithms Algorithms ensembles needed Solution Consequence Bring users and their software to the data Need for cooperation & specialisation An Opportunity for New Business Models Business Model of Munich-based company CloudEO http://www.cloudeo-ag.com/how-it-works
Big Data Infrastructures for the Sentinels Private Sector Google Earth Engine Amazon Web Services Offers Landsat data (complete from 2015 onwards) for its cloud user Helix Nebula Science Cloud etc. Consortium of European ICT providers teaming up with ESA, CERN, etc. Public Sector Initiatives trigged mainly by national space programmes THEIA Land Data Centre (France) Climate, Environment and Monitoring from Space (CEMS) (UK) OPUS/Copernicus Centre (Germany) European Space Agency etc. Thematic Exploitation Platforms Mission Exploitation Platforms Google Earth Engine Premier platform for the scientific analysis of high-resolution imagery Combines the strength of an ICT giant with expertise in earth observation (team of > 100 programmers) Rolled out on three Google data centres (US, Europe, Asia) Access through Java Script or Python API Programming in Googlish, i.e. code can only run on Google Earth Engine Image-oriented data structure, including image pyramids for interactive analysis Commercial applications are not free Data download possible (original and processed data) Landsat: complete archive MODIS: many geophysical variables Sentinel-1: already about 10.000 scenes Sentinel-2: will likely follow soon
Snapshot of Google Earth Engine Interface showing Sentinel-1 data holding as of 4/9/2016 (https://ee-api.appspot.com) Earth Observation Data Centre (EODC) Founded in May 2014 as a Public-Private Partnership Mission EODC works together with its partners from science, the public- and the private sectors in order to foster the use of EO data for monitoring of water and land EODC acts as a community facilitator Joint developments Cloud infrastructure Operational data services Software Open Source EODC works towards a federation of data centres
EODC Cooperation Network Work is done within the Communities Infrastructure Sentinel-1 Sentinel-2 Already 13 Cooperation Partners from 6 countries Austria, Australia, Czech Republic, Italy, France, The Netherlands EODC Infrastructure in Vienna Virtual Machines (VMs) Supercomputer VSC-3 Rank 85 of the World s most powerful computers (11/2014) 24/7 Operations & Rolling Archive Petabyte-Scale Disk Storage Tape Storage
EODC Status Operations started in June 2015 after a one year development phase Operational data reception and processing by ZAMG Computer cluster to operated by EODC Virtual Machines via OpenStack Cloud Services Supercomputer VSC-3 operated by TU Wien Data and Platform Services Community Building PaaS User VMs Repositories Community File Repository VSC-3 Login Node NORA Router Job Scheduler High Availabilty Continuous Integration Various Inspection Tools Web Conferencing Development Collaboration Sentinel-1 Data Availability @ EODC Sentinel-1 data are currently available ~2,5 hours after its processing time and 6,25 hours after acquisition time (median value for August 2015) 54888 acquisitions with 39.65 TB (>1,5 times our 10-year ENVISAT ASAR archive) Ramp-up of Sentinel-1 acquisition scenario to full operational status
Supercomputing Experiment Vienna Scientific Cluster 3 High-performance computing (HPC) system with 2020 nodes Each node has 2 processors Intel Xeon E5-2650v2, 2.6 GHz, and 64 Gbytes of RAM Simple Linux Utility for Resource Management (SLURM) Experiment Geocoding of 624 Sentinel-1 images from Austria, Sudan and Zambia with Sentinel-1 toolbox Each image is about 1 Gbyte in size Serial processing with one processor would take about two weeks Approach Parallel processing on 312 nodes whereas 2 images were simultaneously launched on a single computing node Results Processing was completed within 45 min (without queuing)
Conclusions Big Data is a broad term for data sets so large or complex that traditional data processing applications are inadequate. Earth Observation is entering the Big Data era Big Data infrastructures for processing of Sentinel data are being developed along two main lines Deploy EO specific services on general-purpose cloud computing environments Building of new, or expansion of existing dedicated EO data centres Acknowledgements My colleagues at TU Wien and EODC: Christian Briese, Vahid Naeimi, Bernhard Bauer- Marschallinger, Christoph Paulik, Alena Dostalova, Stefano Elefante, Thomas Mistelbauer, Hans Thüminger, and Andreas Roncat Austrian Space Application Programme: Projects 844350 Prepare4EODC and 88001 WetMon European Space Agency: Contract No. 4000107319/12/I-BG EODC Water Study