Big Data in OpenTopography Vishu Nandigam San Diego Supercomputer Center NSF Big Data in Educa<on Workshop Jan 28-29 2015
Presenta<on Overview Lidar and OpenTopography Data and Workflow Cyberinfrastructure Data Growth and Challenges Data Insights Research and Development
LIDAR LIght Detec<on And Ranging (aka airborne laser swath mapping) Billions of of accurate distance measurements with a scanning laser rangefinder + GPS + Iner<al Measurement Unit (IMU) Point cloud (x,y,z coordinates) = fundamental LIDAR data product Image: David Haddad, AGS
Canopy Height (ft)
OpenTopography NSF Earth Science Facility: 3 year support in 2009. Renewed in 2012 (Award No. 1226353 & 1225810 EAR/IF) CI and Science Collabora3on SDSC, ASU & UNAVCO Related research efforts NASA ROSES: Extend to Satellite- based LIDAR (waveform data) NSF SI2 CyberGIS: OpenTopo as an exemplar of cyber GIS NSF CluE: Inves<gate Computer Science issues in big data Partnerships with state and local agencies to support data hos<ng and processing capabili<es
OpenTopography Facility Overview
OpenTopography Data Large user community with variable needs and levels of sophis<ca<on. Goal: maximize access to data to achieve greatest scien<fic impact. Big data treat data as an asset that can be used and reused Co- locate data with on- demand processing
Data Workflow 1. Original Source Data from Collector (eg. NCALM) 2. Extract relevant data products 3. Source data is archived in Chronopolis digital preserva<on network 1. UCSD Library (Library of Congress) 2. Three geographically distributed copies of the data 4. Extracted data products go through QA/QC 5. Data transforma<on and op<miza<on 1. Error correc<on 2. Projec<on conversion 6. Update Metadata ISO 19115 (Data) 7. Generate addi<onal derived products (e.g. GE hillshades) Data available via OT
Catalog Service for the Web / DOI (8 &9 of the Data Workflow) CSW Catalog ESRI Geoportal Server ISO 19115 (Data) CZO, CyberGIS, Thomson Reuters Web of Science DOI: hmp://dx.doi.org/10.5069/ G9ST7MR9 EZID is a service of UC Cura<on Center of the CDL Protocol: 'CSW' Profile: 'ArcGIS Server Geoportal Extension
Current Data Holdings Lidar Point Cloud Datasets 770 Billion+ lidar returns. Each return has addi<onal amributes On- demand processing capabili<es SRTM, Raster (mul<ple layers) e.g. Sonoma - several intensity products, canopy height, bare earth, hydro- enforced bare earth, canopy top, etc. 30+ TB of on- demand online data (Excluding big custom job runs, original archived data, pre processing data and user generated products)
OT CyberInfrastructure OpenTopography Portal Other Clients Application Tier CSW PC Data Access SRTM Service Data Access Service PC Data Processing SRTM Data Service Processing Service DOI Point Cloud Data SRTM Dataset Data and Processing Gordon Nodes SDSC Cloud OT Resources XSEDE Resources SDSC Resources Infrastructure Tier
OT Challenges (Keeping pace with data and user growth and advancing science!)
LIDAR Point Cloud Data Growth 160 140 120 100 80 60 40 OpenTopography 20 0 Summer 2005 fall2005 winter 2006 spring 2006 Summer 2006 fall 2006 winter 2007 spring 2007 summer 2007 fall 2007 winter 2008 spring 2008 Summer 2008 Fall 2008 Winter 2009 Spring 2009 summer 2009 Fall 2009 Winter 2010 Spring 2010 Summer 2010 Fall 2010 Winter 2011 Spring 2011 Summer 2011 Fall 2011 Winter 2012 Spring 2012 Summer 2012 Fall 2012 Winter 2013 Spring 2013 Summer 2013 Fall 2013 Winter 2014 Spring 2014
Sensor Hardware Technology Rapid Evolu<on of Laser Scanner Technology Data is being collected on mul<ple channels (different wavelengths) and capturing full waveform data. Early datasets collected with scanners opera<ng at less than 33Khz. Current systems collect data at ~900Khz Greater Resolu<on = Larger Data Volumes Image: RIEGL USA
Diverse and complex datasets support Discrete return lidar Full waveform lidar Op<cal imagery (R,G,B orthophotography) Hyperspectral imagery NEON Hyperspectral Imagery collected light reflected across the electromagne<c spectrum for a total of 426 bands of informa<on. Image: Nathan Leisso, NEON AOP
User Growth 7000 OT registered user growth 6000 5000 4000 3000 2000 1000 0 Aug- 08 Nov- 08 Feb- 09 May- 09 Aug- 09 Nov- 09 Feb- 10 May- 10 Aug- 10 Nov- 10 Feb- 11 May- 11 Aug- 11 Nov- 11 Feb- 12 May- 12 Aug- 12 Nov- 12 Feb- 13 May- 13 Aug- 13 Nov- 13 Feb- 14 May- 14 Aug- 14 Nov- 14 CyberGIS, NASA/UNAVCO communi<es Service Level Access
Pluggable Services Infrastructure Methods for scien<fic data processing are evolving Users demanding more processing services Pluggable services infrastructure OT development sandbox Assist researchers with their code Deployment of the algorithm as a service Update processing workflow and UI Increase in user Generated Derived products!
Data Insights What can we learn from: 40,000+ custom PC jobs 1.2 trillion lidar points processed 15,000+ custom raster jobs (past year)
Data Access Pamerns Source: hmp://www.businessinsider.com/chart- of- the- day- the- lifecycle- of- a- youtube- video- 2010-5
Access Pamerns in LIDAR Scien<fic Datasets B4 San Andreas Fault 160 140 120 100 80 60 40 20 0
Access Pamerns in LIDAR Scien<fic Datasets Event based datasets - El Mayor- Cucapah Earthquake 140 120 100 80 60 40 20 0
Access Pamerns in LIDAR Scien<fic Datasets Event based datasets Hai< Earthquake 160 140 120 100 80 60 40 20 0
Data Usage Analy<cs Northern San Andreas Fault Social networking with data Recommenda<on system
Understanding Traffic Pamerns Impact of dataset releases and workshops on traffic Page Visits 14000 12000 10000 8000 6000 4000 2000 0 Jan- 13 Feb- 13 Mar- 13 Apr- 13 May- 13 Jun- 13 Jul- 13 Aug- 13 Sep- 13 Oct- 13 Nov- 13 Dec- 13 New Users Registered users Daily visits SRTM Global Dataset Released 200 180 160 140 120 100 80 60 40 20 0 Registered Users
Tiered storage based on Data Access Ac<vity based data ranking and <ered cloud integrated storage SSD New and existing most active data Regular Disk Between Hot and Cool Cloud Least Active
Cloud Compu<ng Cost effec<veness & feasibility of data science facili<es on the cloud Microsos Azure for Research Award Integra<on of cloud based on- demand geospa<al processing services into community earth science data facili<es.
Leveraging HPC Dedicated Gordon nodes via XSEDE (democra<za<on of supercompu<ng resources) GEO Data and Cyberinfrastructure Impera<ve: Harness the Power of Compu<ng and Computa<onal Infrastructure. - GEO Priori3es and Fron3ers: 2015-2020
Summary OpenTopography is an modern agile data facility Cyberinfrastructure driven by science use cases CI and science collabora<on Big Data needs to be usable - Community not only wants access to data but also wants tools for processing these data. Concept of OT can be used as a template for other large data facili<es
Thank You! White River, IN. Credit: Indiana Geological Survey/The State of Indiana viswanat@sdsc.edu info@opentopography.og @OpenTopography