EcoTrends Cyber-infrastructure Development

Size: px
Start display at page:

Download "EcoTrends Cyber-infrastructure Development"

Transcription

1 Long Term Ecological Research Network Office EcoTrends Cyber-infrastructure Development Mark Servilla LTER Network Office LTER Information Managers Annual Meeting San Jose, California 2 5 August 2007

2 Building Blocks to Success EcoTrends NIS module PASTA NIS Module Framework Metacat/EML metadata and data management PostgreSQL RDBMS Java Servlet, JSP, and R programming Community support for data collection, documentation, and accessibility EcoTrends PASTA Metacat/EML PostgreSQL/ Java/Tomcat Community

3 PASTA Architecture Existing LTER metadata infrastructure (Metacat and EML) Source A Metadata describing derived data, including data provenance and data versioning expand on community provenance research EML MetacatHarvester EML.xml Source B Source C Standard interfaces to support various web portals (e.g., Trends, GEOSS, GEON, NEON, WATERS) and web service APIs Metadata ParserLoader Dataset Registry Cache Source data cache available to all workflow engines Data loading for synthetic processing based on events (e.g., new data, metadata change) Workflow Engine Support for multiple scientific workflow engines (e.g., R script, Kepler, Chimera, D2K) Derived Data Web API HTML SOAP Metadata and derived data products; metadata as EML Site data/metadata Existing infrastructure New infrastructure Pluggable work flows Derived data management User interfaces

4 EcoTrends Development 2007 EML EcoTrends development realm Source A MetacatHarvester EML.xml Source B Source C Metadata ParserLoader Dataset Registry Cache Workflow Engine Derived Data Web API HTML SOAP

5 Development Process Editorial and technical committees, and LNO Use-case Technical committee, NISAC, and LNO Requirements Project Plan Editorial and technical committees, and LNO Milestones IT SO ER LU AT TI IVE ON S Coding Testing Release

6 Major Milestones EML generation Derived data loading Website presentation/integration Data discovery and presentation Browse (by site, by topic/sub-topic) Search (simple keyword, advanced) Result (result set display, dataset display, plot display) Data exploration Graphing (single and multiple datasets) Aggregation (temporal) Download (data and metadata) Site auditing/das Web page, data access, and plot auditing Use-statistics and data access policy conformance

7 EML Generation Step 1: Core Metadata Define core metadata (e.g., contact information) that is repeated in all EML documents Step 2: File Name Parsing Parse the derived data file names for site/station, variable, unit, and timescale metadata Step 3: Derived Data Analysis Analyze derived data for temporal coverage and data value bounds Step 4: R Script Analysis and Inclusion Include in the methods section of EML the R script used to generate derived data and any annotation associated with a specific derived data product Step 5: Manual Documentation Include both non-automated metadata and tacit knowledge metadata into the EML

8 Derived Data Loading Parse data and load relational database Record level attributes PRIMARY_KEY :: INTEGER START_DATE :: DATESTAMP END_DATE :: DATESTAMP OBS :: FLOAT N_EXPECTED :: INTEGER S_DEV :: FLOAT S_ERR :: FLOAT PROP_MISSING :: FLOAT PROP_QUESTIONABLE ::FLOAT PROP_ESTIMATED :: FLOAT PROP_TRACE :: FLOAT PROP_INVALID :: FLOAT COMMENT :: TEXT

9 Website Presentation Initial design and development EcoTrends editorial committee Electric Sage Designs, LLC Laura Downey, Usability Engineer, SEEK Project

10 Website Integration Convert all PHP functionality to equivalent Java Server Page (JSP); integrate Metacat based content Stage 1: Apache, PHP, CSS,Javascript, and MySQL Refactor Stage 2: Apache, PHP, CSS,Javascript, and MySQL Refactor original website to reflect consistency and modularity; modify CSS for application specific design (e.g., table layout) Refactor Stage 3: Tomcat, Servlet, JSP, CSS, Javascript, and Metacat

11 Data Discovery and Presentation

12 Data Discovery and Presentation

13 Data Discovery and Presentation

14 Data Discovery and Presentation

15 Data Exploration Graphing (single and multiple datasets) Aggregation (temporal) Download (data and metadata)

16 Site Auditing/DAS Web page auditing Data access auditing Plot auditing Use-statistics and data access policy conformance

17 Parting shot

18 PASTA Architecture Existing LTER metadata infrastructure (Metacat and EML) Source A Metadata describing derived data, including data provenance and data versioning expand on community provenance research EML MetacatHarvester EML.xml Source B Source C Metadata ParserLoader Dataset Registry Cache Source data cache available to all workflow engines Data loading for synthetic processing based on events (e.g., new data, metadata change) Standard interfaces to support various web portals (e.g., Trends, GEOSS, GEON, NEON, WATERS) and web service APIs Workflow Engine Support for multiple scientific workflow engines (e.g., R script, Kepler, Chimera, D2K) Derived Data Metadata and derived data products; metadata as EML Web API HTML SOAP

19 PASTA Application Stack (temporal-spatial-organizational) Scale Network-level Synthesis Network interface Web API - Portal Standardized data products Derived Database Metadata Harvest Data transformation and integration Workflow Engine Site-level data archive Cache Database Dataset identification and loading Registry/Parser/Loader Existing EML Harvesting EML, Metacat, and Harvester Site Data and EML Metadata

20 Generalized Workflow Sites collect and document time-series observation data (e.g., climate, social-economics, ) Sites update EML with a new revision indicating new data EML is harvested into Metacat EML Loader/Parser loads new/updated dataset into cache database Workflow Engine transforms cache data into derived data Transformed data is stored in derived database EML is generated for derived data and is stored in Metacat Derived data is made available through web portal

21 Decomposed Workflow Sites collect and document time-series observation data (e.g., climate, social-economics, ) Sites update EML with a new revision indicating new data EML is harvested into Metacat EML Loader/Parser loads new/updated dataset into cache database Workflow Engine transforms cache data into derived data Transformed data is stored in derived database EML is generated for derived data and is stored in Metacat Derived data is made available through web portal

22 LTER Site Data Collection Time-series data Physical environment (e.g., climate, ) Human population and economy Biogeochemistry Biotic structure Data/metadata Relational Database Spreadsheet Text file HTML/XML

23 Decomposed Workflow Sites collect and document time-series observation data (e.g., climate, social-economics, ) Sites update EML with a new revision indicating new data EML is harvested into Metacat EML Loader/Parser loads new/updated dataset into cache database Workflow Engine transforms cache data into derived data Transformed data is stored in derived database EML is generated for derived data and is stored in Metacat Derived data is made available through web portal

24 EML, Metacat, and the Harvester EML Package ID EML Source A MetacatHarvester Source B Source C existing LTER investment in technology knb-lter-site.xx.yy knb-lter-sev knb-lter-sev knb-lter-sev Metacat stores the XML of EML; new revisions take precedence old revisions are deprecated, but not deleted Harvester is a time-based update process to pull site EML and inserts into Metacat

25 Decomposed Workflow Sites collect and document time-series observation data (e.g., climate, social-economics, ) Sites update EML with a new revision indicating new data EML is harvested into Metacat EML Loader/Parser loads new/updated dataset into cache database Workflow Engine transforms cache data into derived data Transformed data is stored in derived database EML is generated for derived data and is stored in Metacat Derived data is made available through web portal

26 EML Loader/Parser Dataset registry identifies Trends data in Metacat New revisions assert a new data load. The EML parser/loader* EML Source A MetacatHarvester Source B Source C ParserLoader Cache Dataset Registry *Collaboration with NCEAS/SEEK Translates the site EML into the RDBMS DDL Creates a new DB table in the primary database based on the revision Loads the new data into the primary database Trigger to continue workflow

27 Decomposed Workflow Sites collect and document time-series observation data (e.g., climate, social-economics, ) Sites update EML with a new revision indicating new data EML is harvested into Metacat EML Loader/Parser loads new/updated dataset into cache database Workflow Engine transforms cache data into derived data Transformed data is stored in derived database EML is generated for derived data and is stored in Metacat Derived data is made available through web portal

28 Workflow Data Transformation Cache database stores site data in native site schema and based on snap-shot version Workflow Engine reads native schema performs transformation/integration writes to global schema produces EML metadata Derived database stores derived data in consistent global schema Metadata Cache Workflow Engine Derived Data

29 Site to Global Schema Mapping Wind direction (knb-eco-trends.1.1) MCM Canada Glacier Wind triggered by data load Timestamp (daily) value date_time Timestamp of observation 15 min interval wdir Wind direction (azimuth) Wind direction std dev (knb-eco-trends.2.1) wdirstd Standard deviation of wind direction Timestamp (daily) wspd Wind speed meters/second wspdmax Maximum wind speed meters/second wpsdmin Minimum wind speed meters/second Wind speed max (knb-eco-trends.5.1) Timestamp (daily) value value

30 Global Schema scope revision knb_eco_trends_1_1 identifier

31 Decomposed Workflow Sites collect and document time-series observation data (e.g., climate, social-economics, ) Sites update EML with a new revision indicating new data EML is harvested into Metacat EML Loader/Parser loads new/updated dataset into cache database Workflow Engine transforms cache data into derived data Transformed data is stored in derived database EML is generated for derived data and is stored in Metacat Derived data is made available through web portal

32 EML for derived data EML metadata for the derived data and inserts into Metacat Derived data is now accessible through all Metacat user interface EML MetacatHarvester EML.xml Metadata Workflow Engine Derived Data

33 Decomposed Workflow Sites collect and document time-series observation data (e.g., climate, social-economics, ) Sites update EML with a new revision indicating new data EML is harvested into Metacat EML Loader/Parser loads new/updated dataset into cache database Workflow Engine transforms cache data into derived data Transformed data is stored in derived database EML is generated for derived data and is stored in Metacat Derived data is made available through web portal

34 Web API Store Front provides API to derived data products in secondary DB EML HTML today Metacat Web service Harvester tomorrow Issues: Authentication Authorization Provenance Quality Interactive Plots (beta site location) EML.xml Metadata Derived Data Web API HTML SOAP

35 Parting shot