Cloud based platforms for Linking and Managing Geodata Dimitris Kotzinos ETIS Lab. (ENSEA/UCP/CNRS UMR 8051) & Department of Computer Science, University of Cergy - Pontoise
21st Century Science Theory Theory is augmented and explored through Computa8on Data IntegraOon & Hypotheses are discovered in Linking Data and drive Theory Data Storage Data Mining (Cloud) CompuOng Computa8ons generate Data Theory suggests hypotheses that are verified through Experiment Data Experiments generate Data Computation Computa8ons inform the design of Experiments Experiment & Observation 3
Key Challenges of Data Intensive Science(s) Large volumes High-throughput Relating and Linking Heterogeneity Complexity 4
Inspired GeoData Cloud Services The project wants to demonstrate that a Cloud infrastructure can be used by public organisa7ons to provide more efficient, scalable and flexible services for crea7ng, sharing and dissemina7ng spa7al environmental data
ID- Card of the Project Who? 5 Geological Surveys bringing in 6 inioal Use Cases (datasets and applicaoons) Ground Water Management Geo- Hazards: Landslides, Earthquakes GeoPublicaOon and Web Mapping made easy 3 ICT organizaoons bringing key- experose Cloud CompuOng GIS SemanOc Web and Linked Data So\ware architecture and integraoon EC Support
Why InGeoCloudS Some common characteris7cs about Geo- Applica7ons Usage The Digi;zed Earth: avalanche of data generated by various insotuoons all over Europe in numerous domains: Geography, Earth observaoon, Geology, Public AdministraOon, Private Geo- Companies Data intensity scenarios: storage, distribuoon, etc. Compu;ng- intensity scenarios: geo- processing on demand, procurement and release of resources Access- intensity scenarios: World- wide and/or concurrent access to informaoon in case of parocular events
Why InGeoCloudS Considering INSPIRE?
Why considering the Cloud? Promises of the Cloud Easy procurement (on- demand, self- service) TCO SimplificaOon and ReducOon Power supply, physical space, hardware, cables, network devices «Unlimited» provisioning of resources (up to your wallet depth) Rapidity versus classical ordering processes Technical staff now available for tasks with more added- value Security, Availability, Quality of Service, Performance Very performant (virtual) hardware available Broad / Ubiquitous network access Up- to- date so\ware pieces, OS Possible Scale- economy through pooling and share with peers (consolidaong different types of resources such as Database servers, applicaoon servers ) FacilitaOon for up- taking standards
ID- Card of the Project Achievements Fundamental scalable/elas;c services for data management: Database Server, File Server, Linked Data Store Geo Data publica;on services (SaaS mode) An API available publicly: RESTful Web Services upon a loose- coupled architecture Data providers data and applica;ons in the cloud Generic Services, Integra;on Portal and Management Tools
Pilot DescripOon A Portal
Pilot DescripOon Integrated Geo Applica7ons
the <anything>aas Trend IaaS Elements Servers Network Data storage AdministraOon APIs DaaS Elements GSOM LOD APIs Data storage, import, sync Meta- data descripoon, PaaS Elements AuthenOcaOon Roles implementaoon Workspaces, IntegraOon Portal RESTful APIs Basic Mapping Framework SaaS Elements Geo applicaoons (partner Use Cases), Geo PublicaOon, Catalogues Management, Opensearch Queries Geo- Processing
Pilot2 DescripOon The Whole Picture
Linked Data Management
Problems to tackle Data integraoon Data retrieval in a uniform way Data export to other formats November 25, 2014 16
The Web of data becomes a reality by connecong so far isolated islands or data silos such as Enterprise silos (Disparate DBMS Engines & Development Frameworks) Social Media silos (Social Networking Services, Discussion Forums, Blogs, Wikis etc.) ScienOfic silos (in biology, physics, earth sciences, etc.) Linked Open Data (LOD) is a way of publishing data on the (SemanOc) Web that: Encourages reuse Promotes its (real & potenoal) inter- connectedness Enables network effects to add value to data Informa;on Integra;on November 25, 2014 Linked Open Geodata Source: flickr hmp://www.flickr.com/photos/docsearls/5500714140/ 17
Extensible Applica;on Pool A P I Rel DB VisualizaOon Data IntegraOon Layer Rel DB Query Engine Excel files Linked Open Data as Service CollaboraOon Query Update Import Export Linked Data External Resources Cross Data sets' Querying XML files November 25, 2014 AbstracOon layer for data access abstract the applica7ons from the specific setup of the data management service (such as local vs. remote, federa7on, and distribu7on) Beyond Data Access Enabling automaoon of discovery, composioon, and use of datasets Data Markets Online VisualizaOon Services Data Publishing SoluOons Data Aggregators BI / AnalyOcs as a Service 18
Available Datasets ScienOfic data described in different formats and nooons from diverse areas ground waters, boreholes, chemical analyses, etc. GEUS, EKBAA, BRGM earthquakes, recordings, staoons, etc. EPPO Landslides, suscepobility maps, etc. GEO- ZS, EKBAA Purpose: integrate, describe and query such heterogeneous data in a uniform way November 25, 2014 19
Work Approach CreaOon of a Conceptual Model to integrate and cover all the themaoc fields Map the source relaoonal data into RDF data compliant to the Conceptual Model Provide unique URIs for data Rely on a scalable RDF Triple Store to enforce the mappings and enable the storage and query of the RDF data Provide an API for the RDF data management November 25, 2014 20
Models, Mappings and Integra;on Providers Data results transform GSOM mappings query tsv/csv xml json GeoJSON KML Shapefile. export xml trig trix n3 formats Inspire compliant 21 November 25, 2014
Conceptual Model Our model: needs to be a generic, extendable and interoperable, which relies not only to geographic data Integrates different datasets created from sources w.r.t. their semanocs developed for the documentaoon of scienofic observaoon measurement and sampling An event- centric model that captures complex contexts where Ome, place and parocipants are associated. Capable to describe complex rela;onships to integrate or connect rich informaoon November 25, 2014 22
GEUS data - > Conceptual Model Borehole and Loca;on E55.Type Depth P2F.has_type E54.Dimension DRILLDEPTH P90F.has_value E60.Number E42.Iden;fier BOREHOLENO E41.Appela;on BOREHOLENAME P1F.is_idenOfied_by S16.Borehole P43F.has_ Dimension O9.contains_or_ confines S15.Acquifer_Concept INTAKE P1F.is_idenOfied_ by E42.Iden;fier INTAKENO O7.consists_of S17.Borehole_Collar P87F.is_idenOfied_by E47.Spa;al_Coordinates XUTM E47.Spa;al_Coordinates YUTM P2F.has_type E55.Type EPSG:32632 E47.Spa;al_Coordinates LATITUDE E47.Spa;al_Coordinates ELEVATION E47.Spa;al_Coordinates LONGITUDE P2F.has_type E55.Type EPSG:4326 P87F.is_idenOfied_by P87F.is_idenOfied_by E47.Spa;al_Coordinates XUTM32EUREF89 E47.Spa;al_Coordinates YUTM32EUREF89 E47.Spa;al_Coordinates ZDVR90 P2F.has_type P2F.has_type E55.Type UTM32EUREF89 E55.Type DVR90 November 25, 2014 23
Data Mappings to Conceptual Model The data providers relaoonal data were informally mapped into our Conceptual Model Such mappings can be formally expressed using R2RML R2RML is a W3C RecommendaOon hmp://www.w3.org/tr/r2rml/ Enables the automaoc creaoon and synchronizaoon of the Linked Data w.r.t. the RelaOonal Data Some mappings have already been expressed using R2RML November 25, 2014 24
Conceptual Model and INSPIRE INSPIRE (Infrastructure for SpaOal InformaOon in Europe) covers 34 SpaOal Data Themes. It uses ISO 19100 series standards; oriented to environmental and spa;al data The Generic conceptual schema is not event- oriented (especially the schema that uses the observaoon- measurement specificaoon) Our data and INSPIRE Data are not INSPIRE- compliant but there is a specific INSPIRE- to- model mapping that enables making queries and presenong accordingly the results Exported Data are made INSPIRE- compliant. November 25, 2014 25
Why do not use INSPIRE directly? INSPIRE is not made for informaoon integraoon INSPIRE lacks semanoc clarity of concepts in various places, i.e. An observa7on is an act that results in the es8ma8on of the value of a feature property Confusion of the acoon (event) aka observaoon with the result The observa8on itself is also a feature, since it has proper8es and iden8ty. ObservaOon is the value of a feature property and also a feature November 25, 2014 26
INSPIRE schemas How many observations definitions (structures) INSPIRE includes? 1) ISO 19156 Observations and Measurements/OM Observation 2) GML Observation 3) Observation OGC-WaterML:Observation Process, MeasurementTimeseries `and TimeseriesObservation: What is their relation with the other observation schemas(iso 19156)? 1) ISO 19156 2) GML 3) WaterML
Conceptual Model and INSPIRE November 25, 2014 28
Data Exports Query results viewed as: RDF/XML JSON GML KML Shapefiles TransformaOon services to user provided conceptualizaoons Export to INSPIRE compliant XML formats November 25, 2014 29
Linked Data Storage - Virtuoso Virtuoso Universal Service is a hybrid of DB engine & middleware combining the func. of a RDBMS, ODBMS, and a RDFStore Supports various Internet & Web protocols: HTTP(s), WebDav, SOAP, UDDI, SPARQL Implemented various data access APIs: ODBC, JDBC, OLE DB, ADO. NET Provides a Web- based DB AdministraOon UI Provides an open- source version Can create a clustered- based system of RDFStores + a non- free version available in the Cloud November 25, 2014 30
Reasons for selecoon: Scalable to handle millions of triples + exhibits a very good performance Offers API for RDF data management Supports SPARQL and SPARUL Supports 2- D features and offers spaoal operators Supports R2RML and provides the respecove relaoonal- to- RDF data mapping mechanism Offers a cloud- based version Scalability / ElasOcity Linked Data Storage - Virtuoso Planned at least 2 instances of ElasOc LD Storage to be available in the plaxorm to handle the query load (cross- provider queries are more demanding) The elasoc capabilioes of Amazon Cloud are exploited to increase/decrease the number of instances when the load surpasses or goes below a specific threshold November 25, 2014 31
Linked Data Storage and Querying Generated Linked Data for ground waters (EKBAA, BRGM, GEUS), earthquakes, landslides Successfully inserted ~2B triples into Virtuoso Created specific queries For each partner s dataset Combining similar nooons from cross provider datasets RelaOve small execuoon Omes depending on the number of fetched results + query complexity At the moment using only 1 instance in the Amazon Cloud Created the R2RML mappings for ground water datasets November 25, 2014 32
Conceptual Data Integra;on & Linking API Methods for: Querying data/metadata via SPARQL (both provider- specific & cross- provider queries are supported) ImporOng data/metadata in various formats (both blocking and non- blocking depending on the size of meta- data to be imported) Direct through respecove API method Data provider should be responsible of synchronizing his/her relaoonal and RDF data Indirect through the API method for the definioon of relaoonal- to- RDF mappings Apart from the definioon of mappings, the data provider does not need to perform any kind of synchronizaoon, as the system will be responsible for performing such a synchronizaoon on his/her behalf ExporOng data/metadata in various formats (inline in the response or available in a specific URL) UpdaOng data/metadata via SPARUL InserOng/Defining relaoonal- to- RDF data mappings via R2RML language November 25, 2014 33
Example Cross query over datasets Find the boreholes and their locaoons in all datasets if ground water samples where taken from them a\er the date 01/01/2005. select?bhole?date?laotude?longitude where { { select?bhole?date?laotude?longitude where {?st sci:o5.removed?sample; } crm:p4f.has_ome- span?date; sci:o4.sampled_at?place; crm:p2f.has_type "GRWATER_SAMPLING".?bhole sci:o9.contains_or_confines?place; a sci:s16.borehole; sci:o7.consists_of?bcollar.?bcollar crm:p87f.is_idenofied_by?laotude; crm:p87f.is_idenofied_by?longitude. filter (regex(?laotude, "LaOtude")). filter (regex(?longitude, "Longitude")). filter(?date > "2005-01- 01"^^xsd:date). } limit 100 Cont d November 25, 2014 UNION { select?bhole?date?laotude?longitude where {?anal crm:p2f.has_type "QUALITY_MEASUREMENT"; crm:p4f.has_ome- span?date; sci:o4.sampled_at?bhole;?bhole sci:o7.consists_of?bcollar.?bcollar crm:p87f.is_idenofied_by?laotude; crm:p87f.is_idenofied_by?longitude. filter (regex(?laotude, "LaOtude")). filter (regex(?longitude, "Longitude")). filter(?date > "2005-01- 01"^^xsd:date). } limit 100 } UNION { select?bhole?date?laotude?longitude where {?st crm:p4f.has_ome- span?date; sci:o4.sampled_at?bhole.?bhole a sci:s16.borehole; sci:o7.consists_of?bcollar.?bcollar crm:p87f.is_idenofied_by?laotude; crm:p87f.is_idenofied_by?longitude. filter (regex(?laotude, "LaOtude")). filter (regex(?longitude, "Longitude")). filter(?date > "2005-01-01"^^xsd:date). } limit 100 34
Example November 25, 2014 35
Querying Linked Data Store Geoprocessing implemented as a WPS service: refers to ordinary kriging interpolaoon WPS is using ElasOcWebServer and uses data: Given by the user / Fetched from InGeoCloudS RDF Triplestore
Data IntegraOon & Linking Prototype
Prototype Show benefits of LD technology Link in a meaningful and straighxorward way internal and external data obtained from web services: automaocally by the prototype & manually via the UI Develop a complete applicaoon which exhibits query, visualizaoon & linking funcoonality Indicate that such applicaoon can be easily built by exploiong the Linked Data Management API as well as web- based technologies (HTML and javascript)
Prototype Overview Available at: hmp://portal.ingeoclouds.eu/sitools/client- user/ DataProviderToolkit/project- index.html Two themes are considered: Earthquake management Landslide management External sources exploited: Earthquakes: Earthquake Data Portal ( hmp://www.seismicportal.eu/servicesdoc/search- event.html) Weather Info: Weather Underground (www.wunderground.com) Points of Interest: LinkedGeoData portal ( hmp://linkedgeodata.org/sparql) External services drawn either through SPARQL Queries (case of points of interest) or calls in REST- based web services (case of earthquakes and weather informaoon)
Technical Details HTML, & javascript used for the implementaoon plus parocular methods of the Linked Data Management API (ldquery, earthquakes, landslides, ldupdate) Google Maps API (javascript- based) exploited for map visualizaoon and loading of KML results External services drawn either through SPARQL Queries (case of points of interest) or calls in REST- based web services (case of earthquakes and weather informaoon)
Main Func;onality VisualizaOon of web form & map User fills in the fields for a parocular theme Default values are provided User is corrected when wrong values are given User presses the Submit bumon Depending on user choice on output format, either the map is updated with markers denoong the different types of (related) data/features or a specific textarea is completed with the resulong XML or JSON file By selecong two markers, the user can then link them by pressing the Link Resources bumon: Link properoes supported: rfds:seealso (see addioonal info), owl:sameas (markers represent equivalent informaoon), and sci:o21_triggers (one has triggered the other, e.g., severe weather condioons a landslide)
Concluding remarks
1 billion triples in the RDF triplestore published as Linked Open Data (LOD) Volume Datasets from 3 different scienofic fields Velocity Ground water (informaoon and chemical analyses) (slow change) Landslides (fast change) Earthquakes (real- Ome measurements) Datasets from 4 different countries Denmark (RelaOonal Data) France (RelaOonal Data) ID- Card of the Project (Big?) Data and the 5 Vs Greece (text/xls data, XML data) Slovenia (RelaOonal Data) Variety Veracity X Value?
Monitoring and Costs More Big Data? A key needs in order to master one s Cloud- based system. Two main levels in InGeoCloudS that help in bemer managing the cloud: ApplicaOons (analyocs about usage) Resources (supervision of indicators, alerts and problems management)
Challenges related to Big Data and the Cloud Scien;fic Challenges Big Data storage No naove RDF triplestore that exploits the Cloud capabilioes Data security / privacy / control also a technical issue, especially when we deal with big data that you cannot monitor precisely Querying Big Data Fast querying for big data (many Omes queries are simple but volume is big) Support for geospaoal queries (Fast) Indexing for (RDF) geospaoal data UpdaOng Big Data Consistency? (CAP theorem, eventually consistent databases) GeospaOal support in NoSQL databases? Exchanging Big Data Data providers want to keep their infrastructures & synchronize data between them &the cloud! Issues on synchronizaoon &exchange of big data Monitoring and Management on the Cloud More big data to manage?
Challenges related to Big Data and the Cloud Poli;cal Challenges User adopoon Users are happy with what they have and they paoently wait unol the problem surfaces Users with weak or no infrastructure are most recepove to turn to Cloud and/or Linked Data paradigms than others User see the cloud as a plaxorm with more resources (memory, storage, processing) but they soll want their applicaoons to run there unchanged Security, privacy and trust Public vs. Private Clouds Publicly owned vs. Company- owned Clouds
Acknowledgements FORTH- ICS : Yannis Roussakis, Kyriakos KriOkos, Athina Kritsotaki, MarOn Doerr AKKA Technologies: Benoit Baurens EKBAA: Makis, Atzemoglou, Elias Grinias 49
November 25, 2014 50