Lift your data hands on session

Lift your data hands on session Duration: 40mn Foreword Publishing data as linked data requires several procedures like converting initial data into RDF, polishing URIs, possibly finding a commonly used vocabulary that you can reuse in your dataset, publishing the data (i.e. making sure that each URI used to identify an object in your dataset is dereferenceable, i.e. is a HTTP URL that leads to some actual description of the object). Further on, you would want to give a SPARQL access to these data (the equivalent of WFS service but in the Linked Data world), and also to interconnect your instances with other instances published on the Web. For some of these procedures, you could find code on the Web and for others you would have to write down something yourself. For your URIs to be dereferenceable, you can tune the config file of a HTTPD Web server. To set up the SPARQL access is a bit more complex. This is why this tutorial will use an existing open source software, designed and implemented in the context of the research project Datalift (funded by the French national agency for research, ANR, http://datalift.org). This software gathers modules to perform these steps. It has been designed by the industrial partner of the consortium, Atos, as an open platform so that for each step several modules can be available and so that new modules can be added. Some of them are operational whereas others are still under development. It will be made available at the end of the project (autumn 2013) as open source. We expect two categories of participants to this workshop. For people who are quite familiar with implementation, we suggest you install the platform on your computer (see dedicated directory on the tutorial material) and follow the steps to lift sample datasets and then adapt the process to your own data. The technical description of the software is in the suggested readings directory of the tutorial material. For people who don t feel so familiar, we suggest you use an existing server installed by Atos. Adapting the lifting process to your own data might be a bit longer but we will offer assistance. Please show us your data so that we adapt the configuration file used by the conversion software and load it on the server. 1

Part A: lifting data Principles The workspace interface allows you to design projects. You can specify data sources that will be loaded by the server, and select and apply operations (modules) on these sources. This will yield new data on which you can apply new operations. To lift data, it is necessary to create or use an existing project. A project is an environment in which it is possible to add sources, i.e. to specify location of initial data that the server will fetch to work on them, and it is possible to select and trigger operations on these sources. The resulting data will be in turn stored as sources of the project. Step 0: create the project To create a project, you need to use the «workspace» interface. Remote server: http://datalift.si.fr.atosorigin.com/datalift/project Login: datalift, Pwd: test Your installation: http://localhost:9091/datalift/project On the workspace, click on «New project» in the left column (fig 1 and the fig. 2). The name of the project is used in the working URIs but these can be modified before publishing, in step 3. If you use the remote server make sure to use a distinctive name for your project so that each of you works on his own project in the remaining of the tutorial. fig. 1 2

fig. 2 Step 1: specify sources On the workspace view, select the newly created project, then click on the Sources tab and then on the «+» sign on the left bottom of the Sources tab (fig. 3) to add a new source to the list of existing sources used in your project. fig. 3 On the next window, you have to select the format for the input source of data like CSV, XML, GML, RDF, SHP, any DBMS provided there is a JDBC driver and any SPARQL endpoint. When you have selected a format, an interface helps you to specify the location of your data for the server to be able to fetch it. In this tutorial, first use the SHP source loader to load DEPARTEMENT.shp (fig. 4) and the CSV source loader to load ADRESSES.csv (fig. 5). 3

fig. 4 fig. 5 Some source can then be visualised by clicking on the source name (others not yet): CSV, RDF, XML, GML files can be visualised by clicking on their name in the Sources tab. You may also delete sources using the bin picto or modify them using the pen. Be sure you select the source by clicking on its rectangle but not on the name before selecting the bin or pencil picto. Step 2: convert into RDF To proceed to conversion you go back to the description tab. There are different possible conversion modules depending on the kind of sources. 2.a conversion of the SHAPE to GML 4

In the case of SHAPEFILE, you will apply two modules: firstly SHAPE to GML mapping and second GML to RDF mapping. Note that the SHP to GML will also change the projection to WGS84 that is widely used in Linked Data. fig. 6 Here we only have a SHP source and the only possible module associated to it is a SHP to GML mapping (fig. 6). Click on the module, there is only one possible SHP file to select, click on submit. The newly created GML file appears now in the list of sources and if you click on it you can visualise it: <?xml version="1.0" encoding="utf-8"?> <ogr:featurecollection xmlns:xsi="http://www.w3.org/2001/xmlschema-instance" xsi:schemalocation="http://ogr.maptools.org/ DEPARTEMENT_wgs84.xsd" xmlns:ogr="http://ogr.maptools.org/" xmlns:gml="http://www.opengis.net/gml"> <gml:boundedby> <gml:box> <gml:coord><gml:x>- 5.139017285433222</gml:X><gml:Y>41.36275742645721</gml:Y></gml:coord> <gml:coord><gml:x>9.559823318166366</gml:x><gml:y>51.08939669830915</gm l:y></gml:coord> </gml:box> </gml:boundedby> <gml:featuremember> <ogr:departement_wgs84 fid="departement_wgs84.0"> 5

<ogr:geometryproperty><gml:polygon><gml:outerboundaryis><gml:linearring ><gml:coordinates>5.831226413621034,45.938459578293212 5.822166170149367,45.93026097668983 5.829400764539029,45.913987107834956 5.826316643217439,45.903692922700827 5.815154073062683 etc. The next step is to apply a GML to RDF mapping to this GML file. To do so go to the Description tab, the module GML to RDF mapping is now available. Click on it, there is only one possible GML file to use for the module, click on submit. The newly created RDF file appears in the Sources tab. You can click on it to display the generated RDF data (fig. 7). fig. 7 WARNING: the GML to RDF mapping uses a configuration file that is specific to the GML schema used in the gml file. It is stored at: {$DATALIFT_HOME}/storage/public/project/{PROJECT_NAME}/ and its name must be the same as the name of the GML file with a.conf suffixe. In the tutorial material you can see how it looks: DEPARTEMENT_wgs84.conf. We also join an article from the author of the initial source code (that we have reused in the platform) that explains the parameters of the configuration file. It is the file USGSReport.pdf in the Suggestedreadings directory. Step 2.b: conversion of CSV to RDF In the Description tab, select the module Direct mapping CSV to RDF. A Data type mapping section allows you to specify the datatype for each field (fig 8). 6

fig. 8 An RDF file is created that has the same name as the previous one suffixed with (RDF #1) at the end. This name is generated automatically, feel free to change it. Step 3: change the URIs You may need to change URIs in the generated RDF file. To do so, go to the Description tab and select the module: RDF URI translation (fig. 9). fig. 9 A new source is created that has the same name as the previous one with (RDF #i) at the end. This name is generated automatically, feel free to change it. Step 4: Transform the schema 7

To change the schema you can use the RDF to RDF transformation (CONSTRUCT) like on fig. 10. fig. 10 Step 5: Publish the data Once you have RDF sources, two new modules are available in the Description tab: Data publishing to public RDF store and RDF data export. The first module copies RDF data from the internal store to the public store. When publishing RDF data, the named graph URI is not visible as part of the data but acts as a container for the RDF triples, allowing manipulating them as a set (e.g. to delete or replace them). The URIs of your objects will remain unchanged: if the server receives a request with these URIs it will be able to serve the description of the object identified by the URI. The second module allows you to download the data locally, for example to upload them in another distant RDF store. Step 6: Interconnexion Now that your data are published, it is interesting to match the instances with other instances and to publish the links. 8

Part B: Querying the lifted data The data can be queried by sending: 1) HTTP requests to dereference the URIs (URLs in this case); the server will return a representation of the object after content negotiation based on request headers: HTML, RDF/XML, Turtle, N3. 2) SPARQL queries on the SPARQL endpoint web service. See fig. 11. Some clues about SparQL language: http://www.w3.org/tr/rdf- sparql- query/ You can also experiment HTML files in the tutorial material to display Linked Data on cartographic portals. fig. 11 9