Best practices for Linked Data Asunción Gómez-Pérez Facultad de Informática, Universidad Politécnica de Madrid Avda. Montepríncipe s/n, 28660 Boadilla del Monte, Madrid http://www.oeg-upm.net asun@fi.upm.es Phone: 34.91.3367417, Fax: 34.91.3524819 Acknowledgements: M. Poveda, V. Rodríguez-Doncel, D. Vila BabeLData: TIN2010-17550
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December Linked Data: why it is important? Facilitate data integration From heterogeous sources In different formats Different granularity In different languages From different countries Slide adapted from 5min Introduction to Linked Data - Olaf Hartig
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December 3 3 BD BNE BD VIAF BD AEMET BD IGN BD Prisa BD DBpedia Data Integration BNE Ubicado en Alcalá de Henares 1605 El Quijote Año de Publicación Autor birthplace Same as M. Cervantes M. Cervantes Alcalá de Henares M. Cervantes Year of publication creator Don Quixote 1960 Translated into Hebrew VIAF located Alcalá de Henares guía Tapas Siglo de Oro Alcalá de Henares Temperatura 20º
RDF(S) models Unique identifiers: URI identify or name a resource Foundations Equivalence links to other datasets Same As Data navigation http://iflastandards.info/ns/fr/frbr/frbrer/c1005 Person Is creator of Cer http://iflastandards.info/ns/fr/frbr/frbrer/c1001 Work Is a Is a Cervantes http://datos.bne.es/resource/xx1718747 Is creator of Cer El Quijote http://datos.bne.es/resource/xx3383563 Same As Same As Cervantes http://viaf.org/viaf/17220427 Cervantes http://dbpedia.org/resource/miguel_de_cervantes Asunción Gómez-Pérez http://www.w3.org/designissues/linkeddata.html W3C @ Spain 2013 Madrid, 18 th December 4
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December 5 5 The model (Ontology) and the data for humans Idiom Year translation Publication date Work Is creator of Person birthplace Place Ontology Located at Library Has subject Catalán 1960 translation Publication date El Quijote Is creator of Cervantes birthplace Alcalá de Henares Located in Has subject Vida de Cervantes Data BNE
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December 6 6 The model and the data for Machines Language http://iflastandards.info/ns/fr/frbr/frbrer/c1002 Ontology translation work Is creator of Person Año http://iflastandards.info/ns/fr/frbr/frbrer/c1001 http://iflastandards.info/ns/fr/frbr/frbrer/c1005 Publication date birthplace Located in Has subject http://geo.linkeddata.es/ontology/municipio Biblioteca http://xmlns.com/foaf/0.1/organization Catalán http://datos.bne.es/resource/xx1924295 translation http://geo.linkeddata.es/resource/alcalá de Henares 1960 Publication date Don Quijote de la Mancha http://datos.bne.es/resource/xx3383563 Es autor Cervantes Saavedra, Miguel de http://datos.bne.es/resource/xx1718747 birthplace Has subject BNE Located in http://datos.bne.es/# http://datos.bne.es/resource/bimo0002045496 Vida de Miguel de Cervantes Saavedra Data
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December Linked Data is to be processed by machines
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December The generation process Providers Domains Sources Languages
The Linked Data Generation Process Specification Data Curation Exploitation Modelling Publication Generation Linking 9 There is no One-Size-Fits-All Formula
Lot of data in many domains Music On-line activities E-Gov Cross-domains Publications Geographic Life Sciences
I want to use Linked Open Data Who generated the LD dataset? When the LD dataset was created? How the LD dataset was created? Is the latest version of the LD dataset? Is the license information clearly stated in the LD dataset? How is LD licenses offered? Is the LD dataset monolingual or multilingual?
LOD observations How the LD generation process influence the use of the data by third parties? Vocabularies Licenses Language Provenance
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December How to prevent GIGO GARBAGE PROCESS
Vocabularies 14 th
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December Cervantes at the data level http://www.server1.org/resource/cervantes Same as URI URI URI URI URI Cervantes http://d-nb.info/gnd/11851993x Same as http://datos.bne.es/resource/xx1718747 Same as http://www.server2.es/resource/cervantes D. Quijote Author Phone Date of Birth #People 914 296 093 Same as 1547 Size 1547 276,4 km² http://geo.linkeddata.es/page/resource/municipio/cervantes
http://www.server1.org/resource/cervantes rdf:type Cervantes and a bit of semantics rdf:type Person Retaurant URI URI URI URI URI Cervantes (Person) http://d-nb.info/gnd/11851993x rdf:type Same as http://datos.bne.es/resource/xx1718747 rdf:type Street Author http://www.server2.es/resource/cervantes D. Quijote Date of Birth rdf:type Municipality 1547 http://geo.linkeddata.es/page/resource/municipio/cervantes Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December
17 Cervantes foaf foaf:agent foaf:group foaf:organization foaf:document foaf:person foaf:publications foaf:image foaf:mbox - foaf:firstname - foaf:surname - foaf:birthday foaf:img owl:thing foaf:knows foaf:depiction Miguel de Cervantes Saavedra foaf:firstname foaf:surname instanceof bibliothek:cervantes instanceof foaf:homepage instanceof instanceof 29-09 foaf:birthday foaf:img http://www.bibliothekberlin/ /images/quixote.tif http://.../authors/cervantes.png foaf:publications foaf:depiction http://www.bibliothekberlin.com/.../3-538-06892-5
18 License Information
How Open is the Open Linked Data Cloud? LOD observations: Licenses
An example: the British National Bibliography
License Information is not up to date
Metadata information without license information
License information provided as XML
Linked Data Rights pattern http://oeg-dev.dia.fi.upm.es/licensius/static/ldr/
Lenguage 25
Rationale: LOD is dominated by the English Language 2007 2009 2013 Questions: 1. Searching resources in a particular language 2. Distribution of natural languages across RDF datasets? 3. Usage of language tags to indicate the natural language of RDF tags? 1. Distribution of usage of language tags 2. Distribution of literals tagged as English vs other languages 3. Distribution of literals tagged in languages other than English 26
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December 27 Example of multilingual library resource The dataset publisher does not tag the language of the content of different fields Ernest Hemingway and El viejo y el mar MARC 21 records
Asunción Gómez-Pérez W3C @ Spain 2013 Madrid, 18 th December Multilingualism and the Linked Data Process How to represent language information for datasets? # VoiD description :bne a void:dataset; dcterms:language <http://id.loc.gov/vocabulary/iso639-1/es>. # DCAT description :bne a dcat:dataset; dcterms:language <http://id.loc.gov/vocabulary/iso639-1/es> How to represent language information in Linked Data? Traditional annotation properties for most cases dbpedia:miguel_de_cervantes rdfs:label "Miguel de Cervantes"@es. "ミゲル デ セルバンテス"@ja. " "@ko. Richer models for more demanding applications # LEMON isbd:t1001 lemon:isreferenceof [lemon:issenseof :cartographic]. :cartographic a lemon:lexicalentry; lemon:form [lemon:writtenrep cartográfico @es; isocat:grammaticalgender isocat:masculine]; lemon:form [lemon:writtenrep cartográfica @es; isocat:grammaticalgender isocat:feminine]. isocat:grammaticalgender rdfs:subpropertyof lemon:property.
Implementation of the recording of data and metadata provenance Generation process PROV-O @W3C Resource provenance DC File.txt creator creadondate John 12-2- 1900 rights GPL used Revision Process generatedby PROVENANCE Model (RDF(S)) Filev1. txt RDF Store 29 1
Asuncion Gomez-Perez W3C @ Spain 2013 Madrid, 18 th December Conclusions The use of Data curated Use vocabularies widely known License metadata in RDF Language metadata in RDF Provenance metadata in RDF Will influence the use of the linked data by third parties
Thanks for your attention! Asuncion Gomez-Perez Guidelines for Multilingual Linked Data. WIMS 2013 Madrid, 12-14 June 31
There is no One-Size-Fits-All Formula Phase BNE IGN AEMET PRISA INE Modeling DC hydrontology Wgs84 time SSN ontology SIOC Scovo Data cube RDF generation MARiMbA geometry2rdf NOR2O CSV parser CSV parser NOR2O Links generation DNB VIAF LIBRIS DBPEDIA Silk Silk Silk DBPEDIA DBPEDIA Geolinkeddata.es Geonames Geolinkeddata.es NOR2O Geolinkeddata.es Publication Pubby sitemap4rdf Exploitation map4rdf SPARQL http://oa.upm.es/14465/1/2.formulald.pdf
The multilingual Web of Data: Current state Monolingual datasets Multilingual datasets RDF literals without language tag RDF literals with language tag 349 635 676 2,567,324 3,154,779 3,365,930 1,906 2,201 1,984 10,250,936 10,594,338 12,272,806 January 2012 June 2012 December 2012 1. Number of Monolingual and multilingual datasets January 2012 June 2012 December 2012 2. Current usage of language tagging capabilities in RDF RDF literals with English tag RDF literals with other language tag 431,660 403,714 557,785 2,135,664 2,751,065 2,808,145 January 2012 June 2012 December 2012 3. English tags versus other languages' tags 4. Evolution of top-10 languages 33