COLINDA - Conference Linked Data

Similar documents
COLINDA: Modeling, Representing and Using Scientific Events in the Web of Data

Towards the Integration of a Research Group Website into the Web of Data

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

Visual Analysis of Statistical Data on Maps using Linked Open Data

Linked Open Data Infrastructure for Public Sector Information: Example from Serbia

TECHNICAL Reports. Discovering Links for Metadata Enrichment on Computer Science Papers. Johann Schaible, Philipp Mayr

Facilitating Scientometrics in Learning Analytics and Educational Data Mining - the LAK Dataset

Mining the Web of Linked Data with RapidMiner

Linked Statistical Data Analysis

The FAO Geopolitical Ontology: a reference for country-based information

Semantic Interoperability

LDIF - Linked Data Integration Framework

An Ontology Model for Organizing Information Resources Sharing on Personal Web

LINKED OPEN DRUG DATA FROM THE HEALTH INSURANCE FUND OF MACEDONIA

Data-Gov Wiki: Towards Linked Government Data

CitationBase: A social tagging management portal for references

Semantic Modeling with RDF. DBTech ExtWorkshop on Database Modeling and Semantic Modeling Lili Aunimo

Linked Open Data A Way to Extract Knowledge from Global Datastores

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

Publishing Linked Data Requires More than Just Using a Tool

María Elena Alvarado gnoss.com* Susana López-Sola gnoss.com*

QASM: a Q&A Social Media System Based on Social Semantics

Open Data Integration Using SPARQL and SPIN

DataBridges: data integration for digital cities

Supporting Change-Aware Semantic Web Services

How To Create A Web Of Knowledge From Data And Content In A Web Browser (Web)

Lightweight Data Integration using the WebComposition Data Grid Service

LinksTo A Web2.0 System that Utilises Linked Data Principles to Link Related Resources Together

RDF Dataset Management Framework for Data.go.th

The Ontology and Architecture for an Academic Social Network

Reason-able View of Linked Data for Cultural Heritage

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

GetLOD - Linked Open Data and Spatial Data Infrastructures

Federated Query Processing over Linked Data

Towards a Sales Assistant using a Product Knowledge Graph

A three-level data publishing portal

Annotea and Semantic Web Supported Collaboration

Benchmarking the Performance of Storage Systems that expose SPARQL Endpoints

Linked Data Interface, Semantics and a T-Box Triple Store for Microsoft SharePoint

Semantically Enhanced Web Personalization Approaches and Techniques

Annotation: An Approach for Building Semantic Web Library

ON DEMAND ACCESS TO BIG DATA. Peter Haase fluid Operations AG

Real-Time Data Access Using Restful Framework for Multi-Platform Data Warehouse Environment

How To Write A Drupal Rdf Plugin For A Site Administrator To Write An Html Oracle Website In A Blog Post In A Flashdrupal.Org Blog Post

COMBINING AND EASING THE ACCESS OF THE ESWC SEMANTIC WEB DATA

A Semantic web approach for e-learning platforms

Semantic Search in Portals using Ontologies

Using Open Source software and Open data to support Clinical Trial Protocol design

STAR Semantic Technologies for Archaeological Resources.

GR4PHP: A Programming API for Consuming E-Commerce Data from the Semantic Web

Portal Version 1 - User Manual

How semantic technology can help you do more with production data. Doing more with production data

Serendipity a platform to discover and visualize Open OER Data from OpenCourseWare repositories Abstract Keywords Introduction

RDF y SPARQL: Dos componentes básicos para la Web de datos

A Semantic Web Application The Resilience Knowledge Base and RKBExplorer

DBpedia German: Extensions and Applications

SmartLink: a Web-based editor and search environment for Linked Services

Context Capture in Software Development

DRUM Distributed Transactional Building Information Management

Detection and Elimination of Duplicate Data from Semantic Web Queries

Data Consumer's Guide. Aggregating and consuming data from profiles March, 2015

Exploiting Linked Data For Building Web Applications

DISCOVERING RESUME INFORMATION USING LINKED DATA

RDF graph Model and Data Retrival

Recipes for Semantic Web Dog Food. The ESWC and ISWC Metadata Projects

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

Building Semantic Content Management Framework

Lift your data hands on session

Scope. Cognescent SBI Semantic Business Intelligence

ELIS Multimedia Lab. Linked Open Data. Sam Coppens MMLab IBBT - UGent

- a Humanities Asset Management System. Georg Vogeler & Martina Semlak

DATA MANAGEMENT PLAN DELIVERABLE NUMBER RESPONSIBLE AUTHOR. Co- funded by the Horizon 2020 Framework Programme of the European Union

LiDDM: A Data Mining System for Linked Data

ON DEMAND ACCESS TO BIG DATA THROUGH SEMANTIC TECHNOLOGIES. Peter Haase fluid Operations AG

Achille Felicetti" VAST-LAB, PIN S.c.R.L., Università degli Studi di Firenze!

An Introduction to Linked Data

LINKED DATA EXPERIENCE AT MACMILLAN Building discovery services for scientific and scholarly content on top of a semantic data model

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

DATA MODEL FOR STORAGE AND RETRIEVAL OF LEGISLATIVE DOCUMENTS IN DIGITAL LIBRARIES USING LINKED DATA

LIDS and Network Marketing Services

Hubble: Linked Data Hub for Clinical Decision Support

Semantic Exploration of Archived Product Lifecycle Metadata under Schema and Instance Evolution

A Visual Exploration Workflow as Enabler for the Exploitation of Linked Open Data

Integrating FLOSS repositories on the Web

SPC BOARD (COMMISSIONE DI COORDINAMENTO SPC) AN OVERVIEW OF THE ITALIAN GUIDELINES FOR SEMANTIC INTEROPERABILITY THROUGH LINKED OPEN DATA

We have big data, but we need big knowledge

Querying DBpedia Using HIVE-QL

E6895 Advanced Big Data Analytics Lecture 4:! Data Store

AGRIS: an RDF-aware system in the agricultural domain

Sustainable Development with Geospatial Information Leveraging the Data and Technology Revolution

Visualizing Big Data. Activity 1: Volume, Variety, Velocity

A semantic scheduler architecture for federated hybrid clouds

Data Quality Mining: Employing Classifiers for Assuring consistent Datasets

Semantic Monitoring of Personal Web Activity to Support the Management of Trust and Privacy

Interactive exploration of complex relational data sets in a web interface

Semantics and Ontology of Logistic Cloud Services*

Implementing an RDF/OWL Ontology on Henry the III Fine Rolls

CloudStack Metering Working with the Usage Data. Tariq Iqbal Senior

The Open University s repository of research publications and other research outputs

Transcription:

Undefined 1 (0) 1 5 1 IOS Press COLINDA - Conference Linked Data Editor(s): Name Surname, University, Country Solicited review(s): Name Surname, University, Country Open review(s): Name Surname, University, Country Selver Softic a, Laurens De Vocht b, Erik Mannens b and Rik Van de Walle b a Social Learning, Graz University of Technology, Münzgrabenstrasse 35A, 8010, Graz, Austria E-mail: selver.softic@tugraz.at b Multimedialab, Ghent University - iminds, Sint-Pietersnieuwstraat 25, 9000, Ghent, Belgium E-mail: laurens.devocht,erik.mannens,rik.vandewalle@ugent.be Abstract. We introduce a new LOD (Linked Open Data) Cloud member COLINDA (COnference LInked DAta) which exposes information about scientific events like conferences and workshops for the period from 2007 up to 2011. COLINDA includes also time and venue information of the scientific events which is interlinked to the GeoNames Linked Data set. The main sources of COLINDA are WikiCfP and Eventseer. COLINDA holds information about conferences from all over the world and contains information about 6000 scientific events generating around 140000 triples. More then 25000 new conferences are to come. This paper provides an introduction on the conference linked dataset,and demonstrates its applicability for adoption of web 2.0 into science also known as Research 2.0 or Science 2.0 Keywords: Linked Data, COLINDA, scientific events, Linked Science, Research 2.0 1. Introduction We present our current work on the COLINDA 1 dataset which contains information about scientific events (conferences, workshops) from all over the world. Data published currently in COLINDA is extracted from the dumps of WikiCfP 2 and data has been extracted via JSON interface from Eventseer 3. Data from Eventseer still waits to be published due to the still pending publication permission approval. Some parts of WikiCfP data are still in processing stage, which means that several thousands of additional conferences will be added consequently. COLINDA includes data of about 6000 conferences. Around 30000 additional conferences are planned to follow till the end of the year. In section 2 we describe the nature and origins of data and purpose of creation of the content included in COLINDA. Additionally we provide 1 Available at: http://data.colinda.org/ 2 http://www.wikicfp.com 3 http://eventseer.net information about the availability and accessibility regarding this data. Section 3 is reserved for the details about the dataset creation process, while the Section 4 is aiming at outlining the usage of COLINDA. Finally in Section 5 we draw conclusions, discuss the limitations and report on our future work. 2. COLINDA Dataset 2.1. Data Source and Coverage The main data sources of COLINDA are WikiCfP and Eventseer. Those are two very popular online Web 2.0 pages containing data about calls for papers, locations and topics of conferences as well as concrete call for papers description. Such pages can be considered as scientific event announcement pages editable by the users with archiving character. In order to add some event a user has to be registered and as soon he or she enters an event, a review process by internal editors is started in order to validate the entry qual- 0000-0000/0-1900/$00.00 c 0 IOS Press and the authors. All rights reserved

2 S.Softic et al. / COLINDA - Conference Linked Data ity. This process is more strict by Eventseer then by WikiCfP. Data for these pages is provided by the scientific community users involved into organisation of such events. WikiCfP also provides annual data dumps for previous years in XML format. Currently WikiCfP contains data about around 30.000 conferences with around 100.000 registered researchers. Eventseer contains according the latest infromation 4 information about around 21000 events and serves more then 1 million users. Scientific events from both pages date from 2002 up to now also including the future events in the next year. Listing 1 shows a simple entry from an WikiCfP data dump that was used to create instances from COLINDA, while listing 2 represent the JSON source from Eventseer. Listing 2: Sample entry from Eventseer JSON interface. "totalrecords": 1, "records": [ "Event": "<a href= /e/19285/ > 13th IEEE/ACM international symposium on cluster, cloud and grid computing (CCGRID 2013) </a>", "City": "Delft", "Country": "Netherlands", "Date": "10 Nov 2012", "NextDeadline": "22 Nov 2012", "EndDate": "16 May 2013", "StartDate": "13 May 2013" ] Listing 1: Sample entry from WikiCfP data dump. <row> <field name="eventid">11426</field> <field name="createdate">2010 09 13 07:22:18</field> <field name="fullname">the 10th International Semantic Web Conference</field> <field name="handle">iswc</field> <field name="year">2011</field> <field name="location">koblenz, Germany</field> <field name="begindate">2011 10 22</field> <field name="finishdate">2011 10 27</field> <field name="presubdate">2011 06 16</field> <field name="submitdate">2011 06 23</field> <field name="notifydate">2011 08 08</field> <field name="cameradate">2011 08 28</field> <field name="weblink">http://iswc2011.semanticweb.org</field> <field name="info">cfp text...</field> </row> Current exported data covers generally two domains. The first domain describes the Conference as basic scientific event with a start date, location, description, label and link to the event. The Location is then in interlinking process resolved using the GeoNames 5 data set. Each location contains reference to the city, country and coordinates of the location. 2.2. Purpose of Creation The intention behind COLINDA was initially to provide tag based identification system for scientific events in the manner of the "5-star" quality Open Data 6. Users in social microblogs like Twitter 7 are often using so called "hash tags" to describe an event they are attending. E.g. ISWC (International Semantic Web Conference) 2012 is often referred as "iswc12" or "iswc2012". Also the DBLP linked data set uses this kind of notation to reference the event where a publication belongs to 8. More generally COLINDA is meant to be event driven connection data set for scientific LOD (Linked Open Data) Cloud datasets and to support in this way the efforts of Linked Science 9 initiative as well to be used as mining reference for creation of semantically driven microblog data Mesh Ups for Research 2.0 as it will be shown on simple example in section 4. Research 2.0 also known as Science 2.0 is a initiative in research community to adapt Web 2.0 for the needs of scientific community. Currently several efforts in this direction are running or has been done as e.g. DBLP data set containing data about publications or ResearchGate 10 a social network for scientists just to mention some of them. Events specific for the research community seem to be very intuitive base for connecting to the context since researches with the same interests seem to track and visit similar events. Further appliance of COLINDA would be also to connect information about the scientific events with other scientific data sets of relevance e.g. information about publications, citations, journal informations etc. 4 http://eventseer.net/data/ 5 http://www.geonames.org/ 6 http://5stardata.info/ 7 http://www.twitter.com/ 8 for "iswc2012" : http://dblp.l3s.de/d2r/page/publications/conf/iswc/2012 9 http://linkedscience.org/ 10 http://www.researchgate.net/

S.Softic et al. / COLINDA - Conference Linked Data 3 2.3. Licensing and Availability Table 1 Harmonised COLINDA instances minimal properties set. WikiCfP is a Semantic Wiki and supports "creative commons" Attribution-ShareAlike 3.0 License 11 which was also beside the page popularity decisive by choosing the data source for COLINDA. Eventseer however is run and developed by Abiody AS and Thomas Brox Røst, which implies an explicit request for permission to re-use the data we collected. Currently published data is available at http://data.colinda.org. 3. Linked Data creation process Concept Conference optional Location Property label title description date link location placename city country longitude latitude The data creation process contains basically following steps: Extraction - preprocessing and harmonisation of data sources (Subsection 3.1) Ontology choice - concept coverage (Subsection 3.2) Triplification - creating RDF data triples (Subsection 3.3) Interlinking - connection to other Linked Data sets (Subsection 3.4) Additionally information about data characteristics and URI design are described in subsections 3.5 and 3.6. 3.1. Extraction Since COLINDA consumes data from different sources a minimal set of properties that describe a Conference concept for a single RDF instance has to be defined. All properties from source data sets has to be mapped to this normalized set in order to harmonize the federated data for import. Location concept related to conference event as such is considered as optional enrichment treated in the interlinking process. This decision was made because of the fact that all conference descriptions do not explicitly include the venue information. The quality of source data depends on the users that provide the information. Thus such data sources implicitly exclude assumption of completeness.table 1 represents the minimal set of properties a Conference and Location instance should include. 11 http://creativecommons.org/licenses/by-sa/3.0/ The Extraction process includes steps of preprocessing XML and JSON inputs from WikiCfP and Eventseer into the series of values saved into Comma Separated Value (CSV) form as result. During the preprocessing cycle data fields like e.g. date are normalised to uniform representation in order to provide easier processable input for triplification step which converts the extracted values into RDF formated instances. 3.2. Ontology Choice Representation of scientific events was already elaborated in previous research work [1]. Minimal field set defined in table 1 for RDF instance generation fitted very well to the existing ontologies. This is the reason why we have chosen the SWRC Ontology [1] and basic RDFS Schema 12 as established vocabularies to describe Conference instances. Same approach was applicable for Location concept. Geographical property set was easily mappable into combination of GeoNames 13 and Basic Geo (WGS84) Vocabulary 14. How completely supported mapped model with interlinking looks like can be seen in figure 1, where a single complete and interlinked instance of Conference is depicted. How properties fit to the vocabulary properties can be seen in table 2. 3.3. Triplification Triplification process uses as input in extraction, harmonisation and preprocessing step generated CSV 12 http://www.w3.org/tr/rdf-schema/ 13 http://www.geonames.org/ontology/ 14 http://www.w3.org/2003/01/geo/wgs84_pos#

4 S.Softic et al. / COLINDA - Conference Linked Data Fig. 1. Sample interlinked Conference RDF instance. Table 2 COLINDA concept to ontology model mapping (note: geonames - GeoNames Ontology, geo - W3C GEO Vocabulary, swrc - SWRC Ontology). Concept/Property Conference label title description date link location Location placename city country longitude latitude RDF Class/Property swrc:conference rdfs:label swrc:eventtitle swrc:description swrc:startdate owl:sameas swrc:location geo:spatialthing geonames:p geonames:name geonames:countryname geo:long geo:lat output. Input generated in this way represents tabular set of values compatible with properties from table 1. This CSV file is then imported into appropriate MySQL database table that holds COLINDA data. From this place data is conference wise generated as single RDF instance using the vocabulary properties defined in table 2 and each RDF instance of a single conference svent is recallable via REST (Representational State Transfer) calls as described in subsection 3.5. In order to provide the whole data via SPARQL endpoint batch process synchronises the data from MySQL database table into the ARC2 15 RDF triple store running on the same database server. 3.4. Interlinking In order to provide 5-star data and led by the design issues described in [2], we used swrc:location as 15 https://github.com/semsol/arc2/ interlinking property in order to interlink the location data with GeoNames. The interlinking process uses curl 16 requests against GeoNames query service to resolve geographical information and retrieve coordinates. Although usually owl:sameas is used to interlink to other data set we used this property to resolve the connection to the conference web page and since swrc:location seems regarding the GeoNames to be more appropriate choice. How this connection looks like can be seen in the sample depicted in figure 1. Due the fact that conference information entered by users does not always include location data, we generates some data statistics about how many of overall instances are linked to GeoNames as well some other interesting facts which are presented in subsection 3.6. 3.5. URI Design and Data Set Publication There are generally two ways to access instances of COLINDA. First way is using the direct access to single RDF instances via URIs designed as follows: http://data.colinda.org/conference?id=id http://data.colinda.org/resource/conference/id The id represent the COLINDA s internal id that starts by integer value of 1 up to the number of current instances which is around 6000. All responses are in RDF/XML notation. The REST like access is realised suing the REST PHP Framework Restler 2.0 version 17. The second access possibility is via the SPARQL endpoint that is accessible at: http://data.colinda.org/endpoint.php. This endpoint supports up to 250000 result triples per query and retrieves results in various widely supported formats like JSON, RDF/XML, XML, TSV etc. How 16 http://curl.haxx.se/ 17 http://luracast.com/products/restler/

S.Softic et al. / COLINDA - Conference Linked Data 5 Table 3 Conference counts by country in COLINDA. Year Count 2007 416 2008 2141 2009 2665 2010 768 2011 13 the endpoint can be queried is described by simple example in listings 3. Listing 3: Sample SPARQL query for retrieval of conference meta fields. PREFIX swrc: <http://swrc.ontoware.org/ontology#> PREFIX rdfs: <http://www.w3.org/2000/01/rdf schema#> SELECT?x rdfs:label "LREC2008" OPTIONAL?x swrc:description?des; swrc:keywords?key; swrc:startdate?st; swrc:location?loc. Table 4 Top 5 conference counts by country in COLINDA. Country Count China 258 Germany 248 Italy 229 France 203 Spain 162 Japan 110 nection to GeoNames is present for the conference instance. In order to approve the interlinking quality COLINDA was analysed upon this property. All Conference RDF instances which included the location property has been set to value 1 while others to degree 0. Out of this data a normed histogram over the whole range of instances in COLINDA has been generated. Histogram for this analysis can be seen in figure 2, showing very high degree of interlinked instances. Further executable samples of SPARQL queries can be found at: http://data.colinda.org/endpoint.html. Additional information about COLINDA and the RDF data dumps is also accessible via the CKAN Registry of LOD Cloud: http://datahub.io/dataset/colinda. Fig. 2. Percentage of instances including the swrc:location property. 3.6. Data Characteristics In order to offer a short overview about conferences we created a table of counts of conference instances per year as well an overview over conference occurence with respect to the venue presented in the tables 3 and 4. As it can be seen in table 3 currently most conferences date from 2008 and 2009 since those dumps from WIkiCfP has been imported completely yet. The top 5 conference count cover in sum only 1/6 part of the whole locations contained. Thus table 4 reveals us that the location dissemination of conferences tends to vary strongly inside COLINDA. A full resolution of interlinking is provided only when the location property swrc:location is included in instance making triples, which means only in this cases a con- 4. Usage Based upon example of "Researcher Affinity Browser" [3] we want to demonstrate a possible appliance case for COLINDA linked data set."researcher Affinity Browser" has been developed in the realm of our research as semantically driven microblog data Mesh Up for the needs of Research 2.0. COLINDA was used as mining source for the facetted distinction of scientist Twitter profiles based upon conferences they visited as special affinity criteria. Subsection 4.1 offers a short functional overview over the application as well the role that data from COLINDA plays within. Ad-

6 S.Softic et al. / COLINDA - Conference Linked Data equate demo video showing the "Researcher Affinity Browser" in action can be also viewed online 18. 4.1. Researcher Affinity Browser Fig. 3. "Researcher Affinity Browser Application" snapshot. The "Researcher Affinity Browser Application" [3] is depicted in figure 3. At the beginning it retrieves a list of relevant users. Those results represent a current snapshot which means that every time users produce new tweets on Twitter, the analysis result evolves with it. The relevance is measured according to the number of common conceptual affinities.different affinity facets are displayed on the left. Users can explore three types of affinities: conferences, tags and mentions. Activation of a certain affinity filters the list of matching persons. There is the result table that displays detailed information about each person and how many affinities are shared. Further there is a map view and an affinity plot synchronized with the result table. The affinity plot visualizes in a quick overview affinity correspondence between the analyzed profile and other profiles in the system. 5. Discussion and Future Work As outlined in subsection 4.1 COLINDA has been already used as mining source for generation and enhancement of Researcher Affinity Browser [3] a semantically driven microblog data Mesh Up interface for Research 2.0. However a shortcoming of current state of the COLINDA is still low number of triplified conference instances that counts currently about 6000 which produces around 140000 triples. We are aim- 18 http://www.youtube.com/watch?v=a25drp3mv8w ing at publishing the rest of the conference data (additional 25000 up to 30000 conferences) we extracted as soon we get the permissions and finish the preprocessing. The interlinking process with curl requests is very time consuming, therefore we will use GeoNames dump instead. Further we want to interlink via conference labels our conference entries to the DBLP (Digital Bibliography and Library Project) 19 Linked Data Set. DBLP also provides dumps which will make the process more easiers as it was in the case of GeoNames. Extending the REST interface to allow retrieval of triples of single conferences in the manner: http://url/resource/conference/label/year will be included in the next steps. We also want to implement an Lookup 20 and Spotlight 21 service comparable to the DBPedia and a facetted search interface like the one provided by the DBLP. In order to approve the quality of COLINDA an evaluation against Linked Data Integration Benchmark (LODIB) 22 will be done. Acknowledgement The research activities that have been described in this paper were funded by Ghent University, the Social Learning Department at Graz University of Technology, iminds (an independent research institute founded by the Flemish government to stimulate ICT innovation), the Institute for the Promotion of Innovation by Science and Technology in Flanders (IWT), the Fund for Scientific Research-Flanders (FWO-Flanders), and the European Union. References [1] Y. Sure,S. Bloehdorn, P. Haase, J. Hartmann and D. Oberle,The SWRC ontology - Semantic Web for research communities, in: Proceedings of the 12th Portuguese Conference on Artificial Intelligence - Progress in Artificial Intelligence (EPIA 2005), 2005, pp. 218 231. [2] T. Berners-Lee, Linked Data - Design Issues, http://www.w3.org/designissues/linkeddata.htm, W3C, 2006. [3] L. De Vocht, S. Softic, M. Ebner, and H. Mühlburger, Semantically driven social data aggregation interfaces for Research 2.0, in: Proceedings of the 11th International Conference on Knowledge Management and Knowledge Technologies, New York, NY, USA, ACM, 2011, pp.43:1 43:9. 19 http://dblp.l3s.de/ 20 http://lookup.dbpedia.org 21 https://github.com/dbpedia-spotlight/ 22 http://lodib.wbsg.de/