DBpedia German: Extensions and Applications Alexandru-Aurelian Todor FU-Berlin, Innovationsforum Semantic Media Web, 7. Oktober 2014
Overview Why DBpedia? New Developments in DBpedia German Problems in DBpedia 2
Why DBpedia? Knowledge Bases are the core of Intelligent Web Applications Gazetteerless NER Question Answering Engines Document Enrichment Relation Extraction Event Detection Large web companies are developing their own alternatives Google Knowledge Graph/Freebase Microsoft Satori KB Wikidata Yahoo Knowledge Graph IBM Watson API, Wolfram Alpha etc. 3
DBpedia-LOD Cloud 4
International DBpedia Chapters http://wiki.dbpedia.org/internationalization/chapters Goal: provide additional resources for extraction, access, services, language specific endpoints and services language specific extension German Chapter addresses German Language URL: http://de.dbpedia.org/ 5
DBpedia German: What We Offer DBpedia German Data Dumps: http://de.dbpedia.org/downloads/ DBpedia German SPARQL Endpoint: http://de.dbpedia.org/sparql DBpedia German Spotlight: http://de.dbpedia.org/spotlight/demo DBpedia German Live: http://live.de.dbpedia.org/sparql DBpedia German Live Changesets: http://live.de.dbpedia.org/changesets Improved interlinking data, e.g. Linked Hypernyms Dataset (hypernyms from the first sentences) 6
DBpedia German Statistics Property Value Triples 146 Million Classes 206 Entities 4.3 Million Distinct Subjects 7.6 Million Properties 15609 Distinct Objects 36.7 Million Category Improvement Mappings >30% Missing Labels >300% 7
DBpedia German Infrastructure Main Sever 1: CPU: 2x Hexacore Xeon Ivy Bridge => 24 vcores Memory: 256 GB Ram SDD ~ 1TB Raid 5 Array HDD ~ 10 TB Raid 5 Array Secondary Servers 2: CPU: Quadcore Xeon Sandy Bridge: =>16 vcores Memory: 32 GB x2 HDD: 1TB 10000 RPM x2 8
Problems In DBpedia Ontology & Missing Data Missing Labels Missing Types Editing Capabilities Administration 9
Missing Labels Why Missing Labels are a problem? People don t understand the ontology New classes and properties are created needlessly Not a true multilingual ontology Language Missing Class Labels Missing Property Labels German 147 1530 French 280 2534 Spanish 706 2747 Italian 496 2712 Polish 678 2709 10
Missing Labels How Do We Address the Problem Automatically Translate Labels using Translation services Present translation suggestions to editors in a batch mode Allow editors to edit and commit multiple translations at the same time 11
Missing Labels: Missingbot Bot-framework for editing the MappingsWiki Rest service for communicating with the mappings wiki and other applications Plugins for the Mappings Wiki in order to review added information https://github.com/dbpedia/missingbot 12
MissingBot Label Translations 14
Missing Types Why are Missing Types a Problem? rdf:type statements are the main way we query a KB Without precise type information there is no easy way to say Which CDU politicians born in berlin List all capitals in Europe List all actors (schauschpieler) in Berlin Without precise type information NER annotations are imprecise You can t filter out or select specific entities Ex: annotate only politicians, or software companies in a text document 15
Missing Types Solution: Linked Hypernyms Dataset Cooperation with the Prague University of Economics http://ner.vse.cz/datasets/linkedhypernyms/ Extract type information from Hypernyms Significant improvement over instance-types dataset DBpedia Instance Types LHD 1.0 LHD 2.0 Nr. of resources 910834 893120 795415 New resources N/A 495924 403475 Improvement N/A 52.5 % 44.3% 16
LHD Examples from the German DBpedia Dbpedia Resource Dbpedia Types LHD 1.0 Type LHD 2.0 Types http://de.dbpedia.org/r esource/brad_pitt http://de.dbpedia.org/r esource/tom_hanks wikidata:q5 owl:thing schemaorg:person wikidata:q215627 dul:agent dul:naturalperson dbo:agent dbo:person dbo:actor dbo:actor same dbo:actor dbo:actor http://de.dbpedia.org/r esource/wladimir_wladi mirowitsch_putin same dbo:politician dbo:politician http://de.dbpedia.org/r esource/barack_obama http://de.dbpedia.org/r esource/berlin http://de.dbpedia.org/r esource/leipzig same dbo:politician dbo:politician schemaorg:place odp:location dbo:place wikidata:q532 opengis:_feature Same + dbo:populatedplace dbo:settlement http://dbpedia.org/reso urce/capital_city http://de.dbpedia.org/p age/großstadt dbo:place dbo:place 17
Missing Ontology Editing Capabilities Why are Missing Ontology Editing Capabilities a problem? No good overview of the Ontology No efficient Way to rename or reorganize classes No efficient way to align the ontology with other ontologies 18
Missing Ontology Editing Capabilities Web Protégé Integration How to solve the Ontology Editing Problem? Use an advanced Collaborative Ontology Editor Solve the compatibility problem by integrating the editor into the existing framework Solve authentication and synchronization problems Architecture: 19
Missing Editing Capabilities 20
Why Administration is a Problem Configuring the different DBpedia services is a very complex task DBpedia Static: configuring the abstract extraction, generating datasets and importing them into virtuoso DBpedia Live: creating a Syncwiki, configuring the live extraction and an endpoint for the streaming updates DBpedia Spotlight: configuring a Hadoop cluster for dataset generation and then configuring a rest service DBpedia Lookup: generating the index for the lookup service Debugging: Problems are very specific to a configuration, there is no way to inspect specific issues without replicating the envirtonment 21
Addressing the Administration Problem Container Virtualisation Package the different dbpedia services in docker containers Share Conainers together with the configuration Docker Build once run everywhere Filesystem-level versioning Small containers Easy deployment Docker HUB Share Containers Push and Pull https://registry.hub.docker.com/repos/alexa ndru/ 22
Addressing the Administration Problem: DBpedia+Docker DBpedia Spotlight Static Endpoint Live Endpoint Static Extraction Dockerized DBpedia Live Extraction DBpedia SyncWiki 23
Thank You! http://www.corporate-smart-content.de