Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1 Ivo Marinchev Abstract: The paper introduces approach to semantic lifting of unstructured data with the help of natural language processing (NLP) technologies. Our approach is based on processing the text fragments with NLP tools to tag some of the natural language words and phrases with semantic annotations. Then these inferred annotations are lifted to ontology level in the form of ontology instances that become preliminary automatic annotations of the target text fragments and can later be optionally confirmed and refined by domain experts. Key words: Semantic Web, Natural Language Processing, Ontologies, Instance Data, Semantic, Lifting, OWL, RDFS, RDF. INTRODUCTION In the field of software engineering structured data is data that can be described with formal data model. A data model explicitly determines the structure of data or structured data. Data models describe structured data for storage in data management systems such as relational databases. In our previous paper [1] we presented approach to semantic lifting of structured data based on transformation rules created from the annotations of the data model with a given set of ontologies. Unfortunately data models do not describe unstructured data, such as word processing documents, email messages, pictures, digital audio, digital video, and nontyped data fields (TEXT, CLOB, BLOB, etc) of relational databases. In practice most of the information in the Web today is unstructured and proliferation of digital media and social network services (such as wellknown Facebook, Google+, Twitter, etc) makes the unstructured data overwhelmingly widespread. Hence the need for tools and technologies for processing unstructured data constantly increases. In this paper we introduce an extension of our semantic lifting approach [1] to semistructured and unstructured data. We apply natural language processing (NLP) tools and technologies to processing the text fragments in order to tag some of the natural language words and phrases with semantic annotations. Then these inferred annotations are lifted to ontology level in the form of ontology instances that become preliminary automatic annotations of the target text fragments and can later be optionally confirmed and refined by domain experts. BASE ONTOLOGY AND SPECIALIZED ONTOLOGIES The data managed by computers are never completely stand alone. There are always meta data information (annotations) that describe the raw data even on the most 1 This work was partially supported by Research Project No. D-002-189 funded by the National Science Fund of Bulgaria.
general level that people use to organize, manage and search it. Formally these general annotations about the data can be represented as ontology that we named base ontology. For our examples in the paper we use data from digital library Virtual Encyclopaedia of Bulgarian Iconography [3] and all existing annotation about iconographical objects in the original system are collected in the base ontology that we use throughout our project. In fact the base ontology captures the pre-existing formal data model that any information system has even when it is built unintentionally. For the unstructured part of the information we need a set of specialized ontologies that describe domains that natural language texts refers to. These specialized ontologies are created by domain experts and usually are narrow scoped but with a lot of details in the form of ontology instances that are concrete representations of the ontology concepts of a given domain. In our case we use specialized ontologies for technologies of iconographic objects and for iconographic scenes descriptions. These ontologies have ontology classes for example Primer and ontology instances that represent different instances of the classes for example different types of primer that can be found in real world as oil primer, gum primer, emulsion primer, polymeric primer, combined primer, etc. In our approach ontology instances are the main tool for annotating natural language texts as they correspond to concrete words and phrases that people use. The last detail that we want to emphasize is that the base ontology and the specialized ontologies have to be linked in some way so that the graph of semantic classes (concepts) is transversable and allows the creation of semantic queries (in our case SPARQL [6]) using concepts of different ontologies. Our solution to this problem is to link all specialized ontologies with the base ontology with the help of equivalent classes (in OWL terms owl:equivalentclass). So the base ontology becomes main linking point of the whole semantic network. GENERAL ALGORITM In this section we describe the general algorithm that is used in our approach to semantic lifting of unstructured data. We describe the details of different steps of the algorithms in the following sections together with examples for every step taken from the data [3] used in our project [5]. The input of our algorithm is as follows: Base ontology The list of all data properties from the base ontology, which contain natural language text for annotation. Specialized ontologies related to the annotated texts. Web service for semantic annotation of natural language texts given a set of ontologies in our case we use the ClaRK system [4]. The details of the process of natural language annotation that we employ are given in [7]. This work is also part of the SINUS project [5] but is created as completely separated module and is out of the scope of this paper. The algorithm described here is not dependent of NLP module used and can be used with different ones if they are available. Semantic lifting service. Semantic repository service where the semantic annotations are stored in our case we use OWLIM [2] semantic repository. Output: Semantic annotations inferred from the natural language texts, semantically lifted to the form of unique ontology instances. Basic algorithm:
Creating document with natural language texts for semantic annotations. Annotating the text document from previous step with ClaRK system. Lifting annotations to semantic level. Details of these 3 steps accompanied by concrete examples are presented in the following sections: EXTRACTING TEXTS FOR NLP ANNOTATIONS This step uses the base ontology and the list of data properties that contain natural language texts to extract the text fragments in a separated XML document that is sent to the semantic annotation NLP service. The general steps are the following: Load the original non semantic data. Use the semantic lifting service [1] to lift any structured part of the information this step is required for the creation of unique identifiers [1] of the ontology instances. Extract all specified data properties and insert them in the XML document that was negotiated as an interface between lifting service and the NLP annotation service (ClaRK system). Below we represent an example of the iconographic object with unique identifier 31 from [3] with data properties: OWLDataProperty_iconographicalTechnique_has_Description OWLDataProperty_baseMaterial_has_Description. <OWLClass_IconographicalTechnique велатури. Приложена е и техниката пробастър върху позлатата и върху ореола има гравировки. Лаковото покритие е нанесено тънко и равномерно.</owldataproperty_iconographicaltechnique_has_description> <OWLClass_SolidMaterial rdf:about="#owlclass_solidmaterial_a7db65f2b4989e34c402d219b137cbb9"> <OWLDataProperty_baseMaterial_has_Description>Две дъски от иглолистна дървесина мура. Гипсов грунд, нанесен тънко и равномерно.</owldataproperty_basematerial_has_description> </OWLClass_SolidMaterial> ANNOTATING NATURAL LANGUAGE TEXTS The XML document created in the previous step is sent to the NLP semantic annotation service. In our case it is ClaRK [4, 7] system. The inferred annotations are added to the original tags of the XML document with a special tag <annotation />. Every annotation is represented with <property.. /> tag within the annotation tag. The property tags contain class from the specialized ontologies (domain attribute), the name of object property corresponding to selected class (uri attribute) and ontology instance (rvalue attribute). These tags represent formalized semantics of the corresponding natural language text fragment. The property tag looks like the example below: <property domain="owlclass_primer" rvalue="owlindividual_fillerplaster" uri="owlobjectproperty_primer_has_filler" /> In this example the ontology class is OWLClass_Primer the name of its object property is OWLObjectProperty_primer_has_Filler and the ontology instance is OWLIndividual_FillerPlaster. The result of the annotation service applied on the XML document presented in the previous section is shown below:
<OWLClass_IconographicalTechnique xmlns="" велатури. Приложена е и техниката " пробастър " върху позлатата и върху ореола има гравировки. Лаковото покритие е нанесено тънко и равномерно. <annotation> <property domain="owlclass_gilding" rvalue="owlindividual_typeofgilding01" uri="owlobjectproperty_gilding_has_type"/> <property domain="owlclass_lacquering" rvalue="owlindividual_thicknessoflacquering04" uri="owlobjectproperty_lacquering_has_thickness"/> <property domain="owlclass_lacquering" rvalue="owlindividual_evennessoflacquering02" uri="owlobjectproperty_lacquering_has_evenness"/> </annotation> </OWLDataProperty_iconographicalTechnique_has_Description> <OWLClass_SolidMaterial xmlns="" rdf:about="#owlclass_solidmaterial_a7db65f2b4989e34c402d219b137cbb9"> <OWLDataProperty_baseMaterial_has_Description>Две дъски от иглолистна дървесина - мура. Гипсов грунд, нанесен тънко и равномерно. <annotation> <property domain="owlclass_primer" rvalue="owlindividual_fillerplaster" uri="owlobjectproperty_primer_has_filler"/> <property domain="owlclass_primer" rvalue="owlindividual_thicknessofprimer01" uri="owlobjectproperty_primer_has_thickness"/> <property domain="owlclass_primer" rvalue="owlindividual_evennessofprimer01" uri="owlobjectproperty_primer_has_evenness"/> </annotation> </OWLDataProperty_baseMaterial_has_Description> </OWLClass_SolidMaterial> LIFTING ANNOTATIONS TO SEMANTIC LEVEL On this step all annotations are lifted to semantic level given the base ontology and the specialized ontologies. The annotated XML document contains object properties and ontology instances from the specialized ontologies. For example, in the previous step we got the following annotations in bold: <OWLClass_IconographicalTechnique xmlns="" велатури. Приложена е и техниката " пробастър " върху позлатата и върху ореола има гравировки. Лаковото покритие е нанесено тънко и равномерно. <annotation> <property domain="owlclass_gilding" rvalue="owlindividual_typeofgilding01" uri="owlobjectproperty_gilding_has_type"/> <property domain="owlclass_lacquering" rvalue="owlindividual_thicknessoflacquering04" uri="owlobjectproperty_lacquering_has_thickness"/> <property domain="owlclass_lacquering" rvalue="owlindividual_evennessoflacquering02" uri="owlobjectproperty_lacquering_has_evenness"/> </annotation> </OWLDataProperty_iconographicalTechnique_has_Description> The lifting algorithm is as follows: Using the specialized ontology we find all paths from the classes represented with their ontology instances in the annotations to the root classes of the specialized ontology. The root classes are all classes of the specialized ontology that are in relation equivalent class to the class of the base ontology The edges of the paths between classes (in fact their instances) are object properties of the specialized
ontology. For all intermediate classes (without the root classes), ontology instances are created with the same way as in the algorithm for lifting of structured data [1] and all such instances are assigned unique identifiers [1] in the form of hash values (for example MD5) based on instance property values. The lifting results are stored in the semantic repository and for the root element of the semantic graph is used the corresponding ontology instance that corresponds to the class of the base ontology (not the one of the equivalent class in the specialized ontology). The later is needed to make the explicit link between the instances of both ontologies. As these annotations are created automatically we named them preliminary as in practical applications they usually need to be reviewed and adjusted by domain expert to avoid any errors or discrepancies. That s why in our implementation we denote these annotation in a special way so that one can determine which annotations are preliminary and which are confirmed by human expert. The preliminary annotations are denoted with the suffix _P at the end of the name of the corresponding object properties of the root class for example object property OWLObjectProperty_primer_has_Filler, becomes the preliminary annotation OWLObjectProperty_primer_has_Filler_P. When it is confirmed by domain expert the suffix is removed. Below we present the example used throughout the paper with final semantic annotations applied and in the format used for storage in our semantic repository. <OWLClass_Primer rdf:about="#owlclass_primer_9b2141e798c294d87f0e1f2d59cdd5e0"> <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#thing" /> <OWLObjectProperty_primer_has_Filler_P rdf:resource="owlindividual_fillerplaster" /> <OWLObjectProperty_primer_has_Thickness_P rdf:resource="owlindividual_thicknessofprimer01" /> <OWLObjectProperty_primer_has_Evenness_P rdf:resource="owlindividual_evennessofprimer01" /> </OWLClass_Primer> <OWLClass_Lacquering rdf:about="#owlclass_lacquering_40e7680df9765729429bbf0ac1f95c65"> <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#thing" /> <OWLObjectProperty_lacquering_has_Thickness_P rdf:resource="owlindividual_thicknessoflacquering04" /> <OWLObjectProperty_lacquering_has_Evenness_P rdf:resource="owlindividual_evennessoflacquering02" /> </OWLClass_Lacquering> <OWLClass_Gilding rdf:about="#owlclass_gilding_df1d6429adcbadc255cd101e875e4fb8"> <rdf:type rdf:resource="http://www.w3.org/2002/07/owl#thing" /> <OWLObjectProperty_gilding_has_Type_P rdf:resource="owlindividual_typeofgilding01" /> </OWLClass_Gilding> <OWLClass_IconographicalTechnique xmlns="" велатури. Приложена е и техниката " пробастър " върху позлатата и върху ореола има гравировки. Лаковото покритие е нанесено тънко и равномерно.</owldataproperty_iconographicaltechnique_has_description> <ObjectProperty_iTechnique_uses_Gilding rdf:resource="owlclass_gilding_df1d6429adcbadc255cd101e875e4fb8" /> <ObjectProperty_iTechnique_uses_Lacquering rdf:resource="owlclass_lacquering_40e7680df9765729429bbf0ac1f95c65" /> <OWLClass_SolidMaterial xmlns="" rdf:about="#owlclass_solidmaterial_a7db65f2b4989e34c402d219b137cbb9"><owldataproperty_iconographical Object_has_URI>31</OWLDataProperty_iconographicalObject_has_URI> <OWLDataProperty_baseMaterial_has_Description>Две дъски от иглолистна дървесина - мура. Гипсов грунд, нанесен тънко и равномерно.</owldataproperty_basematerial_has_description> <ObjectProperty_base_has_Component rdf:resource="owlclass_primer_9b2141e798c294d87f0e1f2d59cdd5e0" /> </OWLClass_SolidMaterial>
CONCLUSIONS AND FUTURE WORK In this paper we presented an approach to semantic lifting of unstructured data. As a practical application we applied it to unstructured information from the database of digital library Virtual Encyclopaedia of Bulgarian Iconography [3]. All the work was done in the scope of the SINUS project [5]. Our objective was to enrich the data with additional annotations inferred from natural language textual descriptions that accompany iconographic objects. The later will be used for building and executing semantic queries against the ontology instances in order to infer information that can not be obtained from the original digital library. Our work is also a practical example for upgrading existing legacy system to semantic web level. REFERENCES [1] Marinchev I., Lifting and Lowering the Data from Digital Library "Virtual Encyclopedia of Bulgarian Iconography". Proc of 12th International Conference on Computer Systems and Technologies CompSysTech 2011, Vienna, Austria June 16-17, 2011, ACM ISBN: 978-1-4503-0917-2, pp 179 184. [2] OWLIM family of semantic repositories http://www.ontotext.com/owlim/ [3] Pavlova-Draganova L., V. Georgiev, L. Draganov. Virtual Encyclopaedia of Bulgarian Iconography. Information Technologies and Knowledge, vol.1 (2007), 3, pp. 267-271 [4] Simov K., Z. Peev, M. Kouylekov, A. Simov, M. Dimitrov, A. Kiryakov 2001. CLaRK an XML-based System for Corpora Development. In: Proc. of the Corpus Linguistics 2001, pp 548-560. [5] SINUS Project: Semantic Technologies for Web Services and Technology Enhanced Learning. http://sinus.iinf.bas.bg/ [6] SparQL query language http://www.w3.org/tr/rdf-sparql-query [7] Staykova K., Agre G., Simov K., Osenova P. Language Technology Support for Semantic Annotation of Iconographic Descriptions. In: Proceedings of the International Workshop "Language Technologies for Digital Humanities and Cultural Heritage", 16 Sept. 2011, Hisar, Bulgaria, ISBN 978-954-452-019-9, pp. 51-57. ABOUT THE AUTHOR Assoc. Prof. Ivo Marinchev, PhD, Department of Technologies for Knowledge Management and Processing, Institute of Information and Communication Technologies, Bulgarian Academy of Sciences, Phone: (+ 359 2) 870 75 86, Е-mail: ivo@iinf.bas.bg.