Automatic Metadata Generation in an Archaeological Digital Library: Semantic Annotation of Grey Literature

Size: px
Start display at page:

Download "Automatic Metadata Generation in an Archaeological Digital Library: Semantic Annotation of Grey Literature"

Transcription

1 Automatic Metadata Generation in an Archaeological Digital Library: Semantic Annotation of Grey Literature Andreas Vlachidis, Ceri Binding, Keith May, and Douglas Tudhope * Abstract. This paper discusses the automatic generation of rich metadata from excavation reports from the Archaeological Data Service library of grey literature (OASIS). The work is part of the STAR project, in collaboration with English Heritage. An extension of the CIDOC CRM ontology for the archaeological domain acts as a core ontology. Rich metadata is automatically extracted from grey literature, directed by the CRM, via a three phase process of semantic enrichment employing the GATE toolkit augmented with bespoke rules and knowledge resources. The paper demonstrates the potential of combining knowledge based resources (ontologies and thesauri) in information extraction, and techniques for delivering the automatically extracted metadata as XML annotations coupled with the grey literature reports and as RDF graphs decoupled from content. Examples from two consuming applications are discussed, the Andronikos web portal which serves the annotated XML files for visual inspection and the STAR project, research demonstrator which offers unified search across of archaeological excavation data and grey literature via the core ontology CRM-EH. Keywords: Automatic Metadata Generation, CIDOC CRM, Digital Archaeology, Digital Library, GATE, Knowledge Organization Systems, Information Extraction, Semantic Annotation, Semantic Search, SKOS. 1 Introduction In archaeology today, we see digital libraries of grey literature reports and of excavation datasets but they are not meaningfully connected. This paper discusses Andreas Vlachidis Ceri Binding Douglas Tudhope Hypermedia Research Unit, University of Glamorgan, UK avlachid@glam.ac.uk, cbinding@glam.ac.uk, dstudhope@glam.ac.uk Keith May English Heritage, UK keith.may@english-heritage.org.uk A. Przepiórkowski et al. (Eds.): Computational Linguistics, SCI 458, pp , DOI: / _10 Springer-Verlag Berlin Heidelberg 2013

2 188 A. Vlachidis et al. the automatic generation of metadata that makes possible semantic search of grey literature connected with diverse archaeological datasets. Such metadata enjoy definitions that enable information retrieval and cross searching on the semantic level via ontological and terminological models and references. Thus they are described here as rich metadata. The work forms part of the broader Semantic Technologies for Archaeological Resources (STAR) project, which employs the CIDOC CRM (International Council of Museums Conceptual Reference Model) and its archaeological extension, CRM-EH, as a core ontology providing a contextual framework between different types of information sources and disparate datasets [15, 18]. In collaboration with English Heritage (EH), the STAR project has developed methods for linking digital archive databases, vocabularies and unpublished excavation reports for purposes of semantic cross search. The CRM-EH is necessary for expressing the semantics and complexities of relationships between the data elements and annotations, which underline semantically defined archaeological user queries. The project has also employed knowledge resources, such as EH domain glossaries and thesauri, expressed as SKOS vocabularies [12]. These knowledge resources assist semantically defined queries and NLP information extraction from excavation reports. An extract from the OASIS (Online AccesS to the Index of archaeological investigations) grey literature library, provided by the Archaeological Data Service, forms the STAR free text corpus. The term grey literature is used to describe documents and source materials that cannot be found through the conventional means of publication - many excavation reports exist in this format. The OASIS project is a joint effort of UK archaeology research groups, institutions, and organisations, coordinated by the University of York [13]. It aims to improve the communication of fieldwork results to the wider archaeological community. This paper reports on an investigation of automatic methods for generating rich metadata that connects concepts via the ontological arrangements of the CIDOC CRM and CRM-EH and via the terminological SKOS definition of adopted vocabulary (English Heritage thesauri). The CIDOC CRM is an international standard (ISO21127:2006) semantic framework, aiming to promote shared understanding of cultural heritage information [8. 10]. The CRM is capable of mapping any type of cultural heritage information, as published by museums, libraries and archives [2]. Extensibility is an important aspect; the CRM can be specialized when it is required by a domain. A finer granularity of detail can be expressed for domain purposes while still retaining interoperability at the core CRM level. A particular extension of CRM that addresses the needs for semantic interoperability in the archaeology domain is the English Heritage (EH) extension. Based on the archaeological notion of context, modeled as a subclass of CRM Place, the CRM-EH ontology describes entities and relationships relating to a series of archaeological events, both in the past and present (ie during excavation) [7].

3 Automatic Metadata Generation in an Archaeological Digital Library 189 Previously, the role of the CRM for rich semantic annotations of text documents has been explored via intellectual, non-automatic process, which has the potential for producing very fine grained annotation of specific, important documents, for example as part of the detailed Text Encoding Initiative markup [14]. Inevitably, this process will be resource intensive over a large corpus. Thus, automation of semantic annotation is highly desirable for advancing semantic interoperability over larger corpus. The rest of this paper discusses the efforts of an automatic semantic annotation and rich metadata delivery of archaeological grey literature documents with respect to standard ontological and terminological definitions. The metadata extraction consists of a three phase process of semantic enrichment, employing the General Architecture for Text Engineering (GATE) toolkit [9]. The initial phase pre-processes the grey literature and vocabulary resources. The second phase identifies archaeological domain concepts in context. The final phase transforms GATE annotations to two different representations for consuming applications. The first representation is as XML coupled with the grey literature report this permits colour coded inline viewing of the annotations and also semantic metadata ranked by frequency of occurrence. The second representation is as RDF triples for semantic applications. In both cases, the extracted metadata are expressed as CRM entities, qualified by SKOS archaeological vocabulary concepts. Two web applications use the resultant semantic annotations. The Andronikos web portal delivers the annotated XML files as hypertext documents for visual inspection of the information extraction results. The STAR research demonstrator offers a unified searching of both data and grey literature in terms of the core ontology. Examples of the semantically enriched grey literature from both applications are discussed in this paper, together with results from a preliminary evaluation exercise. 2 The Process of Semantic Enrichment Information Extraction (IE) is a particular NLP technique relevant to the semantic enrichment of free text documents by extracting specific information snippets suitable for further manipulation [16, 6]. These semantic annotations in context enrich documents, enabling access on the basis of a conceptual structure, providing smooth traversal between unstructured text and conceptual models [5]. In addition, they can aid the integration of heterogeneous data sources by exploiting a conceptual structure and allowing users to search across resources for entities and relations, instead of words. Users can search for the term 'Paris' and a semantic annotation mechanism can relate the term with the abstract concept of 'city', while providing a link to the term 'France', which relates to the abstract concept 'country'. Employing a different conceptual model, the same term 'Paris' can be related with the concept of 'mythical hero' linked with the city of Troy from Homer's epic poem, the Iliad. In this paper the discussion of the process of Semantic Enrichment is divided into the three broad phases of the IE pipeline, each subdivided into various subtasks (pipelines). The initial pre-processing phase prepares the OASIS corpus and

4 190 A. Vlachidis et al. knowledge resources. The section on the second main IE phase highlights some details of the pipeline used for the annotation of the textual resources with conceptual and terminological references. The last section discusses the techniques employed in the final phase that constructs RDF representations of the metadata for the semantically enriched documents. The discussion begins by introducing the underlying Language Engineering architecture and Knowledge Organization System resources that contribute to the process of semantic enrichment. 2.1 Underlying Knowledge Organization System Resources Gazetteers are sets of lists, sometimes containing the names of entities such as cities, day of the week, etc. In GATE, gazetteer listings are used to find occurrences of terms in free text, and often support named entity recognition tasks. GATE gazetteers are not flat, which means that list entries can enjoy attributes which in turn can be invoked by Java Annotation Pattern Engine (JAPE) rules for the construction of sophisticated patterns. For example, a gazetteer containing the names of European cities can be enhanced with an attribute denoting the country of origin for each European city enlisted in the gazetteer resource. Therefore, a JAPE rule at later stages can exploit that particular attribute for targeting term matching only at UK cities or only at UK and Greek cities. In the semantic enrichment work described in this paper, gazetteers proved useful, being used to accommodate the wide range of Knowledge Organization System (KOS) resources made available to the STAR project by English Heritage. One type of KOS used in the semantic enrichment process was the thesaurus, with its various semantic relationships. In STAR, the various EH thesauri and glossaries were represented in SKOS, allowing unique identifiers (URIs) for a concept and possible links between concepts (skos:exactmatch, skos:closermatch) [12]. Four English Heritage thesauri and five glossary resources are incorporated in the process of semantic enrichment for identifying occurrences of various conceptual entities. The thesauri are the Archaeological Objects, Monument Types, Building Materials and the experimental Time-line Thesaurus [11]. The glossary resources are the Simple Names for Deposits and Cuts, Find Type Index, Material Index, Small Finds and the Bulk Find Material glossary. All the KOS resources had been previously expressed in SKOS format for the purposes of the STAR project [3, 4]. The glossary resources contain a small set of concepts which are highly relevant to the domain of archaeology. On the other hand, the thesauri resources contain a large set of concepts which relate to the general cultural and heritage domain. Initial entity recognition experiments exploited only the available glossary and Time-line Thesaurus vocabulary. This limited the semantic enrichment to a small set of highly relevant concepts in the context of archaeology. Therefore, new GATE techniques needed to be developed in order to allow the possibility of selectively exploiting the wider context of the EH thesauri. This makes possible richer terminology availability for the semantic enrichment process.

5 Automatic Metadata Generation in an Archaeological Digital Library 191 Exploiting the whole range of thesaurus concepts would expand the process of enrichment to concepts that are not very relevant to the archaeology domain. Therefore an optimum range is needed. A solution is to use concepts that come from those areas of thesaurus structures potentially useful to the enrichment process. Overlapping concepts between the (highly relevant) glossaries and thesauri can serve as appropriate entry points to the thesaurus structures. Thesaurus semantic relationships can then be exploited for expanding from the entry points across thesauri areas relevant to the task of archaeological semantic enrichment. In order to enable this semantic expansion within GATE, JAPE rules must be able to exploit thesaurus narrower and broader semantic relationships. Therefore, transformation of SKOS thesauri to GATE gazetteers allows the translation of thesauri properties to gazetteer attributes, enabling JAPE rules to exploit semantic relationships between gazetteer terms, as described in the following section. 2.2 Transforming KOS to GATE Gazetteers The transformation of SKOS thesauri and glossary resources to GATE gazetteers is achieved via the use of XSLT templates. The templates exploit SKOS properties for adding attributes to gazetteer terms, which can be used by JAPE rules. It is important that the rules are capable of traversing through a thesaurus hierarchy, in order to produce matches that achieve a semantic expansion, which can expand beyond the limits of a single semantic level. Since JAPE is essentially a pattern matching rule engine, some work was required to enable the semantic expansion of SKOS concepts (with their unique identifiers) within GATE at the lookup stage of the information extraction pipeline. Consider the following case; Container (by function) >Food and Drink Serving Container > Drink Serving Container > Jug > Knight Jug. A JAPE rule that used only the narrower concept property would be able to semantically expand only on immediately narrower concepts. Therefore, a gazetteer attribute was built during the transformation from SKOS to GATE gazetteer to reflect the path of unique SKOS identifiers from the concept to the top of its hierarchy. For example: KnightJug@skosConcept=149773@path=/101601/101204/ / Jug@skosConcept=101601@path=/101204/101340/ Drink Serving Container@skosConcept=101204@path=/101340/ Food and Drink Serving Container@skosConcept=101340@path=/ Container@skosConcept=101023@path=/ A simple JAPE rule can then exploit the above gazetteer attributes by matching all terms that contain a skosconcept and a path attribute of a particular reference; for example matches all concepts in the gazetteer within the container hierarchy. The XSLT transformation also takes into account SKOS alternative concept labels (thesaurus non-preferred terms) and makes them available as

6 192 A. Vlachidis et al. gazetteer entries that have the same skosconcept and path attributes as their preferred label counterparts. In addition during the transformation, particular glossary concepts are given an extra attribute (skos:exactmatch) to accommodate the previously defined mapping between a glossary and a thesaurus. For example the concept Hearth of the glossary Simple Names for Deposits and Cuts is mapped to the concept Hearth of the thesaurus Monument Types class Archaeological Feature: hearth@skosconcept=ehg003.37@skosexactmatch=70374 This allows the potential for JAPE rules to optionally expand the glossary concept Hearth to associated concepts within the Monuments thesaurus, depending on the overall context. 2.3 Main Knowledge-Based Information Extraction Phase The main IE process annotates grey literature documents with conceptual and terminological references. Thus process is carried out by the OPTIMA pipeline (figure 1) developed for the project (Object, Place, TIme and Material). These four concepts were considered key metadata elements for the purposes of the project s concern with archaeological excavations and are the focus of the pipeline. The IE process uses a large number of JAPE patterns and utilises the gazetteer resources created by the pre-processing phase. We term this particular approach, Knowledge Based Information Extraction, since the combination of core ontology and knowledge resources plays a major role in the process of semantic enrichment. This drives the IE task by using JAPE patterns which exploit and produce semantic relationships. This section gives an overview of the main functionality of the pipeline. The first stage of the main IE pipeline invokes the gazetteer resources and generates the initial lookup annotations. The various knowledge resources that participate in the pipeline are capable of identifying lookups that fall within the Fig. 1. The Information Extraction pipeline developed in GATE. Bespoke JAPE rules for this project are shown in grey boxes.

7 Automatic Metadata Generation in an Archaeological Digital Library 193 four concept categories mentioned above. For example, the Archaeological Objects thesaurus contains concepts of physical objects, the Building Materials thesaurus contains concepts of materials, the Timeline thesaurus contains time appellations, such as periods, while glossaries are a source of more specific archaeological terminology, such as the Simple Names for Deposits and Cuts, which covers archaeological contexts. There are cases where a term can be found within two different knowledge resources, thus potentially having two different conceptual references (ie multiple senses). For example, the term brick can refer to a material or to a physical object, as can terms such as, glass, stone, iron, gold, etc. in the particular domain practice of archaeology. Therefore the first stage of the pipeline generates two types of lookup annotation, single sense and multiple sense. Multiple sense lookup annotations are disambiguated at later stages. The second stage of the pipeline validates the lookup annotations and aligns annotations to CRM entities. Annotations that are not part of noun phrases and annotations that are part of headings, table of contents and single worded phrases are suppressed for purposes of the current project. It is important to validate lookup annotations, especially those that are not part of noun phrases, because gazetteer matches are invoked via a morphological analyser and matches are created on the root of words. This technique allows matches within a broader orthographical context, including singular and plural forms of matches, but also generates matches for verb senses that have to be suppressed by the validation stage. In addition, during this stage the pipeline performs negation detection over lookup annotations. The next stage of the pipeline disambiguates multiple sense lookup annotations. The disambiguation technique is based on JAPE patterns that examine word pairs and Part-of-Speech input. Lookup annotations that cannot be disambiguated maintain all their possible senses, leaving the judgement to the end user. The pipeline is capable of exploiting the semantic relationships of the knowledge resources that are accommodated as gazetteer resources. The pipeline can be configured in one of three different modes of semantic expansion developed for purposes of the research within GATE, Synonym, Hyponym, and Hypernym expansion. Synonym expansion utilises the glossary resources and expands on the synonyms of glossary terms available in the thesauri resources. Hyponym is similar to the Synonym expansion in utilising the glossary terms and their synonyms but also traverses over the hierarchy of the thesaurus structures to include (transitively) all narrower concepts available for the glossary. The Hypernym expansion mode includes the above two modes of expansion and also exploits the broader concept relationships within the thesauri structures. In terms of volume, the last mode of expansion will expand the semantic enrichment task to include the largest set of concepts, while the first mode of expansion will include the smallest and the most precise set of concepts. For example, for the glossary concept enclosure, Synonym expansion will include the concept, garth, in addition to enclosure. Hyponym expansion will also include the concepts curvilinear enclosure, ditched enclosure, rectilinear enclosure, etc. and their narrower concepts, such as oval enclosure, double ditched enclosure, polygonal enclosure, etc.; Hypernym expansion will also include all Monument by Form concepts, such

8 194 A. Vlachidis et al. as arch, boundary, barrier, ditch and their narrower concepts, such as fence, hedge boundary ditch, etc. Clearly, there are recall/precision trade-offs associated with the different expansion modes and these are a topic for investigation in the forthcoming evaluation work. 2.4 Transformation of Semantic Annotations to RDF Triples The last phase of the semantic enrichment is the transformation of the semantic annotations produced in the previous phase to RDF triples. For this purpose, CRM-EH semantic annotations are exported initially from the GATE environment as XML documents. The GATE exporter produces annotations in the form of XML tags, which are coupled with the associated content. The exported XML tags enjoy properties produced during the IE process, such as skos:concept and skos:exactmatch unique terminological identifiers, gateid unique annotation identifiers and a note for capturing the context surrounding a semantic annotation. In addition, each grey literature document has a unique name that constitutes a unique identification of the file within the OASIS corpus. This unique file name is used in conjunction with the unique gateid property of each annotation to create a corpus wide unique identifier for each individual annotation. In addition, the SKOS concepts assigned to the annotations are associated with underlying CRM Types, using the relationship (is_represented_by) modeling the association asserted for data items (mapped to CRM) and SKOS concepts [18]. For example a semantic annotation of the CRM-EH class, Context, can be associated with the SKOS concept, pit. This supports cross search between data and grey literature in terms of CRM and SKOS. The transformation from XML files to RDF triples required the development of bespoke techniques based on the XML Document Object Model, using the scripting language PHP for building the transformation templates. The decision to use a server-side scripting language like PHP is supported by the two main requirements, a MySQL database for retrieving the unique file name of each document and a visual interface for parameterisation of the transformations, allowing easy selection of semantic annotations that participate in the transformation. PHP allowed rapid development and proved robust with a large set of documents. The final RDF documents are decoupled from the content. However, as explained above each annotation resource is tied to a corpus wide unique identifier. The following example presents the details of the RDF transformation. Consider the following rich phrase pits were uniformly filled with large quantities of pottery. The phrase can be modelled by a CRM-EH ContextFindDepositionEvent, connecting the find, pottery, with the context, pit. The coupled XML output would have the following structure (the note property is omitted and URIs truncated for simplicity): <EHE1004.ContextFindDepositionEvent gate:gateid="281105"> <EHE0007.Context gate:gateid="281155" skos:concept="#ehg003.55">pits</ehe0007.context>

9 Automatic Metadata Generation in an Archaeological Digital Library 195 were uniformly filled with large quantities of <EHE0009.ContextFind gate:gateid="281158" skos="#ehg027.2">pottery</ehe0009.contextfind> </EHE1004.ContextFindDepositionEvent> The RDF transformation would have the following structure: <crmeh:ehe1004.contextfinddepositionevent rdf:about=" <dc:source rdf:resource=" /> <dc:source rdf:resource=" /> <crm:p2f.has_type rdf:resource=" /> <crm:p3f.has_note> <crm:e62.string> <rdf:value>pits were uniformly filled with large quantities of pottery</rdf:value> </crm:e62.string> </crm:p3f.has_note> <crm:p26f.moved_to rdf:resource=" /> <crm:p25f.moved rdf:resource=" base#suff " /> </crmeh:ehe1004.contextfinddepositionevent> 3 Example Uses of Rich Metadata The automatically produced metadata are utilised by two web applications, the STAR research demonstrator and the Andronikos portal [17, 1]. The STAR demonstrator uses the decoupled RDF files to support cross searching between grey literature documents and disparate datasets [3], in terms of the core CRM-EH conceptual model. A SPARQL engine supports the semantic search capabilities of the demonstrator, while an interactive interface hides the underlying model complexity and offers search (and browsing) for Samples, Finds, Contexts or interpretive Groups with their properties and relationships. On the other hand, the Andronikos portal uses the coupled XML files for constructing and delivering the semantic annotations in an easy to follow human readable format. While the portal was developed for project purposes to assist visual inspection of the information extraction outcomes, it is seen as indicative of potential digital library applications where access to the semantically enriched text is desired. The metadata take account of lexical ambiguities such as polysemy (same word having multiple meanings). For example, find all archaeological Contexts of type cut, where the term cut is ambiguous. The semantic enrichment mechanism manages to disambiguate the verb from the noun form and to reveal phrases which make use of cut as relevant to archaeological use, eg levelling layers sealed the

10 196 A. Vlachidis et al. base of a brick wall cut into layer, or It measured 0.3m in diameter and 0.2m deep with a circular cut and avoids the annotation of non archaeological uses of cut, such as although the current 'cut-off' channel is now 500m. In addition, the metadata can take account of a form of polysemous ambiguity. For example, the word brick in an archaeology usage can refer to a material or to a physical object. In the phrase yellowish-brown sandy deposit containing frequent unbonded brick, the term refers to a physical object, whereas in the phrase A layer of small brick tiles forming the street paving, the term refers to a material. The STAR Demonstrator makes use of the rich metadata for semantic search, building on CRM and SKOS unique identifers. For example, searches are possible of the form: Context of type X containing Find of type Y. The two different extracts of screendumps in Figure 2 show a Context of type hearth containing Context Find of type coin, together with a Context Find of type Animal Remains within a Context of type pit. Fig. 2 The STAR demonstrator search of semantic metadata Fig. 3 Search Results from the STAR demonstrator The cross-search capability of the STAR demonstrator retrieves results from both datasets and grey literature reports (a variety of datasets for the hearth query). As seen above, the searches returned results for different annotation types (Contexts, Finds) and from different resources (grey literature resources commence with a hash bar in this demonstrator interface). For example (Figure 3), the first search retrieves from Grey Literature #archaeol a Hearth containing a coin; the original text was It differs from the other coin finds, however, in that it was associated with a hearth. Similarly the second search retrieves from Grey Literature an animal bone within a context of type pit; the original text was the test pit produced a range of artefactual material which

11 Automatic Metadata Generation in an Archaeological Digital Library 197 included animal bone (medium/large ungulate). The semantic enrichment makes it possible for the STAR demonstrator to overcome lexical boundaries and retrieve synonymous terms, as evident in the example of Animal Remains where the term Animal bone is retrieved. The Andronikos web portal uses XML outputs of rich metadata for generating and linking HTML pages, which accommodate semantic annotations of grey literature documents. The annotations are divided into three abstractions: (i) preprocessing annotations, such as Headings, Table of Contents and Summary; (ii) single CRM annotations such as Physical Object, Place, Time and Material; (iii) CRM-EH archaeology specific annotations of rich phrases. Therefore, it is possible to optionally expose particular document abstractions according to different application strategies. Thus, in certain cases, the Summary sections (an example is given in Figure 4) might be targeted (or prioritised) for retrieval as being strongly representative of a grey literature report. Alternatively, the most frequently appearing CRM entities (see Figure 5) in a report might be considered a useful entry strategy. Yet again, a cross search might be interested in any occurrences within grey literature reports of highly specific, rich CRM annotations. Andronikos also makes available links to the XML and RDF versions of grey literature documents, which can be downloaded and further transformed or manipulated. Fig. 4 Summary Section of Grey literature, Andronikos web-portal Fig. 5 Frequent CRM entities (Time Appellation left side, Physical Object right side) in a grey literature report, Andronikos web-portal The summary section of the report (Figure 4) discusses evidence that relates to the Roman period, while the CRM overview (Figure 5) show the most frequently used SKOS concepts in the report for the CRM entities, Time Appellation and Physical Object, in this case Roman and pottery. Both CRM and SKOS annotations are produced by the IE pipeline.

12 198 A. Vlachidis et al. 4 Evaluation This section discusses the results of a preliminary evaluation exercise which aims to inform an ongoing larger scale gold standard evaluation. The main objective of the preliminary evaluation is to examine the effectiveness of the manual annotation instructions for supporting the annotation task, as well as to explore the capacity of GATE modules for supporting the evaluation and the inter-annotator agreement (IAA) analysis tasks. IAA analysis is performed in order to reveal differences between annotators and to refine the instructions of manual annotation before proceeding to the main evaluation phase. The evaluation used a small corpus of 10 summary extracts. Considering the logistic constraints of the manual annotation task, summary extracts, originating from archaeological excavation and evaluation reports were considered to be the best available option. They are potentially useful targets of semantic enrichment work, as discussed in the previous section.. The manual annotation instructions were written to reflect the end-user aims of the evaluation, hiding complex and unnecessary ontological details. Annotators were instructed to annotate at the level of archaeological concepts rather than identifying more abstract ontological entities. The instruction directed the task of manual annotation at the concepts of archaeological place, archaeological finds, the material of archaeological finds and time appellations. Three annotators were employed by the pilot evaluation task from the target archaeology domain, a senior archaeologist (SA), a commercial archaeologist (CA) and a research student (RS) of archaeology. The evaluation corpus of the ten summary extract (5 excavation and 5 evaluation reports) containing in total 2898 words, was made available in MS Word format. The annotators were instructed to use particular highlight colours and underline tools in order to produce their annotations. Although, the annotation task could have applied the GATE Ontology Annotation Tool (OAT), the use of Microsoft Word was preferred, since manual annotators had no previous experience working with GATE but were fluent in Word. The resulting manual annotations sets were systematically transferred to GATE as CRM and CRM-EH oriented annotations using the Ontology Annotation Tool (OAT). Upon completion of annotation transfer, the differences between the individual manual annotation sets were analysed using the IAA module of GATE. The module was configured to report the inter-annotator agreement score in terms of Precision, Recall and F-Measure metrics. The metrics were reported on both Average and Lenient mode. The Average mode treats partial matches as half matches whereas the Lenient mode treats partial matches as full matches. An example of a partial match might be flints instead of worked flints for the phrase worked flints found in several pits. The overall IAA F-Measure score for the three annotation sets on the average mode of reporting was 58% while on the lenient mode the score increased to 68%. In the lenient mode of reporting, individual differences in terms of annotation borders are not encountered since partial matches are considered full matches. An increment of 10% from average to lenient mode is evidence of the disagreement

13 Automatic Metadata Generation in an Archaeological Digital Library 199 between annotators on the scope of annotation boundaries. The following tables present the IAA annotator score for the three different annotation sets in terms of overall agreement including all entity types and also agreement on some individual entities. Table 1 IAA scores for the three different annotation sets, reported on Average and Lenient mode. SA: Senior Archaeologist, CA: Commercial Archaeologist, RS: Research Student. Precision Recall F-Measure Av. Le. Av. Le. Av. Le. SA-CA-RS SA-CA SA-RS CA-RS Table 2 IAA scores of individual entities reported on Average and Lenient mode. The scores are reported on the IAA score of the three Annotation Sets SA-CA-RS. Precision Recall F-Measure Av. Le. Av. Le. Av. Le. E19.Physical_Object E49.TimeAppellation E53.Place E57.Material The data of table 1 reveal fairly low IAA scores, especially when results are reported in the Average mode, which is not untypical in some application domains. The Lenient mode provides improved scores as expected, since disagreement on the annotation boundaries is not taken into account. The best performing F-Measure is between Senior Archaeologist and Research Student, scoring 71% on the Average mode and 86% on the Lenient mode respectively. On the other hand, the IAA scores involving the commercial archaeologist (CA) are lower, due largely to different interpretation of occurrences of the Time Appellation and Place entities. For example, the CA tended to annotate the term phase as for example, phase 1, phase 2 etc., as Time Appellation, whereas the other annotators considered it outside the intended scope. The CA also annotated the term trench as Place. Trenches are usually produced during excavation and so might not be considered places associated with archaeological events in the past. This distinction was not made clear in the Instructions.

14 200 A. Vlachidis et al. Examining table 2 reveals that the agreement on E57.Material entity is significantly lower both in Average and Lenient modes of reporting. This suggests that material poses particular problems of ambiguity or recognition. Also the there is a 20% difference between Average and Lenient mode on the F-Measure score of the E19.Physical_Object entity, indicative of the different annotation boundaries used by annotators. Performance of the pipeline was also evaluated against Recall and Precision using the three individual annotation sets from the exercise using the GATE corpus benchmark utility. The overall scores are shown in Table 3. Although these are only preliminary results, they tend to support the potential use of thesaurus expansion in IE to improve vocabulary matching. Taking the use of Synonyms (a concept will have Table 3 Precision, Recall and F-Measure scores Synonym Hyponym Hypernym SA Precision Recall F-Measure CA Precision Recall F-Measure RS Precision Recall F-Measure various synonym terms) as a base point, we see that both Precision and Recall are generally slightly improved by Hyponym expansion, which brings in narrower concepts (with their terms) in the thesaurus to the starting concept (by a small margin, the best precision score (75%) is delivered by the Synonym mode against the input of the senior archaeologist). On the other hand, Hypernym expansion, which also brings in narrower and broader concepts, tends to decrease precision but improve the recall by a greater amount, The best recall score (79%) is delivered by the Hyponym mode against the senior archaeologist input. The results show a slight improvement in F- Measure rates over all annotators for both Hyponym and Hypernym expansion modes. In addition, results show that the system produces better F-Measure scores when performing on the Hypernym expansion mode due to the improvement in recall rates. However, these are very preliminary results. Subsequent and larger scale evaluation will examine the use of thesaurus expansion further and will consider whether the F- Measure scores continue to show a similar pattern. The preliminary evaluation also revealed a range of terms which were selected by annotators but were not matched by the IE pipeline. Example terms include Bowl, Castle, Cemetery, Flake, Fort, Garden, Hollow, House, Inhumation, Knife, Nail, Plough, Pond, Playground, Ridge and Furrow, Ring, Settlement, Shed, Tool, Weapon. In subsequent discussion with archaeological domain experts, it was judged that many of these terns were valid. Some of these terms were in fact present in the thesaurus resources but had no appropriate bridging entry point from the archaeology glossaries. Some way of widening the terminology resources to take account of these terms would increase the eventual recall of the IE process.

15 Automatic Metadata Generation in an Archaeological Digital Library 201 Based on the results of the pilot evaluation, the manual instruction should be refined to improve the manual annotation task. For example, the instructions for the annotation of material entities should make clear that the annotation task should be focused on materials that have an archaeological interest and are associated with physical objects, rather than materials in general. Similarly, the instructions for the annotation of place entities should be clarified particularly for rich phrases. Instruction should highlight that entity annotations spans can contain both conjunct and adjectival moderators when applicable. In addition, annotation examples should be included for each annotation type targeted by the task in order to ease comprehension. These should attempt to clarify the cases of phrasal annotation and how boundaries should be treated. It was also seen that annotators tended individually to miss different valid annotations in the rather laborious task. The experience suggests the need for a final resolution phase before reaching a gold standard that resolves differences between annotators. An ongoing larger scale gold standard evaluation is informed by the results of this evaluation exercise. 5 Conclusions The discussion has revealed the viability of automatic generation of rich metadata for enabling semantic search of grey literature connected with archaeological datasets. The methods of Information Extraction, driven by the core ontology CIDOC CRM and its extension CRM-EH, in combination with SKOS resources, were central to the process of automatic metadata generation. A large scale evaluation excercise is in process to evaluate the information extraction performance in general and in dealing with lexical ambiguities and the accuracy of rich phrase annotation in particular. Specific contributions of the work reported here include the potential of combining knowledge based resources (including ontologies and thesauri) in information extraction, techniques for using SKOS and CRM resources within GATE based information extraction, techniques for automatic rich metadata generation and expression as coupled XML and also as RDF triples for different consuming applications, including semantic cross search over excavation datasets and archaeological grey literature reports. In addition, the currernt evaluation results demostrate the ability of the system to peform in diffrent modes of semantic expansion that exploit vocabulary resources which may allow the possibility of tailoring the system to favour recall or precision. In general, the current study demonstrates the capability for CRM based methods to drive automatic generation of rich metadata in domain specific digital libraries. Such metadata can be expressed in interoperable formats such as XML and RDF graphs, which can be exploited by digital library systems to enable cross-search functionality between disparate resources. Work is underway investigating generalisation of the methods to related areas in the cultural heritage domain. Acknowledgements. The STAR project was supported by the Arts and Humanities Research Council [grant number AH/D001528/1]. Thanks are due to the Archaeology Data Service for provision of the OASIS corpus and to Phil Carlisle (English Heritage) for providing domain thesauri.

16 202 A. Vlachidis et al. References 1. Andronikos web-portal of semantic indices of the OASIS corpus, 2. Babeu, A., Bamman, D., Crane, G., Kummer, R., Weaver, G.: Named Entity Identification and Cyberinfrastructure. In: Kovács, L., Fuhr, N., Meghini, C. (eds.) ECDL LNCS, vol. 4675, pp Springer, Heidelberg (2007) 3. Binding, C., May, K., Tudhope, D.: Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction Via the CIDOC CRM. In: Christensen- Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL LNCS, vol. 5173, pp Springer, Heidelberg (2008) 4. Binding, C.: Implementing Archaeological Time Periods Using CIDOC CRM and SKOS. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part I. LNCS, vol. 6088, pp Springer, Heidelberg (2010) 5. Bontcheva, K., Duke, T., Glover, N., Kings, I.: Semantic Information Access. In: Semantic Web Semantic Web Technology: Trends and Research in Ontology Based Systems, Wiley, Sussex (2006) 6. Cowie, J., Lehnert, W.: Information extraction. Communications ACM 39(1), (1996) 7. Cripps, P., Greenhalgh, A., Fellows, D., May, K., Robinson, D.: Ontological Modelling of the work of the Centre for Archaeology (2004), 8. Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M.: Definition of the CIDOC Conceptual Reference Model, 9. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. 40th Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002) 10. Doerr, M.: The CIDOC Conceptual Reference Module: an Ontological Approach to Semantic Interoperability of Metadata. AI Magazine 2493, (2003) 11. English Heritage Thesauri, Isaac, A., Summers, E.: SKOS Simple Knowledge Organization System Primer (2009), Online AccesS to the Index of archaeological investigations (OASIS), Ore, C.-E., Eide, Ø.: TEI and cultural heritage ontologies: Exchange of information? Literary and Linguist Computing 24(2), (2009) 15. May, K., Binding, C., Tudhope, D.: A STAR is born: some emerging Semantic Technologies for Archaeological Resources. In: Proceedings Computer Applications and Quantitative Methods in Archaeology (CAA 2008), Budapest (2008) 16. Moens, M.: Information Extraction Algorithms and Prospects in a Retrieval Context. Springer, New York (2006) 17. Semantic Technologies for Archaeological Resources (STAR) demonstrator. University of Glamorgan, resources/star-demonstrator/ 18. Tudhope, D., Binding, C., May, K.: Semantic interoperability issues from a case study in archaeology. In: Kollias, S., Cousins, J. (eds.) Proc. First International Workshop SIEDL 2008, Semantic Interoperability in the European Digital Library, Associated with 5th European Semantic Web Conference, Tenerife, pp (2008)

Semantic Technologies for Archaeology Resources: Results from the STAR Project

Semantic Technologies for Archaeology Resources: Results from the STAR Project Semantic Technologies for Archaeology Resources: Results from the STAR Project C. Binding, K. May 1, R. Souza, D. Tudhope, A. Vlachidis Hypermedia Research Unit, University of Glamorgan 1 English Heritage

More information

Pattern based mapping and extraction via CIDOC CRM

Pattern based mapping and extraction via CIDOC CRM Pattern based mapping and extraction via CIDOC CRM Douglas Tudhope 1, Ceri Binding 1, Keith May 2, Michael Charno 3 ( 1 University of South Wales, 2 English Heritage, 3 Archaeology Data Service) douglas.tudhope@southwales.ac.uk

More information

STAR Semantic Technologies for Archaeological Resources. http://hypermedia.research.glam.ac.uk/kos/star/

STAR Semantic Technologies for Archaeological Resources. http://hypermedia.research.glam.ac.uk/kos/star/ STAR Semantic Technologies for Archaeological Resources http://hypermedia.research.glam.ac.uk/kos/star/ Project Outline 3 year AHRC funded project Started January 2007, finish December 2009 Collaborators

More information

Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature

Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature Semantic Indexing via Knowledge Organization Systems: Applying the CIDOC-CRM to Archaeological Grey Literature Andreas Vlachidis A thesis submitted in partial fulfilment of the requirements of the University

More information

Following a guiding STAR? Latest EH work with, and plans for, Semantic Technologies

Following a guiding STAR? Latest EH work with, and plans for, Semantic Technologies Following a guiding STAR? Latest EH work with, and plans for, Semantic Technologies Presented by Keith May Based on research work of English Heritage staff especially Paul Cripps & Phil Carlisle (NMR DSU)

More information

STAR Semantic Technologies for Archaeological Resources. http://hypermedia.research.glam.ac.uk/kos/star/

STAR Semantic Technologies for Archaeological Resources. http://hypermedia.research.glam.ac.uk/kos/star/ STAR Semantic Technologies for Archaeological Resources http://hypermedia.research.glam.ac.uk/kos/star/ Project Outline 3 year AHRC funded project Started January 2007, finish December 2009 Collaborators

More information

Introduction. Aim of this document. STELLAR mapping and extraction guidelines

Introduction. Aim of this document. STELLAR mapping and extraction guidelines Introduction STELLAR aims to provide tools for non-specialist users to map and extract archaeological datasets into RDF/XML conforming to the CIDOC CRM and its CRM-EH extension. Aim of this document The

More information

Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction via the CIDOC CRM

Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction via the CIDOC CRM Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction via the CIDOC CRM Ceri Binding 1, Keith May 2, Douglas Tudhope 1 1 University of Glamorgan, Pontypridd, UK {cbinding, dstudhope}

More information

Concept for an Ontology Based Web GIS Information System for HiMAT

Concept for an Ontology Based Web GIS Information System for HiMAT Concept for an Ontology Based Web GIS Information System for HiMAT Gerald Hiebel Klaus Hanke University of Innsbruck Surveying and Geoinformation Unit {gerald.hiebel; klaus.hanke}@uibk.ac.at Abstract The

More information

Natural Language to Relational Query by Using Parsing Compiler

Natural Language to Relational Query by Using Parsing Compiler Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 4, Issue. 3, March 2015,

More information

Semantic annotation of requirements for automatic UML class diagram generation

Semantic annotation of requirements for automatic UML class diagram generation www.ijcsi.org 259 Semantic annotation of requirements for automatic UML class diagram generation Soumaya Amdouni 1, Wahiba Ben Abdessalem Karaa 2 and Sondes Bouabid 3 1 University of tunis High Institute

More information

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at

City Data Pipeline. A System for Making Open Data Useful for Cities. stefan.bischof@tuwien.ac.at City Data Pipeline A System for Making Open Data Useful for Cities Stefan Bischof 1,2, Axel Polleres 1, and Simon Sperl 1 1 Siemens AG Österreich, Siemensstraße 90, 1211 Vienna, Austria {bischof.stefan,axel.polleres,simon.sperl}@siemens.com

More information

Ontology-Based Query Expansion Widget for Information Retrieval

Ontology-Based Query Expansion Widget for Information Retrieval Ontology-Based Query Expansion Widget for Information Retrieval Jouni Tuominen, Tomi Kauppinen, Kim Viljanen, and Eero Hyvönen Semantic Computing Research Group (SeCo) Helsinki University of Technology

More information

Information for the Semantic Web. Procedures for Data Integration through h CIDOC CRM Mapping

Information for the Semantic Web. Procedures for Data Integration through h CIDOC CRM Mapping Session II: Chair Luc Van Eycken Encoding Cultural Heritage Information for the Semantic Web Procedures for Data Integration through h CIDOC CRM Mapping Ø. Eide*, A. Felicetti**, C. E. Ore*, A. D Andrea***

More information

Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001

Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001 A comparison of the OpenGIS TM Abstract Specification with the CIDOC CRM 3.2 Draft Martin Doerr ICS-FORTH, Heraklion, Crete Oct 4, 2001 1 Introduction This Mapping has the purpose to identify, if the OpenGIS

More information

Lightweight Data Integration using the WebComposition Data Grid Service

Lightweight Data Integration using the WebComposition Data Grid Service Lightweight Data Integration using the WebComposition Data Grid Service Ralph Sommermeier 1, Andreas Heil 2, Martin Gaedke 1 1 Chemnitz University of Technology, Faculty of Computer Science, Distributed

More information

Introduction to IE with GATE

Introduction to IE with GATE Introduction to IE with GATE based on Material from Hamish Cunningham, Kalina Bontcheva (University of Sheffield) Melikka Khosh Niat 8. Dezember 2010 1 What is IE? 2 GATE 3 ANNIE 4 Annotation and Evaluation

More information

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY

ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY ONTOLOGY-BASED APPROACH TO DEVELOPMENT OF ADJUSTABLE KNOWLEDGE INTERNET PORTAL FOR SUPPORT OF RESEARCH ACTIVITIY Yu. A. Zagorulko, O. I. Borovikova, S. V. Bulgakov, E. A. Sidorova 1 A.P.Ershov s Institute

More information

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks

SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks SemWeB Semantic Web Browser Improving Browsing Experience with Semantic and Personalized Information and Hyperlinks Melike Şah, Wendy Hall and David C De Roure Intelligence, Agents and Multimedia Group,

More information

Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1

Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1 Semantic Lifting of Unstructured Data Based on NLP Inference of Annotations 1 Ivo Marinchev Abstract: The paper introduces approach to semantic lifting of unstructured data with the help of natural language

More information

Handling the Complexity of RDF Data: Combining List and Graph Visualization

Handling the Complexity of RDF Data: Combining List and Graph Visualization Handling the Complexity of RDF Data: Combining List and Graph Visualization Philipp Heim and Jürgen Ziegler (University of Duisburg-Essen, Germany philipp.heim, juergen.ziegler@uni-due.de) Abstract: An

More information

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object

Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object Training Management System for Aircraft Engineering: indexing and retrieval of Corporate Learning Object Anne Monceaux 1, Joanna Guss 1 1 EADS-CCR, Centreda 1, 4 Avenue Didier Daurat 31700 Blagnac France

More information

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation.

CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. CINDOR Conceptual Interlingua Document Retrieval: TREC-8 Evaluation. Miguel Ruiz, Anne Diekema, Páraic Sheridan MNIS-TextWise Labs Dey Centennial Plaza 401 South Salina Street Syracuse, NY 13202 Abstract:

More information

Semantic Knowledge Management System. Paripati Lohith Kumar. School of Information Technology

Semantic Knowledge Management System. Paripati Lohith Kumar. School of Information Technology Semantic Knowledge Management System Paripati Lohith Kumar School of Information Technology Vellore Institute of Technology University, Vellore, India. plohithkumar@hotmail.com Abstract The scholarly activities

More information

Semantic SharePoint. Technical Briefing. Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company

Semantic SharePoint. Technical Briefing. Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company Semantic SharePoint Technical Briefing Helmut Nagy, Semantic Web Company Andreas Blumauer, Semantic Web Company What is Semantic SP? a joint venture between iquest and Semantic Web Company, initiated in

More information

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis

Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Towards SoMEST Combining Social Media Monitoring with Event Extraction and Timeline Analysis Yue Dai, Ernest Arendarenko, Tuomo Kakkonen, Ding Liao School of Computing University of Eastern Finland {yvedai,

More information

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches

A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria 105 A Software Tool for Thesauri Management, Browsing and Supporting Advanced Searches J. Nogueras-Iso, J.A. Bañares, J. Lacasta, J. Zarazaga-Soria

More information

CONDIS. IT Service Management and CMDB

CONDIS. IT Service Management and CMDB CONDIS IT Service and CMDB 2/17 Table of contents 1. Executive Summary... 3 2. ITIL Overview... 4 2.1 How CONDIS supports ITIL processes... 5 2.1.1 Incident... 5 2.1.2 Problem... 5 2.1.3 Configuration...

More information

Integrating data from The Perseus Project and Arachne using the CIDOC CRM An Examination from a Software Developer s Perspective

Integrating data from The Perseus Project and Arachne using the CIDOC CRM An Examination from a Software Developer s Perspective Integrating data from The Perseus Project and Arachne using the CIDOC CRM An Examination from a Software Developer s Perspective Robert Kummer, Perseus Project at Tufts University and Research Archive

More information

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies

Semantic Data Management. Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies Semantic Data Management Xavier Lopez, Ph.D., Director, Spatial & Semantic Technologies 1 Enterprise Information Challenge Source: Oracle customer 2 Vision of Semantically Linked Data The Network of Collaborative

More information

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System

Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Architecture of an Ontology-Based Domain- Specific Natural Language Question Answering System Athira P. M., Sreeja M. and P. C. Reghuraj Department of Computer Science and Engineering, Government Engineering

More information

CIDOC-CRM Extensions for Conservation Processes: A Methodological Approach

CIDOC-CRM Extensions for Conservation Processes: A Methodological Approach CIDOC-CRM Extensions for Conservation Processes: A Methodological Approach Evgenia Vassilakaki 1,a), Daphne Kyriaki- Manessi 1,b), Spiros Zervos 1,c) and Georgios Giannakopoulos 1,d) 1 Dept. Library science

More information

Service Road Map for ANDS Core Infrastructure and Applications Programs

Service Road Map for ANDS Core Infrastructure and Applications Programs Service Road Map for ANDS Core and Applications Programs Version 1.0 public exposure draft 31-March 2010 Document Target Audience This is a high level reference guide designed to communicate to ANDS external

More information

Semantic Search in Portals using Ontologies

Semantic Search in Portals using Ontologies Semantic Search in Portals using Ontologies Wallace Anacleto Pinheiro Ana Maria de C. Moura Military Institute of Engineering - IME/RJ Department of Computer Engineering - Rio de Janeiro - Brazil [awallace,anamoura]@de9.ime.eb.br

More information

Improving Traceability of Requirements Through Qualitative Data Analysis

Improving Traceability of Requirements Through Qualitative Data Analysis Improving Traceability of Requirements Through Qualitative Data Analysis Andreas Kaufmann, Dirk Riehle Open Source Research Group, Computer Science Department Friedrich-Alexander University Erlangen Nürnberg

More information

A Mind Map Based Framework for Automated Software Log File Analysis

A Mind Map Based Framework for Automated Software Log File Analysis 2011 International Conference on Software and Computer Applications IPCSIT vol.9 (2011) (2011) IACSIT Press, Singapore A Mind Map Based Framework for Automated Software Log File Analysis Dileepa Jayathilake

More information

Achille Felicetti" VAST-LAB, PIN S.c.R.L., Università degli Studi di Firenze!

Achille Felicetti VAST-LAB, PIN S.c.R.L., Università degli Studi di Firenze! 3D-COFORM Mapping Tool! Achille Felicetti" VAST-LAB, PIN S.c.R.L., Università degli Studi di Firenze!! The 3D-COFORM Project! Work Package 6! Tools for the semi-automatic processing of legacy information!

More information

Appendix B Data Quality Dimensions

Appendix B Data Quality Dimensions Appendix B Data Quality Dimensions Purpose Dimensions of data quality are fundamental to understanding how to improve data. This appendix summarizes, in chronological order of publication, three foundational

More information

ANALEC: a New Tool for the Dynamic Annotation of Textual Data

ANALEC: a New Tool for the Dynamic Annotation of Textual Data ANALEC: a New Tool for the Dynamic Annotation of Textual Data Frédéric Landragin, Thierry Poibeau and Bernard Victorri LATTICE-CNRS École Normale Supérieure & Université Paris 3-Sorbonne Nouvelle 1 rue

More information

technische universiteit eindhoven WIS & Engineering Geert-Jan Houben

technische universiteit eindhoven WIS & Engineering Geert-Jan Houben WIS & Engineering Geert-Jan Houben Contents Web Information System (WIS) Evolution in Web data WIS Engineering Languages for Web data XML (context only!) RDF XML Querying: XQuery (context only!) RDFS SPARQL

More information

Queensland recordkeeping metadata standard and guideline

Queensland recordkeeping metadata standard and guideline Queensland recordkeeping metadata standard and guideline June 2012 Version 1.1 Queensland State Archives Department of Science, Information Technology, Innovation and the Arts Document details Security

More information

Flattening Enterprise Knowledge

Flattening Enterprise Knowledge Flattening Enterprise Knowledge Do you Control Your Content or Does Your Content Control You? 1 Executive Summary: Enterprise Content Management (ECM) is a common buzz term and every IT manager knows it

More information

SEMANTIC VIDEO ANNOTATION IN E-LEARNING FRAMEWORK

SEMANTIC VIDEO ANNOTATION IN E-LEARNING FRAMEWORK SEMANTIC VIDEO ANNOTATION IN E-LEARNING FRAMEWORK Antonella Carbonaro, Rodolfo Ferrini Department of Computer Science University of Bologna Mura Anteo Zamboni 7, I-40127 Bologna, Italy Tel.: +39 0547 338830

More information

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web

Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web Encoding Library of Congress Subject Headings in SKOS: Authority Control for the Semantic Web Corey A Harper University of Oregon Libraries Tel: +1 541 346 1854 Fax:+1 541 346 3485 charper@uoregon.edu

More information

Semantically Enhanced Web Personalization Approaches and Techniques

Semantically Enhanced Web Personalization Approaches and Techniques Semantically Enhanced Web Personalization Approaches and Techniques Dario Vuljani, Lidia Rovan, Mirta Baranovi Faculty of Electrical Engineering and Computing, University of Zagreb Unska 3, HR-10000 Zagreb,

More information

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them

An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them An Open Platform for Collecting Domain Specific Web Pages and Extracting Information from Them Vangelis Karkaletsis and Constantine D. Spyropoulos NCSR Demokritos, Institute of Informatics & Telecommunications,

More information

Europeana Core Service Platform

Europeana Core Service Platform Europeana Core Service Platform MILESTONE MS29: EDM development plan Revision Final version Date of submission 30-09-2015 Author(s) Valentine Charles and Antoine Isaac, Europeana Foundation Dissemination

More information

Building a Question Classifier for a TREC-Style Question Answering System

Building a Question Classifier for a TREC-Style Question Answering System Building a Question Classifier for a TREC-Style Question Answering System Richard May & Ari Steinberg Topic: Question Classification We define Question Classification (QC) here to be the task that, given

More information

LinkZoo: A linked data platform for collaborative management of heterogeneous resources

LinkZoo: A linked data platform for collaborative management of heterogeneous resources LinkZoo: A linked data platform for collaborative management of heterogeneous resources Marios Meimaris, George Alexiou, George Papastefanatos Institute for the Management of Information Systems, Research

More information

COMBINING AND EASING THE ACCESS OF THE ESWC SEMANTIC WEB DATA

COMBINING AND EASING THE ACCESS OF THE ESWC SEMANTIC WEB DATA STI INNSBRUCK COMBINING AND EASING THE ACCESS OF THE ESWC SEMANTIC WEB DATA Dieter Fensel, and Alex Oberhauser STI Innsbruck, University of Innsbruck, Technikerstraße 21a, 6020 Innsbruck, Austria firstname.lastname@sti2.at

More information

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources

Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Using Feedback Tags and Sentiment Analysis to Generate Sharable Learning Resources Investigating Automated Sentiment Analysis of Feedback Tags in a Programming Course Stephen Cummins, Liz Burd, Andrew

More information

Arches: An Open Source GIS for the Inventory and Management of Immovable Cultural Heritage

Arches: An Open Source GIS for the Inventory and Management of Immovable Cultural Heritage Arches: An Open Source GIS for the Inventory and Management of Immovable Cultural Heritage David Myers 1, Alison Dalgity 1, Ioannis Avramides 2, and Dennis Wuthrich 3 1 The Getty Conservation Institute,

More information

Information Services for Smart Grids

Information Services for Smart Grids Smart Grid and Renewable Energy, 2009, 8 12 Published Online September 2009 (http://www.scirp.org/journal/sgre/). ABSTRACT Interconnected and integrated electrical power systems, by their very dynamic

More information

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study

Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study Revealing Trends and Insights in Online Hiring Market Using Linking Open Data Cloud: Active Hiring a Use Case Study Amar-Djalil Mezaour 1, Julien Law-To 1, Robert Isele 3, Thomas Schandl 2, and Gerd Zechmeister

More information

- a Humanities Asset Management System. Georg Vogeler & Martina Semlak

- a Humanities Asset Management System. Georg Vogeler & Martina Semlak - a Humanities Asset Management System Georg Vogeler & Martina Semlak Infrastructure to store and publish digital data from the humanities (e.g. digital scholarly editions): Technically: FEDORA repository

More information

Guideline for Implementing the Universal Data Element Framework (UDEF)

Guideline for Implementing the Universal Data Element Framework (UDEF) Guideline for Implementing the Universal Data Element Framework (UDEF) Version 1.0 November 14, 2007 Developed By: Electronic Enterprise Integration Committee Aerospace Industries Association, Inc. Important

More information

Semantic Integration in Enterprise Information Management

Semantic Integration in Enterprise Information Management SETLabs Briefings VOL 4 NO 2 Oct - Dec 2006 Semantic Integration in Enterprise Information Management By Muralidhar Prabhakaran & Carey Chou Creating structurally integrated and semantically rich information

More information

Rotorcraft Health Management System (RHMS)

Rotorcraft Health Management System (RHMS) AIAC-11 Eleventh Australian International Aerospace Congress Rotorcraft Health Management System (RHMS) Robab Safa-Bakhsh 1, Dmitry Cherkassky 2 1 The Boeing Company, Phantom Works Philadelphia Center

More information

Methodology for CIDOC CRM based data integration with spatial data

Methodology for CIDOC CRM based data integration with spatial data CAA'2010 Fusion of Cultures Francisco Contreras & Fco. Javier Melero (Editors) Methodology for CIDOC CRM based data integration with spatial data G. Hiebel 1, K. Hanke 1, I. Hayek 2 1 Surveying and Geoinformation

More information

WHITE PAPER TOPIC DATE Enabling MaaS Open Data Agile Design and Deployment with CA ERwin. Nuccio Piscopo. agility made possible

WHITE PAPER TOPIC DATE Enabling MaaS Open Data Agile Design and Deployment with CA ERwin. Nuccio Piscopo. agility made possible WHITE PAPER TOPIC DATE Enabling MaaS Open Data Agile Design and Deployment with CA ERwin Nuccio Piscopo agility made possible Table of Contents Introduction 3 MaaS enables Agile Open Data Design 4 MaaS

More information

Automatic Timeline Construction For Computer Forensics Purposes

Automatic Timeline Construction For Computer Forensics Purposes Automatic Timeline Construction For Computer Forensics Purposes Yoan Chabot, Aurélie Bertaux, Christophe Nicolle and Tahar Kechadi CheckSem Team, Laboratoire Le2i, UMR CNRS 6306 Faculté des sciences Mirande,

More information

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy)

Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Building the Multilingual Web of Data: A Hands-on tutorial (ISWC 2014, Riva del Garda - Italy) Multilingual Word Sense Disambiguation and Entity Linking on the Web based on BabelNet Roberto Navigli, Tiziano

More information

Integrating data from The Perseus Project and Arachne using the CIDOC CRM An Examination from a Software Developer s Perspective

Integrating data from The Perseus Project and Arachne using the CIDOC CRM An Examination from a Software Developer s Perspective Integrating data from The Perseus Project and Arachne using the CIDOC CRM An Examination from a Software Developer s Perspective Robert Kummer, Perseus Project at Tufts University and Research Archive

More information

Selecting a Taxonomy Management Tool. Wendi Pohs InfoClear Consulting #SLATaxo

Selecting a Taxonomy Management Tool. Wendi Pohs InfoClear Consulting #SLATaxo Selecting a Taxonomy Management Tool Wendi Pohs InfoClear Consulting #SLATaxo InfoClear Consulting What do we do? Content Analytics Strategy and Implementation, including: Taxonomy/Ontology development

More information

Interactive Dynamic Information Extraction

Interactive Dynamic Information Extraction Interactive Dynamic Information Extraction Kathrin Eichler, Holmer Hemsen, Markus Löckelt, Günter Neumann, and Norbert Reithinger Deutsches Forschungszentrum für Künstliche Intelligenz - DFKI, 66123 Saarbrücken

More information

CRM dig : A generic digital provenance model for scientific observation

CRM dig : A generic digital provenance model for scientific observation CRM dig : A generic digital provenance model for scientific observation Martin Doerr, Maria Theodoridou Institute of Computer Science, FORTH-ICS, Crete, Greece Abstract The systematic large-scale production

More information

A Business Process Services Portal

A Business Process Services Portal A Business Process Services Portal IBM Research Report RZ 3782 Cédric Favre 1, Zohar Feldman 3, Beat Gfeller 1, Thomas Gschwind 1, Jana Koehler 1, Jochen M. Küster 1, Oleksandr Maistrenko 1, Alexandru

More information

secure intelligence collection and assessment system Your business technologists. Powering progress

secure intelligence collection and assessment system Your business technologists. Powering progress secure intelligence collection and assessment system Your business technologists. Powering progress The decisive advantage for intelligence services The rising mass of data items from multiple sources

More information

Mapping Encoded Archival Description to CIDOC CRM

Mapping Encoded Archival Description to CIDOC CRM Mapping Encoded Archival Description to CIDOC CRM Lina Bountouri and Manolis Gergatsoulis Database & Information Systems Group (DBIS), Laboratory on Digital Libraries and Electronic Publishing, Department

More information

SERVICE ORIENTED ARCHITECTURES (SOA) AND WORKFLOWS NEED FOR STANDARDISATION?

SERVICE ORIENTED ARCHITECTURES (SOA) AND WORKFLOWS NEED FOR STANDARDISATION? SERVICE ORIENTED ARCHITECTURES (SOA) AND WORKFLOWS NEED FOR STANDARDISATION? J-P. Evain European Broadcasting Union (EBU), Switzerland ABSTRACT This paper is an insight into what the EBU, the collective

More information

A Java Tool for Creating ISO/FGDC Geographic Metadata

A Java Tool for Creating ISO/FGDC Geographic Metadata F.J. Zarazaga-Soria, J. Lacasta, J. Nogueras-Iso, M. Pilar Torres, P.R. Muro-Medrano17 A Java Tool for Creating ISO/FGDC Geographic Metadata F. Javier Zarazaga-Soria, Javier Lacasta, Javier Nogueras-Iso,

More information

The Relevance of Aggregators: Proposal for a new Data Provision Reference Model - Synergy

The Relevance of Aggregators: Proposal for a new Data Provision Reference Model - Synergy The Relevance of Aggregators: Proposal for a new Data Provision Reference Model - Synergy Authors: Dominic Oldman, Chrysoula Bekiari, Martin Doerr, Gerald de Jong, Barry Norton, Thomas Wikman Conference

More information

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery

Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery Combining SAWSDL, OWL DL and UDDI for Semantically Enhanced Web Service Discovery Dimitrios Kourtesis, Iraklis Paraskakis SEERC South East European Research Centre, Greece Research centre of the University

More information

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases

A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases A Visual Tagging Technique for Annotating Large-Volume Multimedia Databases A tool for adding semantic value to improve information filtering (Post Workshop revised version, November 1997) Konstantinos

More information

Building Semantic Content Management Framework

Building Semantic Content Management Framework Building Semantic Content Management Framework Eric Yen Computing Centre, Academia Sinica Outline What is CMS Related Work CMS Evaluation, Selection, and Metrics CMS Applications in Academia Sinica Concluding

More information

JOURNAL OF OBJECT TECHNOLOGY

JOURNAL OF OBJECT TECHNOLOGY JOURNAL OF OBJECT TECHNOLOGY Online at www.jot.fm. Published by ETH Zurich, Chair of Software Engineering JOT, 2008 Vol. 7, No. 8, November-December 2008 What s Your Information Agenda? Mahesh H. Dodani,

More information

FUTURE VIEWS OF FIELD DATA COLLECTION IN STATISTICAL SURVEYS

FUTURE VIEWS OF FIELD DATA COLLECTION IN STATISTICAL SURVEYS FUTURE VIEWS OF FIELD DATA COLLECTION IN STATISTICAL SURVEYS Sarah Nusser Department of Statistics & Statistical Laboratory Iowa State University nusser@iastate.edu Leslie Miller Department of Computer

More information

SOCIS: Scene of Crime Information System - IGR Review Report

SOCIS: Scene of Crime Information System - IGR Review Report SOCIS: Scene of Crime Information System - IGR Review Report Katerina Pastra, Horacio Saggion, Yorick Wilks June 2003 1 Introduction This report reviews the work done by the University of Sheffield on

More information

Ontology construction on a cloud computing platform

Ontology construction on a cloud computing platform Ontology construction on a cloud computing platform Exposé for a Bachelor's thesis in Computer science - Knowledge management in bioinformatics Tobias Heintz 1 Motivation 1.1 Introduction PhenomicDB is

More information

Introduction to Service Oriented Architectures (SOA)

Introduction to Service Oriented Architectures (SOA) Introduction to Service Oriented Architectures (SOA) Responsible Institutions: ETHZ (Concept) ETHZ (Overall) ETHZ (Revision) http://www.eu-orchestra.org - Version from: 26.10.2007 1 Content 1. Introduction

More information

From Databases to Natural Language: The Unusual Direction

From Databases to Natural Language: The Unusual Direction From Databases to Natural Language: The Unusual Direction Yannis Ioannidis Dept. of Informatics & Telecommunications, MaDgIK Lab University of Athens, Hellas (Greece) yannis@di.uoa.gr http://www.di.uoa.gr/

More information

A Case Study of Question Answering in Automatic Tourism Service Packaging

A Case Study of Question Answering in Automatic Tourism Service Packaging BULGARIAN ACADEMY OF SCIENCES CYBERNETICS AND INFORMATION TECHNOLOGIES Volume 13, Special Issue Sofia 2013 Print ISSN: 1311-9702; Online ISSN: 1314-4081 DOI: 10.2478/cait-2013-0045 A Case Study of Question

More information

T HE I NFORMATION A RCHITECTURE G LOSSARY

T HE I NFORMATION A RCHITECTURE G LOSSARY T HE I NFORMATION A RCHITECTURE G LOSSARY B Y K AT H AGEDORN, ARGUS A SSOCIATES M ARCH 2000 I NTRODUCTION This glossary is intended to foster development of a shared vocabulary within the new and rapidly

More information

Joint Steering Committee for Development of RDA

Joint Steering Committee for Development of RDA Page 1 of 11 To: From: Subject: Joint Steering Committee for Development of RDA Gordon Dunsire, Chair, JSC Technical Working Group RDA models for authority data Abstract This paper discusses the models

More information

ITIL V3: Making Business Services Serve the Business

ITIL V3: Making Business Services Serve the Business ITIL V3: Making Business Services Serve the Business An ENTERPRISE MANAGEMENT ASSOCIATES (EMA ) White Paper Prepared for ASG October 2008 IT Management Research, Industry Analysis, and Consulting Table

More information

Information Retrieval Systems in XML Based Database A review

Information Retrieval Systems in XML Based Database A review Information Retrieval Systems in XML Based Database A review Preeti Pandey 1, L.S.Maurya 2 Research Scholar, IT Department, SRMSCET, Bareilly, India 1 Associate Professor, IT Department, SRMSCET, Bareilly,

More information

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection

Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Dublin City University at CLEF 2004: Experiments with the ImageCLEF St Andrew s Collection Gareth J. F. Jones, Declan Groves, Anna Khasin, Adenike Lam-Adesina, Bart Mellebeek. Andy Way School of Computing,

More information

Introduction to Text Mining. Module 2: Information Extraction in GATE

Introduction to Text Mining. Module 2: Information Extraction in GATE Introduction to Text Mining Module 2: Information Extraction in GATE The University of Sheffield, 1995-2013 This work is licenced under the Creative Commons Attribution-NonCommercial-ShareAlike Licence

More information

AN INTELLIGENT TUTORING SYSTEM FOR LEARNING DESIGN PATTERNS

AN INTELLIGENT TUTORING SYSTEM FOR LEARNING DESIGN PATTERNS AN INTELLIGENT TUTORING SYSTEM FOR LEARNING DESIGN PATTERNS ZORAN JEREMIĆ, VLADAN DEVEDŽIĆ, DRAGAN GAŠEVIĆ FON School of Business Administration, University of Belgrade Jove Ilića 154, POB 52, 11000 Belgrade,

More information

A Semantically Enriched Competency Management System to Support the Analysis of a Web-based Research Network

A Semantically Enriched Competency Management System to Support the Analysis of a Web-based Research Network A Semantically Enriched Competency Management System to Support the Analysis of a Web-based Research Network Paola Velardi University of Roma La Sapienza Italy velardi@di.uniroma1.it Alessandro Cucchiarelli

More information

Secure Semantic Web Service Using SAML

Secure Semantic Web Service Using SAML Secure Semantic Web Service Using SAML JOO-YOUNG LEE and KI-YOUNG MOON Information Security Department Electronics and Telecommunications Research Institute 161 Gajeong-dong, Yuseong-gu, Daejeon KOREA

More information

II. PREVIOUS RELATED WORK

II. PREVIOUS RELATED WORK An extended rule framework for web forms: adding to metadata with custom rules to control appearance Atia M. Albhbah and Mick J. Ridley Abstract This paper proposes the use of rules that involve code to

More information

A Framework for Ontology-Based Knowledge Management System

A Framework for Ontology-Based Knowledge Management System A Framework for Ontology-Based Knowledge Management System Jiangning WU Institute of Systems Engineering, Dalian University of Technology, Dalian, 116024, China E-mail: jnwu@dlut.edu.cn Abstract Knowledge

More information

Implementing an RDF/OWL Ontology on Henry the III Fine Rolls

Implementing an RDF/OWL Ontology on Henry the III Fine Rolls Implementing an RDF/OWL Ontology on Henry the III Fine Rolls Jose Miguel Vieira, Arianna Ciula Centre for Computing in the Humanities, King s College London, Kay House, 7 Arundel Street, WC2R 3DX London,

More information

Leveraging existing Web frameworks for a SIOC explorer to browse online social communities

Leveraging existing Web frameworks for a SIOC explorer to browse online social communities Leveraging existing Web frameworks for a SIOC explorer to browse online social communities Benjamin Heitmann and Eyal Oren Digital Enterprise Research Institute National University of Ireland, Galway Galway,

More information

Scope. Cognescent SBI Semantic Business Intelligence

Scope. Cognescent SBI Semantic Business Intelligence Cognescent SBI Semantic Business Intelligence Scope...1 Conceptual Diagram...2 Datasources...3 Core Concepts...3 Resources...3 Occurrence (SPO)...4 Links...4 Statements...4 Rules...4 Types...4 Mappings...5

More information

Implementing Topic Maps 4 Crucial Steps to Successful Enterprise Knowledge Management. Executive Summary

Implementing Topic Maps 4 Crucial Steps to Successful Enterprise Knowledge Management. Executive Summary WHITE PAPER Implementing Topic Maps 4 Crucial Steps to Successful Enterprise Knowledge Management Executive Summary For years, enterprises have sought to improve the way they share information and knowledge

More information

A GENERALIZED APPROACH TO CONTENT CREATION USING KNOWLEDGE BASE SYSTEMS

A GENERALIZED APPROACH TO CONTENT CREATION USING KNOWLEDGE BASE SYSTEMS A GENERALIZED APPROACH TO CONTENT CREATION USING KNOWLEDGE BASE SYSTEMS By K S Chudamani and H C Nagarathna JRD Tata Memorial Library IISc, Bangalore-12 ABSTRACT: Library and information Institutions and

More information

Appendix A: Inventory of enrichment efforts and tools initiated in the context of the Europeana Network

Appendix A: Inventory of enrichment efforts and tools initiated in the context of the Europeana Network 1/12 Task Force on Enrichment and Evaluation Appendix A: Inventory of enrichment efforts and tools initiated in the context of the Europeana 29/10/2015 Project Name Type of enrichments Tool for manual

More information