Automatic Metadata Generation in an Archaeological Digital Library: Semantic Annotation of Grey Literature

Transcription

1 Automatic Metadata Generation in an Archaeological Digital Library: Semantic Annotation of Grey Literature Andreas Vlachidis, Ceri Binding, Keith May, and Douglas Tudhope * Abstract. This paper discusses the automatic generation of rich metadata from excavation reports from the Archaeological Data Service library of grey literature (OASIS). The work is part of the STAR project, in collaboration with English Heritage. An extension of the CIDOC CRM ontology for the archaeological domain acts as a core ontology. Rich metadata is automatically extracted from grey literature, directed by the CRM, via a three phase process of semantic enrichment employing the GATE toolkit augmented with bespoke rules and knowledge resources. The paper demonstrates the potential of combining knowledge based resources (ontologies and thesauri) in information extraction, and techniques for delivering the automatically extracted metadata as XML annotations coupled with the grey literature reports and as RDF graphs decoupled from content. Examples from two consuming applications are discussed, the Andronikos web portal which serves the annotated XML files for visual inspection and the STAR project, research demonstrator which offers unified search across of archaeological excavation data and grey literature via the core ontology CRM-EH. Keywords: Automatic Metadata Generation, CIDOC CRM, Digital Archaeology, Digital Library, GATE, Knowledge Organization Systems, Information Extraction, Semantic Annotation, Semantic Search, SKOS. 1 Introduction In archaeology today, we see digital libraries of grey literature reports and of excavation datasets but they are not meaningfully connected. This paper discusses Andreas Vlachidis Ceri Binding Douglas Tudhope Hypermedia Research Unit, University of Glamorgan, UK avlachid@glam.ac.uk, cbinding@glam.ac.uk, dstudhope@glam.ac.uk Keith May English Heritage, UK keith.may@english-heritage.org.uk A. Przepiórkowski et al. (Eds.): Computational Linguistics, SCI 458, pp , DOI: / _10 Springer-Verlag Berlin Heidelberg 2013

2 188 A. Vlachidis et al. the automatic generation of metadata that makes possible semantic search of grey literature connected with diverse archaeological datasets. Such metadata enjoy definitions that enable information retrieval and cross searching on the semantic level via ontological and terminological models and references. Thus they are described here as rich metadata. The work forms part of the broader Semantic Technologies for Archaeological Resources (STAR) project, which employs the CIDOC CRM (International Council of Museums Conceptual Reference Model) and its archaeological extension, CRM-EH, as a core ontology providing a contextual framework between different types of information sources and disparate datasets [15, 18]. In collaboration with English Heritage (EH), the STAR project has developed methods for linking digital archive databases, vocabularies and unpublished excavation reports for purposes of semantic cross search. The CRM-EH is necessary for expressing the semantics and complexities of relationships between the data elements and annotations, which underline semantically defined archaeological user queries. The project has also employed knowledge resources, such as EH domain glossaries and thesauri, expressed as SKOS vocabularies [12]. These knowledge resources assist semantically defined queries and NLP information extraction from excavation reports. An extract from the OASIS (Online AccesS to the Index of archaeological investigations) grey literature library, provided by the Archaeological Data Service, forms the STAR free text corpus. The term grey literature is used to describe documents and source materials that cannot be found through the conventional means of publication - many excavation reports exist in this format. The OASIS project is a joint effort of UK archaeology research groups, institutions, and organisations, coordinated by the University of York [13]. It aims to improve the communication of fieldwork results to the wider archaeological community. This paper reports on an investigation of automatic methods for generating rich metadata that connects concepts via the ontological arrangements of the CIDOC CRM and CRM-EH and via the terminological SKOS definition of adopted vocabulary (English Heritage thesauri). The CIDOC CRM is an international standard (ISO21127:2006) semantic framework, aiming to promote shared understanding of cultural heritage information [8. 10]. The CRM is capable of mapping any type of cultural heritage information, as published by museums, libraries and archives [2]. Extensibility is an important aspect; the CRM can be specialized when it is required by a domain. A finer granularity of detail can be expressed for domain purposes while still retaining interoperability at the core CRM level. A particular extension of CRM that addresses the needs for semantic interoperability in the archaeology domain is the English Heritage (EH) extension. Based on the archaeological notion of context, modeled as a subclass of CRM Place, the CRM-EH ontology describes entities and relationships relating to a series of archaeological events, both in the past and present (ie during excavation) [7].

3 Automatic Metadata Generation in an Archaeological Digital Library 189 Previously, the role of the CRM for rich semantic annotations of text documents has been explored via intellectual, non-automatic process, which has the potential for producing very fine grained annotation of specific, important documents, for example as part of the detailed Text Encoding Initiative markup [14]. Inevitably, this process will be resource intensive over a large corpus. Thus, automation of semantic annotation is highly desirable for advancing semantic interoperability over larger corpus. The rest of this paper discusses the efforts of an automatic semantic annotation and rich metadata delivery of archaeological grey literature documents with respect to standard ontological and terminological definitions. The metadata extraction consists of a three phase process of semantic enrichment, employing the General Architecture for Text Engineering (GATE) toolkit [9]. The initial phase pre-processes the grey literature and vocabulary resources. The second phase identifies archaeological domain concepts in context. The final phase transforms GATE annotations to two different representations for consuming applications. The first representation is as XML coupled with the grey literature report this permits colour coded inline viewing of the annotations and also semantic metadata ranked by frequency of occurrence. The second representation is as RDF triples for semantic applications. In both cases, the extracted metadata are expressed as CRM entities, qualified by SKOS archaeological vocabulary concepts. Two web applications use the resultant semantic annotations. The Andronikos web portal delivers the annotated XML files as hypertext documents for visual inspection of the information extraction results. The STAR research demonstrator offers a unified searching of both data and grey literature in terms of the core ontology. Examples of the semantically enriched grey literature from both applications are discussed in this paper, together with results from a preliminary evaluation exercise. 2 The Process of Semantic Enrichment Information Extraction (IE) is a particular NLP technique relevant to the semantic enrichment of free text documents by extracting specific information snippets suitable for further manipulation [16, 6]. These semantic annotations in context enrich documents, enabling access on the basis of a conceptual structure, providing smooth traversal between unstructured text and conceptual models [5]. In addition, they can aid the integration of heterogeneous data sources by exploiting a conceptual structure and allowing users to search across resources for entities and relations, instead of words. Users can search for the term 'Paris' and a semantic annotation mechanism can relate the term with the abstract concept of 'city', while providing a link to the term 'France', which relates to the abstract concept 'country'. Employing a different conceptual model, the same term 'Paris' can be related with the concept of 'mythical hero' linked with the city of Troy from Homer's epic poem, the Iliad. In this paper the discussion of the process of Semantic Enrichment is divided into the three broad phases of the IE pipeline, each subdivided into various subtasks (pipelines). The initial pre-processing phase prepares the OASIS corpus and

4 190 A. Vlachidis et al. knowledge resources. The section on the second main IE phase highlights some details of the pipeline used for the annotation of the textual resources with conceptual and terminological references. The last section discusses the techniques employed in the final phase that constructs RDF representations of the metadata for the semantically enriched documents. The discussion begins by introducing the underlying Language Engineering architecture and Knowledge Organization System resources that contribute to the process of semantic enrichment. 2.1 Underlying Knowledge Organization System Resources Gazetteers are sets of lists, sometimes containing the names of entities such as cities, day of the week, etc. In GATE, gazetteer listings are used to find occurrences of terms in free text, and often support named entity recognition tasks. GATE gazetteers are not flat, which means that list entries can enjoy attributes which in turn can be invoked by Java Annotation Pattern Engine (JAPE) rules for the construction of sophisticated patterns. For example, a gazetteer containing the names of European cities can be enhanced with an attribute denoting the country of origin for each European city enlisted in the gazetteer resource. Therefore, a JAPE rule at later stages can exploit that particular attribute for targeting term matching only at UK cities or only at UK and Greek cities. In the semantic enrichment work described in this paper, gazetteers proved useful, being used to accommodate the wide range of Knowledge Organization System (KOS) resources made available to the STAR project by English Heritage. One type of KOS used in the semantic enrichment process was the thesaurus, with its various semantic relationships. In STAR, the various EH thesauri and glossaries were represented in SKOS, allowing unique identifiers (URIs) for a concept and possible links between concepts (skos:exactmatch, skos:closermatch) [12]. Four English Heritage thesauri and five glossary resources are incorporated in the process of semantic enrichment for identifying occurrences of various conceptual entities. The thesauri are the Archaeological Objects, Monument Types, Building Materials and the experimental Time-line Thesaurus [11]. The glossary resources are the Simple Names for Deposits and Cuts, Find Type Index, Material Index, Small Finds and the Bulk Find Material glossary. All the KOS resources had been previously expressed in SKOS format for the purposes of the STAR project [3, 4]. The glossary resources contain a small set of concepts which are highly relevant to the domain of archaeology. On the other hand, the thesauri resources contain a large set of concepts which relate to the general cultural and heritage domain. Initial entity recognition experiments exploited only the available glossary and Time-line Thesaurus vocabulary. This limited the semantic enrichment to a small set of highly relevant concepts in the context of archaeology. Therefore, new GATE techniques needed to be developed in order to allow the possibility of selectively exploiting the wider context of the EH thesauri. This makes possible richer terminology availability for the semantic enrichment process.

5 Automatic Metadata Generation in an Archaeological Digital Library 191 Exploiting the whole range of thesaurus concepts would expand the process of enrichment to concepts that are not very relevant to the archaeology domain. Therefore an optimum range is needed. A solution is to use concepts that come from those areas of thesaurus structures potentially useful to the enrichment process. Overlapping concepts between the (highly relevant) glossaries and thesauri can serve as appropriate entry points to the thesaurus structures. Thesaurus semantic relationships can then be exploited for expanding from the entry points across thesauri areas relevant to the task of archaeological semantic enrichment. In order to enable this semantic expansion within GATE, JAPE rules must be able to exploit thesaurus narrower and broader semantic relationships. Therefore, transformation of SKOS thesauri to GATE gazetteers allows the translation of thesauri properties to gazetteer attributes, enabling JAPE rules to exploit semantic relationships between gazetteer terms, as described in the following section. 2.2 Transforming KOS to GATE Gazetteers The transformation of SKOS thesauri and glossary resources to GATE gazetteers is achieved via the use of XSLT templates. The templates exploit SKOS properties for adding attributes to gazetteer terms, which can be used by JAPE rules. It is important that the rules are capable of traversing through a thesaurus hierarchy, in order to produce matches that achieve a semantic expansion, which can expand beyond the limits of a single semantic level. Since JAPE is essentially a pattern matching rule engine, some work was required to enable the semantic expansion of SKOS concepts (with their unique identifiers) within GATE at the lookup stage of the information extraction pipeline. Consider the following case; Container (by function) >Food and Drink Serving Container > Drink Serving Container > Jug > Knight Jug. A JAPE rule that used only the narrower concept property would be able to semantically expand only on immediately narrower concepts. Therefore, a gazetteer attribute was built during the transformation from SKOS to GATE gazetteer to reflect the path of unique SKOS identifiers from the concept to the top of its hierarchy. For example: KnightJug@skosConcept=149773@path=/101601/101204/ / Jug@skosConcept=101601@path=/101204/101340/ Drink Serving Container@skosConcept=101204@path=/101340/ Food and Drink Serving Container@skosConcept=101340@path=/ Container@skosConcept=101023@path=/ A simple JAPE rule can then exploit the above gazetteer attributes by matching all terms that contain a skosconcept and a path attribute of a particular reference; for example matches all concepts in the gazetteer within the container hierarchy. The XSLT transformation also takes into account SKOS alternative concept labels (thesaurus non-preferred terms) and makes them available as

6 192 A. Vlachidis et al. gazetteer entries that have the same skosconcept and path attributes as their preferred label counterparts. In addition during the transformation, particular glossary concepts are given an extra attribute (skos:exactmatch) to accommodate the previously defined mapping between a glossary and a thesaurus. For example the concept Hearth of the glossary Simple Names for Deposits and Cuts is mapped to the concept Hearth of the thesaurus Monument Types class Archaeological Feature: hearth@skosconcept=ehg003.37@skosexactmatch=70374 This allows the potential for JAPE rules to optionally expand the glossary concept Hearth to associated concepts within the Monuments thesaurus, depending on the overall context. 2.3 Main Knowledge-Based Information Extraction Phase The main IE process annotates grey literature documents with conceptual and terminological references. Thus process is carried out by the OPTIMA pipeline (figure 1) developed for the project (Object, Place, TIme and Material). These four concepts were considered key metadata elements for the purposes of the project s concern with archaeological excavations and are the focus of the pipeline. The IE process uses a large number of JAPE patterns and utilises the gazetteer resources created by the pre-processing phase. We term this particular approach, Knowledge Based Information Extraction, since the combination of core ontology and knowledge resources plays a major role in the process of semantic enrichment. This drives the IE task by using JAPE patterns which exploit and produce semantic relationships. This section gives an overview of the main functionality of the pipeline. The first stage of the main IE pipeline invokes the gazetteer resources and generates the initial lookup annotations. The various knowledge resources that participate in the pipeline are capable of identifying lookups that fall within the Fig. 1. The Information Extraction pipeline developed in GATE. Bespoke JAPE rules for this project are shown in grey boxes.

7 Automatic Metadata Generation in an Archaeological Digital Library 193 four concept categories mentioned above. For example, the Archaeological Objects thesaurus contains concepts of physical objects, the Building Materials thesaurus contains concepts of materials, the Timeline thesaurus contains time appellations, such as periods, while glossaries are a source of more specific archaeological terminology, such as the Simple Names for Deposits and Cuts, which covers archaeological contexts. There are cases where a term can be found within two different knowledge resources, thus potentially having two different conceptual references (ie multiple senses). For example, the term brick can refer to a material or to a physical object, as can terms such as, glass, stone, iron, gold, etc. in the particular domain practice of archaeology. Therefore the first stage of the pipeline generates two types of lookup annotation, single sense and multiple sense. Multiple sense lookup annotations are disambiguated at later stages. The second stage of the pipeline validates the lookup annotations and aligns annotations to CRM entities. Annotations that are not part of noun phrases and annotations that are part of headings, table of contents and single worded phrases are suppressed for purposes of the current project. It is important to validate lookup annotations, especially those that are not part of noun phrases, because gazetteer matches are invoked via a morphological analyser and matches are created on the root of words. This technique allows matches within a broader orthographical context, including singular and plural forms of matches, but also generates matches for verb senses that have to be suppressed by the validation stage. In addition, during this stage the pipeline performs negation detection over lookup annotations. The next stage of the pipeline disambiguates multiple sense lookup annotations. The disambiguation technique is based on JAPE patterns that examine word pairs and Part-of-Speech input. Lookup annotations that cannot be disambiguated maintain all their possible senses, leaving the judgement to the end user. The pipeline is capable of exploiting the semantic relationships of the knowledge resources that are accommodated as gazetteer resources. The pipeline can be configured in one of three different modes of semantic expansion developed for purposes of the research within GATE, Synonym, Hyponym, and Hypernym expansion. Synonym expansion utilises the glossary resources and expands on the synonyms of glossary terms available in the thesauri resources. Hyponym is similar to the Synonym expansion in utilising the glossary terms and their synonyms but also traverses over the hierarchy of the thesaurus structures to include (transitively) all narrower concepts available for the glossary. The Hypernym expansion mode includes the above two modes of expansion and also exploits the broader concept relationships within the thesauri structures. In terms of volume, the last mode of expansion will expand the semantic enrichment task to include the largest set of concepts, while the first mode of expansion will include the smallest and the most precise set of concepts. For example, for the glossary concept enclosure, Synonym expansion will include the concept, garth, in addition to enclosure. Hyponym expansion will also include the concepts curvilinear enclosure, ditched enclosure, rectilinear enclosure, etc. and their narrower concepts, such as oval enclosure, double ditched enclosure, polygonal enclosure, etc.; Hypernym expansion will also include all Monument by Form concepts, such

8 194 A. Vlachidis et al. as arch, boundary, barrier, ditch and their narrower concepts, such as fence, hedge boundary ditch, etc. Clearly, there are recall/precision trade-offs associated with the different expansion modes and these are a topic for investigation in the forthcoming evaluation work. 2.4 Transformation of Semantic Annotations to RDF Triples The last phase of the semantic enrichment is the transformation of the semantic annotations produced in the previous phase to RDF triples. For this purpose, CRM-EH semantic annotations are exported initially from the GATE environment as XML documents. The GATE exporter produces annotations in the form of XML tags, which are coupled with the associated content. The exported XML tags enjoy properties produced during the IE process, such as skos:concept and skos:exactmatch unique terminological identifiers, gateid unique annotation identifiers and a note for capturing the context surrounding a semantic annotation. In addition, each grey literature document has a unique name that constitutes a unique identification of the file within the OASIS corpus. This unique file name is used in conjunction with the unique gateid property of each annotation to create a corpus wide unique identifier for each individual annotation. In addition, the SKOS concepts assigned to the annotations are associated with underlying CRM Types, using the relationship (is_represented_by) modeling the association asserted for data items (mapped to CRM) and SKOS concepts [18]. For example a semantic annotation of the CRM-EH class, Context, can be associated with the SKOS concept, pit. This supports cross search between data and grey literature in terms of CRM and SKOS. The transformation from XML files to RDF triples required the development of bespoke techniques based on the XML Document Object Model, using the scripting language PHP for building the transformation templates. The decision to use a server-side scripting language like PHP is supported by the two main requirements, a MySQL database for retrieving the unique file name of each document and a visual interface for parameterisation of the transformations, allowing easy selection of semantic annotations that participate in the transformation. PHP allowed rapid development and proved robust with a large set of documents. The final RDF documents are decoupled from the content. However, as explained above each annotation resource is tied to a corpus wide unique identifier. The following example presents the details of the RDF transformation. Consider the following rich phrase pits were uniformly filled with large quantities of pottery. The phrase can be modelled by a CRM-EH ContextFindDepositionEvent, connecting the find, pottery, with the context, pit. The coupled XML output would have the following structure (the note property is omitted and URIs truncated for simplicity): <EHE1004.ContextFindDepositionEvent gate:gateid="281105"> <EHE0007.Context gate:gateid="281155" skos:concept="#ehg003.55">pits</ehe0007.context>

9 Automatic Metadata Generation in an Archaeological Digital Library 195 were uniformly filled with large quantities of <EHE0009.ContextFind gate:gateid="281158" skos="#ehg027.2">pottery</ehe0009.contextfind> </EHE1004.ContextFindDepositionEvent> The RDF transformation would have the following structure: <crmeh:ehe1004.contextfinddepositionevent rdf:about=" <dc:source rdf:resource=" /> <dc:source rdf:resource=" /> <crm:p2f.has_type rdf:resource=" /> <crm:p3f.has_note> <crm:e62.string> <rdf:value>pits were uniformly filled with large quantities of pottery</rdf:value> </crm:e62.string> </crm:p3f.has_note> <crm:p26f.moved_to rdf:resource=" /> <crm:p25f.moved rdf:resource=" base#suff " /> </crmeh:ehe1004.contextfinddepositionevent> 3 Example Uses of Rich Metadata The automatically produced metadata are utilised by two web applications, the STAR research demonstrator and the Andronikos portal [17, 1]. The STAR demonstrator uses the decoupled RDF files to support cross searching between grey literature documents and disparate datasets [3], in terms of the core CRM-EH conceptual model. A SPARQL engine supports the semantic search capabilities of the demonstrator, while an interactive interface hides the underlying model complexity and offers search (and browsing) for Samples, Finds, Contexts or interpretive Groups with their properties and relationships. On the other hand, the Andronikos portal uses the coupled XML files for constructing and delivering the semantic annotations in an easy to follow human readable format. While the portal was developed for project purposes to assist visual inspection of the information extraction outcomes, it is seen as indicative of potential digital library applications where access to the semantically enriched text is desired. The metadata take account of lexical ambiguities such as polysemy (same word having multiple meanings). For example, find all archaeological Contexts of type cut, where the term cut is ambiguous. The semantic enrichment mechanism manages to disambiguate the verb from the noun form and to reveal phrases which make use of cut as relevant to archaeological use, eg levelling layers sealed the

10 196 A. Vlachidis et al. base of a brick wall cut into layer, or It measured 0.3m in diameter and 0.2m deep with a circular cut and avoids the annotation of non archaeological uses of cut, such as although the current 'cut-off' channel is now 500m. In addition, the metadata can take account of a form of polysemous ambiguity. For example, the word brick in an archaeology usage can refer to a material or to a physical object. In the phrase yellowish-brown sandy deposit containing frequent unbonded brick, the term refers to a physical object, whereas in the phrase A layer of small brick tiles forming the street paving, the term refers to a material. The STAR Demonstrator makes use of the rich metadata for semantic search, building on CRM and SKOS unique identifers. For example, searches are possible of the form: Context of type X containing Find of type Y. The two different extracts of screendumps in Figure 2 show a Context of type hearth containing Context Find of type coin, together with a Context Find of type Animal Remains within a Context of type pit. Fig. 2 The STAR demonstrator search of semantic metadata Fig. 3 Search Results from the STAR demonstrator The cross-search capability of the STAR demonstrator retrieves results from both datasets and grey literature reports (a variety of datasets for the hearth query). As seen above, the searches returned results for different annotation types (Contexts, Finds) and from different resources (grey literature resources commence with a hash bar in this demonstrator interface). For example (Figure 3), the first search retrieves from Grey Literature #archaeol a Hearth containing a coin; the original text was It differs from the other coin finds, however, in that it was associated with a hearth. Similarly the second search retrieves from Grey Literature an animal bone within a context of type pit; the original text was the test pit produced a range of artefactual material which

11 Automatic Metadata Generation in an Archaeological Digital Library 197 included animal bone (medium/large ungulate). The semantic enrichment makes it possible for the STAR demonstrator to overcome lexical boundaries and retrieve synonymous terms, as evident in the example of Animal Remains where the term Animal bone is retrieved. The Andronikos web portal uses XML outputs of rich metadata for generating and linking HTML pages, which accommodate semantic annotations of grey literature documents. The annotations are divided into three abstractions: (i) preprocessing annotations, such as Headings, Table of Contents and Summary; (ii) single CRM annotations such as Physical Object, Place, Time and Material; (iii) CRM-EH archaeology specific annotations of rich phrases. Therefore, it is possible to optionally expose particular document abstractions according to different application strategies. Thus, in certain cases, the Summary sections (an example is given in Figure 4) might be targeted (or prioritised) for retrieval as being strongly representative of a grey literature report. Alternatively, the most frequently appearing CRM entities (see Figure 5) in a report might be considered a useful entry strategy. Yet again, a cross search might be interested in any occurrences within grey literature reports of highly specific, rich CRM annotations. Andronikos also makes available links to the XML and RDF versions of grey literature documents, which can be downloaded and further transformed or manipulated. Fig. 4 Summary Section of Grey literature, Andronikos web-portal Fig. 5 Frequent CRM entities (Time Appellation left side, Physical Object right side) in a grey literature report, Andronikos web-portal The summary section of the report (Figure 4) discusses evidence that relates to the Roman period, while the CRM overview (Figure 5) show the most frequently used SKOS concepts in the report for the CRM entities, Time Appellation and Physical Object, in this case Roman and pottery. Both CRM and SKOS annotations are produced by the IE pipeline.

12 198 A. Vlachidis et al. 4 Evaluation This section discusses the results of a preliminary evaluation exercise which aims to inform an ongoing larger scale gold standard evaluation. The main objective of the preliminary evaluation is to examine the effectiveness of the manual annotation instructions for supporting the annotation task, as well as to explore the capacity of GATE modules for supporting the evaluation and the inter-annotator agreement (IAA) analysis tasks. IAA analysis is performed in order to reveal differences between annotators and to refine the instructions of manual annotation before proceeding to the main evaluation phase. The evaluation used a small corpus of 10 summary extracts. Considering the logistic constraints of the manual annotation task, summary extracts, originating from archaeological excavation and evaluation reports were considered to be the best available option. They are potentially useful targets of semantic enrichment work, as discussed in the previous section.. The manual annotation instructions were written to reflect the end-user aims of the evaluation, hiding complex and unnecessary ontological details. Annotators were instructed to annotate at the level of archaeological concepts rather than identifying more abstract ontological entities. The instruction directed the task of manual annotation at the concepts of archaeological place, archaeological finds, the material of archaeological finds and time appellations. Three annotators were employed by the pilot evaluation task from the target archaeology domain, a senior archaeologist (SA), a commercial archaeologist (CA) and a research student (RS) of archaeology. The evaluation corpus of the ten summary extract (5 excavation and 5 evaluation reports) containing in total 2898 words, was made available in MS Word format. The annotators were instructed to use particular highlight colours and underline tools in order to produce their annotations. Although, the annotation task could have applied the GATE Ontology Annotation Tool (OAT), the use of Microsoft Word was preferred, since manual annotators had no previous experience working with GATE but were fluent in Word. The resulting manual annotations sets were systematically transferred to GATE as CRM and CRM-EH oriented annotations using the Ontology Annotation Tool (OAT). Upon completion of annotation transfer, the differences between the individual manual annotation sets were analysed using the IAA module of GATE. The module was configured to report the inter-annotator agreement score in terms of Precision, Recall and F-Measure metrics. The metrics were reported on both Average and Lenient mode. The Average mode treats partial matches as half matches whereas the Lenient mode treats partial matches as full matches. An example of a partial match might be flints instead of worked flints for the phrase worked flints found in several pits. The overall IAA F-Measure score for the three annotation sets on the average mode of reporting was 58% while on the lenient mode the score increased to 68%. In the lenient mode of reporting, individual differences in terms of annotation borders are not encountered since partial matches are considered full matches. An increment of 10% from average to lenient mode is evidence of the disagreement

13 Automatic Metadata Generation in an Archaeological Digital Library 199 between annotators on the scope of annotation boundaries. The following tables present the IAA annotator score for the three different annotation sets in terms of overall agreement including all entity types and also agreement on some individual entities. Table 1 IAA scores for the three different annotation sets, reported on Average and Lenient mode. SA: Senior Archaeologist, CA: Commercial Archaeologist, RS: Research Student. Precision Recall F-Measure Av. Le. Av. Le. Av. Le. SA-CA-RS SA-CA SA-RS CA-RS Table 2 IAA scores of individual entities reported on Average and Lenient mode. The scores are reported on the IAA score of the three Annotation Sets SA-CA-RS. Precision Recall F-Measure Av. Le. Av. Le. Av. Le. E19.Physical_Object E49.TimeAppellation E53.Place E57.Material The data of table 1 reveal fairly low IAA scores, especially when results are reported in the Average mode, which is not untypical in some application domains. The Lenient mode provides improved scores as expected, since disagreement on the annotation boundaries is not taken into account. The best performing F-Measure is between Senior Archaeologist and Research Student, scoring 71% on the Average mode and 86% on the Lenient mode respectively. On the other hand, the IAA scores involving the commercial archaeologist (CA) are lower, due largely to different interpretation of occurrences of the Time Appellation and Place entities. For example, the CA tended to annotate the term phase as for example, phase 1, phase 2 etc., as Time Appellation, whereas the other annotators considered it outside the intended scope. The CA also annotated the term trench as Place. Trenches are usually produced during excavation and so might not be considered places associated with archaeological events in the past. This distinction was not made clear in the Instructions.

14 200 A. Vlachidis et al. Examining table 2 reveals that the agreement on E57.Material entity is significantly lower both in Average and Lenient modes of reporting. This suggests that material poses particular problems of ambiguity or recognition. Also the there is a 20% difference between Average and Lenient mode on the F-Measure score of the E19.Physical_Object entity, indicative of the different annotation boundaries used by annotators. Performance of the pipeline was also evaluated against Recall and Precision using the three individual annotation sets from the exercise using the GATE corpus benchmark utility. The overall scores are shown in Table 3. Although these are only preliminary results, they tend to support the potential use of thesaurus expansion in IE to improve vocabulary matching. Taking the use of Synonyms (a concept will have Table 3 Precision, Recall and F-Measure scores Synonym Hyponym Hypernym SA Precision Recall F-Measure CA Precision Recall F-Measure RS Precision Recall F-Measure various synonym terms) as a base point, we see that both Precision and Recall are generally slightly improved by Hyponym expansion, which brings in narrower concepts (with their terms) in the thesaurus to the starting concept (by a small margin, the best precision score (75%) is delivered by the Synonym mode against the input of the senior archaeologist). On the other hand, Hypernym expansion, which also brings in narrower and broader concepts, tends to decrease precision but improve the recall by a greater amount, The best recall score (79%) is delivered by the Hyponym mode against the senior archaeologist input. The results show a slight improvement in F- Measure rates over all annotators for both Hyponym and Hypernym expansion modes. In addition, results show that the system produces better F-Measure scores when performing on the Hypernym expansion mode due to the improvement in recall rates. However, these are very preliminary results. Subsequent and larger scale evaluation will examine the use of thesaurus expansion further and will consider whether the F- Measure scores continue to show a similar pattern. The preliminary evaluation also revealed a range of terms which were selected by annotators but were not matched by the IE pipeline. Example terms include Bowl, Castle, Cemetery, Flake, Fort, Garden, Hollow, House, Inhumation, Knife, Nail, Plough, Pond, Playground, Ridge and Furrow, Ring, Settlement, Shed, Tool, Weapon. In subsequent discussion with archaeological domain experts, it was judged that many of these terns were valid. Some of these terms were in fact present in the thesaurus resources but had no appropriate bridging entry point from the archaeology glossaries. Some way of widening the terminology resources to take account of these terms would increase the eventual recall of the IE process.

15 Automatic Metadata Generation in an Archaeological Digital Library 201 Based on the results of the pilot evaluation, the manual instruction should be refined to improve the manual annotation task. For example, the instructions for the annotation of material entities should make clear that the annotation task should be focused on materials that have an archaeological interest and are associated with physical objects, rather than materials in general. Similarly, the instructions for the annotation of place entities should be clarified particularly for rich phrases. Instruction should highlight that entity annotations spans can contain both conjunct and adjectival moderators when applicable. In addition, annotation examples should be included for each annotation type targeted by the task in order to ease comprehension. These should attempt to clarify the cases of phrasal annotation and how boundaries should be treated. It was also seen that annotators tended individually to miss different valid annotations in the rather laborious task. The experience suggests the need for a final resolution phase before reaching a gold standard that resolves differences between annotators. An ongoing larger scale gold standard evaluation is informed by the results of this evaluation exercise. 5 Conclusions The discussion has revealed the viability of automatic generation of rich metadata for enabling semantic search of grey literature connected with archaeological datasets. The methods of Information Extraction, driven by the core ontology CIDOC CRM and its extension CRM-EH, in combination with SKOS resources, were central to the process of automatic metadata generation. A large scale evaluation excercise is in process to evaluate the information extraction performance in general and in dealing with lexical ambiguities and the accuracy of rich phrase annotation in particular. Specific contributions of the work reported here include the potential of combining knowledge based resources (including ontologies and thesauri) in information extraction, techniques for using SKOS and CRM resources within GATE based information extraction, techniques for automatic rich metadata generation and expression as coupled XML and also as RDF triples for different consuming applications, including semantic cross search over excavation datasets and archaeological grey literature reports. In addition, the currernt evaluation results demostrate the ability of the system to peform in diffrent modes of semantic expansion that exploit vocabulary resources which may allow the possibility of tailoring the system to favour recall or precision. In general, the current study demonstrates the capability for CRM based methods to drive automatic generation of rich metadata in domain specific digital libraries. Such metadata can be expressed in interoperable formats such as XML and RDF graphs, which can be exploited by digital library systems to enable cross-search functionality between disparate resources. Work is underway investigating generalisation of the methods to related areas in the cultural heritage domain. Acknowledgements. The STAR project was supported by the Arts and Humanities Research Council [grant number AH/D001528/1]. Thanks are due to the Archaeology Data Service for provision of the OASIS corpus and to Phil Carlisle (English Heritage) for providing domain thesauri.

16 202 A. Vlachidis et al. References 1. Andronikos web-portal of semantic indices of the OASIS corpus, 2. Babeu, A., Bamman, D., Crane, G., Kummer, R., Weaver, G.: Named Entity Identification and Cyberinfrastructure. In: Kovács, L., Fuhr, N., Meghini, C. (eds.) ECDL LNCS, vol. 4675, pp Springer, Heidelberg (2007) 3. Binding, C., May, K., Tudhope, D.: Semantic Interoperability in Archaeological Datasets: Data Mapping and Extraction Via the CIDOC CRM. In: Christensen- Dalsgaard, B., Castelli, D., Ammitzbøll Jurik, B., Lippincott, J. (eds.) ECDL LNCS, vol. 5173, pp Springer, Heidelberg (2008) 4. Binding, C.: Implementing Archaeological Time Periods Using CIDOC CRM and SKOS. In: Aroyo, L., Antoniou, G., Hyvönen, E., ten Teije, A., Stuckenschmidt, H., Cabral, L., Tudorache, T. (eds.) ESWC 2010, Part I. LNCS, vol. 6088, pp Springer, Heidelberg (2010) 5. Bontcheva, K., Duke, T., Glover, N., Kings, I.: Semantic Information Access. In: Semantic Web Semantic Web Technology: Trends and Research in Ontology Based Systems, Wiley, Sussex (2006) 6. Cowie, J., Lehnert, W.: Information extraction. Communications ACM 39(1), (1996) 7. Cripps, P., Greenhalgh, A., Fellows, D., May, K., Robinson, D.: Ontological Modelling of the work of the Centre for Archaeology (2004), 8. Crofts, N., Doerr, M., Gill, T., Stead, S., Stiff, M.: Definition of the CIDOC Conceptual Reference Model, 9. Cunningham, H., Maynard, D., Bontcheva, K., Tablan, V.: GATE: A Framework and Graphical Development Environment for Robust NLP Tools and Applications. In: Proc. 40th Meeting of the Association for Computational Linguistics (ACL 2002), Philadelphia (2002) 10. Doerr, M.: The CIDOC Conceptual Reference Module: an Ontological Approach to Semantic Interoperability of Metadata. AI Magazine 2493, (2003) 11. English Heritage Thesauri, Isaac, A., Summers, E.: SKOS Simple Knowledge Organization System Primer (2009), Online AccesS to the Index of archaeological investigations (OASIS), Ore, C.-E., Eide, Ø.: TEI and cultural heritage ontologies: Exchange of information? Literary and Linguist Computing 24(2), (2009) 15. May, K., Binding, C., Tudhope, D.: A STAR is born: some emerging Semantic Technologies for Archaeological Resources. In: Proceedings Computer Applications and Quantitative Methods in Archaeology (CAA 2008), Budapest (2008) 16. Moens, M.: Information Extraction Algorithms and Prospects in a Retrieval Context. Springer, New York (2006) 17. Semantic Technologies for Archaeological Resources (STAR) demonstrator. University of Glamorgan, resources/star-demonstrator/ 18. Tudhope, D., Binding, C., May, K.: Semantic interoperability issues from a case study in archaeology. In: Kollias, S., Cousins, J. (eds.) Proc. First International Workshop SIEDL 2008, Semantic Interoperability in the European Digital Library, Associated with 5th European Semantic Web Conference, Tenerife, pp (2008)