Towards semantic interoperability of cultural information systems making ontologies work. Magisterarbeit an der Philosophischen Fakultät der Universität zu Köln vorgelegt von Robert Kummer
Contents 1 Introduction 2 2 Establishing digital scholarship 5 2.1 Functional requirements of digital scholarship............ 5 2.2 Implementing digital scholarship.................... 10 2.3 The interoperability challenge..................... 13 3 A web of linked cultural heritage data 17 3.1 Conceptual and technical requirements................ 17 3.2 Identification and representation of resources............. 22 3.3 Semantic Web tools........................... 24 4 Standards for semantic interoperability 26 4.1 Managing archaeological objects.................... 27 4.2 Linking to bibliographic information................. 30 4.3 Linking to other forms of knowledge organisation.......... 34 5 Dealing with heterogeneity 37 5.1 Levels of heterogeneity......................... 37 5.2 Heterogeneity on the schema level................... 38 5.2.1 Uniform representation of data models............ 38 5.2.2 Mapping data models...................... 39 5.3 Heterogeneity on the entity level.................... 48 5.3.1 Data extraction and data quality problems.......... 48 5.3.2 Entity Identification and record linkage............ 50 5.4 Implementing an overall mapping workflow.............. 53 6 Knowledge visualization for the Semantic Web 56 6.1 Paradigms for visualizing linked data................. 56 6.2 Faceted browsing using Longwell................... 57 7 Conclusion 61 1
Chapter 1 Introduction Recently new terms have emerged to describe an IT and social infrastructure that should facilitate seamless digital scholarly work, usually referred to as Cyberinfrastructure, a term coined by the US National Science Foundation. Many endeavors have been made to approach such Cyberinfrastructure. Most of them have the same objective with only slight variations, because due to fragmentation of existing data sources that are spread all over the world, some scientific questions cannot be solved today. The main objective therefore has to be to identify, describe and implement elements of an infrastructure that enable scholars to better exploit digital resources [28, 45, 59]. This infrastructure will provide unified access to data sources and offer services that add value to their underlying cultural heritage content. One step towards an integrated Cyberinfrastructure for cultural heritage is to syntactically bring data objects together and to semantically mediate between different data models. State of the art scientific research suggests to establish metadata harvesting in addition to crafting software agents that are aware of ontologies. Conceptual reference models like the CIDOC CRM help to mediate between different data models and provide a blueprint for building software that understands cultural heritage data [10]. But semantic (processing data meaningfully) and syntactic integration (bringing data to a common place) are just one step towards seamless interoperability of cultural heritage information systems. New ideas originating from Semantic Web research and well established concepts from the world of (digital) libraries may contribute important ideas for a digital work environment for scientists. However, today, many different information systems with different methodical approaches can be found in the field of historical cultural research; each one is designed according to a specific scientific question and perspective, using specialized terminology and a certain national language. This could be seen as a rather productive situation, but the experience of using information systems for historical 2
cultural research could be greatly enhanced by creating a common platform for information retrieval. In a joint effort, two parties from classics and archaeology intend to formulate a research program for achieving the goals mentioned above. These parties will be The Perseus Project and Arachne [9, 8, 20]. The Perseus Project is a digital library currently hosted at Tufts University. It provides humanities resources in digital form with a focus on Classics but also early modern and even more recent material. Arachne is the central database for archeological objects of the German Archaeological Institute (DAI) and the Research Archive for Ancient Sculpture (FA) at the University of Cologne [11, 21]. DAI and FA joined their efforts in developing Arachne as a free tool for archaeological internet research. The goal of this thesis is to document the course of this project, i.e. the efforts to gain first experience with building a system that syntactically and semantically integrates data in an international and therefore multilingual environment. It also reports about issues that were encountered during the project and reflects on possible ways to resolve them. It turns out that conceptually mapping data models is not the greatest challenge, but extracting data with appropriate quality and identifying multiple digital surrogates that refer to the same entity in a multilingual environment. The project is designed to contribute to the said efforts to establish a digital infrastructure for scientific research in the cultural heritage area. Thus, first, to incorporate the project in its greater context, by crafting a model of digital scholarship, functional requirements are discussed that will have to be implemented during the process of software development. Second, the peculiarities of sharing data among Perseus and Arachne are introduced, how the collections complement each other and where the mutual benefits are. Third, state of the art concepts and tools are discussed that help with integrating heterogeneous data from multiple sources, most of them originate from current Semantic Web research. Fourth, the reader is given a closer look at standards for digitally representing cultural heritage data, commonly known as (Networked) Knowledge Organisation Systems and Services (NKOS). 1 Within the context of this project, the CIDOC CRM was used as a common data model for sharing metadata. The main part will be discussing forms of heterogeneity that were encountered during implementing a mapping workflow. The main issues are explained and possible ways to resolve the problems are suggested. Finally, paradigms for visually presenting integrated data objects to users are explored. Longwell, a Semantic Web browser, was used to index and display the data that was mapped to the CIDOC CRM. With only one person doing the analysis of both data models, the mapping, and 1 NKOS discusses the requirements for enabling knowledge organization systems as network services (http:// nkos.slis.kent.edu/ ). 3
the implementation of fundamental software tools, the overall software architecture had to remain rather lean. Therefore, higher programming languages were avoided and most of the presented workflow relies on shell scripts, most of them basic tools that come with the UNIX operating system, and style-sheet processing for data extraction and mapping. The mapping results were documented in a simple text file that can be found in the Appendix and that was implemented using regular expressions and XSLT style-sheets. Although the infrastructure discussed is suitable for all areas of scientific research, this thesis focuses on the cultural heritage area. 4
Chapter 2 Establishing digital scholarship This section is aimed at sounding the intellectual requirements that facilitate a scholarly workflow as defined above. While still being a very young discipline compared to examining ancient Greek and Roman texts, software development always intended to better facilitate certain tasks that a person needs to perform. Therefore, functional requirements are defined first and software developers then build a specific software tool around these agreed and formulated requirements. This involves a lot of communication between experts and software developers before the first line of code can be written. In the larger context of this project functional requirements can be deducted from traditional scientific workflow, especially within the subjects that deal with culture and history. In the following section, first, a model of digital scholarship is used to help identifying and describing the tasks that could be supported by proper integration of cultural heritage information systems. Second, interoperability on the level of data objects presupposes that also on more abstract levels, interoperability of ideas and concepts needs to be established. Therefore, several related ideas are discussed and evaluated. Finally, these concepts are applied to the particular project that this thesis is to report on. 2.1 Functional requirements of digital scholarship Large amounts of cultural heritage information have already been migrated to various digital media during the last years. Additionally, the importance of peer reviewed Open Access material is more and more recognized within the scientific community. Consequently, a lot of work goes into reflecting the architecture that currently facilitates scholarly communication and how it could be transformed as a reaction to new opportunities that arise in a digital environment. One core argu- 5
ment is that new knowledge which has been discovered with the help of taxpayers money should not be given to large publishing houses so that libraries have to pay for it again while prices become prohibitive. Making scientific results available at every subsequent intellectual processing stage is one first important step. Adding digital services to the data that is publicly available in the World Wide Web would add even more value to the underlying content. But which services do scientists need for research in the cultural heritage area? Figure 2.1 introduces a layered logical model of digital scholarship that transcends the components of the aforementioned Cyberinfrastructure. The uppermost layer suggests the need to distinguish objects of the perceptible world and their digital surrogates in one or more digital library collections. These surrogates consist of critical editions of ancient literary texts, archaeological surveys of individual sites or even catalogues of physical artifacts. Scientists create these surrogates as a result of their everyday work, by digitalizing material that has been published in traditional form or, in the future, by directly publishing in digital form. Beneath the primary sources, digital libraries should also host secondary sources like reference works that capture the results of a longer research process, and also monographs and research papers that exhaust new and original ideas. In a digital library, secondary sources should be linked to primary research material to facilitate advanced services. Figure 2.1: Model of digital scholarship. 6
The model differentiates between three further layers. While pursuing scientific research, scholars need to refer to texts, parts of texts, archaeological objects, and abstract things. If a digital library provides a stable and unambiguous identifier for each relevant instance, a scientist could use this identifier to refer to the archaeological object, for example. This reference would be more accurate than in traditional scholarly works. In combination with a resolving service, the identifier could be used to obtain one or more digital representations of, for example, a passage of an ancient text. A digital representation could be a scanned image or the result of OCR. This is reflected by the third layer of the model. However, in certain cases scholars do not want to refer to an instance, but to a specific entity, lets say to one of the smaller Alexandrias that were built to honor Alexander the Great, not to the one in Egypt. By grouping instances that have been identified as referring to the same entity in the perceptible world, scholars not only can refer to all digital surrogates that have been digitalized so far but also to the one entity that they stand for. The model therefore emphasizes two layers between surrogates of primary sources within the digital collection and secondary sources. Secondary sources refer both to named entities that are derived from grouping instances and to the instances themselves that represent the object in the real world. A long term project objective will be to populate both layers with metadata about instances and entities that conform to the CIDOC CRM and other standards. The third layer represents the world of quotations and the forth layer represents the world of authority documents. This view suggests that annotating objects together with referring to instances and entities are fundamental functions of digital scholarship especially in the humanities, since arguments have to be connected to their evidence in primary sources. The latter is reflected by the bottom layer of the model [1]. After defining the functional requirements that could leverage digital scientific research, software components have to be described as parts of a logical architecture that is able to provide services to meet the functional requirements defined. Snow et al. have described a layered logical model that assists archaeological research consisting of storage management, a web service interface layer and portal software [56]. They also state that in the absence of a new generation of cybertools archaeological research will remain impoverished. Archaeology concentrates on exploring the evolution of culture, growth in population and the interaction of cultures. Research in these areas depends on finding meaningful links between different findings. This in turn depends on being able to access distributed data sources hosting heterogeneous data objects. Because nowadays data mainly is held in separate silos administered by individuals, museum and governmental institutions, finding those connections is difficult. 7
Both, classification and terminology vary and especially GIS databases are composed of records that have been accumulated on paper. On addition, there is a voluminous amount of unpublished gray literature with images, maps and photographs embedded. But problems do not only lie in access, data internally is differently represented. They further state that due to political boundaries, also in the future, archaeology will remain a mosaic of provincial efforts. This is one of the main motivations to build an integrated framework with customizable access points to methods and data that would help to overcome the current state of fragmentation. Against that background interoperability not only is a technical goal, but also a social project. Sharing design strategies can promote effective cooperation both on the level of human collaboration and electronic interaction. Especially because archaeological research is dealing with cultural heritage data sustainability has to be established. Therefore, all host institutions should remain in control of their data. Digital libraries and the services offered should be made publicly available to researchers and whole organizations to store their data. Figure 2.2 introduces a possible logical infrastructure consisting of data providers and service providers that process the data to offer advanced services on the raw data objects. Additionally, authority naming services will contribute information that can be exploited by software components or by the end user. Both Perseus and Arachne would form repositories that expose well curated data objects and exhaustive metadata to the web community, possibly by using institutional repository software that will be introduced later. The repository software should implement a protocol such as OAI-PMH that is suitable for dissemination of huge amounts of data. IRs often also offer advanced services for scalability and durability of the data objects. Authority naming services will provide specialized structured information on entities of the Greco-Roman world that cannot be covered by gazetteer services like the Getty Thesaurus of Geographic Names [35]. These services host knowledge that has been created by scientists at all times and can be used by them to unambiguously refer to a specific entity. They should be rich in variants and languages to help with information retrieval, entity identification and translation of metadata. The figure also demonstrates how one service provider (indexing) can become a data provider for a second service provider (search and image browsing). Service providers harvest data from institutional repositories to offer advanced services for that data. These could be either services that process large sets of data objects like statistical analysis and indexing or services that focus on single data objects to deliver representations like images in multiple formats. The figure shows an indexing service that consults authority naming services to perform entity identification for 8
Figure 2.2: Overall system architecture. data merging. A second service obtains processed data from the indexing service to offer searching and browsig facilities. End users are equipped with a piece of software commonly called agents. The term agent refers to a very broad definition of software that performs complex tasks. A software agent could be a web browser or more specialized software that can be influenced directly by a user. Either the tool (by configuration) or the user (for example by typing the address of a web page in the browser address field) has knowledge of the service provider and knows how to connect and use the service. The agent also can run at a remote site controlled by the user with a browser. All the pieces of software offer useful services to the user. That could be compilations of images of data objects, information about unambiguously identified entites or metadata of the data objects themselves. From a logical perspective it does not matter where the services live and where the data is stored as long as they are scalable, reliable, and accessible. Lately a new buzzword has emerged: distributed or grid computing. Although the term is used for a lot of things, it describes some requirements that are valuable for interoperability. Commonly the term is used to refer to different forms of distributed computing, a method of digital information processing that uses a logical layer to run different parts of a computer program simultaneously and distributed to gain performance. 9
However, processing soft cultural heritage data will lead to scalability issues. A grid infrastructure could help to exploit resources of many separate computers that are connected by a network. A grid should be able to solve large scale computing problems by virtualizing resources using a logical layer that mediates between resource consumers and resource providers. For example, large numbers of distributed physical hard drives could be logically connected to one large volume to host huge amounts of image data. Additionally, and absolutely transparent for the user, this disk array could be plugged into a preservation system. This system would assure that all data objects are stored redundantly and will be preserved over time. The infrastructure described would be suitable to build high level services to manage complex workflows without having to accept multiple media discontinuities (in German Medienbrüche ). A new form of work environment could support scientists by offering a tool that supports a complex workflow starting from targeted information search to compiling and arranging thoughts and ideas to argumentation chains and online publishing. This agent would be able to use a set of services to support the key workflow steps. The German TextGrid project is one of the larger efforts to achieve this goal, focusing on the field of literary studies [28]. 2.2 Implementing digital scholarship But what is already out there? The following paragraphs deal with how the requirements introduced in the preceding section should be implemented. To approach this challenge it helps to have a look at paradigms that have emerged lately, especially the notion of Digital Libraries and Institutional Repositories. One fundamental paradigm to keep in mind is the notion of a process-oriented view of the overall infrastructure. Leveraging the interoperability capabilities from the metadata to the resource level means supporting scholarly workflows like publication, citation and archiving of resources, not just information retrieval. An effective Cyberinfrastructure will provide functions for discovery, reference, dissemination, aggregation and other forms of reuse and exchange of resources while preserving intellectual property rights. Today, many Web resources are dynamically created by scripting languages like PHP or servlet technology. These can be considered to be part of what is commonly referred to as the Deep Web. Typically this data is managed in relational databases and compiled in a certain way to provide useful presentations to the human user. From the perspective of the Semantic Web, this approach has the disadvantage that crawling services like Google will not have access to this kind of content, it remains invisible. Only human beings can, by operating a front-end in a certain way, reveal the contents. The Semantic Web approach aims to link resources that 10
conceptually belong together. But to be able to do this, all resources need to be publicly available to a certain community. The concept of an Institutional Repository is a step in this direction. Arachne currently approaches this problem by creating a sitemap that helps search engine bots to find objects that are buried within the architecture. 1 In a digital age, the primary function of institutions hosting cultural heritage material is to publicly offer data and services to their audience. Digital library is an emerging term that describes a set of software that can fulfill this task. The term digital library has been used in many different ways in the past. Digital libraries hold collections of digital objects and provide means to rapidly access material in digital form. Additionally, the digital form facilitates new services on that data. While traditional libraries focused on the document as the most granular item needing to be accessed, digital libraries can also focus on the content itself. The content either is digitally created or digitized by for example scanning and applying OCR software. 2 A digital library has at its core some sort of institutional repository software like Fedora or DSpace [12, 3, 58]. Institutional repository software provides methods for collecting, preserving and disseminating the intellectual output of an institution, particularly research institutions. Institutional repositories also help to achieve interoperability of resources from institutions by providing programming interfaces that help with disseminating and federating items of the collections. They can also be used for implementing common services associated with digital libraries. Since 2006 the Mellon Foundation has been funding an initiative that will develop specifications allowing distributed repositories to share digital objects [33]. In this context digital objects are considered as units of scholarly communication as opposed to the traditional definition. Traditionally, a scientific publication in printed form is one unit of scholarly discourse. Fedora is an institutional repository that aims at building the foundation for digital libraries [12, 40, 3, 14]. Although the models that are developed by this initiative seem to be very ambitious, they point in a direction that is productive for the further development of digital scholarship in classics and archaeology. From this point of view, in archaeology, a set of metadata and images about an archaeological find can be considered as a unit of scholarly communication. This bundle could be aligned with scientific annotations, leading away from traditional scholarly publication in this domain. 1 Google offers a set of tools for webmasters that facilitate indexing of contents that are dynamically created at https:// www.google.com/ webmasters/ tools/ docs/ de/ about.html. 2 One remarkable project for the cultural heritage area is the OCRopus project that aims at covering pluggable layout and character recognition as well as statistical language modeling and multilingualism. In a later project phase, OCRopus wants to be able to recognize handwritten documents. More information can be found at http:// code.google.com/ p/ ocropus/. 11
Fedora stands for Flexible Extensible Digital Object Repository Architecture. Modern digital libraries are supposed to host a large variety of heterogeneous digital objects. During the life-cycle of a digital object a number of management tasks like data creation, organization and dissemination have to be carried out. Fedora tries to reduce costs by providing a set of features that standardize these management tasks. According to the Fedora digital object model a unit of information consists of one or more data streams. Each data stream could be another representation of a text or an image in different resolutions. Metadata that is associated with a digital object is stored as a separate data stream, multiple metadata formats, images and other data can be associated with one object using this mechanism. Fine-grained access-control policies to the management and access interfaces provide a security architecture. Internally, all data objects together with their data streams are serialized as XML files on a hard disk. This better supports complex tasks associated, for example, with digital preservation. Fedora therefore is one approach to provide a technical foundation for digital library software. Interestingly Fedora also implements a couple of features that are interesting for providing Semantic Web services. Any type of relation that is expressed within the metadata of an object is indexed and can be queried using Semantic Web query languages like SPARQL. 3 All data streams of digital objects can be associated with behavior for dynamic content delivery (for example image manipulation services or metadata crosswalks). Additionally the management and access API s (REST and SOAP) facilitate integration into different application environments. Furthermore, each digital object is associated with a unique URI during the ingesting process and a history of all modifications is stored together with the digital object. This enables references to a specific version of a digital object. Fedora supports dissemination of all data streams, including metadata that is associated with any digital objects of the managed collection by implementing the OAI Protocol for Metadata Harvesting. This is a protocol developed by the Open Archives Initiative and used to collect metadata descriptions of resources for (indexing) services that need to use metadata from many sources [39, 13]. Rooted in the e-print community and well known in the context of Open Access, 4 the OAI Protocol for Metadata Harvesting is based on a client-server architecture. Harvesting clients request data from repositories called Data Providers. Service Providers can then use this data to offer advanced services like indexing or other forms of advanced organization on that data. The metadata to be transported over a network can be in any format that can be serialized as XML and on which a certain community has agreed upon. Unqualified Dublin Core 3 SPARQL is a W3C recommendation published at http:// www.w3.org/ TR/ rdf-sparql-query/. 4 The Budapest Open Access Initiative (http:// www.soros.org/ openaccess/ ) is recognized as the first visible development of Open Access. 12
always has to be attached in order to facilitate a basic layer of interoperability. OAI-PMH claims to be one enabling infrastructure element for supporting new forms of scholarly communication. Digital libraries should provide multiple access methods to their collections as well as advanced services on the hosted content. For federated digital libraries, two fundamental paradigms for searching do exist, distributed searching and searching an index of previously harvested metadata. Both ways of dealing with federated information systems face fundamental problems both on the server and on the client side. However, a harvesting approach is more appropriate for the project purposes. To exploit the full power of the CIDOC CRM, resources from different places extensively have to be linked. The processing steps that are required for this task will be performed much better if data objects are accumulated in one place. Distributed searching involves a software component that is aware of a set of associated databases. Search criteria are encoded using a standardized client server query protocol such as Z39.50. 5 These information systems translate the query to an internal format and modify the results so that they conform with the standard. Then they are sent back to the querying component that merges the results. This approach delegates the indexing work to each connected database. Thus, computing efforts for index generation and searching are distributed. Since the search results have to be transferred back to the issuing query service, the network traffic during searching is higher. Control over how the index is created and search results are weighted continue to be up to each federated database system. Searching of metadata that was digitally harvested is basically implemented by the OAI Protocol for Metadata Harvesting. In this scenario, a service provider harvests data from multiple associated data providers in advance and then builds a local index. This approach bears the disadvantage that all indexing work has to be done locally. But since the harvesting is done in advance there is less network traffic involved while the actual query is performed. In fact, in this case indexing could be an ongoing process. This approach does not delegate indexing work to federated library institutions and full control over how the index is technically created remains at the querying software system. 2.3 The interoperability challenge Cultural heritage databases use specialized terminology of their respective domain of research in a certain national language. Moreover, terminology and standards 5 Meanwhile, the Z39.50 standard is 20 years old and currently maintained by the Library of Congress (http:// www.loc.gov/ z3950/ agency/ ). 13
used may vary within only one domain. Given the fact that the information in each of the databases is of interest for a large community of people, efforts have to be made to overcome current problems with data integration that are caused by the described heterogeneity. Against this background it seems reasonable that the CIDOC CRM, delivering a set of standardized terms and properties, could serve as a basis for heterogeneity. Many projects started experimenting using the CIDOC CRM for describing cultural heritage data in general and archaeological data in particular [18]. Integration of different cultural heritage vocabularies and descriptive systems is an ongoing research challenge in the course of projects like BRICKS, EPOCH, SCULPTEUR, and IUGO. 6 But currently only a few implementations exist that try to bridge the gap between more than one language and several data models at the same time. To overcome the lack of experience with implementing the CIDOC CRM as an intellectual concept and as a software system, Perseus and Arachne want to establish a robust implementation of a mapping workflow in the long term. This thesis reports about launching this collaboration by creating a prototype to sound the mapping of both databases to a shared metadata format. Together, Perseus and Arachne are hosting hundreds of texts, thousands of art objects, bibliographic records and large lists of named entities, especially about places and people [8, 9, 20]. Both project partners expect many benefits from integrating their collections by using open standards. Arachne hosts data about approximately 100,000 objects of antiquity and in addition over 200,000 images of these objects in a connected image repository. 7 The Perseus Project comprises of 6,000 well-described art and archaeology objects and additionally 36,000 images, but also approximately eight million words of Greek and Latin text as TEI code [34]. First, the integration of records on art and archeology would provide a larger source of information to users, accessible through a common and multilingual interface. Second, to facilitate serious digital scholarly research, advanced services regarding those collections should be provided. Users may be interested in browsing passages of Pausanias History of Greece (a text that is part of the Perseus digital collection) that are referring to objects in Arachne. Or they may want to consult, for example, the Smiths Dictionary of Greek and Roman Antiquity that is accessible online at Perseus to rapidly acquire more information about 6 The BRICKS project (http:// www.brickscommunity.org/ ) uses the CIDOC CRM for a software component that manages archaeological finds. EPOCH (http:// www.epoch-net. org/ ) wants to develop a tool that maps from other metadata standards to the CIDOC CRM. The already completed SCULPTEUR project (http:// sculpteur.it-innovation.soton.ac.uk/ auth/ login.jsp) used the CIDOC CRM as internal data model for data integration among several European institutions. IUGO (http:// iugo.ilrt.bris.ac.uk/ ) exploits Semantic Web tools to help locating informal related content of conferences. 7 August 2007; the numbers are constantly growing, since digitization projects are ongoing. 14
a specific record in Arachne, with just one or two mouse clicks. In a nutshell, data integration in this context would mean to link the Greek and Latin collections in Perseus to the Greco-Roman material in Arachne. Galuzzi points out that traditionally museums present art objects only with few context and according to specific curatorial decisions [25]. In the course of changing from analogue to digital media formats, he sees a chance to break with the traditional ways of documentation and information. One challenge of introducing reference models and ontologies like CIDOC CRM is the re-contextualization of those objects by connecting them to other art objects of the same or different kind, such as ancient texts. This approach permits to lay emphasis on conceptual similarities among objects of classics and archaeology, and it does not only allows the user to find conceptually related objects, but also to navigate from one object to another by means of qualified links. The aim, therefore, must not be to imitate traditional forms of documentation in digital form, but finding new paradigms of data processing and presentation. Arachne and Perseus host unique but conceptually related data objects that could be linked meaningfully. However, currently data is technically processed in completely different ways within each database; each institution has designed its own software that can deal with the respective specialized data model. Both databases process data of a certain heterogeneity sculptures, vases and entire buildings with their hierarchical arrangement, and of course large amounts of textual data. It does not seem to be reasonable or feasible to change the internal data models of all participating database systems. Therefore, an abstract mapping agent that can be configured to match each internal data model would certainly be a more rational approach. This mapping agent had to be aware of both database schemas to be able to translate data to a shared vocabulary of terms with a certain structure. It has been argued that the belief in easily building such a mapping agent is naïve [57]. Therefore, one goal of the project was to estimate the feasibility of an abstract but adaptable mapping component. But how should this mapping-agent be designed? In software technology, flexibility often is described with regard to modularity, adaptability and maintainability. It is interesting that all three claims deal with the reduction of complexity. All become especially problematic when dealing with information systems hosting and processing cultural heritage data. In this context, information systems have to cope with rather complex and non-uniform, sometimes incomplete, sets of data. In addition, in cultural heritage research, functional requirements have a tendency to evolve rapidly while information systems are used by historians. As the understanding of the subject increases, new questions and requirements arise. A flexible information system must therefore be able to advance at the same pace as scientific methodology develops. This should be considered in the design phase already, and 15
since databases change, mapping components do need to reflect this [4]. 16
Chapter 3 A web of linked cultural heritage data The issues described so far are saturated with concepts and ideas that are currently discussed under the notion of Semantic Web. Having discussed the intellectual requirements of digital scholarship, presented means for implementation, this section deals with identifying and describing state of the art developments relating to the current World Wide Web. These are to be facilitating new and better ways of scholarly communication. Although often criticized, the current Semantic Web research efforts articulate new and interesting ideas on how to deal with data that is to be published on the known internet. Also, a fruitful discussion is emerging on how to identify, describe, and retrieve Web resources in future interoperability environments. These means of identifying and retrieving resources could be the glue that ties distributed data together. However, the Semantic Web lacks bigger integrated software solutions and is mostly tool-based, today. Some of these tools greatly helped with sounding the usability of the Semantic Web for the interoperability prototype that was developed during the course of the project. In this section, first, the foundations of Semantic Web technology are described on the basis of the traditional, and admittedly imprecise, Semantic Web layer cake. Then to emphasize the importance of identification and representation, the latest information on this topic are presented and discussed. Finally, those Semantic Web tools are introduced that were used during the project. 3.1 Conceptual and technical requirements Sofern der Forscher seinen Einfall kritisch beurteilt, abändert oder verwirft, könnte man unsere methodologische Analyse auch als eine rationale Nachkonstruktion der betreffenden denkpsychologischen Vorgänge 17
auffassen [49]. The World Wide Web Consortium defines the term Semantic Web in a surprisingly simple way: The Semantic Web is a web of data. 1 Even more surprising is that a condition is described thereby that we do not have today in the humanities: A web of data. Why? A great deal of cultural heritage data that is currently accessible on the Internet is controlled by software written for small and specialized audiences and tailored to a specific purpose. Furthermore, archeological data currently is collected at a low level of granularity as sets of documents. Today s cultural heritage web therefore can be described as a web of linked documents, not as a web of linked data. The Web of Data, how the Semantic Web should consequently be named, describes all activities aimed at overcoming today s unsatisfactory state. For that purpose formal languages and software components need to be developed that deal with two aspects of data integration. Syntactical data integration physically combines data of different data sources by accumulating data objects at one place, a central database for example. Semantic integration builds upon this foundation by assuring that the data is interpreted and processed in a consistent way, namely interpreted as intended by the originator of the data. By this means data of different sources can be combined and queried better than before. Scientists that want to solve a scientific problem need a phase of creative thinking to collect ideas and materials that contribute to resolving the research problem. Thus, they need to juggle a lot of information at a time in their minds to exhaustively study all aspects of the issue. This is what the Semantic Web technology is designed for it is supposed to knock down the boundaries between different silos of information. 2 The Semantic Web thus aims at allowing scientists to connect information in a seamless and networked way without the need to translate and transform between multiple media formats. Figure 3.1 shows the components of the Semantic Web. 3 This comparison captures the notion that there are several levels, each of them build upon a lower one. The Unicode standard (ISO/IEC Standard 10646) reserves a distinct number for each letter (more general: character) independent of the platform (operating system), language or program that uses Unicode. 4 Major IT companies accepted Unicode and other standards such as XML or JAVA support it. The concept of the 1 This quote was taken from the basic introductory material about the Semantic Web that can be found at http:// www.w3.org/ 2001/ sw/. 2 Tim Berners-Lee expressed this idea in an interview that was published at http:// www. businessweek.com/ technology/ content/ apr2007/ tc20070409 961951.htm. 3 The image was taken from http:// www.w3.org/ 2001/ 09/ 06-ecdl/ swlevels.gif. 4 A basic introduction on Unicode can be retrieved at http:// www.unicode.org/ standard/ WhatIsUnicode.html. 18
Figure 3.1: Semantic Web layer cake. Semantic Web builds upon Unicode characters for expressing strings. In order to interact with resources on the internet, a Uniform Resource Identifier (URI) was introduced. A URI is a string of Unicode characters that unambiguously names or identifies material or abstract things of the real world, provided that there is a digital surrogate available. URI s can be divided into two subcategories, Uniform Resource Names (URN) and Uniform Resource Locators (URL). While URLs are URIs that provide some additional information on how the reference to a resource can be resolved to an actual object, URNs only provide a unique name for a resource without information about where an agent can get a representation such as an image. An example for the latter are DOIs (The Digital Object Identifier System). 5 For DOIs the DOI website itself provides a resolver that does not directly deliver HTML but redirects to a URL that can be resolved to a HTML page. However, many URNs simply are URIs that have a well known resolver mechanism, the global Domain Name System. This system is so well established that it seems to be totally transparent. Metadata is data about data that is used to facilitate the understanding, use, and management of data. In the context of a digital library a data object could be a digitized text. Metadata for this text would include, for example, information about the author, the publisher, or the number of pages. The Extensible Markup Language (XML) with its hierarchical structure can be used to attach data about data objects to the same. XML defines a basis syntax that can be used to structure 5 If a user types in the DOI 10.1007/978-0-387-34347-1 6 at http:// www.doi.org/, it will be resolved to the URL http:// www.springerlink.com/ content/ h3800073756x7872/. This in turn delivers a HTML page with more information on the paper and a few more browsing facilities. 19
documents on the Web. 6 But XML does not provide any means to make assertions about the semantics of a document or its parts. XML Schema is a language that constraints the structure of an XML document and augments the XML standard with additional typing facilities. 7 It depends on the context if data is considered as self contained or as data about data. One could imagine cases where metadata is the object of research. In this event metadata about metadata would be absolutely valuable. The lower layers of the model basically deal with questions of syntax while the higher layers are concerned with interpreting the meaning of data. The term semantics has been used for a lot of things and never has been well defined. Moreover, there is no agreement on how the term semantics refers to the concept of the Semantic Web. As mentioned earlier, the Semantic Web community lately prefers the term Web of Data over Semantic Web. It can be said that the notion semantics itself refers to the meaning that is expressed in some form of representation of information, for example natural or formal language (metadata). Uschold sates that the notion of real world semantics as defined by Ouksel and Sheth best captures the role of semantics in the orbit of the Semantic Web [60, 47]. According to this definition, objects within a model are mapped onto the perceptible word. Uschold then introduces a semantic continuum. According to his model, information can be encoded on different levels of detail ranging from implicitly, over explicitly informally, explicitly formally for human processing, to explicitly formally for machine processing. Although the far right end of this continuum has not been reached today, there is a lot of value in encoding meaning explicitly and formally for human processing. This helps software developers to write software that is able to process a certain kind of shared data. In the end, the objective will be to build software that dynamically and autonomously resolves the meaning of data objects that are encountered by concept reasoning. To explicitly make assertions about the semantics of a data object a hierarchical markup language is insufficient. That is where higher standards like RDF and the notion of ontologies comes in. Gruber defines a formal ontology as an artifact of a construction that was designed for a specific purpose and is evaluated against objective design criteria [29]. The meaning of ontology is controversially discussed in the artificial intelligence field because at the same time it has a long tradition in philosophical discourse where it alludes to the notion of existence. It has often been confused with epistemology that refers to knowledge and theory of cognition. In the context of knowledge sharing and reuse, ontology can be defined as a specification of conceptualization. Thus, an ontology is a description 6 XML is a markup standard derived from SGML (ISO 8879). More information about XML can be found at http:// www.w3.org/ XML/. 7 XML Schema is a WC3 standard and has been published at http:// www.w3.org/ TR/ 2004/ REC-xmlschema-1-20041028/ structures.html. 20
(like a formal specification of a computer program) of the concepts and relations within a domain that an agent (again a computer program) or a set of agents can evaluate to process data. By restricting the vocabulary to express what is the case in a specific domain, ontologies facilitate interoperability between multiple pieces of software. 8 The Resource Description Framework (RDF) mentioned above is another language that defines a simple data model to describe resources and the relations that can exist between resources. RDF provides trivial semantic concepts like objects and relations and can be expressed in XML but also in other notations like Notation3 [30]. In RDF, information is represented as triples. A triple is an assertion that comprises subject, predicate, and object. RDF Schema builds on top of RDF by providing a vocabulary to group objects to classes and to constrain the relations that may exist between class instances. Thus, RDFS is to RDF what XML Schema is to XML. It augments the semantics of RDF by hierarchical generalization and the definition of properties. It has enough semantic power to describe simple ontologies [5]. Since the CIDOC CRM version 4.2 has been published as RDFS, both Perseus and Arachne data were exported to RDF and evaluated against the published RDFS document [31]. OWL is a language that reaches beyond the abilities of RDFS for example by defining further language elements to describe relations between classes ( disjunctive ), restricting cardinalities ( exactly one ), equality, richer typing of properties, features of properties ( symmetry ), and enumerated classes [42]. The concept of the Semantic Web knows three additional layers that have not been addressed extensively until now, Logic, Proof, and Trust: The three upper layers deal with advanced concepts that are irrelevant for the description of the CIDOC CRM. Therefore, will not be further dealt with in this thesis. It has been argued that the Semantic Web endeavor is too expensive, that nobody would be willing or even be able to produce enough content to create enough uptake. Shadbolt et al. explain that uptake is about reaching the point where serendipitous reuse of data, your own and others, becomes possible [54]. They carry on by saying that, today, most projects lack this viral uptake. In most cases there is no stable URI for objects so that the predicted revolution has not taken place yet. There is a need for small communities that have a pressing need for new technology. Could the cultural heritage sector be such a community? Viral uptake would create a network effect. In information technology the term network effect was coined by Metcalfe, the founder of the ethernet [43]. 9 8 Pidcock tries to clarify the destinction between a vocabulary, a taxonomy, a thesaurus, an ontology, and a meta-model at http:// www.metamodel.com/ article.php? story= 20030115211223271. 9 For more information on applying Matclalfe s law to the Semantic Web refer to http:// blogs.sun.com/ bblfish/ entry/ rdf and metcalf s law. 21
He argued that the costs of network cards is proportional to the number of cards installed, but the value of the network was proportional to the square of the number of users. These can share access to expensive resources like storage. Transfered to the linked data idea, users then could share access to metadata about a uniquely identified resource that already has been annotated by others. A critical mass has to be reached to make the system useful for all users because the value obtained from the infrastructure has to be greater or equal to the price paid for establishing the building blocks of the overall system. A reasonable strategy could be to build a system that delivers value to users even without exploiting network effects. As the number of users increases the system becomes more valuable to everybody. Scalability of these solutions can be almost infinitely enhanced by introducing a peer to peer principle instead of hosting all data as a monolithic block on one server. But it is certain that by sharing unique identifiers everybody can add metadata to a specific entity and share it among the community. 3.2 Identification and representation of resources Currently every archaeologist can access the Arachne database to conduct research and to choose from a vast amount of information. It is also possible to cite Arachne as a source by mentioning the unique Arachne serial number in connection with providing some information to disambiguate the serial number in Arachne, that distinguishes buildings from topographic entities. This enables the reader of a certain publication to write down the serial number and direct his browser to the Arachne website. After logging in he can use the serial number to access the same information that his predecessor got some time ago. This is one method to reconstruct the methodical approach that was used to compile the results in a publication. This traditional approach has a couple of shortcomings and seems to be complicated and time-consuming. To be able to talk about a specific subject area that has an internet representation, each object on the Web should be identified by a stable URI. Then, this URI can be used to reference the entity, lets say for annotation purposes, or to resolve a digital representation of this resource (in Fedora terms this is a data stream). Many webservers also support content negotiation. By exploiting this functionality, a software agent can state its preference regarding the representation of a Web resource. The webserver then can deliver one or more representations in HTML, a machine-readable representation in RDF/XML or a couple of images for the resource. By using the traditional HTTP URL schema for naming a web resource, most Web-enabled programs will be able to rapidly retrieve a representation of the resource. An archaeological object in Arachne could, for example, be named 22
by the URL http:// arachne.org/ object/ 30014. By exploiting the mechanism of content negotiation, a software agent could retrieve a RDF/XML representation and discover that there are multiple images connected to this resource. As a second step, the agent retrieves one image by dereferencing the URL http: // arachne.org/ images/ 482199 and indicating that compressed JPEG is the preferred format. The Apache HTTP server [23], for example, would indicate that by including the string Accept: image/jpeg; q=1.0, application/rdf+xml; q=0.5, text/html; q=0.1 in the header of the request. 10 By transmitting this string together with the request, the user agent can thus express that with this request he prefers an image over a representation in RDF/XML. The remaining option, if all else fails, is a representation in HTML. Listing 3.1 demonstrates the process of content negotiation that can direct a client to select the appropriate representation of a specific Web resource. In this particular example, the client tries to retrieve the URL http:// dbpedia.org/ resource/ Berlin and indicates that it prefers a HTML page as result. The server responds with a 303 message and provides another URL that most browsers automatically re-retrieve to display according HTML page. This process is transparent to the user. 11 Listing 3.1: The client request an HTML represenation. 1 Krabat : rokummer$ t e l n e t dbpedia. org 80 2 Trying 1 6 0. 4 5. 1 3 7. 8 5... 3 Connected to dbpedia. org. 4 Escape c h a r a c t e r i s ˆ ]. 5 GET / resource / Berlin HTTP/ 1. 1 6 Host : dbpedia. org 7 Accept : text / html 8 9 HTTP/ 1. 1 303 See Other 10 Date : Tue, 14 Aug 2007 12: 05: 12 GMT 11 Server : Apache Coyote / 1. 1 12 L o c a t i o n : http : / / dbpedia. org / page / B e r l i n 13 Content Length : 0 14 Content Type : text / plain 15 16 Connection c l o s e d by f o r e i g n h o s t. Listing 3.2 shows the client requesting the HTML page that it asked for. After the heading information, the HTML code is attached at line 34. Listing 3.2: The client retrieves the HTML representation. 17 Krabat : rokummer$ t e l n e t dbpedia. org 80 18 Trying 1 6 0. 4 5. 1 3 7. 8 5... 19 Connected to dbpedia. org. 20 Escape c h a r a c t e r i s ˆ ]. 21 GET / page / B e r l i n HTTP/ 1. 1 22 Host : dbpedia. org 23 Accept : text / html 24 10 Apache supports content negotiation according to the HTTP/1.1 standard. More information on Apache content negotiation can be found at http:// httpd.apache.org/ docs/ 2.3/ content-negotiation.html. 11 This example is inspired by the document How to publish Linked Data on the Web at http:// sites.wiwiss.fu-berlin.de/ suhl/ bizer/ pub/ LinkedDataTutorial/. 23
25 HTTP/ 1. 1 200 OK 26 Date : Wed, 15 Aug 2007 1 3 : 3 6 : 3 6 GMT 27 Server : Apache Coyote / 1. 1 28 Cache Control : no cache 29 Pragma : no cache 30 Content Type : text / html ; charset=utf 8 31 Transfer Encoding : chunked 32 33 5b4 34 <html xmlns=" h t t p : / / w w w. w 3. o r g / 1 9 9 9 / x h t m l " xml : l a n g=" e n " l a n g=" e n "> 35 <head> Listing 3.3 shows the client indicating that RDF/XML is the preferred representation. The sever again responds with a 303 redirect but this time to the URL that points to RDF/XML data. Listing 3.3: The client requests a RDF representation. 36 Krabat : rokummer$ t e l n e t dbpedia. org 80 37 Trying 1 6 0. 4 5. 1 3 7. 8 5... 38 Connected to dbpedia. org. 39 Escape c h a r a c t e r i s ˆ ]. 40 GET / resource / Berlin HTTP/ 1. 1 41 Host : dbpedia. org 42 Accept : a p p l i c a t i o n / r d f+xml 43 44 HTTP/ 1. 1 303 See Other 45 Date : Tue, 14 Aug 2007 12: 05: 50 GMT 46 Server : Apache Coyote / 1. 1 47 Location : http : / / dbpedia. openlinksw. com : 8890/ sparql? default graph u r i=http%3a%2f%2fdbpedia. org& query=describe+%3chttp%3a%2f%2fdbpedia. org%2f r es o ur c e%2fberlin%3e 48 Content Length : 0 49 Content Type : text / plain 50 51 Connection c l o s e d by f o r e i g n h o s t. An alternative to addressing resources with HTTP URLs is to use a generic URI and to provide a service to resolve this URI to an appropriate representation. Whilst the use of URLs exploits existing technology, using generic URIs entails building resolving services. This is complex and cost intensive but is useful in some domains. A sample URL that resolves an URI is http:// some.resolver.org/ resolve? uri=arachne:objekt:4711&type=application/ rdf+xml. Here the content negotiation part is visibly encoded within the URL. There will be a more in-depth description of this mechanism in section 4.2 on page 32. 3.3 Semantic Web tools Even if there is enough data represented in a way that can be easily exchanged and shared, there is still the need for software that is able to process the data. These software components are so-called agents. They serve to process Semantic Web data and to provide communication channels to resolve problems collaboratively, one or more agents for each task. Many tools are evolving in the field of the Semantic Web. In fact, the number of tools that are supposed to deal with the technologies described has grown so fast that the W3C could not cope with the upsurge and decided to create a community driven portal to keep track of the 24
domain. 12 Since most of these toolkits deal with and depend on RDF, we decided to choose RDF for implementing the mapping. Unfortunatey most of these tools come with little or no documentation and little experience on how they deal with large amounts of data. Shopping agents are degenerated examples of Semantic Web applications. On behalf of their users, they fulfill the fundamental task of comparing prices from disparate and heterogeneous but semantically related sources. They are degenerated because usually none of the sources has published its vocabulary. Shopping agents therefore usually need to scrape the information from multiple HTML pages. This results in additional work for software developers since they always have to deal with individual data models. There is no format that everybody has agreed on and a lot of semantics have to be hardwired within the agent software. Each time one of the participating vendors changes the appearance of the web page, the agent software needs to be adapted. Throughout the project, multiple tools served to provide a better understanding of Semantic Web concepts and methods. The following describes the software components used. Protégé was helpful for approaching modeling techniques of ontologies including the CIDOC CRM. The user gets an impression on how the RDF markup could look like if it was produced by an automated mapping algorithm. Strengths and drawbacks of different modeling approaches became visible after manually creating data objects in the CIDOC CRM schema. 13 The next tool, called Jena, is a Java framework that supports the development of Semantic Web applications. It provides a programming environment for RDF, RDFS and OWL and embodies a rule based inference engine. Jena is Open Source, a result of development efforts of the HP Labs Semantic Web Programme. There are a couple of frameworks available for Java and other programming languages, but Jena comprising od currently 11 developers and 24,600 downloads, appears to be one of the more active projects within the Open Source community. 14 Eyeball is a part of the Jena framework that checks RDF model for common problems and is used within the project to check the CIDOC CRM markup before it is further processed by software components. It checks for unknown predicates and classes, bad namespaces, ill-formed URIs, amongst other things. The Redland RDF Libraries provide a couple of command line tools that were useful to count triples and to reformat the RDF code. In this particular case, it was used to count the triples that were generated during the mapping efforts. 15 12 The W3C maintains a wiki-style list of Semantic Web tools at http:// esw.w3.org/ topic/ SemanticWebTools. 13 http:// protege.stanford.edu/. 14 http:// jena.sourceforge.net/. 15 http:// librdf.org/. 25
Chapter 4 Standards for semantic interoperability Many cultural historians are happy to conduct scientific research without having to think about formalized and shared conceptualizations. Developing formalized ontologies for easier exchange of knowledge involves more time and effort than doing things intuitively. The issue of building awareness of the advantages that ensue from using standards for digital representation of cultural heritage data still needs to be addressed. Formalizing knowledge with standardized systems not only allows it to be transfered and displayed over network connections, but also to enrich it with annotations and behaviors like searching and browsing. However, as Semantic Web concepts are not yet understood and accepted within the cultural heritage area, this currently limits the CRM s potential. Common conceptual models like the CIDOC CRM can be used in may ways. Guarino categorizes different uses of ontologies by temporal and structural dimension [52]. Thus, ontologies can be used at development time and at run time. At development time, ontologies can serve as a common language for software developers and domain experts. In this scenario, it would help to model domain concepts as software components. By using standard vocabularies, the software usually achieves a better rate of interoperability. Information systems that are ontology-aware use ontologies at runtime. Some software agents recognize data that they encounter as being encoded according to a certain ontology. From a structural point of view, an ontology can be used at different levels of an application program or even interfuse the whole information system, the database component, the application component, and the user interface. Due to the respective focus of each project partner, the project focuses on material objects, ancient Greek and Latin text, and the contexts that these can be linked to. To establish interoperability, multiple standards have to collaborate to cover the needs of a specific domain. While the CIDOC CRM was developed 26
to represent information about objects, especially those managed by museums, a new version of Functional Requirements for Bibliographic Records (FRBR), FR- BRoo, is being developed as ontology aligned to the CIDOC CRM [17]. As an entity-relationship model, FRBR provides the means to accurately describe bibliographic information in a digital world. FRBRoo provides the means to express the IFLA FRBR data model with the same mechanisms and notations provided by the CIDOC CRM. The CIDOC CRM and FRBR harmonization, especially when extended with the Canonical Text Services protocol [50], will allow collections to integrate complex textual materials with extensive metadata about objects. The following section will focus on the introduction of these standards. Thus, the concept of the CIDOC CRM itself heavily relies on other form of shared infrastructure and standards. Gazetteers, other domain specific naming authorities, and controlled vocabularies provide the means for referencing and describing things and objects that form the context of material and textual objects. These registries still have to be developed and published so that a wide audience will be able to use these vocabularies by referencing to entities and contributing to the content. Furthermore, service registries will hook up all participating data providers and play a major role in data discovery. 4.1 Managing archaeological objects We have the vision of a global semantic network model, a fusion of relevant knowledge from all museum sources, abstracted from their context of creation and units of documentation under a common conceptual model. The network should, however, not replace the qualities of good scholarly text. Rather it should maintain links to related primary textual sources to enable their discovery under relevant criteria [15]. Many standards have emerged that facilitate representation of cultural heritage data like the Getty Categories for the Descriptions of Works of Art or the Art Museum Image Consortium that operated until 2005 [2, 7]. Since 2006, the CIDOC Conceptual Reference Model became the official standard ISO 21127:2006. The CIDOC CRM comprises definitions arranged as a structured vocabulary that were developed over a period of ten years by the CIDOC Documentation Standards Group. This group falls within the International Committee for Documentation (ICOM-CIDOC) of the International Council of Museums (ICOM). The CIDOC CRM provides a blueprint to describe cultural heritage and museum information. Therefore the CIDOC CRM will have a major role within the integration efforts of this project. It can help to analyze the data structures of the participating information systems, to identify common information contents. 27
Technically speaking, the CIDOC CRM is a hierarchy of 84 classes defining concepts that are commonly referred to in museum documentation practice. Each class describes a set of objects that share common features. 141 so called properties define semantic relations between these conceptual classes. Thus, the CRM builds a foundation for semantic interoperability in the cultural heritage area [10]. Figure 4.1 shows a schematic overview of the most important concepts and relations that can exist between them, according to the model. 1 By adopting these concepts of formal semantics, the CIDOC CRM is well prepared play a role in the development of the Semantic Web. Figure 4.1: Conceptual overview of the CIDOC CRM. The CIDOC CRM does not inted to prescribe how a certain community should document objects, even though it could serve as a guideline for good documentation practice. The goal is to facilitate a read-only data integration of data materially or virtually. While creating the CIDOC CRM, two design choices have been made, the CIDOC CRM to further enhance and facilitate data integration and to keep the whole vocabulary to a manageable size. First, as the result of a pragmatic approach to ontology design, the CIDOC CRM is property centric. By providing a large set of properties, richer semantics can be expressed than by using fine grained hierarchies of classes like thesauri would do. Classes were thus only introduced to form the domains and ranges for properties. While an attribute 1 The figure follows [16]. 28
is applicable to only one class instance, a relation always concerns two instances. Thus, the CIDOC CRM helps with modeling objects within their context instead of attaching isolated attributes. Second, it has been argued that explicitly including events in ontologies results in models that facilitate better integration of cultural contents [15]. Thus, the CIDOC CRM proposes events that tie objects and their contexts together. Figure 4.1 demonstrates how events link physical things, conceptual objects, places, timeframes, and actors. It goes without saying that a data structure that conforms to this paradigm is more difficult to create than flat attachment of values to a data object. An ancient sculpture for example would be modeled as an instance of the class E24 Physical Man-Made Thing, a class that comprises all persistent physical items that are purposely created by human activity [10]. It came into existence by an activity that in turn is an instance of the class E12 Production, this class comprises activities that are designed to, and succeed in, creating one or more new items. Both instances are connected by the property P108B was produced by, a property that identifies the Physical Man-Made Thing that came into existence as a result of a Production Event. Data from different sources which follow this scheme can be processed more consistently, even if different sources deliver contradictory information. Unlike Dublin Core, the CIDOC CRM focuses on the cultural heritage domain and adds a class and property hierarchy to its vocabulary defintions. Additionally, attribute assignments can be linked to events so that the same attribute can be assigned twice with different values as a result of different measurement events. A situation that is common when dealing with soft historical data. Arranging database objects to well-defined classes also facilitates searching for common objects that originate from different data sources. If certain communities figure that class concepts like E24 are too broad in scope, more detailed classes can be agreed on, for example, in order to distinguish vases and buildings that both fall within the category E24. This is usually done be exploiting the extension mechanism of the CIDOC CRM. Certainly, a simple mapping of these concepts to E24 Physical Man-Made Thing would be dissatisfactory because information would get lost. Therefore, the CRM offers means to refine its high level concepts by using the class E55 Type. This class can be used to attach a thesaurus-like hierarchy of terms to the standard data model. Because the extensions through E55 Type are community specific and not covered by standard CIDOC CRM, they have to be documented and published as authority documents. Not until this has been done, seamless and automatic processing of the data is assured. The CRM offers two mechanisms to create more granularity for describing museum objects. One approach would be to define subclasses of the built in 29
CIDOC CRM classes like, for example, A1 Sculpture is a E24 Physical Man-Made Thing or A2 Building is a E24 Physical Man-Made Thing. Interestingly the same can be done with properties. The other mechanism is to use a Type hierarchy that can be constructed by using the class E55 Type. The class E55 Type is treated as universal and specific at the same time. This bears the advantage that a type can be discussed as an element of scholarly discourse (E83 Type Creation P135 created type E55 Type). But in some situations this approach seems to be too complicated and the creation of subclasses or usage of publicly available and more specialized ontologies seems to be more feasible. This approach has the advantage that those ontologies are already published and often well documented. Defining subclasses and exploiting the type hierarchy has the disadvantage that both extension mechanisms are not covered by the standard so that other information systems cannot exploit them out of the box. Hand-crafted extensions have to be documented and published so that others can easily retrieve the information and build their software accordingly. Anyway, the CIDOC CRM becomes more powerful if it is used in connection with other ontologies like SKOS for attaching thesauri that will be looked at in more detail below. All properties of the CRM have definite domains and ranges that belong to the vocabulary itself. The CRM offers classes for describing people, places and bibliographic entities. This seems as if the ontology claimed the authority not only for describing museum objects but also for covering most of their contexts. However, it does not seem to be useful to treat the CRM as an all-in-one device suitable for each and every purpose. Additionally, the CRM is an upper level ontology and therefore cannot and does not intend to cover the pecularities of each cultural heritage domain. Although it does provide an extension mechanism, variations and specialization have to be documented and published (preferably in a formal language). For each object, a specific unambiguous URI needs to be assigned. This information does not include hints on how that URI could be resolved into a human or machine readable representation. For example, a URI of an image does not include the information how to decode and display it. One could argue, that this does not belong to the scope of the CRM and needs to be addressed on other layers like content negotiation as described above. 4.2 Linking to bibliographic information The CIDOC CRM mainly concentrates on describing material cultural heritage and museum objects. But the value of this information source can be increased by linking the material objects to other sources of information like gazetteers or bigger bibliographic databases. Information about archaeological objects in Arachne, for 30
example, is commonly drawn from publications. Thus, many objects are connected to bibliographic information and to other forms of structured vocabularies like material descriptions. The following paragraphs focuses on standards for describing bibliographic objects, especially ancient Greek and Latin texts and how they could be linked by using the CRM vocabulary and structure. FRBR is a conceptual entity-relationship model developed and maintained by the International Federation of Library Associations and Institutions [32]. In respect to the model, entities are classified by products of intellectual endeavor (group 1), custodianship of entities (group 2), and subjects (group 3) with a strong emphasis on the first group. Within this first group entities are classified as Works, Expressions, Manifestations, and Items. Works are defined as specific products of intellectual effort (Moby Dick), Expressions form realizations of this intellectual effort (German translation of Moby Dick), Manifestations the physical embodiment of an expression of a work (btb edition of the German translation of Moby Dick), and Items form a single exemplar of a manifestation (my copy of the btb edition of the German translation of Moby Dick). The second group comprises concepts like person and corporate body which hold custodianship of entities belonging to the first group. The third group consists of concepts, objects, events, and places that appear as subjects of the first two classes. All entities can be connected by defining relationships that assist the user to navigate the web of information that is formed by a bibliography, catalogue, or bibliographic database. Finally, four user tasks are defined: find, identify, select, and obtain. Information systems should implement them as behavior to enable users to perform any of them on any entity or relationship. FRBR is a constitutional data model that enables digital libraries to better provide the most basic functions to their user community. Users need to be able to identify multiple instantiations of primary texts. Therefore, The Perseus Project needed to know precisely how many editions, translations, and commentaries of canonical works such as the Iliad or Suetonius Lives of the Caesars are in the collections. The object hierarchy of Work, Expression and Manifestation served to encode the Iliad as a general work, its multiple translations and editions, treated as subclasses of Expressions, and multiple instantiations of these publications, for example page images and OCR or XML transcriptions, represented as Manifestations. On this basis, services can be built that help fulfilling the user tasks required. Perseus successfully proved FRBR to be useful for its collections by implementing a FRBR catalog of its bibliographic assets [44]. FRBR was developed by librarians to support traditional user tasks in the world of digital libraries. Within this standard, the most granular layer of representing intellectual efforts is the Item which means a single copy, a book that I can hold in my hands. Anyway, FRBR is inadequate for classical studies. For decades 31
scholars have been using elaborate citation schemes with unique identifiers to refer to particular chunks of text. A citation scheme like Il. 3.44 means book 3, line 44 of Homer s Iliad. Canonical citation schemes like this generally point to the same text passage in different editions or translations of a specific ancient Greek or Roman text. This way of referring to a specific text passage has been developed centuries ago and has been inherited as useful instrument until today. For these purposes the Canonical Text Services (CTS) have been developed [50]. They comprise a protocol and interface to facilitate sophisticated referencing to and resolving of text passages. CTS implements a subset of the FRBR hierarchy and adds some extensions both upwards and downwards. Figure 4.2 demonstrates that upwards Textgroups can group a set of Work objects. Downwards, a citation mechanism has been added that is not necessary in the library domain but vital for classical studies. Figure 4.2: Canonical Text Services compared to FRBR Unlike archaeological finds, ancient texts form a relatively constant set of documents. Therefore assigning a URI with a global namespace has immediate network ramifications. Archaeological finds are produced in larger quantities all over the world, and therefore, each object should be equipped with a local URI by adding namespace information. If there is a canon of well known monuments that scientists regularly refer to, a global URI should be assigned to them as well, perhaps 32
automatically by using entity identification systems. The URI urn:greeklit:tlg0012. tlg001:1.10, for example, is a reference to line 10 of book 1 of the Iliad. This URI can be resolved by a resolving service like like http:// katoptron.holycross.edu/ texttools/ textbrowser/ index.html? service=fucts&urn=urn:greeklit:tlg0012.tlg001:1.10. The resolver accepts two variables, service=fucts is used to select the Furman University resolving service (there is more than one) and urn=urn:greeklit:tlg0012.tlg001:1.10 represents the URI to resolve. As mentioned above, whilst the CIDOC CRM was developed to formalize information about objects, especially those managed by museum, a new version of FRBR, FRBRoo provides the means to express the IFLA FRBR data model with the same mechanism and notations provided by the CIDOC CRM. For the purpose of this project this is a major breakthrough. It provides the first third-party with an integrated data model for textual and art and archaeological collections, both of which have undergone several years of development. Figure 4.3 shows how database records can be linked to ancient texts. Figure 4.3: Linking CIDOC CRM and FRBRoo. 33
4.3 Linking to other forms of knowledge organisation According to the above, the value of digital surrogates of archaeological objects and bibliographic items can be enhanced by establishing qualified and machineactionable links. These links can be exploited by software-agents to provide services such as navigation, searching, and the like. This section deals with broadening the scope by adding other contexts that are relevant to archaeological data and ancient Greek and Latin texts. For decades, libraries linked bibliographic data with knowledge organized as thesauri or other forms of structured vocabularies. These forms of knowledge organization could also be useful in providing contextual knowledge for archaeological data. The amount of information that is available on the Web grows exponentially through users contributing structured and unstructured content. Although some people would find it helpful if this content was published as PDF-documents, this approach adheres to the traditional document-based approach of publishing and is not beneficial to the idea behind the Semantic Web. The data that is to be published most likely contains some content that follows certain structural principles to express knowledge about a specific subject or domain. Formalized systems of knowledge organization like controlled vocabularies, taxonomies or thesauri can encapsulate that structure and should therefore be used to publish this content using Semantic Web languages. Vocabularies include Dublin Core for simple cross-domain information resource description, Friend of a Friend (FOAF) for machine-readable personal profiles, Description Of A Project (DOAP) for describing open-source projects and Simple Knowledge Organization System (SKOS) that is described in greater detail now [62]. 2 Within the family of formal language, one appears to be helpful for the purposes of the project: The Simple Knowledge Organisation System (SKOS). The SKOS language is defined using the terms of RDF and RDFS because its main purpose is to facilitate publishing any type of structured vocabulary on the Web, including thesauri, classification schemes, taxonomies, and subject-heading systems. Additionally, SKOS provides means for multilingual labeling of resources. This could facilitate publishing the concept schemes mentioned above in more than one language. Figure 4.4 shows how the ZENON thesaurus that provides access to bibliographic information could be linked to archaeological objects that are stored in Arachne. The German Archaeological Institute maintains three bilingual thesauri (English and German) in total to provide additional access methods to biblio- 2 For further information about FOAF and DOAP, refer to http:// www.foaf-project.org/ and http:// usefulinc.com/ doap/. 34
graphic assets. By harmonising the bibliographic databases of ZENON, Arachne, and Perseus, the bilingual thesauri could be exploited as additional multilingual access methods for integrated metadata of Perseus and Arachne. Access methods that proved to be useful for one system can also be useful for other systems within the same domain. The current project would be enhanced by multilingual translations. SKOS, for example, could be a more elaborate language to express the ZENON thesaurus. Figure 4.4: From A to Z, Arachne and ZENON. The approach described above would also contribute to building large networked systems of organized and structured knowledge. Structured vocabularies that have been carefully compiled in the course of many research projects should not be insulated in individual software systems. This holds true not only for gazetteer systems but also for all forms of structured vocabularies such as lists of material descriptions of archaeological objects. The Arachne database contains a total of 800 different descriptions for materials, categorized in a hierarchical manner, which could each be equipped with a unique identifier. This information could also be exploited by other research projects. If a large section of the community used unique identifiers that are provided with each object, tangible and beneficial network effects would ensue. The linked data approach enabes one to look up each 35
Web resource and thereby making available useful data to the general public. Current projects that intend to contribute to the linked data paradigm are DBpedia DBLP, and GeoNames. 3 3 The DBpedia project draws on structured information from Wikipedia to make it available for browsing, semantic searcing and harvesting (http:// dbpedia.org/ ). The long-standing DBLP also provides linked data about bibliographic publications in the information technology area (http:// www4.wiwiss.fu-berlin.de/ dblp/ ). GeoNames is an ambitious gazetteer project that provides multilingual RDF for each geographic entity that it hosts (http:// www.geonames. org/ ). 36
Chapter 5 Dealing with heterogeneity Information integration needs to occur on several levels. Besides the syntactical integration that deals with physically accumulating data objects at one place, there are at least two more levels of semantic integration. The CIDOC CRM is designed to provide assistance finding conceptual agreements among multiple cultural heritage information systems. But to use the classes and properties provided by the CIDOC CRM, the source data needs to fulfill certain quality criteria. Although all requisite steps cannot be distinguished clearly, during the process of data integration, heterogeneity occurs at several levels, namely, at the schema and the instance level. This section takes a look at different forms of heterogeneity in these areas and how to deal with them. 5.1 Levels of heterogeneity Most cultural heritage databases have due to a strong commercial influence been built upon relational data models managed by a respective database management system. However, as relational databases are not suitable for rich semantic modeling and force software developers entrusted with this task to switch to other components of the overall information system. This invariably leads to undocumented distribution of composition algorithms among several software components. If one component of the software system is changed, this in turn changes the meaning of the managed data objects themselves. This is one factor that needs to be considered when mapping internal data to a shared metadata format [37]. Under the internal data model (storage and internal representation of factual knowledge), the application logic (first layer of interpretation) retrieves and recombines data for the graphical user interface (second layer of interpretation, interpretation of data model). The layouts that display the information to the user in multiple views and the user s implicit knowledge (third layer of interpretation) 37
about the information system, including implicit conventions, have to be taken into account. The implementation of the CIDOC CRM impacts all levels and an abstract mapping component needs to be aware of all these levels to preserve all presented levels of meaning. It is questionable at the very least whether that complex and highly structured data contained in one information system (context I) can be transferred to another information system (context II) without loss of (mind the structure!) information without preserving the process of composition. Most current databases in the humanities field don t meet halfway because their proprietary semantic modeling do not rely on standards. In future projects, it might be advisable to build awareness of shared data models like the CIDOC CRM into cultural heritage database systems. Since it appears to be too complex to map the whole data structure to a shared data model, it is important to identify those parts of the data model that are most important and valuable for integration. It is by no means practicable to map each detail of each data model, a reasonable level of detail has to determinded. 5.2 Heterogeneity on the schema level Database management systems provide tools to impose constraints on managed data objects. This prevents the data from becoming inconsistent. Relational databases, for example, provide tables that contain records to model a tiny cutout of the world in a digital environment. Database records can be linked by defining relations between tables. Additionally, certain products provide tools to enforce the ACID paradigm (atomicity, consistency, integrity, durability). Therefore, by looking at the database schema, a human being or a software program can get an impression of how the data is structured. Unfortunately, data objects often contain additional formal and even informal structure that is not covered by the database schema. This section explores these forms of formal and informal heterogeneity on the database schema level and how to deal with it in the context of information integration. 5.2.1 Uniform representation of data models To make data integration efforts more consistent and streamlined, a uniform syntax for the representation of different data models needed to be found. Therefore, the data models of Perseus and Arachne were exported to XML. The Extensible Markup Language provides a syntax that is able to represent heterogeneous data models while simultaneously providing uniform methods to process the exported dataset. This data then can be processed by using Extenslible Stylesheet Language Tarnsformations, a language that can be used to reorganize XML documents. 38
Neel Smith at Holy Cross University in collaboration with The Center of Hellenic Studies is developing a network service that can expose different data structures as well as data objects as XML code, called the Collection services [55]. A Collection service exposes sets of objects to network discovery and searches. Perseus already exposes its dataset using the Collection service. Since it could not handle the large amount of data that is hosted by Arachne, the MySQL Query Browser was used to export data objects to XML. 1 In future, however, implementation of both Perseus and Arachne could expose their data with this service. Listing 5.1 shows the output for one specific data object of the Perseus database. The 102-field data model has been reduced to some significant fields. Listing 5.1: The Collection XML Code. 1 <?xml version=" 1. 0 " e n c o d i n g=" utf -8 "?> 2 <Q u e r y C o l l e c t i o n> 3 <r e q u e s t> 4 <C o l l e c t i o n I D>A r t i f a c t</ C o l l e c t i o n I D> 5 <QueryCollectionXPath>/ A r t i f a c t [ i d = 6 2 3 8 9 ]</ QueryCollectionXPath> 7 <C o n f i g F i l e> 8 / u s r / l o c a l / tomcat / webapps / c o l l s e r v i c e / b u i l d S e r v i c e C o n f i g. xml</ C o n f i g F i l e> 9 </ r e q u e s t> 10 < r e s u l t s> 11 <S c u l p t u r e A r t i f a c t i d=" 2 3 8 9 "> 12 <authorityname>new York 3 0. 1 1. 3 3</ authorityname> 13 <name>new York 3 0. 1 1. 3</name> 14 <type>s c u l p t u r e</ type> 15 <s t y l e>high C l a s s i c a l</ s t y l e> 16 <formstyledescription>&l t ; P> ; The s t e l e i s crowned by a 17 broad e p i s t y l e s u p p o r t i n g a s h a l l o w, p l a i n 18 pediment.& l t ; /P> ;& l t ; P> ; F r e l a t t r i b u t e d t h i s f i n e p i e c e, 19 and another in the Kerameikos (& l t ; rs 20 type=" s c u l p t u r e "> ; Athens, Kerameikos P 1130& l t ; / r s&g t ; ) t o 21 the work of the same sculptor, perhaps h i s so c a l l e d Dexileos 22 s c u l p t o r. Clairmont a g r e e s t h a t the t h i s p i e c e and t h a t i n 23 the Kerameikos s h o u l d be a t t r i b u t e d to the same hand, and 24 t h i s theory gains support from Herberts observations of the 25 s t y l i s t i c s i m i l a r i t i e s between the two monuments and t h e i r 26 i n s c r i p t i o n s.</ f o r m S t y l e D e s c r i p t i o n> 27 < t i t l e>fragmentary s t e l e o f woman</ t i t l e> 28 <sculpturetype>stele, r e l i e f decorated</ sculpturetype> 29 </ S c u l p t u r e A r t i f a c t> 30 </ r e s u l t s> 31 </ Q u e r y C o l l e c t i o n> 5.2.2 Mapping data models To gain a better understanding of the vocabulary definition and the structure used by the CIDOC CRM, experimental mappings of Perseus and Arachne s archaeological artifacts have been compiled. The following section reports about the overall methodology, workflow, and the issues and challenges that have been identified during this process. Once a unified representation of data models has been found, the structure of the internal data model can be transformed into the new structure. This transformation process results in a different data model while preserving the semantics. At the same time, however, the new structure needs to conform to a data model that all parties have agreed on. In the scope of the 1 http:// www.mysql.de/ products/ tools/ query-browser/. 39
project described, the XML code needs to be processed in a way that results in a markup that is compatible with the CIDOC CRM. The need for mapping internal data to common conceptualizations is a challenge that many cultural heritage projects currently need to face. Consequently, multiple research projects started to study the feasibility of abstract mapping software that can be adapted to the needs of certain databases. Within the EPOCH network, a mapping tool is being developed called the Archive Mapper for Archaeology (AMA). 2 The AMA tool is meant to enable mapping other well known standards to the CIDOC CRM. Unfortunately this approach presupposes that the internal data model of a CH database follows a certain standard which certainly will not be the case within most institutions. Another open-source software framework, Building Resources for Integrated Cultural Knowledge Services (BRICKS), for building digital libraries includes an Archaeological Pillar with the implementation of the Finds Identifier. 3 This software includes a mapping tool that is based on XSLT. Both mapping tools function well if either the databases can deliver data object of a certain quality or adhere to a certain standard. As mentioned above, the CIDOC CRM defines a data model that predominantly focuses on events. However, current documentation practice of Perseus and Arachne is not geared to explicitly record information about events. Nevertheless, whenever data is recorded about archaeological objects, at least implicitly, this entails various events. So for each attribute that was assigned to a specific data object, at least, there is this assignment event. For each date of creation that is attached to a data object, there must have been a creation or production event. Current documentation practice ignores these events but implicitly records information about them that needs to be extracted. Kondylakis et al. introduced a mapping language for information integration [36]. It claims to cover the most frequent occurrences of heterogeneity and introduces a specific formalism that can be visualized. Figure 5.1 shows the application of this language in the context of the current project. This mapping language comprises the introduction of intermediate nodes, contraction and extraction of compounds, nesting formerly parallel structures, re-using instances for different mappings and performing conditional mapping. The latter addresses cases where the mapping of one field depends on the value of another field. The first rule of Figure 5.1 is rather straightforward but demonstrates how mapping is performed. Each record of the Perseus art and archaeology table is mapped to the CRM concept E24 Physical Man-Mad Thing, the field name authorityname maps to the property P47 is identified by and the field-value 2 http:// www.epoch-net.org/ index.php?option=com content&task=view&id= 74&Itemid=120. 3 There are no extensive publications about this particular mapping tool but a brief introduction can be found at http:// dev.brickscommunity.org/ Archaeological Sites. 40
Figure 5.1: Graphical representation of the mapping process. 41
itself finally maps to the class E42 Object Identifier. This is an example of a simple one-to-one mapping operation. The second mapping rule models the period in which the archaeological artifact was crafted. In CRM terms, this involves a production event, and consequently, mapping rules two and three show how a E12 Production Event is introduced to express the creation date of an artifact. Finally, rule four explains how style information can be mapped; this rule uses the CRM class E17 Style Assignment as an intermediate node and then attaches further information. This is an example of changing a parallel structure to a nested one. This mapping language is supposed to be applied to data of a certain quality. Thus, the mapping language is not even remotely close to being able to deal with the special characteristics of the involved data models; it covers frequent but simple mapping problems with high quality data sources. For the mapping project, these peculiarities will be dealt with in the section on data quality. These rules stem from a thorough analysis of the Perseus data model. Listing 5.2 illustrates what a semi-formal documentation of this analysis process might look like. The first step involved finding a set of fields that together need to be mapped to a different set of fields with a certain structure. Then, to help identify the meaning of a particular field, some representative sample values were extracted and documented. In instances where a database field had several hundred values, a representative sample was taken. Then, after consulting the CRM definition document and matching the vocabulary definitions, a first mapping proposal was made and then elaborated iteratively. Finally, the overall process was commented on, and problems were documented for reconsideration in the next mapping iteration. Parentheses represent a constant value to be inserted and curly brackets represent the value of a specific database field. Listing 5.2: Semi-formal mapping documentation. 1 A f f e c t e d f i e l d s : 2 " s t y l e ", " f o r m S t y l e D e s c r i p t i o n " 3 4 Sample v a l u e s " s t y l e " : 5 " A r c h a i s t i c ", " E a r l y H e l l e n i s t i c ", " H i g h C l a s s i c a l ", " H i g h C l a s s i c a l ] " 6 7 Scope note " s t y l e " : 8 The epoch to which the s t y l e o f the d e s c r i b e d a r t i f a c t b e l o n g s. 9 10 Sample v a l u e s " f o r m S t y l e D e s c r i p t i o n " : 11 " < P > A d e s c e n d a n t o f t h e r i d e r s o f t h e P a r t h e n o n f r i e z e. </ P > ", " < P > B e a z l e y n o t e s t h a t t h e s t y l e i s E a s t G r e e k. </ P > ", " < P > E c l e c t i c w o r k, w i t h a l a t e H e l l e n i s t i c m a l e b o d y, a n d a f e m i n i n e h e a d t y p e < / P > " 12 13 Scope note " f o r m S t y l e D e s c r i p t i o n " : 14 Provides more f u l l text information on the s t y l e assignment. 15 16 Proposed mapping : 17 P14B. w a s c l a s s i f i e d b y 18 E17. Type Assignment ( Style assignment of ) { authorityname } 19 P42F. assigned 20 E55. Type ( s t y l e ) 21 P2F. has type 22 E55. Type { s t y l e } 23 P3F. has note 24 E62. String ( formstyledescription ) 25 42
26 Other n o t e s : The f i e l d " f o r m S t y l e D e s c r i p t i o n " sometimes c o n t a i n s i n v a l i d XML data. Maybe a h e u r i s t i c data c l e a n i n g t o o l c o u l d s o l v e t h i s problem. A d d i t i o n a l l y t he f i e l d s t y l e c o n t a i n s m i s s p e l l e d words and a d d i t i o n a l c h a r a c t e r s. Figure 5.2 shows an UML diagram of the Perseus data model and Figure 5.3 an entity relationship diagram of the Arachne data model. The figures demonstrate how different modeling approaches result in very different data models. While Perseus preferred a clean model with just a few tables based on inheritance, Arachne focused on tight and explicit contextualization of archaeological objects forsaking the clarity of the data model. The latter approach is based on the assumption that for archaeological objects, the information does not only lie in the metadata alone but in the qualified links to other objects. Figure 5.2: The Perseus art and archaeology database UML diagram. As previously mentioned, the Perseus data model relies heavily on inheritance. Therefore, as a test case, we decided to start mapping those database fields that are relevant for all objects (this refers to the class AtomicArtifact). Upon making a mapping the problems that arose while mapping the specific fields were enumerated. These included semantic dependence, semi-structured and unstructured content, and dirty data. In general, Perseus and Arachne s fields are designed for human viewing, i.e. not for machine processing. This results in low granularity of database fields and poses a challenge to extracting more granular data objects. The processing instructions of XSLT were deemed to be suitable to cover the cases of heterogeneity that appeared within the Perseus data model. Thus, XSL Transformations were used to implement these mapping rules that were elaborated and documented as set out in Listing 5.2. To comply with current developments in 43
Figure 5.3: The Arachne ERD diagram the Semantic Web field, the data was transformed to RDF/XML. This will enable most Semantic Web tools to process the data as shown in section 6.2 on page 59. Listing 5.3 shows a cutout of the XSLT style-sheet that was used for mapping. The RDF wrapper was inserted after line 9. Since all objects stored in the database were mapped to E24 Physical Man-Made Thing, this element is inserted in line 10. After that further templates are called (lines 18 and 19). Listing 5.3: Mapping implementation as XSLT style-sheet. 1 <?xml version=" 1. 0 " e n c o d i n g=" ISO - 8 8 5 9-1 "?> 2 < x s l : s t y l e s h e e t x m l n s : x s l=" h t t p : / / w w w. w 3. o r g / 1 9 9 9 / X S L / T r a n s f o r m " 3 x m l n s : d c=" h t t p : / / p u r l. o r g / d c / e l e m e n t s / 1. 1 / " 4 x m l n s : r d f=" h t t p : / / w w w. w 3. o r g / 1 9 9 9 / 0 2 / 2 2 - rdf - s y n t a x - n s # " 5 x m l n s : r d f s=" h t t p : / / w w w. w 3. o r g / 2 0 0 0 / 0 1 / rdf - s c h e m a # " 6 xmlns:crm=" h t t p : / / c i d o c. i c s. f o r t h. g r / r d f s / c i d o c _ v 4. 2. r d f s # " version=" 2. 0 " xml:lang=" e n "> 7 <x s l : o u t p u t e n c o d i n g=" UTF - 8 " i n d e n t=" y e s " /> 8 <x s l : t e m p l a t e match=" Q u e r y C o l l e c t i o n "> 9 <rdf:rdf> 10 <crm:e24. Physical Man Made Thing> 11 < x s l : a t t r i b u t e name=" r d f : a b o u t "> 12 < x s l : t e x t>h t t p : // p e r s e u s. t u f t s. edu / a r t i f a c t /</ x s l : t e x t> 13 <x s l : v a l u e o f s e l e c t=" r e p l a c e ( / / n a m e, [ & q u o t ; ; \ [ \ ] \ + & l t ; & g t ;], _ ) " /> 14 </ x s l : a t t r i b u t e> 15 < x s l : v a r i a b l e name=" a r t i f a c t I D " s e l e c t=" / / S c u l p t u r e A r t i f a c t / @ i d " /> 16 < x s l : v a r i a b l e name=" q u e r y L i n k " 17 s e l e c t=" c o n c a t ( h t t p : / / 1 3 4. 9 5. 1 1 3. 2 0 0 : 8 0 8 0 / e x i s t / x q u e r y / a r t i f a c t 2 i m g. x q l? a r t i f a c t =,$ a r t i f a c t I D ) " /> 18 <x s l : a p p l y t e m p l a t e s s e l e c t=" / / s t y l e " /> 19 <x s l : a p p l y t e m p l a t e s s e l e c t=" d o c u m e n t ( $ q u e r y L i n k ) " /> 20 </ crm:e24. Physical Man Made Thing> 21 </ rdf:rdf> 22 </ x s l : t e m p l a t e> 23 <x s l : t e m p l a t e match=" s t y l e "> 44
24 < x s l : i f t e s t=" s t r i n g - l e n g t h ( ) "> 25 <crm:p41b. w a s c l a s s i f i e d b y> 26 <crm:e17. Type Assignment> 27 < x s l : a t t r i b u t e name=" r d f : a b o u t "> 28 < x s l : t e x t>h t t p : // p e r s e u s. t u f t s. edu / a s s e s s m e n t /</ x s l : t e x t> 29 <x s l : v a l u e o f 30 s e l e c t=" r e p l a c e ( / / a u t h o r i t y N a m e, [ & q u o t ; ; \ [ \ ] \ + & l t ; & g t ;], _ ) " /> 31 </ x s l : a t t r i b u t e> 32 < d c : t i t l e> 33 < x s l : t e x t>s t y l e a s s i g n m e n t o f </ x s l : t e x t> 34 <x s l : v a l u e o f s e l e c t=" / / t i t l e " /> 35 </ d c : t i t l e> 36 <crm:p42f. a s s i g n e d> 37 <crm:e55. Type> 38 < x s l : a t t r i b u t e name=" r d f : a b o u t "> 39 < x s l : t e x t>h t t p : // p e r s e u s. t u f t s. edu / s t y l e T y p e /</ x s l : t e x t> 40 <x s l : v a l u e o f s e l e c t=" r e p l a c e (., [ & q u o t ; ; \ [ \ ] \ + & l t ; & g t ;], _ ) " /> 41 </ x s l : a t t r i b u t e> 42 < d c : t i t l e> 43 <x s l : v a l u e o f s e l e c t=". " /> 44 </ d c : t i t l e> 45 <crm:p2f. h a s t y p e> 46 <crm:e55. Type> 47 < x s l : a t t r i b u t e name=" r d f : a b o u t "> 48 < x s l : t e x t>h t t p : // p e r s e u s. t u f t s. edu / s t y l e T y p e</ x s l : t e x t> 49 </ x s l : a t t r i b u t e> 50 < d c : t i t l e> 51 < x s l : t e x t>s t y l e</ x s l : t e x t> 52 </ d c : t i t l e> 53 </ crm:e55. Type> 54 </ crm:p2f. h a s t y p e> 55 </ crm:e55. Type> 56 </crm:p42f. a s s i g n e d> 57 <x s l : a p p l y t e m p l a t e s s e l e c t=" / / f o r m S t y l e D e s c r i p t i o n " /> 58 </ crm:e17. Type Assignment> 59 </ crm:p41b. w a s c l a s s i f i e d b y> 60 <crm:p2f. h a s t y p e> 61 < x s l : a t t r i b u t e name=" r d f : r e s o u r c e "> 62 < x s l : t e x t>h t t p : // p e r s e u s. t u f t s. edu / s t y l e T y p e /</ x s l : t e x t> 63 <x s l : v a l u e o f s e l e c t=" r e p l a c e (., [ & q u o t ; ; \ [ \ ] \ + & l t ; & g t ;], _ ) " /> 64 </ x s l : a t t r i b u t e> 65 </ crm:p2f. h a s t y p e> 66 </ x s l : i f> 67 </ x s l : t e m p l a t e> 68 <x s l : t e m p l a t e match=" r e f "> 69 < x s l : i f t e s t=" s t r i n g - l e n g t h ( ) "> 70 <crm:p138b. h a s r e p r e s e n t a t i o n> 71 <crm:e36. V i s u a l I t e m> 72 < x s l : a t t r i b u t e name=" r d f : a b o u t "> 73 < x s l : t e x t>h t t p : // r e p o s i t o r y 0 1. l i b. t u f t s. e d u : 8 0 8 0 / f e d o r a / g e t / t u f t s : p e r s e u s. image.</ x s l : t e x t> 74 <x s l : v a l u e o f s e l e c t=". " /> 75 < x s l : t e x t>/ Thumbnail. png</ x s l : t e x t> 76 </ x s l : a t t r i b u t e> 77 < d c : t i t l e> 78 <x s l : v a l u e o f s e l e c t=". " /> 79 </ d c : t i t l e> 80 <d c : t y p e> 81 < x s l : t e x t>image / j p e g</ x s l : t e x t> 82 </ d c : t y p e> 83 </ crm:e36. V i s u a l I t e m> 84 </ crm:p138b. h a s r e p r e s e n t a t i o n> 85 </ x s l : i f> 86 </ x s l : t e m p l a t e> 87 </ x s l : s t y l e s h e e t> As mentioned above, just like most cultural heritage databases, Perseus and Arachne do not explicitly record information about events in their data models. Therefore, implicit events had to be extracted to make use of all concepts that the CRM has to offer. The template in line 18 evaluates the Perseus field style and connects the E55 Type to an E17 Type Assessment, in lines 23ff. In line 57 another template is triggered that evaluates the Perseus field formstyledescription to further specify the style assignment. This is the implementation for mapping a parallel structure to a nested one that clarifies the relationships among the fields. 45
According to the Semantic Web, an Uniform Resource Identifier needs to be attached to each Web resource, or it needs to be defined as anonymous (blank node). If during the mapping process a node has been created that needs potentially to be referred to, a URI has been assigned, even to events. Otherwise and anonymous node has been created. Lines 11ff demonstrate the construction of unique and unambiguous URIs for a Web resource. In this example the string http:// perseus.tufts.edu/ artifact/ has been concatenated with the value of the field authorityname. A unique namespace identifying the Perseus art and artifact database has been attached to a unique identifier that points to a specific artifact. As a consequence, this forms a URI that is unique world wide. The replace function in line 13 has been included to guarantee that the URI does not become malformed due to forbidden characters. A HTTP URL has been chosen to facilitate providing a representation of each object as multiple data-formats. However, this has not been implemented yet. Additionally, lines 32ff demonstrate providing a human readable string along with the URI that is meant for machine processing. Many software tools are aware of Dublin Core tags and can figure out the portion of information that was meant for displaying purposes, including the Longwell browser being used within the project on hand to display mapped data objects. The Perseus project hosts images of art and archaeology objects within a Fedora repository and also maintains an index to resolve the object identification number to one or more images that depict the same object. For mapping purposes this index has been ingested to an exist database. Line 12 constructs an URL to an Xquery service that is called at line 14. The result document of this query is then evaluated at line 68. Listing 5.4 shows the result of the mapping process as RDF/XML that can be validated against the RDFS definition document of the CIDOC CRM. Each mapped object has been equipped with a unique identifier and, additionally, for human readability a Dublin Core tag has been attached. The same has been done to events such as the one defined in line 34. Because different attribute assignments could result in different conclusions about the style of an artifact, this approach prevents metadata from becoming contradictory. This helps with merging information during the process of data integration, contradicting metadata about an object can coexist. Listing 5.4: The artifact as RDF/XML that conforms to the CIDOC CRM. 1 <?xml version=" 1. 0 " e n c o d i n g=" utf -8 "?> 2 <rdf:rdf x m l n s : d c=" h t t p : / / p u r l. o r g / d c / e l e m e n t s / 1. 1 / " 3 x m l n s : r d f=" h t t p : / / w w w. w 3. o r g / 1 9 9 9 / 0 2 / 2 2 - rdf - s y n t a x - n s # " 4 x m l n s : r d f s=" h t t p : / / w w w. w 3. o r g / 2 0 0 0 / 0 1 / rdf - s c h e m a # " 5 xmlns:crm=" h t t p : / / c i d o c. i c s. f o r t h. g r / r d f s / c i d o c _ v 4. 2. r d f s # "> 6 <crm:e24. Physical Man Made Thing r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / a r t i f a c t / N e w _ Y o r k _ 3 0. 1 1. 3 "> 7 <crm:p47f. i s i d e n t i f i e d b y> 8 <crm:e42. O b j e c t I d e n t i f i e r r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / i d e n t i f i e r s / 2 3 8 9 "> 9 < d c : t i t l e>2389</ d c : t i t l e> 10 </ crm:e42. O b j e c t I d e n t i f i e r> 11 </ crm:p47f. i s i d e n t i f i e d b y> 46
12 <crm:p47f. i s i d e n t i f i e d b y> 13 <crm:e42. O b j e c t I d e n t i f i e r r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / i d e n t i f i e r s / N e w _ Y o r k _ 3 0. 1 1. 3 3 "> 14 < d c : t i t l e>new York 3 0. 1 1. 3 3</ d c : t i t l e> 15 </ crm:e42. O b j e c t I d e n t i f i e r> 16 </ crm:p47f. i s i d e n t i f i e d b y> 17 <crm:p48f. h a s p r e f e r r e d i d e n t i f i e r r d f : r e s o u r c e=" h t t p : / / p e r s e u s. t u f t s. e d u / i d e n t i f i e r s / N e w _ Y o r k _ 3 0. 1 1. 3 3 " /> 18 <crm:p102f. h a s t i t l e> 19 <crm:e35. T i t l e r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / t i t l e / F r a g m e n t a r y _ s t e l e _ o f _ w o m a n "> 20 < d c : t i t l e>fragmentary s t e l e o f woman</ d c : t i t l e> 21 </ crm:e35. T i t l e> 22 </ crm:p102f. h a s t i t l e> 23 <crm:p2f. h a s t y p e> 24 <crm:e55. Type r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / a r t i f a c t T y p e / S c u l p t u r e "> 25 < d c : t i t l e>s c u l p t u r e</ d c : t i t l e> 26 <crm:p2f. h a s t y p e> 27 <crm:e55. Type r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / a r t i f a c t T y p e "> 28 < d c : t i t l e>type o f a r t i f a c t</ d c : t i t l e> 29 </ crm:e55. Type> 30 </ crm:p2f. h a s t y p e> 31 </ crm:e55. Type> 32 </ crm:p2f. h a s t y p e> 33 <crm:p41b. w a s c l a s s i f i e d b y> 34 <crm:e17. Type Assignment r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / a s s e s s m e n t / N e w _ Y o r k _ 3 0. 1 1. 3 3 "> 35 < d c : t i t l e>s t y l e a s s i g n m e n t o f New York 3 0. 1 1. 3 3</ d c : t i t l e> 36 <crm:p42f. a s s i g n e d> 37 <crm:e55. Type r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / s t y l e T y p e / H i g h _ C l a s s i c a l "> 38 < d c : t i t l e>high C l a s s i c a l</ d c : t i t l e> 39 <crm:p2f. h a s t y p e> 40 <crm:e55. Type r d f : a b o u t=" h t t p : / / p e r s e u s. t u f t s. e d u / s t y l e T y p e "> 41 < d c : t i t l e>s t y l e</ d c : t i t l e> 42 </ crm:e55. Type> 43 </ crm:p2f. h a s t y p e> 44 </ crm:e55. Type> 45 </crm:p42f. a s s i g n e d> 46 <crm:p3f. h a s n o t e>the s t e l e i s crowned by a broad e p i s t y l e 47 s u p p o r t i n g a s h a l l o w, p l a i n pediment. F r e l a t t r i b u t e d t h i s 48 f i n e piece, and another in the Kerameikos ( Athens, 49 Kerameikos P 1130) to the work o f the same s c u l p t o r, 50 perhaps h i s so c a l l e d D e x i l e o s s c u l p t o r. Clairmont a g r e e s 51 that the t h i s piece and that in the Kerameikos should be 52 a t t r i b u t e d to the same hand, and t h i s t h e o r y g a i n s s u p p o r t 53 from H e r b e r t s o b s e r v a t i o n s o f the s t y l i s t i c s i m i l a r i t i e s 54 between the two monuments and t h e i r 55 i n s c r i p t i o n s.</ crm:p3f. h a s n o t e> 56 </ crm:e17. Type Assignment> 57 </ crm:p41b. w a s c l a s s i f i e d b y> 58 <crm:p2f. h a s t y p e r d f : r e s o u r c e=" h t t p : / / p e r s e u s. t u f t s. e d u / s t y l e T y p e / H i g h _ C l a s s i c a l " /> 59 <crm:p138b. h a s r e p r e s e n t a t i o n> 60 <crm:e36. V i s u a l I t e m r d f : a b o u t=" h t t p : / / r e p o s i t o r y 0 1. l i b. t u f t s. e d u : 8 0 8 0 / f e d o r a / g e t / t u f t s : p e r s e u s. i m a g e. 7 9 8 8 7 / T h u m b n a i l. p n g "> 61 < d c : t i t l e>79887</ d c : t i t l e> 62 <d c : t y p e>image / j p e g</ d c : t y p e> 63 </ crm:e36. V i s u a l I t e m> 64 </ crm:p138b. h a s r e p r e s e n t a t i o n> 65 </ crm:e24. Physical Man Made Thing> 66 </ rdf:rdf> In some cases, the CIDOC CRM insists on assigning identifiers to identifiers. It provides a class called E42 Object Identifier that is meant for attaching additional identifiers such as museum inventory numbers to archeological objects. But each identifier needs its own URI along with the additional identifier string to be attached. This leads to an inflation of identifiers typical for the Semantic Web. However, the generated markup can now be transported to a physical place where it can be used by many Semantic Web toolkits for display or more advanced processing purposes, for example, a triple store system with a faceted browser component such as Longwell. 47
5.3 Heterogeneity on the entity level Mapping data to a common schema usually not only involves multiple database schemas but also multiple customs to name things that are stored in the databases. It has been mentioned before that all databases use their individual means to internally represent data objects, and usually use several processing levels to mediate between internal representation and end-user experience. For mapping purposes, the best case is a normalized database with granular data objects. This facilitates to map each data object to a shared data model, one to one. Unfortunately many cultural heritage databases, including both Perseus and Arachne, do not provide normalized data of sufficient quality. Therefore, additional intermediate steps need to be introduced to leverage data quality to the mapping needs. This is done either by simply dropping data or by cleaning it through pre-processing. 5.3.1 Data extraction and data quality problems Data cleaning is concerned with correcting anomalies of different data sources, especially in the context of information integration. Within the integration project of Perseus and Arachne, data is considered of high quality if it is suitable for mapping to the CIDOC CRM. Then, the quality of each data object can satisfy its purpose. But data has been entered manually into the databases, while only few constraints had been formulated. Therefore, database records certainly contain inconsistent data. To address this issue, Galhardas et al. introduced AJAX, an extensible tool for data cleaning [24]. In the course of their research, common problems with extracting data from legacy databases have been identified, such as schema level quality and instance level quality problems. Some issues on the schema level are commonly avoided by database management systems. Wrong data types and missing data, for example, can be avoided by introducing strict types and declaring database fields as NOT NULL. However, issues that cannot easily be avoided by database management systems are those that deal with the meaning of the database content. Someone, for example, could store a term that is too general for the scope of a specific database field, or that does not at all belong into a certain field. On the instance level, there are problems that either concern single records or involve multiple records. Someone could, for example, use a dummy value to outsmart the NOT NULL constraint or misspell values. Duplicated or contradicting records result in inconsistent data that is difficult to find and browse. Another example is the usage of inconsistent units for measurements. In the course of the project, different problems were encountered that prohibited a one-to-one mapping. Some of them have been classified as quality problems, others have been resolved by introducing more simple pre-processing steps. On the 48
schema level, for example, fields with overlapping meaning needing to be merged were found. There were also database fields that did not use the NULL value consistently. On the instance level, values with inconsistent spelling and invalid structure were found (bibliographic entries with nonuniform or missing separators). Listing 5.2 shows that the field style, for example, contains misspelled words. Two fundamental approaches have been chosen to get a set of clean database records for mapping. First, some inconsistencies that were identified could be resolved. In such case an algorithm could fix the problem for example by using regular expressions. Second, if step one failed or turned out to be too complex and therefore too expensive, the respective chunk of data was ignored and did not take part in the mapping process. In this case the data that resides within the database should be corrected manually at a later stage. The following deals with more examples that could not be mapped one-to-one. First, some fields are involved in semantical dependencies, the field startmod qualifies the field startdata (a terminus post quem) by stating that the data is an estimation. In many cases these functional dependencies cause the same field to be mapped to a different CIDOC CRM concept. Conditional mapping has to be applied here and can be implemented by using the XSLT <xsl:if test="expression"> tag. Second, fields with structured content, the field sourcesused entails an internal structure that is not covered by the database schema. TEI markup was used to enumerate multiple bibliographic entries. Although this results in a database field with internal structure, the information can still be extracted by processing the TEI tags with an XML parser. Unfortunately, for some records, the XML markup that formed the internal structure was defective. The current mapping implementation ignores those cases, future implementations should try to fix this issue by using, for example, heuristics. Third, fields with unstructured content, the field subjectdescription contains full-text that includes references to modern and ancient people. It contains valuable information that cannot be easily mapped. It has been suggested by the CIDOC CRM community that authority lists for people and other named entities could help to draw more information from unstructured full-text descriptions [26]. Figure 5.4 demonstrates how bibliographic entities have been extracted and mapped to fields of an intermediate data model. Within the scope of the project prototype, this is done by using regular expressions but in future versions, an XML parser should recognize the structure and do the mapping automatically. Most of the issues described above deal with informal or formal structure within database fields. Since this structure is not described by the database model, the legacy software resolves them at levels above the database component. A software developer designing a mapping component, needs to understand these algorithms to produce an acceptable mapping. 49
Figure 5.4: Demonstration of the mapping workflow with a trivial pre-processing step. In future versions, the mapping component should be able to process various kinds of commonly appearing heterogeneity. Some databases store lists of values delimited by special characters. Regular expressions could be configured and used to extract data with regular structure that is not covered by the database schema. XML markup within database fields is generally used to express structural coherence. This should be extracted by plugging in XML parsers. Furthermore, cultural heritage databases extensively use unstructured free text to attach relevant information that does not fit in other fields. Text parsers that can exploit authority naming services and that are aware of structured vocabularies could be used to extract relevant information. Internally, cultural heritage databases should invest in reorganizing and structuring their own data models and in enhancing data quality to better facilitate the re-use of data. If we had both a suite of mapping tools and databases with explicit formal structure, mapping would be a lot easier. 5.3.2 Entity Identification and record linkage While the complexity of internal data models poses difficulties for conversion to the CIDOC CRM, there are even deeper challenges. For a number of objects the collections of Perseus and Arachne overlap. With regard to those, there is a need for data analysis and fusion. Beyond the challenge of integrating heterogeneous data models and establishing a certain data quality lies the problem of heterogeneous data records. There is a need to identify semantically corresponding records that is, two ore more digital surrogates referring to the same entity in the world. This problem usually is referred to as entity identification or record linkage and has been described as a difficult and resource consuming challenge [63]. 50
Figure 5.5 shows two database records that have been mapped to the CIDOC CRM, one from Perseus and the other from Arachne. These two records are surrogates of the same archaeological object in the real world. The example demonstrates different forms of heterogeneity on the entity level. First, it shows that the problem of language becomes apparent. While Perseus, for describing the exact same object, uses bust, Arachne uses Portraitkopf which are English and German words that are closely related. It all boils down to the fact that the same things are named differently because of allowed variations that could result from different languages or accepted customs within a domain. Machine translation currently focuses on full text data but many hours of human effort went into structuring data by entering information into databases. This instantly could be exploited to help translating metadata to other languages for establishing better access to information. Second, there is a need to match H 44 cm in Arachne with H.0433m in Perseus, two comparable but not equivalent figures for the height of the bust. Third, there are different spellings of the placename Aricia / Ariccia in each record. Both records provide additional information that a gazetteer system could use to resolve both variants into a unique identifier like tgn,7007011 in the Getty Thesaurus of Geographic Names. Fourth, a system also should be able to establish that Boschung 1993 and D. Boschung, Die Bildnisse des Augustus, Das ro mische Herrscherbild I 2 (1993) 146 Nr. 80 Taf. 119, 3 most probably refer to the same bibliographic item. Figure 5.5: Approaches to entity identification. 51
The name Augustus is the same in German and English but none of the data presented in Figure 5.5 unambiguously indicates that this is Augustus, Emperor of Rome, 63 B.C.-14 A.D. with the unique identifier lccn:n79-33006. These are common problems that most cultural heritage databases have to put up with. By either using algorithm-based approaches (for measurable units) or the help of authorities (for names of places or people) entities could be identified by matching their properties as shown in red color in Figure 5.5. If a globally established identifier like a museum catalog number is already available, the process is straightforward; names and bibliographic references will be more complicated though. Scholars have done entity identification and record linkage (for example for prosopography research) for hundreds of years, they have created lists of names, attached thesauri or indices, a valuable resource for research in the field of cultural history. If a scientist extracts a name from a text, s/he exactly wants to identify the historical person that is referred to. This work has been done by humans for ages and today we have the opportunity to make this in a machine actionable way. It has to be managed in a way that makes the structured information publicly available together with offering functions like searching and browsing. Authority naming services are pieces of software capable of fulfilling this task. Authority naming lists contain a sufficient amount of information to establish a given author (entity) as unique while excluding information that, although perhaps interesting, does not contribute to this objective. The Functional Requirements and Numbering of Authority Records (FRANAR) build upon the Functional Requirements for Bibliographic Records (FRBR) and define several user tasks associated to authority services. According to this document an authority system should support the user in identifying an entity, that is distinguishing between two or more entities with similar characteristics. Furthermore, it should be able to contextualize an entity by provide relationships between entities. That could be relationships between two or more persons or between a person and a corporate body etc. [48]. It has already been said that record linkage is about finding database records (digital surrogates) that refer to the same entity in the non-digital world. They do not bear enough data to explicitly link them to a unique real-world entity and cannot be matched by trivial string comparison. The task of record linkage results in linked data, i.e. is data that somehow is marked as belonging together, for example by assigning a common identifier. In historical research, record linkage was popular in the 1980s, where computers could help to study data sets like census records or parish registers. Today, it could contribute to linking large heterogeneous sets of data for Semantic Web purposes. Several approaches to record linkage have been developed since then, reaching from rule-based approaches to probabilistic methods like Baive-Bayes algorithms. However, semi-automatic 52
normalization of data to a common format still is an important task certainly resulting in higher accuracy of machine-driven record linkage. The above leads to the identification of and collaboration with established naming-authorities by enriching established vocabularies with more granular entities from the Greco-Roman world. The German Archaeological Institute, for example, hosts a vast amount of granular information about place names that is not yet exploited and that should be made public. Once published as a web accessible resource, the entities can be connected to already established authorities like TGN or GeoNames that currently cannot deliver information with appropriate granularity for the study of the ancient Greco-Roman world. 5.4 Implementing an overall mapping workflow After introducing the tesserae that form the mapping process, this section concentrates on how they are tied together. Figure 5.6 points out how an overall mapping workflow should be implemented to contribute to a system that establishes interoperability. Although the model tries to divide different steps of the mapping process, some steps cannot be clearly separated and need to interact. Entity linking for example does need indexing. Figure 5.6: The overall interoperability workflow. First, both data models had to be represented in a uniform way for further processing. Perseus data has been exported to XML by using the Collection services. Since the Collection service could currently not handle the amount of data that is generated during the export for all of Arachne s data, the MySQL Query Browser was used to export Arachne s object, literature, and images tables. This export created more than 80,000 files, one for each data object, that had to be distributed as a hierarchic directory structure. Thus, there are definitely scalability issues in the mapping phase already. The export step has resulted in an one-to-one XML representation of the data models. 53
The next step aimed at cleaning the resulting data-set. For building the mapping prototype, the Unix sed command has been used, containing various regular expressions for extracting bibliographic entities from different fields. Some XML code for bibliographic entries was not valid and had to be dropped until a tool will be on hand that uses heuristics to fix the broken XML markup. Since Arachne maintains its own bibliographic database, extraction of bibliographic entities was easier on this side. The end result is an intermediate data model that can be better processed. In future versions, it would be good to experiment with professional data cleaning tools to extract more data from fields with informal internal structure. This step again resulted in an XML representation, but this time as an intermediate data model. According to the mapping documentation, an XSLT style-sheet has been crafted that implements the mapping rules described. By processing each XML file with this style-sheet the intermediate data model then has been mapped to RDF/XML conforming to the CIDOC CRM. Additionally, the Eyeball tool described earlier was used to validate the resulting RDF code against the published CRM RDF definition file. This mapping step also involved assigning unique identifiers to each material or conceptual object that was created during the mapping process, in accordance with RFC 3986. 4 Having mapped each database record to a single RDF/XML files, the data has been prepared for merging. At this part of the process, all data objects have been cleaned for proper record linkage. The current implementation relies on a simple mechanism to copy the resulting files to a common directory. Thereafter the RDF information has been merged by ingesting the file into the Longwell browser. Longwell ingests all RDF files and uses the Lucene search engine to connect objects that bear the same identifiers. 5 This mechanism is useful because it accumulates everything that has been said about a specific entity, even if the information is distributed among different physical files. Currently, this is the only form of record linkage that has been achieved, the prototype does not do multilingual entity identification. However, the infrastructure that would make this step feasible is still missing. Longwell was also used to visually present the results of the mapping process for debugging reasons. In this case, both indexing and presentation were achieved by ingesting the data into the Longwell browser software. The next section gives a more in-depth introduction on how Longwell has been configured to display cultural heritage data objects. The mapping workflow presented was chosen to gain experience with the appli- 4 The full-text of the Request for Comments can be found at http:// tools.ietf.org/ html/ rfc3986. 5 http:// lucene.apache.org/ java/ docs/. 54
cation of Semantic Web concepts to cultural heritage data and to explore the issues that are connected with it. The overall mapping process definitely needs more automation by implementing means to publish and harvest, index and present the data. Once this automation has been established further steps need to be introduced. These include multilingual record linkage and the interaction with authority-naming services for better linking data objects and accumulating multilingual metadata. This, in turn, would better facilitate services like cross language information retrieval in very specialized domains like classics and archaeology. On the conceptual level, the mapping should be enhanced iteratively by including more database fields and extracting more information. 55
Chapter 6 Knowledge visualization for the Semantic Web The visualization of Semantic Web data poses an interesting challenge to software developers. Data structures of almost unlimited complexity need to be presented to users that usually are not aware of the underlying concepts of information representation. The CIDOC CRM, for example, promotes the consideration of events during modeling cultural heritage data. It has been argued that this approach facilitates better data integration, therefore, it is necessary. Although this method of describing data may be useful and logical, users probably will not immediately agree to the necessity. This assumption is backed by the observation of current documentation practice. Here, events obviously are not needed and therefore not explicitly documented. This section deals with exploring means to process and visualize data that resulted from prior information integration. First, a survey of paradigms for visualizing Semantic Web data is undertaken. Then, the Longwell browser that was used to index and display the RDF/XML is introduced. Longwell has also been useful for exploring scalability issues with Semantic Web data. 6.1 Paradigms for visualizing linked data When it comes to presenting data to the user, lets say a scientist who is into cultural heritage research, a fundamental conflict has to be solved. Maeda states that simplicity is about subtracting the obvious, and adding the meaningful [41]. But RDF facilitates the formulation of amazingly complex data models where huge amounts of interlinked data objects can reside. But how do we extract what is meaningful and useful for the end user? Visualization in information technology always went after explicitly pointing to coherence that, without applying smart algorithms, remained implicit. 56
Geroimenko et al. pioneered the area of visualization of Semantic Web data. They propose the extensive use of SVG and X3D to implement different visualization paradigms. 1 The identified application fields reach from creating distributed user interfaces, over illustrating complex networks (for example citation networks), to sophistic models for knowledge visualization by using dynamic SVG charts. One topic particularly interesting for archaeological research is the use of SVG and XSLT to display geo-referenced data on interactive maps [27]. Different user communities need to figure the processing and visualization paradigms they require to create useful data presentations. The most straightforward approach to presenting Semantic Web would be to generate a textual presentation of web resources that may be linked by the well-known HTTP link mechanism. By using a simple XSLT transformation, an RDF/XML document can be converted to a HTML file including links to other data objects. This strategy is pursued by browsers like Disco or the well-known Tabulator. 2 But there are more sophisticated approaches to display RDF data. Since RDF is based on the the idea that all data can internally be represented as a graph, there is an almost unlimited number of ways to visualize the data. Robertson created several examples on visualizing cultural heritage data within the scope of his Historical Event Markup and Linking Project (HEML). Historical events either can be displayed on a map for emphasizing the spatial element of an event or in a timeline to emphasize the temporal dimension. This particular example is interesting because the HEML language can easily be translated in CIDOC CRM and visa versa [53]. Other project display data objects as nodes of a graph to emphasize the relation an object has to its surrounding contexts. 6.2 Faceted browsing using Longwell The main objective of the project on hand has been to map two specialized data models to a data model that conforms to the CIDOC Conceptual Reference Model. This data can be shared and processed within multiple contexts. However, this does not answer for the existence of tools that can process the data in a way that is meaningful and contributes to solving a specific scientific problem. One fundamental step towards this objective has been visualizing the data as soon as possible to get a better impression of how Semantic Web tools could deal with the data to be ingested. 1 SVG is maintained at the W3C (http:// www.w3.org/ TR/ SVG/ ). The X3D standard is defined at http:// www.web3d.org/ x3d/ specifications/ x3d specification.html. 2 A representation of Berlin in Disco as HTML can be retrieved at http:// dbpedia.org/ page/ Berlin, the same resource as RDF/XML at http:// dbpedia.org/ data/ Berlin. The Tabulator browser is available at http:// www.w3.org/ 2005/ ajar/ tab. 57
Longwell is a Semantic Web browser using the faceted classification paradigm that was first introduced by the Flamenco Project at Berkeley University [46, 19]. This paradigm assigns a couple of category terms to each term from one or more facets. A facet is a set of categories, for example, archaeological artefacts could be classified under a facet material with categories such as marble and bronze. Unfortunately, Flamenco has its own proprietary data model and mark-up format. Therfore, it will not be able to ingest RDF metadata that is published on the World Wide Web. Longwell is a web-based Semantic Web browser and runs on a standalone basis or within the context of a Java servlet container like Apache Tomcat [22]. Longwell is highly configurable in how it presents data to the user. The Fresnel 3 display vocabulary can be used to change the appearance of items that are displayed within the browser [61]. Currently most RDF browsers rely on their individual methods to approach two issues: selecting what information of an RDF graph will be displayed and how the data will be formatted. Fresnel can be used to facilitate concept-oriented browsing by explicitly displaying links to related objects. Listing 6.1 shows an abbreviated example of how the Fresnel language was used to tailor the output. Listing 6.1: Fresnel configuration code in Notation3 (N3). 1 @ p r e f i x f r e s n e l : <http : / /www. w3. org /2004/09/ f r e s n e l#>. 2 @prefix rdf : <http : / /www. w3. org /1999/02/22 rdf syntax ns#>. 3 @ p r e f i x f a c e t s : <http : / / s i m i l e. mit. edu /2006/01/ o n t o l o g i e s / f r e s n e l f a c e t s#>. 4 @ p r e f i x crm : <http : / / c i d o c. i c s. f o r t h. gr / r d f s / c i d o c v 4. 2. r d f s#>. 5 6 @ p r e f i x : <#>. 7 8 : f a c e t s a f a c e t s : F a c e t S e t ; 9 f a c e t s : t y p e s f a c e t s : a l l T y p e s ; 10 f a c e t s : f a c e t s ( r d f : type ). 11 12 : c i d o c F a c t e t s r d f : type f a c e t s : F a c e t S e t ; 13 f a c e t s : t y p e s ( crm : E24 Physical Man Made Thing ) ; 14 f a c e t s : f a c e t s ( 15 crm : P 6 7 B i s r e f e r r e d t o b y 16 crm : P 5 3 F h a s f o r m e r o r c u r r e n t l o c a t i o n 17 crm : P44F has condition 18 crm : P46B forms part of 19 crm : P103F was intended for 20 crm : P 4 5 F c o n s i s t s o f 21 ). 22 23 : c i d o c O b j e c t L e n s r d f : type f r e s n e l : Lens ; 24 f r e s n e l : purpose f r e s n e l : d e f a u l t L e n s ; 25 f r e s n e l : classlensdomain crm : E24 Physical Man Made Thing ; 26 f r e s n e l : s h o w P r o p e r t i e s ( 27 crm : P3F has note 28 crm : P103F was intended for 29 crm : P 5 3 F h a s f o r m e r o r c u r r e n t l o c a t i o n 30 crm : P44F has condition 31 crm : P 4 5 F c o n s i s t s o f 32 crm : P 6 7 B i s r e f e r r e d t o b y 33 crm : P46B forms part of 34 crm : P138B has representation 35 ) ; 36 f r e s n e l : group : gr. 37 38 : cidocojectimageformat r d f : type f r e s n e l : Format ; 3 The term Fresnel refers to the French physicist Augustin-Jean Fresnel who constructed a special type of faceted lens for lighthouses. 58
39 f r e s n e l : propertyformatdomain crm : P138B has representation ; 40 f r e s n e l : v a l u e f r e s n e l : image ; 41 f r e s n e l : l a b e l " I m a g e s " ; 42 f r e s n e l : group : gr. 43 44 : g r r d f : type f r e s n e l : Group ; 45 f r e s n e l : l a b e l " C I D O C C R M s t a n d a r d g r o u p " ; 46 f r e s n e l : s t y l e s h e e t L i n k <http : / / pentheus. p e r s e u s. t u f t s. edu /crm. c s s >. Figures 6.1 and 6.2 exemplify how the Fresnel language can be used to change the appearance of data objects within the browser, including the display of images. Figure 6.1: The Longwell Semantic Web browser, unconfigured. Longwell is an example for using the underlying data model to control the user interface component of an application. In this example the CIDOC CRM is used both for internal information representation and for external user interface generation. If the underlying data model will be changed or extended, the graphical user interface component will automatically reflect these changes without any additional efforts. Khurso and Tjoa found out that Longwell, compared to other browsing and visualization tools, is one of the more scalable tools [38]. According to their experiments, Longwell is able to handle more than 500,000 triples. Currently about 40 fields and one link to the picture database of Perseus 6,000 database records are mapped to RDF/XML. This results in about 401,000 RDF triples, an amount of data that has been indexed within a couple of hours. The data-set could be browsed at good performance. However, for Arachne, ten fields of the main object table with links to geographic entities and bibliographic information have been mapped resulting in about 2,402,000 triples for about 60,000 archaeological objects, 6,000 bibliographic entries, and 5,000 records with place information. This amount of data could not be ingested into the Longwell browser, the ingesting process was stopped after 109 hours of computing time on a Mac Pro (3,0 GHz 59
Figure 6.2: The Longwell Semantic Web browser, configured with Fresnel. Quad-Core Intel Xeon 5300, 2GB main memory). Performance experiments with a native in-memory store turned out to be promising. But there are other alternatives that should be explored as well. A larger integration project for archaeological data would easily reach a magnitude of more than 30 million RDF triples. Portwin and Parvatikar state that the Jena API scaled up to 200 million triples during their project [51]. 4 This amount of triples is enough for a small cultural domain but not enough for huge amounts of data worldwide. Unfortunately, Longwell does not support any inferencing on the underlying ontology. Even if the CIDOC CRM definitions were ingested together with Perseus and Arachne s metadata, no links from data objects to their defining classes were discovered and indexed. This leads to the fact that Longwell completely ignores the concepts of generalization and inheritance. For example, Longwell does not allow for displaying all persistent physical items and non-material products of human activity by selecting the E71 Man-Made Stuff, the class under which they are subsumed. The user rather has to formulate a concatenated query that includes both classes. However, this prevents from exploiting some of the most fundamental advantages of ontologies and thesauri. 4 At http:// www.mkbergman.com/?p=227 M. K. Bergmann states that 250 million triples currently is the High Water Mark for Semantic Web data. 60
Chapter 7 Conclusion After evaluating functional requirements of digital scholarship, some building blocks of a future Cyberinfrastructure have been introduced. A distinction has been made that conceptually separates instances from entities. To conduct serious research, scientists need to refer to instances within primary sources to give evidence for their argumentation. They also need to make unambiguous assertions about, for example, historical places and persons. Thus, there is a need for a system that enables scientists referring to specific entities. A complex software architecture including authority-naming services and institutional repositories that build upon Semantic Web concepts could provide the functionality needed. Additionally, standards that facilitate networked knowledge organization systems have been looked at. To better understand different conceptual and physical elements of the overall architecture, a basic mapping workflow has been established, reaching from data extraction, over cleaning and mapping, to visual presentation in the Longwell Semantic Web browser. Most current data models cannot instantly deliver the data in a way that can be processed for Semantic Web purposes. Common problems comprise dirty and unstructured data that could not be easily extracted. Additionally, many Semantic Web concepts are still not well understood and complicated to implement using state of the art web-server technology. The Perseus art and archaeology database contains approximately 6,000 data objects. Each object is described as a subset of altogether 102 database fields. Some of these fields are administrative and only used internally so that 94 fields qualify for mapping. Since the database hosts a high diversity of objects ranging from coins to buildings, only 34 fields were found to be relevant for all data objects. Therefore, the mapping experiment started with mapping those fields to the CIDOC CRM. Three fields contained structured bibliographic entities that could be easily extracted by trivial pre-processing. Four fields, however, contained mostly unstructured text with valuable information about places and people that have not been extracted. The mapping workflow certainly needs better automation 61
and options for plugging in data quality and cleaning tools. The compilation of further and better mapping rules as well as pre-processing components will be an iterative and ongoing endeavor. The experience we gained with mapping Perseus data will help to better map the more than 100,000 data objects of Arachne. As a test-case the most important fields of three central Arachne tables (objekt, literatur, ort) have also been mapped to the CIDOC CRM. Some problems with extracting data from both databases originate from fields with an implicit internal structure. For bibliographic information, the structure could be automatically discovered and items extracted. Because of poor data quality, some information had to be dropped and could not be mapped to the CIDOC CRM. The application of tools that can fix common data quality problems would result in a more comprehensive mapping result. A couple of tools are freely available in the public domain and commercial solutions also exist. But most problems require domain specific knowledge and could probably be handled better by specialized software. However, investing in internal data cleaning and re-organization of data models would help with mapping cultural heritage databases to the CIDOC CRM. Introduction of multilingual record-linkage tools could assist with automatically linking data objects that belong together. Perseus and Arachne have slightly overlapping collections. In this context, digital surrogates that refer to the same entity should be linked. This would result in accumulating multilingual metadata for these objects. Even if cross language information retrieval tools are introduced, all metadata internally should to be available in a certain language, for example, English. Record-linkage is dependent on entity identification. If two bundles of metadata can be identified as referring to the same entity, the records can be linked as belonging together. The objective has to be not only to identify that a specific string refers to a person or a place, but also to what specific place or person. This will be carried out by assigning a common global identifier. The overall aim will be to automatically linking data objects that conceptually belong together. Entity-identification will become more powerful if it is done with the assistance of authority naming-services. Hooking text-parsers up to these services could extract information about people and places from full-text descriptions. However, in the course of the project, record-linkage only could be established on a very low level, future research should concentrate on this area. Advanced record-linkage applications seem to be promising for contributing linked Semantic Web data. For the time being, Perseus data has been published for harvesting at http: // athena.perseus.tufts.edu/ collection in three different representations (RDF/XML, HTML and Collection service XML). This data-set should be ingested into institutional repository software that is able to handle huge amounts of data. Each data 62
object should be equipped with a persistent identifier. This could be the Fedora institutional repository software or a simple triple store with another publishing component. Fedora bears the advantage of delivering many data management tools and facilitating long term preservation. For publishing the data to a large audience, Fedora implements the OAI Protocol for Metadata Harvesting. Large repositories that facilitate discovery of RDF data are emerging. 1 By the development of new and flexible ways of knowledge representation, historical cultural scientists will be enabled to refer to, access, and manage vast amounts of densely linked data objects as surrogates for existing cultural heritage objects. The most obvious benefit of putting granular data online is providing rapid and economic access not only to documents but to granular metadata and large knowledge organization systems. To action this vision, scientists will need to encode their documentation in a way that can be processed by machines and reused by other scientists. Moreover, the vast amount of material that has been published traditionally, in print or even hand written, should be digitized in a way that contributes to the linked data idea. Named entity identification systems in addition to full-text parsers could be adapted to this task. Although the Semantic Web is obviously an emerging field, current frameworks and browsers leave many issues unaddressed. Each tool is targeted to a certain display paradigm and provides only limited scalability, being suitable for research in the lab but not for a large production environment. Because Longwell separates data and display it provides a promising paradigm for future research, provided that it will be able to overcome current scalability issues. Frameworks like the Jena API also show promise concerning scalability issues. However, certain communities need to chose a suitable display paradigm that provides useful access to contribute to their research objectives. All tools that have been described so far could also be applied to publications that exist in digital form either to link archaeological objects and ancient texts to secondary sources or to automatically create data objects in bulk. A fruitful area of research surely will be the development of tools that provide Intelligent Information Access for digital libraries. These are technologies that make use of human knowledge or human-like intelligence to provide effective and efficient access to large, distributed, heterogeneous and multilingual (and at this time mainly [but not only] text-based) information resources and to satisfy users information needs [6]. 1 http:// pingthesemanticweb.com/ is a service which acts as a concentrator for multiple contributing repositories. 63
64
Bibliography [1] A. Babeu, D. Bamman, G. Crane, R. Kummer, and G. Weaver. Named entity identification and cyberinfrastructure. In Proceedings of the 11th European Conference on Research and Advanced Technology for Digital Libraries (ECDL 2007)-to appear, pages 259 270. Springer Verlag, September 2007. [2] M. Baca and P. Harpring. Categories for the Description of Works of Art. http:// www.getty.edu/ research/ conducting research/ standards/ cdwa/, August 2006. [3] J. Bekaert, X. Liu, H. Van de Sompel, C. Lagoze, S. Payette, and S. Warner. Pathways core: a data model for cross-repository services. In JCDL 06: Proceedings of the 6th ACM/IEEE-CS joint conference on Digital Libraries, pages 368 368, New York, NY, USA, 2006. ACM Press. [4] O. Boonstra, L. Breure, and P. Doorn. Past, present and future of historical information science. Historical Social Research / Historische Sozialforschung, 29(2):4 131, 2004. [5] Dan Brickley and R.V. Guha. Rdf vocabulary description language 1.0: Rdf schema. http:// www.w3.org/ TR/ rdf-schema/, February 2004. [6] J. Chen, F. Li, and C. Xuan. A preliminary Analysis of the Use of resources in intelligent information access research. In Proceedings 69th Annual Meeting of the American Society for Information Science and Technology (ASIST), volume 43, 2006. [7] Art Museum Image Consortium. AMICO Data Specification. http:// www.amico.org/ AMICOlibrary/ dataspec.html, 2004. [8] G. Crane, D. Bamman, L. Cerrato, A. Jones, D. Mimno, A. Packel, D. Sculley, and G. Weaver. Beyond digital incunabula: Modeling the next generation of digital libraries. In Proceedings of the 10th European Conference on Research and Advanced Technology for Digital Libraries 65
(ECDL 2006), volume 4172 of Lecture Notes in Computer Science. Springer, 2006. [9] G. Crane, C. E. Wulfman, L. M. Cerrato, A. Mahoney, T. L. Milbank, D. Mimno, J. A. Rydberg-Cox, D. A. Smith, and C. York. Towards a cultural heritage digital library. In Proceedings of the 3rd ACM/IEEE-CS Joint Conference on Digital Libraries, JCDL 2003, pages 75 86, Houston, TX, June 2003. [10] N. Crofts, M. Dörr, T. Gill, S. Stead, and M. Stiff. Definition of the CIDOC object-oriented conceptual reference model. Technical report, The CIDOC CRM Special Interest Group, 2005. [11] DAI. Deutsches Archäologisches Institut. http:// www.dainst.org, August 2007. [12] H. Van de Sompel, C. Lagoze, J. Bekaert, X. Liu, S. Payette, and S. Warner. An Interoperable Fabric for Scholarly Value Chains. D-Lib Magazine, 12(10), October 2006. [13] H. Van de Sompel, M. L. Nelson, C. Lagoze, and S. Warner. Resource Harvesting within the OAI-PMH Framework. D-Lib Magazine, 10(12), 2004. [14] H. Van de Sompel, S. Payette, J. Erickson, C. Lagoze, and S. Warner. Rethinking Scholarly Communication. Building the System that Scholars Deserve. D-Lib Magazine, 10(9), September 2004. [15] M. Dörr. The CIDOC conceptual reference module[sic!]: An ontological approach to semantic interoperability of metadata. AI Mag, 24(3):75 92, 2003. [16] M. Dörr. The CIDOC CRM, a Standard for the Integration of Cultural Information. http:// cidoc.ics.forth.gr/ docs/ crm for gothenburg.ppt, November 2005. [17] M. Dörr and P. LeBoeuf. FRBR object-oriented definition and mapping to FRBR-ER. http:// cidoc.ics.forth.gr/ docs/ frbr oo/ frbr docs/ FRBR oo V0.8.1c.pdf, May 2007. [18] EPOCH. A Survey of Documentation Standards in the Archaeological and Museum Community. http://hdl.handle.net/2313/91, October 2006. [19] Flamenco. The Flamenco Search Interface Project. http:// flamenco.berkeley.edu/ index.html, May 2007. 66
[20] R. Förtsch. ARACHNE - Datenbank und kulturelle Archive des Forschungsarchivs für Antike Plastik Köln und des Deutschen Archäologischen Instituts. http:// arachne.uni-koeln.de/ inhalt text.html, August 2007. [21] R. Förtsch. Forschungsarchiv für Antike Plastik. http:// www.klassarchaeologie.uni-koeln.de/ abteilungen/ mar/ forber.htm, August 2007. [22] The Apache Software Foundation. Apache Tomcat. http:// tomcat.apache.org/, May 2007. [23] The Apache Software Foundation. The Apache HTTP Server Project. http:// httpd.apache.org/, August 2007. [24] H. Galhardas, D. Florescu, D. Shasha, and E. Simon. AJAX: An Extensible Data Cleaning Tool. In SIGMOD 00: Proceedings of the 2000 ACM SIGMOD international conference on Management of data, page 590, New York, NY, USA, 2000. ACM Press. [25] P. Galuzzi. The virtual museum of the Future. In Semantic Web for scientific and cultural organisations: results of some early experiments, June 2003. [26] M. Genereux and F. Niccolucci. Extraction and mapping of CIDOC-CRM encodings from texts and other digital formats. In The 7th International Symposium on Virtual Reality, Archaeology and Cultural Heritage (VAST), Nicosia, Cyprus, 2006. [27] V. Geroimenko and C. Chen. Visualizing Information Using SVG and X3D. XML Based Technologies for the XML Based Web. Springer, London [Et al.], 2. ed edition, 2004. [28] P. Gietz, A. Aschenbrenner, S. Budenbender, F. Jannidis, M. W. Kuster, C. Ludwig, W. Pempe, T. Vitt, W. Wegstein, and A. Zielinski. TextGrid and ehumanities. In E-SCIENCE 06: Proceedings of the Second IEEE International Conference on e-science and Grid Computing, pages 133 141, Washington, DC, USA, 2006. IEEE Computer Society. [29] T. R. Gruber. Towards Principles for the Design of Ontologies Used for Knowledge Sharing. In N. Guarino and R. Poli, editors, Formal Ontology in Conceptual Analysis and Knowledge Representation, Deventer, The Netherlands, 1993. Kluwer Academic Publishers. 67
[30] I. Herman, R. Swick, and D. Brickley. Resource Description Framework (RDF) / W3C Semantic Web Activity. http:// www.w3.org/ RDF/, January 2007. [31] ICS-FORTH. Partial Definition of the CIDOC Conceptual Reference Model version 4.2 in RDF. http:// cidoc.ics.forth.gr/ rdfs/ cidoc v4.2.rdfs, June 2005. [32] IFLA Study Group on the Functional Requirements for Bibliographic Records. Functional Requirements for Bibliographic Records: Final Report, volume 19 of UBCIM Publications-New Series. K.G.Saur, München, 1998. [33] Open Archives Initiative. Open Archives Initiative Protocol Object Reuse and Exchange. http:// www.openarchives.org/ ore/, August 2007. [34] The Text Encoding Initiative. Tei: Yesterday s information tomorrow. http:// www.tei-c.org/, August 2007. [35] Getty Institute. The Getty Thesaurus of Geographic Names Online. http: // www.getty.edu/ research/ conducting research/ vocabularies/ tgn/ index.html, August 2007. [36] H. Kondylakis, M. Dörr, and D. Plexousakis. Mapping Language for Information Integration. Technical report, ICS-FORTH, December 2006. [37] R. Kummer. Integrating Data from The Perseus Project and Arachne using the CIDOC CRM: An Examination from a Software Developer s Perspective. In Exploring the Limits of Global Models for Integration and Use of Historical and Scientific Information-ICS Forth Workshop, Heraklion, Crete, October 2006. ICS-Forth, ICS-Forth. [38] S. Kushro and A. Tjoa. Fulfilling the Needs of a Metadata Creator and Analyst An Investigation of RDF Browsing and Visualization Tools. Canadian Semantic Web, pages 81 101, 2006. [39] C. Lagoze and H. Van de Sompel. The Open Archives Initiative: Building a Low-Barrier Interoperability Framework. In ACM/IEEE Joint Conference on Digital Libraries, pages 54 62, 2001. [40] C. Lagoze, S. Payette, E. Shin, and C. Wilper. Fedora: An Architecture for Complex Objects and their Relationships. http://arxiv.org/abs/cs.dl/0501012, August 2005. [41] J. Maeda. The Laws of Simplicity (Simplicity: Design, Technology, Business, Life). The MIT Press, August 2006. 68
[42] D. L. McGuinness and F. van Harmelen. OWL Web Ontology Language Overview. http:// www.w3.org/ TR/ owl-features/, February 2004. [43] B. Metcalfe. Metcalfe s Law: A Network Becomes More Valuable as it Reaches More Users. InfoWorld, October 1995. [44] D. Mimno, G. Crane, and A. Jones. Hierarchical catalog records: Implementing a FRBR catalog. D-Lib Magazine, 11(10), 2005. [45] American Council of Learned Societies. Our Cultural Commonwealth: The final report of the ACLS Commission on Cyberinfrastructure for the Humanities and Social Sciences. http:// www.acls.org/ cyberinfrastructure/, December 2006. [46] Massachusetts Institute of Technology. Longwell. http:// simile.mit.edu/ wiki/ Longwell, August 2007. [47] A. M. Ouksel and A. P. Sheth. Semantic Interoperability in Global Information Systems: A Brief Introduction to the Research Area and the Special Section. SIGMOD Record, 28(1):5 12, 1999. [48] G. Patton. FRANAR: A Conceptual Model for Authority Data. Cataloging & Classification Quarterly, 38(3/4):91 104, November 2004. [49] K. Popper. Logik der Forschung. Springer, 1935. [50] D. Porter, W. Du. Casse, J. W. Jaromczyk, N. Moore, R. Scaife, and J. Mitchell. Creating CTS Collections. In Digital Humanities, pages 269 74, 2006. [51] K. Portwin and P. Parvatikar. Scaling Jena in a commercial environment: The Ingenta MetaStore Project. In 2006 Jena User Conference, 2006. [52] Proceedings of the 1st International Conference on Formal Ontologies in Guarino,N. Formal Ontology and Information Systems. In N. Guarino, editor, Proceedings of the 1st International Conference on Formal Ontologies in Information Systems, FOIS 98. Formal Ontology and Information Systems, 1998. [53] B. Robertson. The Historical Event Markup and Linking Project. http:// www.heml.org/ heml-cocoon/, August 2007. [54] N. Shadbolt, T. Berners-Lee, and W. Hall. The Semantic Web Revisited. Intelligent Systems, IEEE [see also IEEE Intelligent Systems and Their Applications], 21(3):96 101, 2006. 69
[55] N. Smith. Collection services. http:// chs75.harvard.edu/ projects/ diginc/ techpub/ collections, August 2007. [56] D. R. Snow, M. Gehagan, C. L. Giles, K. G. Hirth, G. R. Milner, P. Mitra, and J. Z. Wang. Cybertools and Archaeology. Science, 311(5763):958 959, February 2006. [57] R. Stein, J. Gottschewski, R. Heuchert, A. Ermert, M. Hagedorn-Saupe, H.-J. Hansen, C. Saro, R. Scheffel, and G. Schulte-Dornberg. Das CIDOC Conceptual Reference Model: Eine Hilfe für den Datenaustausch? Bericht der AG Datenaustausch Fachgruppe Dokumentation im Deutschen Museumsbund. Mitteilungen und Berichte aus dem Institut für Museumskunde, 2005. [58] R. Tansley, M. Bass, D. Stuve, M. Branschofsky, D. Chudnov, G. Mcclellan, and M. Smith. The DSpace Institutional Digital Repository System: Current Functionality. In JCDL 03: Proceedings of the 3rd ACM/IEEE-CS joint conference on Digital libraries, pages 87 97, Washington, DC, USA, 2003. IEEE Computer Society. [59] Research Councils UK. About the UK e-science Programme. http:// www.rcuk.ac.uk/ escience/ default.htm, August 2007. [60] M. Uschold. Where Are the Semantics in the Semantic Web? AI Magazine, 24(3):25 36, 2003. [61] W3C. Fresnel Display Vocabulary for RDF. http:// www.w3.org/ 2005/ 04/ fresnel-info/, November 2006. [62] W3C. Simple Knowledge Organisation Systems (SKOS) Home Page. http:// www.w3.org/ 2004/ 02/ skos/, June 2007. [63] H. Zhao and S. Ram. Entity identification for heterogeneous database integration: a multiple classifier system approach and empirical evaluation. Inf. Syst., 30(2):119 132, 2005. 70
Appendix A DVD is attached to this thesis containing the directories below. /collections/ contains all mapped artifacts of Perseus and Arachne s art and artifact databases. /docs/ contains the complete mapping documentation draft document and this thesis as PDF document. /style-sheets/ contains the mapping implementations as style-sheets. /scripts/ contains the used shell scripts. A running Longwell browser has been installed at http:// athena.perseus.tufts. edu:8080/. 71