Knowledge-Based Persistent Archives

Transcription

1 SDSC TR Knowledge-Based Persistent Archives Reagan W. Moore San Diego Supercomputer Center Sponsored by NATIONAL ARCHIVES AND RECORDS ADMINISTRATION and ADVANCED RESEARCH PROJECTS AGENCY ITO INTELLIGENT METACOMPUTING TESTBED ARPA Order D570 Issued by ESC/ENS under contract F C-0020 January 18, 2001 San Diego Supercomputer Center TECHNICAL REPORT Copyright 2001, The Regents of the University of California

2 The views and conclusions contained in this document are those of the authors and should not be interpreted as representing the official policies, either expressed or implied, of the Advanced Research Projects Agency or the U.S. Government.

3 Knowledge-based Persistent Archives Reagan W. Moore San Diego Supercomputer Center La Jolla, CA Abstract The preservation of digital information for long periods of time is becoming feasible through the integration of archival storage technology from supercomputer centers, information models from the digital library community, and preservation models from the archivist s community. The supercomputer centers provide the technology needed to store the immense amounts of digital data that are being created, while the digital library community provides the mechanisms to define the context needed to interpret the data. The coordination of these technologies with preservation and management policies defines the infrastructure for a collection based persistent archive [1]. This report discusses the use of knowledge representations to augment collection-based persistent archives. 1. Introduction Supercomputer centers, digital libraries, and archival storage communities have common persistent archival storage requirements. Each of these communities is building software infrastructure to organize and store large collections of data. An emerging common requirement is the ability to maintain data collections for long periods of time. The challenge is to maintain the ability to discover, access, and display digital objects that are stored within the archive, while the technology used to manage the archive evolves. We originally implemented a collection-based persistent archive [1] in which a description of the collection is stored along with the data. The approach focused on the development of infrastructure independent representations for the information content of the collection, interoperability mechanisms to support migration of the collection onto new software and hardware systems, and use of a standard tagging language to annotate the information content. The process used to ingest a collection, transform it into an infrastructure independent form, and recreate the collection on new technology is shown schematically in Figure 1. 1

4 Figure 1. Persistent Collection Process Two phases are emphasized, the archiving of the collection, and the retrieval or instantiation of the collection onto new technology. The diagram shows the multiple steps that are necessary to preserve digital objects through time. The steps form a cycle that can be used for migrating data collections onto new infrastructure as technology evolves. The technology changes can occur at the system-level where archive, file, compute and database software evolves, or at the information model level where formats, programming languages and practices change. The ultimate goal is to maintain not only the bits associated with the original data, but also the context that permits the data to be interpreted. We rely on the use of collections to define the context to associate with digital data. Each digital object is maintained as a tagged structure that includes the original bytes of data, as well as attributes that have been defined as relevant for the data collection. A collection-based persistent archive is therefore one in which the organization of the collection is archived simultaneously with the digital objects that comprise the collection. A persistent collection requires the ability to dynamically recreate the collection on new technology. Scalable archival storage systems are used to ensure that sufficient resources are available for continual migration of digital objects to new media. The software systems that interpret the infrastructure independent representation for the collections are based upon generic digital library systems, and are migrated explicitly to new platforms. In this system, the original representation of the digital objects and of the collections does 2

5 not change. The maintenance of the persistent archive is then achieved through application of archivist policies that govern the rate of migration of the objects and the collection instantiation software. 2. Knowledge-based Archives The preservation of the context to associate with digital objects is the dominant issue for knowledge-based persistent archives. The context is traditionally defined through specification of attributes that are associated with each digital object. The context is also defined through the implied relationships that exist between the attributes, and the preferred organization of the attributes in user interfaces for viewing the data collection. Management of the collection context is made difficult by the rapid change of technology. Software systems used to manage collections are changing on five to tenyear time scale. Of greater concern is that the information tagging languages used to annotate digital objects is also changing. The persistent archiving of a collection must also handle the evolution of the information mark-up language. We have characterized persistent archives in prior publications [1,2] as collection-based repositories. We now recognize the need to broaden the archive characterization to knowledge-based repositories. Not only the information content, but also the processing steps used to accession the collection must be preserved. Conceptually, one can view the accessioning process as the equivalent of the process needed to instantiate the collection on new technology. If the accessioning process can be captured in an infrastructure independent representation, the same process can be used to manage the migration of the collection to new markup languages, archival data repositories, information repositories, and knowledge repositories. The archival description of a collection then must include not only contextual information about the digital objects, but also knowledge about the relationships used to derive the contextual information. The architecture that is needed to implement a knowledge-based persistent archive is shown in figure 2. 3

6 Ingest Manage Access Knowledge Relationships between Concepts Knowledge Repository for Rules Knowledge or Topic-Based Query Information Attributes Semantics Information Repository Attribute- based Query Data Fields Containers Folders Storage (Replicas, Persistent IDs) Feature-based Query Process Infrastructure Process Figure 2. Knowledge-based Persistent Archive The three columns represent the technologies needed to manage the ingestion process, manage the persistent archive, and manage the access environment. The three rows represent the infrastructure needed to manage knowledge, information and data. Knowledge is represented as relationships between domain concepts. Information is represented as attributes about digital objects within the collection. The digital objects are images of the reality described by the domain concepts. Ingestion corresponds to the steps of knowledge mining/tagging, information mining/tagging, and digital object organization/storage. Persistent archive management requires infrastructure to store the digital objects (archives), information repositories to hold the metadata (databases), and knowledge repositories to organize the relationships (logic systems). The access environment provides mechanisms to query the collection at the data level through feature extraction, at the information level through database queries, and at the knowledge-level through domain concepts. Just as the data management infrastructure is intended to provide access without having to know data object names, the knowledge access infrastructure is intended to provide access without having to know the explicit metadata attribute names used to organize the collection database. 4

7 The knowledge-based persistent archive requires software infrastructure to support interoperability between different implementations of ingestion, management, and access infrastructure components. This is shown in Figure 3. Between Ingest platforms and Management repositories, standards are needed to define consistent tagging mechanisms for knowledge (XML Topic Map DTD[3] or XTM DTD) for information (XML DTD[4]), and for data organization (logical folders and physical containers). Between Management repositories and Access platforms, standard query languages are needed for knowledge-based access (Knowledge query language or rule manipulation language), attribute-based access (EMCAT SGL generator or MIX mediator[5]), and feature-based access (application of procedures within a computational grid). Between the knowledge and information environments, a standard representation is needed to map from concepts to attributes, such as topic maps or model-based access systems. Between information and data storage environments, a data handling system is needed to map from attributes to storage locations, such as the SDSC Storage Resource Broker.[6] Ingest Manage Access Knowledge Relationships Between Concepts X T M D T D Knowledge Repository for Rules Ru les - K Q L Knowledge or Topic-Based Query Information Attributes Semantics (Topic Maps / Model-based Access) X M L D T D Information Repository (Data Handling System - Storage Resource Broker) E M C A T / M IX Attribute- based Query Data Fields Containers Folders M C A T/ H D F Storage (Replicas, Persistent IDs) Gr ids Feature-based Query Figure 3. Persistent Archive Interfaces 5

8 Persistence is achieved through the infrastructure middleware (shown in Figure 3 as the blue grid) that links accession platforms, management repositories, and access platforms. The same middleware is needed to support grid environments (such as computation on distributed data collections) and digital library environments (such as curricula support in the National Science, Mathematics, Engineering, and Technology Education Digital Library - NDSL). This architecture has been proposed to both the Grid Forum and the NSDL, and may be the architecture that integrates knowledge management activities from these communities with the persistent archive community. 2.1 Archive Accessioning Process: Of interest is the emerging need for knowledge management as well as information management and data management when ingesting collections. When we look at collections, we see multiple interfaces where knowledge is required to be able to adequately describe relationships inherent within the collection. We have been looking at the preservation of relationships that are needed to describe: - implied knowledge (interpretation of fields) - structural knowledge (topology associated with digital line graphs) - domain knowledge (relationships between domain concepts) - procedural knowledge (workflow creation steps for digital objects) - presentation knowledge (support for knowledge-based queries). One way to accomplish the goal of knowledge-based access is to use the ISO Topic Maps standard to maintain mappings between domain concepts and the attribute names used in the collection schema. It is very interesting to note that relationships are implicit between each of the nine infrastructure components defined in Figure 2. The relationships either define rules that can be applied to the collection, or quantify associations that can be made between collection elements. Examples are: Relationships that quantify rules: Rules for defining collection attributes Rules for organizing attributes into a schema Rules for feature extraction Rules governing data set creation Relationships that quantify associations: Organization of concepts into topic maps Ontology mapping between concept maps Mapping of concepts to collection attributes Mapping of concepts to feature extraction rules Mapping between attributes and data fields (semantics) 6

9 Semantic mapping between collections Mapping between attributes and storage Mapping between attributes and features Clustering of data into containers The relationships can be separated into four broad classes: Semantic/logical relationships. Relationships can be defined to map from the concepts used to describe the collection to the attribute tags used to annotate the collection. Semantic relationships can also be defined between the domain specific concepts as knowledge bases or semantic maps. Procedural/temporal relationships. The transformations that are applied to the collection to create the archival form constitute a workflow that represents the ingestion process. The temporal order and explicit transformations can be represented as a set of states through which the collection is processed. Structural/spatial relationships. The internal organization of digital objects within the collection can be represented as a structural ordering of the tagged elements. The representation of the structure can be expressed using the same types of characterization as needed for spatially tagged data. Functional relationships. For scientific applications, analysis algorithms are needed to identify features that might be associated with a digital object. The expression of the relationship between the named feature and its presence within a digital object will require the ability to archive mathematical expressions. In the ingestion process, a major challenge has been the need to be able to differentiate between artifacts and implied knowledge. Essentially, the steps of refining the description of a collection by including more attributes, must be integrated with the identification of anomalies. To make progress, we apply the concepts of occurrence tagging and closure to the archived collections. Occurrence tagging is the explicit annotation of the location of each tagged attribute along with the associated value. This provides a representation that captures all of the information content, without imposing constraints on permissible attribute values. Closure is the analysis of the occurrences to identify both completeness and consistency. Completeness is evaluated by verifying that all attributes are populated, and that the information content is fully annotated. Consistency checks that all attribute values fall within defined ranges. Consistency can be checked by construction of inverse indexes that point to all occurrences of each attribute value. It is necessary to iterate between knowledge extraction and attribute mining. We illustrate this through application of the ingestion process shown in Figure 4. 7

10 Define a representation of the concepts inherent within the collection. Build a concept map that identifies all of the possible attributes to associate with each concept Tag the collection to identify attributes for each of the possible fields. Restructure the concept map to eliminate unused fields, specialize classes, rearrange class attributes, etc. Mine the collection to identify differences between bill versions, identify missing attributes, identify implicit attributes, and identify invalid data (such as duplicated pages). Accession Template Closure Concept/Attribute Attribute Inverse Indexing Knowledge Generation Information Generation Attribute Selection Attribute Tagging Occurrence Tagging View Management Data Organization Collection Figure 4. Ingestion Process At one time, the hope was to be able to ingest a collection in a single pass. Based upon the above steps, at least three analyses are needed to mine knowledge, information, and organize data. Depending upon the number of iterations used to refine the concept space, additional passes through the data may be necessary. It is still an area of debate for whether it will be possible to differentiate in general between concept map refinement and error analysis. These steps will have to be done jointly for most collections. 8

11 Note that once the data has been wrapped into XML, all integrity checking, knowledge mining, derivation of a "consolidated version", etc., can be seen as (albeit very elaborate) queries against an XML collection. The interesting research issue is to find out how well XML query languages (including the UCSD/SDSC XMAS system) are able to express the analysis queries. Especially for integrity checking, logic-based XML query languages seem to be a good choice for an ingestion environment. 2.2 Archival Representation of Collections: One of the results of the analysis of the collections provided by NARA was the realization that multiple views of a collection may need to be archived. Typical views include: Original form as submitted XML tagged form Occurrence representation (occurrence, attribute, value) Knowledge-based representation (recreation of the original form from the occurrence representation). This view can be thought of as the noise-free representation of the original collection based upon the knowledge and information content that was created during the accessioning process. This view can be designed to include white space and all anomalies if desired. Consolidated representation (elimination of all duplicated information) By archiving descriptions of the processing steps needed to go between each of these views, one can guarantee that the same processing steps could be applied in the future to re-instantiate the collection on new technology, including new information and knowledge representations. 3. Relationships between NARA and other Agency projects: There is a strong synergy between the development of persistent archive infrastructure for NARA, digital library development for NSF, and data grid development for DOE, NASA, and NLM. All of these research areas require the ability to manage knowledge, information, and data objects. What has become apparent, is that even though the requirements driving the infrastructure development for each agency are different, a uniform architecture is emerging that meets all agency requirements. The architecture shown in Figure 3 provides: Validation mechanism for the common data management architecture 9

12 Validation mechanism for the differentiation between knowledge, information, and data and the choice of representation standards Integration vehicle for tying together persistent archives with grid environments Integration vehicle for tying together grid environments with digital libraries Integration vehicle for tying together digital libraries with persistent archives It is interesting to note the multiple projects that are building upon the architecture that is being developed in the NARA collaboration: NSF Digital Library Initiative, Phase 2. NSF National SMET Education Digital Library NSF NPACI data grid for neuroscience brain image federation NASA Information Power Grid distributed data processing DOE ASCI Data Visualization Corridor remote data processing DOE Particle Physics Data Grid object replication NLM Digital Embryo Project data grid for image processing and storage NARA Persistent Archive It is also interesting to note the iterative technology development cycle that links all of the projects. An original DARPA project developed the data handling capabilities as part of the Distributed Object Computation Testbed. The NASA IPG integrated the data handling technology with computational grid technology (common security environments). The NSF NPACI project integrated information management with data handling to support digital libraries. The ASCI PPDG then applied the technology to support replica management across heterogeneous systems. And the NARA project applied the technology to manage migration of collections across evolving infrastructure technology. Acknowledgements: This research has been sponsored by the National Archives and Records Administration and Advanced Research Projects Agency/ITO, "Intelligent Metacomputing Testbed", ARPA Order No. D570, issued by ESC/ENS under Contract #F C-0020, and by the Data Intensive Computing thrust area of the National Science Foundation project ASC National Partnership for Advanced Computational Infrastructure. The research topics have been investigated by the following members of the Data Intensive Computing Environment Group at the San Diego Supercomputer Center: Richard Marciano, Bertram Ludaescher, Ilya Zaslavsky, Amarnath Gupta, and Chaitan Baru. 10

13 References: [1] Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta, Collection-Based Persistent Digital Archives - Part 1, D-Lib Magazine, March 2000, [2] Moore, R., C. Baru, A. Rajasekar, B. Ludascher, R. Marciano, M. Wan, W. Schroeder, and A. Gupta, Collection-Based Persistent Digital Archives - Part 2, D-Lib Magazine, April 2000, [3] ISO/IEC FCD Topic Maps, [4] Extensible Markup Language (XML) [5] Baru, C., V. Chu, A. Gupta, B. Ludäscher, R. Marciano, Y. Papakonstantinou, and P. Velikhov. XML-Based Information Mediation for Digital Libraries. In ACM Conference on Digital Libraries, Berkeley, CA, Exhibition program. [6] Baru, C., R, Moore, A. Rajasekar, M. Wan,"The SDSC Storage Resource Broker, Proc. CASCON'98 Conference, Nov.30-Dec.3, 1998, Toronto, Canada. 11