CDS Invenio - a software solution for National Repository of Grey Literature



Similar documents
CERN Document Server

Invenio: A Modern Digital Library for Grey Literature

EndNote Beyond the Basics

CatDV Pro Workgroup Serve r

The FAO Open Archive: Enhancing Access to FAO Publications Using International Standards and Exchange Protocols

Integrating with BarTender Integration Builder

The Czech Digital Library and Tools for the Management of Complex Digitization Processes

DiskPulse DISK CHANGE MONITOR

Introduction Connecting Via FTP Where do I upload my website? What to call your home page? Troubleshooting FTP...

Using the Push Notifications Extension Part 1: Certificates and Setup

Customer Tips. How to Upgrade, Patch or Clone Xerox Multifunction Devices. for the user. Purpose. Upgrade / Patch / Clone Process Overview

Adlib Library. Software for the professional management of collections in libraries and information centres. Comprehensive, Flexible, User-friendly

A Tool for Evaluation and Optimization of Web Application Performance

PAPER Data retrieval in the PURE CRIS project at 9 universities

Indian Journal of Science International Weekly Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

Monitoring Replication

Indian Journal of Science International Weekly Journal for Science ISSN EISSN Discovery Publication. All Rights Reserved

F Cross-system event-driven scheduling. F Central console for managing your enterprise. F Automation for UNIX, Linux, and Windows servers

FileMaker Server 10 Help

OPENGREY: HOW IT WORKS AND HOW IT IS USED

Version 1.0 January Xerox Phaser 3635MFP Extensible Interface Platform

One of the fundamental kinds of Web sites that SharePoint 2010 allows

Digital Asset Management (DAM) Protecting, preserving, retrieving and distributing digital assets

How To Test Your Web Site On Wapt On A Pc Or Mac Or Mac (Or Mac) On A Mac Or Ipad Or Ipa (Or Ipa) On Pc Or Ipam (Or Pc Or Pc) On An Ip

S3 Monitor Design and Implementation Plans

Chemistry Enterprise Dashboard

Rotorcraft Health Management System (RHMS)

How To Manage Your Digital Assets On A Computer Or Tablet Device

StreamServe Persuasion SP5 Control Center

Building Library Website using Drupal

Document management and exchange system supporting education process

Archiving Your Photo Collection I

MD Link Integration MDI Solutions Limited

Visualizing ECL Results Technical Preview

What's New In DITA CMS 4.0

RDS Building Centralized Monitoring and Control

Exercise 1 : Branding with Confidence

EZcast technical documentation

How To Build A Connector On A Website (For A Nonprogrammer)

Documentum Content Distribution Services TM Administration Guide

Novell Identity Manager

Server-Based PDF Creation: Basics

Digital Asset Management Developing your Institutional Repository

So today we shall continue our discussion on the search engines and web crawlers. (Refer Slide Time: 01:02)

BlueJ Teamwork Tutorial

EVALUATION ONLY. WA2088 WebSphere Application Server 8.5 Administration on Windows. Student Labs. Web Age Solutions Inc.

irods and Metadata survey Version 0.1 Date March Abhijeet Kodgire 25th

Microsoft Office Access 2007 Training

RS MDM. Integration Guide. Riversand

Scheduling in SAS 9.3

Practical Options for Archiving Social Media

A DICOM-based Software Infrastructure for Data Archiving

Centralized Disaster Recovery using RDS

SOFTWARE TESTING TRAINING COURSES CONTENTS

Meta-Framework: A New Pattern for Test Automation

Why HTML5 Tests the Limits of Automated Testing Solutions

126 SW 148 th Street Suite C-100, #105 Seattle, WA Tel: Fax:

N-CAP Users Guide Everything You Need to Know About Using the Internet! How Firewalls Work

A grant number provides unique identification for the grant.

Ex Libris Rosetta: A Digital Preservation System Product Description

Leveraging the Eclipse TPTP* Agent Infrastructure

Integrity Checking and Monitoring of Files on the CASTOR Disk Servers

Enterprise Archive Managed Archiving & ediscovery Services User Manual

Management of Journals Through KOHA Open Source Software: an Overview. Asheesh Kamal Assistant Librarian

EMC DOCUMENTUM xplore 1.1 DISASTER RECOVERY USING EMC NETWORKER

Premium Server Client Software

BIRT Document Transform

Easy configuration of NETCONF devices

SOA REFERENCE ARCHITECTURE: WEB TIER

XenData Archive Series Software Technical Overview

Colligo Manager 6.0. Offline Mode - User Guide

Using the Bulk Export/Import Feature

V16 Pro - What s New?

Server Manager. Open Text Web Solutions Management Server 10.0

THE CCLRC DATA PORTAL

Inmagic Content Server v9 Standard Configuration Technical Guidelines

Analysis for Automated Unattended Installation

Data Store Interface Design and Implementation

The Benefits of Utilizing a Repository Manager

CA XCOM Data Transport Gateway

ENHANCED PUBLICATIONS IN THE CZECH REPUBLIC

OMU350 Operations Manager 9.x on UNIX/Linux Advanced Administration

Using EndNote Online Class Outline

Functional Requirements for Digital Asset Management Project version /30/2006

Grandstream Networks, Inc.

Cataloging: Save Bibliographic Records

Hypercosm. Studio.

WebSphere Business Monitor

Chapter 1: The Cochrane Library Search Tour

White Paper. The integration of Formate and Alchemy

Cache Configuration Reference

Using Form Scripts in WEBPLUS

CRM Global Search: Installation & Configuration

Transcription:

CDS Invenio - a software solution for National Repository of Grey Literature Tomáš Müller National Technical Library, Prague, Czech Republic HUtomas.muller@techlib.czU Third Seminar on Providing Access to Grey Literature December 8, 2010 Abstract: For retrieving, preserving and managing digital documents of gray literature and its metadata a sophisticated software system must be used. CDS Invenio is a software solution for the needs of NRGL for the central repository, which collects and the grey literature and makes it accessible and for the co-working organizations, which produce the grey literature. Its complexity, flexibility and open-source solution ensure fulfilling of nearly all requirements that NRGL have for this system. Contribution: The grey literature There are many definitions of what the grey literature is. One of the most famous definitions appeared in 1997 in Luxemburg and it was expanded in New York in 2004. It says that the grey literature is: "information produced on all levels of government, academics, business and industry in electronic and print formats not controlled by commercial publishing i.e. where publishing is not the 1 primary activity of the producing body"f F, so under the term grey literature we can imagine things like reports (i.e. annual reports, research reports, final reports from projects, ), conference materials (i.e. posters, presentations, papers, ), theses (i.e. bachelor theses, master theses, rigorous theses, ), even common correspondence, like letters, e-mails, web blog posts and so on. Some of these documents might contain some very valuable pieces of information and that is the reason why this phenomenon is so discussed nowadays. 0BThe NRGL project The good news is that grey literature probably contains a lot of useful information. The bad news is that finding some particular grey literature document we are interested in is very difficult. These documents are usually scattered in local repositories and there is no way how to search among all of 1 Luxembourg, 1997 - Expanded in New York, 2004, dostupné na WWW <http://www.greynet.org/index.html> 1

them at the same time and that is why NRGL was created. F The main purpose of NRGL is to create a central national repository of grey literature that will collect metadata and full texts from local repositories and ease the access to actual documents. When all these data are in one place it s much easier to find what we want. NRGL tries to find organizations that produce potentially valuable grey literature (such as colleges, science institutes, libraries ) and that are interested in sharing it. Metadata usually contain some sensitive information so it is necessary to make a contract with the institution. Then the institution may insert their records directly into NRGL or the metadata will be harvested from their local repository into NRGL. 1BCDS Invenio 2 3BSystem overview CDS Invenio is document management system developed by Swiss company CERN. It is free and open source solution for library systems and repositories. The architecture of the system is modular, most parts are written in 3 Python programming language, so it is not difficult to extend it if necessary.f Performance of the system is well optimized. All the most demanding outputs are cached to minimize the communication with database, what increases the speed of the system. Invenio also uses various indexes. Once created, searching is very fast. All these features are contributing to user friendliness but on the other hand they make the administration of the system much more difficult. Practically it is not possible to run this system without the knowledge of Python or al least some remotely similar language. It is also very important to keep in mind that most changes in the configuration won t apply immediately due to caching. For instant result it is necessary to refresh the cache manually. The biggest down of this system is that the current version is still 0.99.1, so some parts of the system is still being developed and the documentation is rather brief and some parts are outdated. A stable version 0.99.3 and development version 1.0 RC 0 were released recently, but there are a lot of changes in our system, so we are planning to wait for stable 1.0 version to import all modifications there. The administration occurs both via web browser and via command line of the computer on which Invenio runs. Web browser administration usually allows the administrator to set up the system, while command line administration is used mostly to run various tasks. 2 PEJŠOVÁ, Petra. Národní úložiště šedé literatury (NUŠL). Čtenář : měsíčník pro knihovny. 2010, roč. 62, č. 5, s. 176-180. Available from WWW: <http://ctenar.svkkl.cz/clanky/2010-roc- 62/05-2010.htm>. 3 CDS Invenio [online]. 2010 [cit. 2010-12-05]. About Invenio. Available from WWW: <http://invenio-software.org/>. 2

Figure 1: User web interface of Invenio in NRGL Invenio in NRGL is installed on virtual machine, which can be exported and provided to organizations that want their own repository and don t have one. Using exported virtual machine will save them from installing a configuring Invenio, what is hard and tedious job. Virtual environment is currently provided 4 by VirtualBox.F Data structure The records are internally stored in MARC 21 format. Each record consists of several fields defined by 3 place number and 2 indicators. Each indicator is a single number or blank. Every field has one or more subfields, which are defined by single character or number. The meaning of the fields and subfields are charted. The records in NRGL divide in two ways - by document type and by 5 institution. A division unit is called collection. The document types are followingf F: 4 VirtualBox [online]. 2010 [cit. 2010-12-05]. About VirtualBox. Dostupné z WWW: <http://www.virtualbox.org/>. 5 Typologie dokumentu NUŠL [online]. 2010 [cit. 2010-12-05]. Národní úložiště šedé literatiry. Dostupné z WWW: <http://nusl.techlib.cz/index.php/typologie_dokumentu>. 3

Theses o Habilitation theses o PhD theses o Rigorous theses o Master theses o Bachelor theses Reports o Survey reports o Grant reports o Final report of the project o Interim report of the project o Statistical reports o Technical reports o Research reports o Annual reports Copyrighted Writings o Preprints o Papers Trade Literature o Product Catalogues o Guides Conference Materials o Posters o Presentations o Proceedings o Programs o Articles Study Materials o Course Synopses o Exam Questions o Teaching Transcripts Assigning some record to a collection is indirect by logical field. Logical field is defined in BibIndex module. We select a field(s) and/or subfield(s) which we want to index and some name of the index. For example we want to create a logical field called collection for 980 a (field 980, both indicators blank, subfield a ). From now on we can search among the records by values in field 980 a. Now we may for example set that collection Preprints is defined by collection logical field and by value preprints (correct syntax is collection:preprints ). Now all records which have in field 980 a value preprints (case insensitive, ignores accentuation) will appear in Preprints collection. Like that we can select whatever filed to search by or to assign to collection by. 4BData acquisition We can deliver metadata and full texts into CDS Invenio in three ways submit using web form, submit using e-mail and harvest from another repository. In NRGL only web form submission and harvesting is used. Submitting through web form consists of two parts. First creating a form(s) (there can be more pages of the form) and second construct a 4

F It sequence of functions and their arguments that will process the data obtained through the form. First we have to define all the elements we want to use in a form. In fact the elements are classic form objects like text input, select box etc. There is of course more set up like element name, description and so on. Then we have to arrange these elements into a form with some labels, setting whether the element is optional or mandatory and so on. Now we have a complete form which we can fill in, so we have to create a sequence of functions which will do things like creating system number, renaming submitted files and moving them to storage, creating the actual record, upload record etc. That should result into record in MARCXML format, which we can upload. We can even write our own functions. We can define several document types and their subtypes and for each type we can have a separate form while the elements are still the same. Apart from submitting a new record, these forms can be used for example for editing some record or just adding a file to existing record and much more. Harvesting records from another repository is carried out by OAI-PMH 6 protocol.f sends record in some XML format (mostly in DC or MARCXML) through HTTP protocol by batches of usually 100 1000 records. These records must be converted into MARCXML format before they can be uploaded to the system. The most convenient conversion is by XSLT. Converting large number (e.g. thousands) of records may be difficult sometimes, particularly when the harvested data are not consistent. The most important is to create all tags which are used to assigning records to collection exactly right, which can be really hard sometimes. CDS Invenio can also play the role of data provider. We can specify a set of data which will be exposed for harvesting to some other repository using once again OAI-PMH protocol. NRGL will use this feature for joining the international project concerning grey literature such as Open Grey. 5BOther utilities There are a great number of features offered by CDS Invenio, so there will be mention only those that are widely used by NRGL. Security in CDS Invenio is solved by classic role-based model. Basic element in security module is action with its parameters. Actions group into roles. Roles are assigned to users directly to user accounts or indirectly by firewall-like settings. Firewall-like role assignment is very powerful feature which allows us to set a role to user e.g. by user s IP address. This allows us for example to restrict access to some collections or full texts to only certain networks. BibRank module is capable of computing some special indexes like citation index or word similarity. This enables the option of searching similar records, which can be very useful feature e.g. for research jobs. There is a need of running many periodical tasks in Invenio, such as indexing, harvesting, cache refresh, cleanup tasks and many more. Invenio runs its own task scheduler (BibSched module), which is used for every task in the system. So if someone submits a new record, the uploading is put in a queue in a scheduler, it is not processed immediately. Many changes in Invenio will 6 The Open Archives [online]. 2008 [cit. 2010-12-06]. The Open Archives Initiative Protocol for Metadata Harvesting. Dostupné z WWW: <http://www.openarchives.org/oai/openarchivesprotocol.html>. 5

appear when it s their turn in the scheduler, not right away, which may seem a little odd to the users, but it helps to maintain optimal performance. 6BEnhancements Even so complex system like Invenio can t offer everything. There are some things that must have been made-to-measure. When converting harvested data we can t know the system number of the records in advance. This number is assigned when the record is being uploaded. Yet we need it for creating a record identifier. This problem was solved as post-process task. This task finds all records without identifier and fills it in. Similar problem is with date of uploading of the record and date of modifying of the record. These two entries are stored in the database but we need them in MARC as well. So similar task as mentioned above was created, that will keep these two data field up to date. Another missing feature is the possibility of exporting and importing records in the system, so this feature had been created as well. Export script dumps that part of database that contains data of the records and copies all the full text in some backup directory. Export script works exactly oppositely. It takes dumped part of database and loads it up and copies full texts back into its place. Currently and automatic indexation tool is being developed in Invenio. It shall be analyzed the full text and based on its content it will suggest some keywords to us. This feature should help a lot with document description. 2BConclusion So far we have a fully operational digital repository with 3 harvested repositories Academy of Science, University of Economics and our institutional repository (altogether about 42000 records) and about 10 manually inserted records. We created a manual for Invenio installation, collections management and WebSubmit templates a we are working on FAQ, where we want to describe some important tasks which are hard to understand from the official documentation. 6