1 Studi Trent. Sci. Nat., 84 (2009): ISSN Museo Tridentino di Scienze Naturali, Trento 2009 Adding content to content a generic annotation system for biodiversity data Anton Güntsch 1*, Walter G. Berendsohn 1, Pepé Ciardelli 1, Andrea Hahn 2, Wolf-Henning Kusber 1 & Jinling Li 3 1 Botanischer Garten und Botanisches Museum Berlin-Dahlem, Freie Universität Berlin, Königin-Luise-Str. 6-8, Berlin, Germany 2 Global Biodiversity Information Facility (GBIF) Secretariat, Universitetsparken 15, 2100 Copenhagen, Denmark 3 Northwestern University, Department of Industrial Engineering & Management Sciences, 2145 Sheridan Road, Evanston, IL 60208, USA * Corresponding author BiodiversityInformatics[at]bgbm.org SUMMARY - Adding content to content a generic annotation system for biodiversity data - Biodiversity information networks such as GBIF and BioCASE provide access to a rapidly growing number of collection records, ranging from simple occurrence information to high resolution images. Today, about 150 million observation and collection records are available at a global level. Although the technology for providing and retrieving primary biodiversity data is reasonably mature, the development of advanced techniques for feedback using annotations has been neglected. We have developed and implemented a web-based annotation system which fills this gap. Rather than sending the annotation as a message to the collection holder, the system allows for adding, changing and removing information in a copy of the collection record. The modified record is then stored on a public annotation server together with all previous versions. This allows collection holders to compare the different versions and decide whether a given annotation will be fed back into the collection database. The system works with any GBIF-compliant information network. RIASSUNTO - Aggiungere contenuti ai contenuti un sistema per la gestione delle annotazioni per dati relativi alla biodiversità - Le reti informatiche per i dati sulla biodiversità, come GBIF e BioCASE, consentono l accesso a un numero, in rapido incremento, di schede sugli esemplari catalogati appartenenti a varie collezioni: le informazioni in esse contenute spaziano da semplici occorrenze in un area geografica fino a immagini ad alta risoluzione. Oggigiorno, a livello globale, sono disponibili circa 150 milioni di osservazioni e schede riferite a esemplari collezionati. Sebbene la tecnologia in grado di fornire e recuperare dati primari relativi alla biodiversità si sia evoluta raggiungendo livelli di funzionalità ragionevoli, è stato tuttavia trascurato lo sviluppo di tecniche avanzate di feedback che fanno uso di annotazioni. Abbiamo allora elaborato e implementato un sistema di annotazioni basato sul web che colma questa lacuna. Invece di inviare l annotazione al curatore della collezione sotto forma di messaggio, tale sistema consente di aggiungere, cambiare e rimuovere informazioni in una copia della scheda catalogata della collezione. La scheda modificata viene quindi salvata su un server pubblico per le annotazioni insieme a tutte le versioni precedenti. Questo consente ai curatori delle collezioni di paragonare le differenti versioni e di decidere se accettare una certa annotazione nel database della collezione. Il sistema può operare con qualsiasi rete informatica compatibile con GBIF. Key words: biodiversity informatics, collection networking, annotation software Parole chiave: informatica applicata alla biodiversità, reti informatiche di accesso alle collezioni, software per le annotazioni 1. Introduction Traditionally, biological collection objects consist of representative material collected at a single place at a certain point in time. For example, an object may be mounted on cardboard, with a label containing information provided by the collector. Over time, a variety of other informations is added in the form of annotations representing corrections, confirmations and results of scientific analyses (Fig. 1). Annotations increa se the value of biological collections and ensure that scientists Paper presented at the 2 nd Central European Diatom Meeting (CEDiatoM2), Museo Tridentino di Scienze Naturali, Trento, June 12-15, 2008.
2 124 Güntsch et al. A generic annotation system for biodiversity data Fig. 1 - Image of an annotated herbarium type-specimen collected by Humboldt in Columbia ( BGBM 2008). Fig. 1 - Immagine di un esemplare, raccolto da Humboldt in Columbia, inserito in un erbario annotato( BGBM 2008). working in collections have access to the scientific results of previous studies associated with the objects of interest (Perkins 2002). Electronic collection information systems now provide access to many millions of objects via common Internet portals, but have also interrupted the traditional flow of annotation information. A convincing method for adding information to the electronic surrogate of a collection object, and for feeding the information back to the collection holder, has yet to be developed. The EU 6th Framework project SYNTHESYS (http://www.synthesys.info/index.htm) has tackled the problem and developed a structured annotation system for the European collection information network BioCASE (Biological Collection Access Service for Europe, BioCASE uses the GBIF technical infrastructure (Global Biodiversity Information Facility, for providing access to biological objects collected or observed in Europe (Güntsch et al. 2007). This article gives an overview of the annotation information flow implemented with the SYNTHESYS annotation system and outlines its benefits and limitations.
3 Studi Trent. Sci. Nat., 84 (2009): The SYNTHESYS annotation workflow The possibility of comparing the annotations for a biological collection object with its original label information is central to both the traditional annotation system and its electronic surrogate in biodiversity information networks. The SYNTHESYS annotation system was therefore built around a central server for handling versions of documents. Rather than developing this server ourselves, we decided to use the existing open-source revision control system subversion (http://subversion.tigris.org/), which is primarily intended for handling documents in the context of software programming, but is also an ideal platform for version control in the context of collection records. In addition to the subversion document repository, the annotation server connects to an authentication data base containing registered users of the system and their pass words (Fig. 2). Within the present architecture there are no further checks of the trustworthiness of authenticated users, which should be considered for future developments of the system. Being based on the GBIF technical infrastructure, the collection data themselves remain with the collection holders in their own database systems. Network accessibility is realized by means of standard provider software implementing either of the two protocols for networking biodiversity data, BioCASe or DiGIR, together with the XML data schemas (http://www. w3.org/xml/) ABCD (Berendsohn 2003) or Darwin- Core (Anonymous 2005). Thanks to the provider software, collection data are always retrievable in a standardized format no matter how they are stored in the collection holding institution. Once a user has changed a record using the annotation system, it is stored in the subversion database together with all previous versions as a new version of this collection record. The collection holder then receives a message by containing the information that one of his messages has been changed, as well as the URLs of both the original and the changed records Fig. 2 - Information flow in the SYNTHESYS annotation system. Fig. 2 - Flusso d informazione nel sistema di annotazioni SYNTHESYS.
4 126 Güntsch et al. A generic annotation system for biodiversity data Fig. 3 - Using standard software to visualize the annotation history of a specimen. Fig. 3 - Utilizzo di un software standard per la visualizzazione della storia delle annotazioni effettuate su di un esemplare. on the server. There are innumerable freely available software tools supporting subversion which can then be used to analyze the changes. These include diff-tools to compare documents, as well as tools to graphically display the version history of a record (Fig. 3). Based on these data, collection holders are then able to make an informed decision about whether they want to ignore an annotation or import the modified record into their database. The software interface between collection-portals and the annotation system is simply the GBIF triple- ID consisting of institution, collection identifier, and an identifier of the particular collection record, with which the system can be plugged in in to different data portals with comparatively little effort. In addition, because it is based on transfer and storage of XMLdocuments, it is generic with respect to the collection data schema used. 3. User interface The annotation system has a plain HTML interface which can be used with any World Wide Web browser. It is simply linked to each collection record (details view) in the BioCASE portal (http://search.bio- case.org/europe/). Users using the system for the first time are prompted to register themselves with a spamsafe registration form. Once logged in successfully, the annotation form is opened offering two fields: for editing the collection record, and for entering a free-text comment (Fig. 4). Editing the collection record is done straightforwardly in its XML-representation (ABCD or one of the various variants of DarwinCore). When submitting modified records, an XML-validation algorithm checks whether the records are well-formed and valid. If not, the system comes back to the user with a list of potential problems. Occasionally, users may be unable to express their annotation in the terminology provided by the collection record itself. In these cases, the comment field can be used and the collection record remains unchanged. 4. Discussion and conclusions Decoupling the annotation process from the physical collection object and migrating it to an electronic web-based platform is highly complex. The SYN- THESYS annotation system is a first and promising step on the way to an entirely virtualized annotation en-
5 Studi Trent. Sci. Nat., 84 (2009): Fig. 4 - Editing a collection record on the web. Fig. 4 - Revisione di una scheda di catalogazione riferita a una collezione fatta sul web. vironment. Based on well-defined XML formats and correspond ing validation software, it is open enough to handle a variety of existing data formats as well as formats still in development. However, editing raw XML-documents is only possible for experienced users. There fore, future developments of the system will focus on semi-automatic form-generation based on XML-schemas. Another weakness of the system is that annotations can only be linked to single collection records. Therefore, there is no way to express a statement about a set of collection objects ( latitude/longitude values have been swapped in all your Italian records ). To achieve this, the software interface to the annotation system has to be extended from single triple-ids to multiple triple-ids and a form in the collection portal will be needed to support the application of the annotation system for more than one collection object. We believe that electronic web-based annotation sys tems have two substantial benefits: they can help collection holders and users to communicate better with one another; and they can speed up the workflows of both. However, such systems will not replace the taxonomic expertise necessary to make informed decisions about the validity of annotations within collections. In fact, they will likely require greater exper tise on the collection side if, as expected, they cause an increase in the number of annotations.
6 128 Güntsch et al. A generic annotation system for biodiversity data Acknowledgements This work was funded by the Network Activity D of the European Union 6 th Framework project SYNTHE- SYS. Pilot studies were carried out under the GBIF- D programme funded by the German Federal Ministry of Education and Research. References Anonymous, The DarwinCore wiki site. Available from WebHome [cited ]. Berendsohn W.G. (ed.), ABCD, Access to Biological Collection Data. Available from twiki/bin/view/abcd/ [cited ]. Güntsch A., Kusber W.-H., Döring M., Ciardelli P. & Berendsohn W.G., Common access to distributed biodiversity information. In: Kusber W.-H. & Jahn R. (ed.), Proceedings of the 1st Central European Diatom Meeting Berlin: [doi: /cediatom.109] Perkins K.D., Annotation of Herbarium Specimens: Recommendations. Available from edu/herbarium/anno/ [cited ]. Accettato per la stampa: 6 ottobre 2008