DC, MODS and CERIF-XML A Tale of Two Cultures Ed Simons Radboud University Nijmegen, NL.
Some personal data Ed Simons Workplace: Information Centre (UCI) of Radboud University UCI takes care of all IT-services for RU. UCI also managing host of SURFnet, NL university network. Project leader software development projects. Initiator and project leader METIS: CRIS of all NL universities + NL Royal Academy of Sciences Last few years: international IT-projects within framework of development cooperation (Africa). Board member eurocris.
Structure of the Presentation 1. Comparison of the 3 formats. 2.Why XML? 3.Towards another solution for exposure and access of research information?
Part 1: Comparison of the formats
Comparison of the formats DC: Dublin Core MODS: Metadata Object Description Schema. Often goes together with DIDL (Digital Item Declaration Language), so often you see DIDL/MODS mentioned. In these cases MODS is a metadata record in the DIDL container, describing the bibliographic metadata of the publication whereas other parts in DIDL contain the metadata of the object files of the publication (location, file size, mime type etc...). CERIF-XML: XML transformation of the CERIF data model.
Comparison of the formats The following documents give a representation of the same article in the 3 XML-formats: Titel: On the relations between ISE and structure in some RE(Mg)SiAlO(N) glasses Author(s): Dauce R (Dauce, R.)1, Keding R (Keding, R.)2, Sangleboeuf JC (Sangleboeuf, J-C.)1 Source: JOURNAL OF MATERIALS SCIENCE Abbriviation: J MATER SCI Volume: 43 Issue: 22 Pages: 7239-7246 Published: NOV 2008 JCR Impact factor: 1.081 Times Cited: 0 References: 47 Abstract: Six oxide and oxynitride glasses were synthesized in the Y-Mg-Si-Al-O-N, Nd-Mg-Si-Al-O-N and La-Mg-Si-Al-O-N systems. As already known, nitrogen introduction increases the T-g, packing factor and mechanical properties of the glasses. Cationic substitution also has an influence on the glasses' behavior, particularly in terms of sensitivity to indentation load/size effect (ISE). The structure of the yttrium-containing glasses was investigated by mean of Al-27 and Si-29 MAS-NMR. Al is found to occur for 2/3 as a network former and for 1/3 as a modifier. Language: English Reprint Address: Dauce, R (reprint author), Univ Rennes 1, CNRS, LARMAUR, FRE 2717, F-35042 Rennes, France Addresses: 1. Univ Rennes 1, CNRS, LARMAUR, FRE 2717, F-35042 Rennes, France 2. Univ Aalborg, Aalborg, Denmark Corrosponding author: E-mail Addresses: rachel.dauce@gmail.com Publisher: SPRINGER, 233 SPRING ST, NEW YORK, NY 10013 USA KeyWords: AL-O-N; EARTH ALUMINOSILICATE GLASSES; OXYNITRIDE GLASSES; MAS-NMR; FLOPPY MODES; INDENTATION; SYSTEM; MICROHARDNESS; RAMAN; DIFFRACTION Subject Category: Materials Science, Multidisciplinary IDS Number: 373FZ ISSN: 0022-2461 (Print) 1573-4803 (Online) DOI: 10.1007/s10853-008-2851-3 Full text in Institutional Repository (post print): http://vbn.aau.dk/ws/fbspretrieve/16588792/fulltext.pdf.
Dublin Core DC: too simple: of limited use because of lack of detail and granularity. E.g.: no separate elements for volume, issue and page not possible to describe in the same DC record the item of which a publication, e.g., a book chapter, is a part. not possible to indicate the exact role of a creator or contributor, Etc...
Dublin Core DC reflects the tradtional library culture : electronic version of the old library card. DC possibly also reflects a political aspect or culture. The OAI-community needed a format which was easy to implement everywhere on short notice. They in a way did not have time to wait until a more suitable, robust solution was worked out. DC and DC-based harvesting indeed a success but in which sense: the success of the tool or the success of optimally supplying research information?
MODS Solves the shortcomings of DC. More detailed format and good handling of semantics, e.g.: possibility to express roles of authors/persons possibility to use established classification schemas (controlled vocabularies) by means of the authority attribute. <role> <roleterm authority="marcrelator"...> aut </roleterm> </role>
MODS Describe in the same record the item of which a publication, e.g., a book chapter, is a part. <titleinfo> <title>the provisions of the Corpus Juris on community fraud</title> <subtitle>a Belgian and Dutch perspective</subtitle> </titleinfo> <relateditem type="host"> <titleinfo> <title>das Corpus Juris als Grundlage eines europaeischen Strafrechts : Europaeisches Kolloquium, Trier, 4.-6. Maerz 1999 </title> </titleinfo> </relateditem>
MODS Still MODS heavily reflects the library culture and vision on research information. Rich metadata set to adequately describe the bibliographical aspects of a publication. But adequately and optimally exposing research information involves more than just bibliographical aspects and more than just publications. E.g. contextual research metadata (e.g. about the research project the publication results from).
CERIF-XML Describe in the same record the item of which a publication, e.g., an article, a book chapter, is a part. <cfrespubltitle> <cfrespublid>arttitle4778</cfrespublid> <cftitle cflangcode="en" cftrans="o">on the relations between ISE and structure in some RE(Mg)SiAlO(N) glasses</cftitle> </cfrespubltitle> <cfrespubltitle> <cfrespublid>journaltitle345</cfrespublid> <cftitle cflangcode="en" cftrans="o">journal OF MATERIALS SCIENCE</cfTitle> </cfrespubltitle> <cfrespubl_respubl> <cfrespublid1>arttitle4778</cfrespublid1> <cfrespublid2>journaltitle345</cfrespublid2> <cfclassid>is article in</cfclassid> <cfclassschemeid>cfresultpublication-resultpublication</cfclassschemeid> <cfstartdate>2001-01-01t12:00:00-05:00</cfstartdate> <cfenddate>2001-01-01</cfenddate> </cfrespubl_respubl>
CERIF-XML The link between an author and the publication is done in the same, uniform way. <cfpers_respubl> <cfpersid>daucer</cfpersid> <cfrespublid>arttitle4778</cfrespublid> <cfclassid>is author of</cfclassid> <cfclassschemeid>cfperson-resultpublicationroles</cfclassschemeid> <cfstartdate>2001-01-01t12:00:00-05:00</cfstartdate> <cfenddate>2001-01-01t12:00:00-05:00</cfenddate> <cfcopyright></cfcopyright> </cfpers_respubl>
CERIF-XML Very strong point of CERIF-XML: all relations between entities of a publication are done in exactly the same, uniform way: Authors to publication Article to journal Chapter to book Editors to book Etc.. Secondly: all these relations are at the same time semantically described (role of a person, type of relation between publications, etc ), again in a uniform, standardized way..
CERIF-XML Reflects very strongly the relational database culture or way of thinking: Mirrors the relational CERIF model into the XMLworld. Too fragmented: too many schema's and namespaces (e.g. More than 10 different schema's to express an article). Could lead to performance issues. Difficult to communicate to non-experts or people not familiar with relational thinking.
CERIF-XML Need to combine schema's into a limited number, corresponding to major research objects, e.g.: one schema for a publication. CERIF task group is aware of this and currently working on it already.
CERIF-XML Extensive set of metadata not limited to bibliographic or publication metadata, but encompassing all aspects of research information (including the bibliographical metadata as expressed by MODS). Strong point of CERIF is that it is a uniformed, standardized MODEL which allows easy extension or addition of research metadata (and not so much a given, fixed list of metadata).
Part 2: Why XML?
Why XML? We all seem to uncritically embrace XML as the obvious format for exposing research information in the international context. All has to be prepared to work with the XML-based architecture and technologies: OAI/PMH, SOA... Result: we all copy, transform and double store data (e.g. from our CRIS repositories we transform a set of metadata into XML which we then upload/store in the institutional repository).
Why XML? But shouldn't we ask ourselves whether all this copying, transforming and re-storing of data in XMLformat is necessary and really the way to go? Would it not be better should there be a solution which only needs the original data sources and leave these intact without transforming and re-storing data somewhere else?
Part 3: Towards another solution?
Towards another solution?
Towards another solution? Conclusions: METIS can automatically harvest the metadata already stored in Elsevier s SCOPUS database and so these do not have to be entered separately in METIS again. However up to now, METIS still stores the harvested data in its own database, but actually this probably should not be necessary and so we should considering solutions for this. This brings us to a next step.
Business Intelligence view may be inspiring In one sentence Business Intelligence (BI) could be defined as: knowledge of all aspects of the business in a comprehensive, integrated and maneagable way. BI-tools are softwares which supply this knowledge (e.g. Business Objects, Jasper-Reports/iReport). Great consumers of BI are managers of big companies who need to know all aspects of their business in a comprehensive manageable form (statistics, charts, diagrams, et...)
Business Intelligence view may be inspiring The problem that BI is confronted with is more or less the same as we face when talking about getting full, appropriate and integrated view on research information: the data is dispersed over various, heterogenuous resources (databases,, XMLrepositories, files, etc..). There are solutions emerging that solve this problem, in other words: which supply timely data from heterogenuous sources in an integrated way without first copying, tranforming and storing these data in intermediate resources.
Business Intelligence view may be inspiring The following builds upon the ideas expressed by Rick van der Lans, a Dutch internationally acclaimed expert on software architecture and solutions for Business Intellingence and notably his recent publication: Rick, F. van der Lans, Developing a Data Delivery PlatformWith Informatica Data Services. A Technical Whitepaper on Next Generation Data Virtualization, February 28 th, 2011, Copyright 2011 R20/Consultancy. http://vip.informatica.com/ricklans8761?elqpurlpage=6013&docid=1571&lsc=na- Ongoing-2011Q1-JP-DI_Developing_Data_Delivery_Platform_WP_www
Federation Server
Federation Server Works with all kinds of input resources: relational databases, data warehouses, XML-resources, Excel sheets, text files, web services, etc.. Based on relational database concept but with virtual tables. No (re-)storage of the data On demand (on-the-fly) transformation of incoming data to the virtual table structure.
Federation Server: virtualization Copyright 2011 R20/Consultancy B.V., The Hague, The Netherlands
Mapping foreign to virtual table Copyright 2011 R20/Consultancy B.V., The Hague, The Netherlands
Mapping XML document to virtual table Copyright 2011 R20/Consultancy B.V., The Hague, The Netherlands
Joining Relational and XML data Copyright 2011 R20/Consultancy B.V., The Hague, The Netherlands
. Concrete Application
Concrete Application (Nirvana?)
To conclude: Perhaps good to explore also this kind of technologies instead of just sticking to the XML based solutions. Thank you for your attention!