From Data Deluge to Data Curation

Similar documents
Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

Response to Invitation to Tender: requirements and feasibility study on preservation of e-prints

RESEARCH DATA MANAGEMENT AT THE UNIVERSITY OF WARWICK: RECENT STEPS TOWARDS A JOINED-UP APPROACH AT A UK UNIVERSITY

ESRC Research Data Policy

LIBER Case Study: University of Oxford Research Data Management Infrastructure

Research information meets research data management... in the library?

UK-EOF Data Solutions Workshop

The International Journal of Digital Curation Issue 2, Volume

NERC Data Policy Guidance Notes

Project Plan DATA MANAGEMENT PLANNING FOR ESRC RESEARCH DATA-RICH INVESTMENTS

Research and the Scholarly Communications Process: Towards Strategic Goals for Public Policy

THE UNIVERSITY OF LEEDS. Vice Chancellor s Executive Group Funding for Research Data Management: Interim

NSF Data Management Plan Template Duke University Libraries Data and GIS Services

Supporting Digital Preservation and Asset Management in Institutions

Checklist for a Data Management Plan draft

European University Association Contribution to the Public Consultation: Science 2.0 : Science in Transition 1. September 2014

Development of a retention schedule for research data at the London School of Hygiene & Tropical Medicine JISC final report

DOCTORAL EDUCATION TAKING SALZBURG FORWARD

What EDINA Does: EDINA Vision for 2020 A Community Report

Urban Big Data Centre. Data services: Guide for researchers. December 2014 Version 2.0 Authors: Nick Bailey

Service Road Map for ANDS Core Infrastructure and Applications Programs

Second EUDAT Conference, October 2013 Data Management Plans and Certification Motivation: increasing importance of Data Management Planning

National Statistics Code of Practice Protocol on Data Management, Documentation and Preservation

UKSG Invitation to Develop Library Discovery Technologies

The International Journal of Digital Curation Issue 1, Volume

OpenAIRE Research Data Management Briefing paper

DataShare & Data Audit. Lessons Learned. Robin Rice. Digital Curation Practice, Promise and Prospects

Information Strategy

Cambridge University Library. Working together: a strategic framework

Project Information. EDINA, University of Edinburgh Christine Rees Sheila Fraser

Towards a research data management policy at Goldsmiths

University of Stirling. Records Management Strategy I. Introduction

A grant number provides unique identification for the grant.

Data Management Plans & the DMPTool. IAP: January 26, 2016

Jisc Research Data Discovery Service Project

A 360 degree approach to evaluate a broker s impact on partnerships

Jean Sykes The UK Research Data Service Project

Implementing selection and appraisal policies at the UK Data Archive

Long-term preservation in Europe. The strategy of the Alliance for Permanent Access

EDUCATION SERVICES AUSTRALIA (ESA) LETTER OF EXPECTATION JULY 2014 JUNE 2016

Research Data Management Policy

North Carolina Digital Preservation Policy. April 2014

Digital preservation a European perspective

Outline of a Research Data Management Policy for Australian Universities / Institutions

Statistics on E-commerce and Information and Communication Technology Activity

NATIONAL INFORMATICS STANDARDS for NURSES AND MIDWIVES

Summary of the role and operation of NHS Research Management Offices in England

Data management plan

INFORMATION GOVERNANCE REVIEW EVIDENCE GATHERING: COMMISSIONING

State of Florida ELECTRONIC RECORDKEEPING STRATEGIC PLAN. January 2010 December 2012 DECEMBER 31, 2009

REBs & Data Management Plans: Conflict & Coexistence Susan Babcock and Chuck Humphrey, University of Alberta CAREB Conference, Vancouver,

The European and UK Space Agencies

University of Edinburgh Library Committee. Wednesday 17 th June Library & University Collections Fundraising Strategy & Plan

UNIVERSITY OF NAMIBIA

Healthcare, transportation,

RECOMMENDATION CONCERNING THE PROMOTION AND USE OF MULTILINGUALISM AND UNIVERSAL ACCESS TO CYBERSPACE

Tips and Guidelines for an NIH Proposal

The Importance of Bioinformatics and Information Management

QUALITY ASSURANCE POLICY

Applying the OAIS standard to CCLRC s British Atmospheric Data Centre and the Atlas Petabyte Storage Service

The University of Reading. e-learning Strategy

PhD in Information Studies Goals

The astronomical Virtual Observatory : lessons learnt, looking forward. Françoise Genova - Forum VO-PDC d après ADASS XXI, Paris, nov.

The Phios Whole Product Solution Methodology

Global Scientific Data Infrastructures: The Big Data Challenges. Capri, May, 2011

A Policy Framework for Canadian Digital Infrastructure 1

Inter university use of tutored online courses: an alternative to MOOCs

Research and Innovation Strategy: delivering a flexible workforce receptive to research and innovation

DATA LIFE CYCLE & DATA MANAGEMENT PLANNING

Data Management Planning

Sponsored Programs Guidance Cradle to Grave

CEDARS: Digital Preservation and Metadata

Action Plan towards Open Access to Publications

Introduction. 1. Name of your organisation: 2. Country (of your organisation): Page 2

RESPONSIBLE CONDUCT OF RESEARCH AND THE GLOBAL CONTEXT

Further Education: General and Programme Requirements for the Accreditation of Teacher Education Qualifications

Skills across the curriculum. Developing communication

Digital Scholarship within the Liberal Arts College and Larger Landscapes:

Guidelines for Doctoral Programs in Business and Management

Research Data Storage Infrastructure consultation University of Sydney response

Research Data Storage and the University of Bristol

MANAGING DIGITAL CONTINUITY

THE ROLE OF LEAD GOVERNMENT DEPARTMENTS IN PLANNING FOR AND MANAGING CRISES

Digital Archiving Survey

Digital Stewardship Education at the Graduate School of Library & Information Science, Simmons College

Schneps, Leila; Colmez, Coralie. Math on Trial : How Numbers Get Used and Abused in the Courtroom. New York, NY, USA: Basic Books, p i.

Salzburg ii recommendations. EuroPEan universities achievements SincE 2005 in implementing the Salzburg PrinciPlES

Issue 1.0. UoG/ILS/IS 001. Information Security and Assurance Policy. Information Security and Compliance Manager

Stewardship of digital research data: a framework of principles and guidelines

EUROPEAN COMMISSION Directorate-General for Research & Innovation. Guidelines on Data Management in Horizon 2020

The Institute for Information Infrastructure Protection (I3P) is a consortium that includes academic

Chapter 8. The Training of Trainers for Legal Interpreting and Translation Brooke Townsley

Report of the Delaware School Library Survey 2004

Computing Advisory Panel (CAP) Preliminary enquiry in respect of future PRACE membership. Jan- 2014

Subject: Circular Number: 2011/22 INTERNET SAFETY. Date of Issue: 27 September Governor Awareness: Essential.

Achieving a Step Change in Digital Preservation Capability

Research Data Management Policy. Glasgow School of Art

LJMU Research Data Policy: information and guidance

ITC 19 th November 2015 Creation of Enterprise Architecture Practice

The challenges of becoming a Trusted Digital Repository

Transcription:

From Data Deluge to Data Curation Philip Lord, Alison Macdonald, Liz Lyon, David Giaretta The Digital Archiving Consultancy Limited and the Digital Curation Centre Abstract e-science or e-research - enables new forms and layers of research. It generates massive amounts of, at different research stages. Yet the many technologies used also transform and put its integrity at risk. Readability and usefulness are jeopardized not just by technical factors. Data s future quality richness, trustworthiness is a function of investment in it. But should all be kept? What other issues arise, for whom? We highlight findings of the recent e-science Data Curation report commissioned by JISC with the support of the e-science Core Programme, and present the Digital Curation Centre, the first of its kind in the world, and its role in providing resources and support for digital curation and research. 1. Introduction The volume of being created is growing at an astonishing rate i. E-Science, or perhaps more inclusively e-research, enables a new order of collaborative, more inter-disciplinary research, based on shared research expertise, instruments and computing resources, and, crucially, increasing access to collections of primary research and information. This is the knowledge base of research. There are challenges, however: these same technology changes and the flexibility in use of information technology tools put the very they create and transform at risk and raise serious and complex issues of strategy, policy and practice regarding the creation, management, and long-term care of its curation. A recent study ii commissioned by the JISC Joint Committee for the Support of Research showed that much needs to be done at all levels to enable the which is being created by this revolution to remain available and valid to future researchers. And much is being done by the e-science community, in projects, research and other initiatives, and which will be reported at the e-science All Hands Meeting of 2004. As part of their response to this problem, the JISC and e-science Core Programme are jointly funding the newly established Digital Curation Centre (DCC) iii. Its remit is to provide practical guidance and outreach concerning curation, and to undertake research into digital curation. The DCC is the first initiative of its kind in the world, and is expected to become a centre of excellence in the area. In this paper we highlight some of the technical, strategic and policy findings emerging from the e-science Data Curation report and discuss the DCC s role in addressing some of the practical challenges to be addressed. 2. e-science Curation This is a relatively new field, and terminologies are not yet stable. We have used the following working definitions of three key activities: Curation: The activity of managing and promoting the use of from its point of creation, to ensure it is fit for contemporary purpose, and available for discovery and reuse. For dynamic sets this may mean continuous enrichment or updating to keep it fit for purpose. Higher levels of curation will also involve maintaining links with annotation and other published materials. Archiving: A curation activity which ensures that is properly selected, stored, can be accessed and that its logical and physical integrity is maintained over time, including security and authenticity. Preservation: An activity within archiving in which specific items of are maintained over time so that they can still be accessed and understood through changes in technology. iv v 2.1 Survey Findings The e-science Curation report surveyed and reported on the provision of curation for e- Science in the UK, listing some 13 major findings, and making ten major recommendations for action at a strategic level.

Strategic and policy level findings are not presented in detail here, but in summary they showed that: Urgent action was needed for the UK to capitalize on the opportunities presented by e-science. Action was needed to address a short-term funding regime which mitigates against the essentially long-term needs of curation. Before -based research can flourish, questions of trust in (and as it ages) need to be addressed, such as security, confidentiality, ownership, provenance and authenticity. Awareness of long-term curation was generally low among research workers, and researchers need to be encouraged to engage more in the curation of their own. Provision of services for curation tended to be patchy, but was more advanced in some areas particularly areas concerning the biosciences and in big, collaborative science such as astronomy and particle physics. Areas for further research, debate and action include: Preservation: How is to survive the constant changes in information technology, which sees the rapid obsolescence of hardware architectures and software and file formats? How do we decide to keep what, and how? Various proposals have been made for addressing this problem, but the area remains one where more work is required, both theoretical and practical. Awareness and compliance: The viability of over the longer term depends on awareness. This means that the originators of, or of annotation need to be aware of the issues of preservation and curation, and they also need to be given practical guidance to be engaged in the process. Forums such as the e-science programme and the All Hands Meetings are opportunities to spread the curation word, and to encourage our audience to do so too. Of course, there needs to be awareness at all levels. Trust: As we noted above, in a digital environment it is not obvious how to engender trust in which has been passed on to us. How can we be sure of its provenance, its quality, freedom from corruption, and its continued privacy and security (where that is an issue as in medical science)? We need to determine to what extent these are real issues, and for which. Work is proceeding on a number of fronts examples include the work being done by Professor Buneman on bases and the provenance question, or the Qurator project, looking at tools to help discover and document the quality of information resources. However, we are still a long way from complete solutions. Data selection: What criteria should be applied when selecting for longer-term retention? Some is obviously of unique value, but what else should be kept? Selection introduces uncertainties how do we know what we should keep? Questions of costs and risks arise. Who sets the selection criteria? How can selection be assessed, when, by whom? Or should we keep everything, bearing in mind the costs of maintaining it (its curation)? The work being carried out and the tools being developed such as in the e-science projects will contribute to the practicality, economics and thus viability of curation. Thanks to grids, portals, defined taxonomies, ontologies, users will be able to discover resources (which may include the meta about ) without having to worry about loading the, establishing its reliability, or not finding it in the first place because of a spelling error. This work is surely also important for funders: on the one hand it lightens the cost burden entailed in keeping, and on the other it can protect the value of generated in research. Grid infrastructure provides a distributed computing environment which facilitates the creation and analysis of large volumes of from e-research experimentation and applications. It creates opportunity for versatility in model, as well as opening the knowledge base described above to much broader research communities. Data curation is an emergent field and an exciting one, with many current areas of active research. 2.1 Curation report recommendations The report s findings led to endorsement of the creation of a Digital Curation Centre in the UK as part of the national provision of curation facilities. Since the report was drafted the DCC has become a reality, and its programme of work is described briefly in section 4 of this paper. Of the other nine strategic

Figure 1 Model of the Curation Process Research Process Web Content Patent Research Process Primary Secondary (derived) Tertiary for Scientist Meta Research based on Peer Review Process Curator Curation Curation Process Data repositories Archived e-prints Primary Secondary Archives Tertiary Library - Peers - Public - Industry Philip Lord, 2003 recommendations made, three are of direct relevance to the DCC s programme: The production of research led-exemplars to demonstrate and promote benefits of curation should be co-ordinated by the DCC. National and international activities should be initiated to promote incentives which will foster a scientific culture of engagement in curation. Educational materials, guidelines and policy documents for researchers need to be developed and publicized. 3. A curation model for e-sciences The accompanying diagram (Figure 1) shows a model of the newly emergent research knowledge cycle. This has three major components: research and creation, publishing, and the maturing area of curation. In this model the traditional cycle of research findings going through the s process and back to consumers, the research community and other consumers (peers, libraries, the public and industry) is shown to the top and right of the information flow diagram (indicated in blue and white on the diagram, and referred to as Level 1 Curation in the report). More recently, on the research side, this has been augmented by research methods based primarily on the re-use and interpretation of held in archives (indicated by red in the diagram, and referred to as Level 2 Curation in the report). This somewhat enhanced cycle is exemplified by the work done by social scientists re-using held in repositories such as those held by the UK Data Archive (UKDA) at the University of Essex; similar models also appear in the life sciences, and within the arts community too, with the Arts and Humanities Data Service (AHDS) a distributed resource with a central base at Kings College, London. Another example is the astronomical domain where there are two types of collection which are common: observatory mode - where is taken on behalf of the observer and is processed by the observatory system, as opposed to principal-investigator mode where the observer has hands-on control and processes him/herself. The latter case is more likely to pose problems with archiving and curation.

We are now entering a phase where a third level of curation is demanded. In this matured situation, repositories which are actively curated are a reality, rather than mere archival stores. This new part of the information cycle is depicted in the lower left of the diagram (in green). In this phase the is not merely stored, but is preserved to overcome the technical obsolescence problem noted above, and is subject to revision and enhancement as necessary, perhaps augmented with tools to assist discovery, (re-)exploitation and presentation, such as the use of ontologies. We note that accompanying this trend to curation there is a parallel movement of provision of enhanced bibliographic facilities in digital libraries, and even more significantly for the scientific information cycle, there is an increasing role for enhanced electronic pre-print services (e-prints) and electronic delivery of completed articles. This trend has been described in other work sponsored by JISC in its initiatives under the e-research Programme vi and in the Digital Preservation and Continuing Access Strategy vii viii ix. A good example of a curated resource at this level is the UniProt/Swiss-Prot Protein Knowledgebase. UniProt/Swiss-Prot x is an annotated protein sequence base, which was first established in 1986. The knowledgebase contains curated protein sequence information that provides a high level of annotation, a minimal level of redundancy and high level of integration with other bases. It is a "one-stop shop" that allows easy access to all publicly available information of protein sequence annotation. It is maintained collaboratively by the Swiss Institute for Bioinformatics (SIB) and the European Bioinformatics Institute (EBI). It employs approximately 100 scientist in the curation process. Release 43.6 (21-Jun-04) of the knowledgebase contains 153320 sequence entries, comprising 56,402,618 amino acids abstracted from 117,067 references. 4. The Digital Curation Centre The DCC, was awarded funding from 1 st March 2004. It is based at the National e-science Centre in Edinburgh and the consortium comprises four partner institutions: University of Edinburgh (lead, Informatics, Law, Information Services and research institutes) University of Glasgow (HATII and Information Services) UKOLN, University of Bath Council for the Central Laboratory of the Research Councils (CCLRC). The DCC aims to provide a comprehensive advisory service, a repository of user tools and knowledge base, outreach and dissemination activities including an e-journal and an innovative research programme. The DCC is also forming an Associates Network to provide a forum for engaging with the communities of practice and with key organisations working in this area. The Centre is currently gathering information and feedback from disciplinary representatives and users, which will inform the research and development initiatives of the Centre and will begin the process of building a user base and community network. The DCC is also developing an Approach to Curation which will inform and provide underlying principles and technical standards for the curation activity. The DCC is monitoring existing architecture work and developments elsewhere with the aim of positioning the DCC research and development programmes within the wider landscape. Further information about the DCC is presented in a separate AHM poster xi. 5. Conclusion New avenues of research within which digital and its continued care and enhancement are central are now emerging. We can expect to become part of the mainstream research in a few years. To take best advantage of this nationally and to contribute fully internationally, strategic and policy level recommendations have been recommended. These initiatives are required both on management and technical fronts. Action has already been initiated on some of these, most notably with the founding of the Digital Curation Centre this year, with the objectives of supporting the scientific community in taking best advantage of new opportunities.

References i Tony Hey & Anne Trefethen, 2003. The Data Deluge. In: Grid Computing Making the Global Infrastructure a Reality, Wiley, January 2003. Summary: JISC Senior Management Briefing 2004. ii Lord and Macdonald, 2004. Data Curation for e-science in the UK,. See: http://www.jisc.ac.uk/uploaded_documents/e- ScienceReportFinal.pdf iii See: http://www.dcc.ac.uk iv Hedstrom, M., 1998. Digital Preservation: A Time Bomb for Digital Libraries, Computers and the Humanities (31), no. 3, 189-202. v Cedars, 2002. Cedars Guide to Technical Strategies. See http://www.leeds.ac.uk/cedars vi ebank UK project, See: http://www.ukoln.ac.uk/projects/ebank-uk/ vii Jones, Maggie, 2003, Archiving E-Journals Consultancy - Final Report: http://www.jisc.ac.uk/uploaded_documents/ejou rnalsfinal.pdf viii James, Hamish, et al, 2003, Feasibility and Requirements Study on Preservation of E- Prints: http://www.jisc.ac.uk/uploaded_documents/eprints_report_final.pdf ix Parker, Elizabeth, 2003, Study of the Records Lifecycle (revised edition of original first published 1999) (Joint Information Systems Committee). See: http://www.jisc.ac.uk/index.cfm?name=srl_stru cture x European Bioinformatics Institute, 2004. See: http://www.ebi.ac.uk/swissprot/ xi Giaretta, D. Robinson, B., Lyon, L, 2004. Curating for the Future the work of the Digital Curation Centre.. AHM 2004 Poster.