Data Curation for the Long Tail of Science: The Case of Environmental Sciences



Similar documents
The Importance of Bioinformatics and Information Management

Three Major Roles of the Biomedical Information Specialist

Exploring the roles and responsibilities of data centres and institutions in curating research data a preliminary briefing.

A Capability Maturity Model for Scientific Data Management

Overcoming the Technical and Policy Constraints That Limit Large-Scale Data Integration

Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21)

EDUCATING THE CURATOR: DIGITAL CURATION EDUCATION IN THE UNITED STATES

Sustainable Digital Data Preservation and Access Network Partners (DataNet)

Digital libraries of the future and the role of libraries

The Program in Environmental Studies.

Organizational Change for a Sustainable Digital Ecosystem: Building the Yale Digital Commons (Through Collaboration)#

RE-USING QUALITATIVE DATA: QUALITATIVE RESEARCHERS UNDERSTANDING OF AND THEIR EXPERIENCE ON DATA REUSE

UMCES Draft Mission Statement, March 31, 2014

Research Data Alliance: Current Activities and Expected Impact. SGBD Workshop, May 2014 Herman Stehouwer

Summary Bachelor of Environment Credential and Concentrations (For review prior to SFU student focus group participation) November 2012

A Policy Framework for Canadian Digital Infrastructure 1

Report to the NOAA Science Advisory Board

LIBER Case Study: University of Oxford Research Data Management Infrastructure

SUMMARY MISSION STATEMENT

OpenAIRE Research Data Management Briefing paper

Linked Science as a producer and consumer of big data in the Earth Sciences

Research Data Management Services. Katherine McNeill Social Sciences Librarians Boot Camp June 1, 2012

THE M.SC. PROGRAMS OF THE FACULTY OF SCIENCE GENERAL INFORMATION THE SCHOOL OF M.SC. STUDIES

Integrated Information Services (IIS) Strategic Plan

Infrastructure, Standards, and Policies for Research Data Management

Yan Zhang. 110 Mistywood Circle Apt. H Chapel Hill, NC, (919) yanz@ .unc.edu

RESPONSE FROM GBIF TO QUESTIONS FOR FURTHER CONSIDERATION

Canadian National Research Data Repository Service. CC and CARL Partnership for a national platform for Research Data Management

IT S ABOUT TIME. Sponsored by. The National Science Foundation. Digital Government Program and Digital Libraries Program

USGS Community for Data Integration

Bachelor of Science Degree Structure

University of Arizona Libraries Initiates Successful Partnership with Campus Commercialization Unit: A Case Study

Data at NIST: A View from the Office of Data and Informatics

First Cycle (Undergraduate) Degree Programme in Environmental Science, Cl. L-32

Center for Urban Ecology Strategic Plan

Integrating Research Information: Requirements of Science Research

Exploitation of ISS scientific data

Entering its Third Century

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 st CENTURY SCIENCE AND ENGINEERING (CIF21)

Extract from the reporting 2008

The data landscape lessons from UK

Digital Repository Services for Managing Research Data: What Do Oxford Researchers Need?

How To Teach Data Science

DEFINITIONS OF SUSTAINABILITY AT STARS- RATED SCHOOLS

February 22, 2013 MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES

Board Meeting May 17, 2007 PRESIDENT S REPORT ON ACTIONS OF THE SENATES

Wyoming Geographic Information Science Center University Planning 3 Unit Plan, #

Civil & Environmental Engineering

The mission of academic libraries is to support research, education,

Scientific Data Infrastructure: activities in the Capacities Programme of FP7

A Component of Professional Skills Workshops for Graduate Research Students

LIBER Case Study: Author: Mijke Jetten, University Library, Radboud University,

Data Registry Workshop Report

Progress Report Template -

Realizing the research library - data center alliance

Digital Stewardship Education at the Graduate School of Library & Information Science, Simmons College

EUDAT. Towards a pan-european Collaborative Data Infrastructure

Research Data Service Campus Annual Report

Multi-domain Research Data Description

Evolution of Chinese Research Data Policy

Institutes for Data Science: New York University University of Washington University of California, Berkeley

Spartan Archive: An Electronic Records Archive at Michigan State University NHPRC Project #RE

NERC Data Policy Guidance Notes

Data-Intensive Science and Scientific Data Infrastructure

How to get started with research data management training services for the academic library?

Environment and Natural Resources Trust Fund 2016 Request for Proposals (RFP)

NIH Commons Overview, Framework & Pilots - Version 1. The NIH Commons

Preparation and. Attitudes. Carol Tenopir. Knoxville, TN,USA. Richard J. Daley. Library. Chicago, IL, USA. Allard

Computational Science and Informatics (Data Science) Programs at GMU

Biomedical Informatics: Computer Applications in Health Care and Biomedicine

Cambridge University Library. Working together: a strategic framework

SHared Access Research Ecosystem (SHARE)

ODUM INSTITUTE ARCHIVE SERVICES OVERVIEW IASSIST 2015

The Bachelor of Science program in Environmental Science is a broad, science-based

Rhode Island School of Design Strategic Plan Summary for critical making. making critical

Approach Paper: Guidelines for Climate Mitigation Evaluations. Climate-Eval community of practice Draft as of February 8,

Vanderbilt University Biomedical Informatics Graduate Program (VU-BMIP) Proposal Executive Summary

Jochen Schirrwagen, Najko Jahn. Bielefeld University Library, Germany. Research in Context

CYBERINFRASTRUCTURE FRAMEWORK FOR 21 ST CENTURY SCIENCE, ENGINEERING, AND EDUCATION (CIF21) $100,070,000 -$32,350,000 / %

Mission and Goals Statement. University of Maryland, College Park. January 7, 2011

New MSc-programme at the Faculty of Science - University of Copenhagen

SCAR report. SCAR Data Policy ISSN International Council for Science. No 39 June Scientific Committee on Antarctic Research

REACH NC: The Research, Engagement, and Capabilities Hub of North Carolina. Connecting People, Ideas, and Investment A RENCI/REACH NC WHITE PAPER

PhD in Information Studies Goals

UNIVERSITY OF NAMIBIA

Research Data Management: The library s role

Evolving Curricula in LIS-focused Bioinformatics Programs

Digitization in the Pacific. Larry M. Page PD, idigbio Curator, FLMNH

SUMMARY STUDY PURPOSE AND SCOPE

SCHOOL OF COMMUNICATION TENURE AND PROMOTION CRITERIA, GUIDELINES FOR CREATIVE, PROFESSIONAL, SCHOLARLY ACHIEVEMENT

The University earned a Green Rating for its campus wide initiatives from the Princeton Review.

AC : THE DEVELOPMENT OF AN INTERDISCIPLINARY BACH- ELOR S DEGREE COMPLETION PROGRAM IN THE STEM FIELDS

Big Data for Patients (BD4P) Program Overview

DATABASE ZOOLOGICAL RECORD

REGULATIONS AND CURRICULUM FOR THE MASTER S PROGRAMME IN INFORMATION ARCHITECTURE FACULTY OF HUMANITIES AALBORG UNIVERSITY

Metrics: (1) Poorly (2) Adequately (3) Well (4) Very Well (5) With Distinction

Short Report. Research and development project Communicating the concept of ecosystem services on the basis of the TEEB study

Organizational Change for a Sustainable Digital Ecosystem: Building the Yale Digital Commons (Through Collaboration)

Associate or Full Professor Position in Leadership in Public Science (Communication)

Transcription:

Data Curation for the Long Tail of Science: The Case of Environmental Sciences Carole L. Palmer, Melissa H. Cragin, P. Bryan Heidorn, Linda C. Smith Graduate School of Library and Information Science University of Illinois at Urbana-Champaign {clpalmer, cragin, heidorn, lcsmith}@uiuc.edu Abstract. Universities and consortial groups need to rationalize how they invest in and manage their growing data assets. While some sciences have organized sharing and deposit activities around standardized disciplinary repositories, the data curation and stewardship needs of sciences that rely on smaller, researchlevel data collections are less well understood. This paper outlines a research agenda, and a foundational study of data practices and needs of a loosely organized group of environmental scientists, to advance our understanding of the potential for inter-institutional data coordination to promote and preserve the long tail of science. Preliminary findings of the survey component of the study will be reported. Keywords: data curation, data management, research collections, environmental sciences, ecoinformatics 1 Introduction The success and promise of national data centers, such as those for Arabidopsis and other model organisms, have demonstrated the critical role of reference level data initiatives. At the same time, resource collections are maturing in some fields, as evidenced by efforts such as the Biomedical Informatics Research Network (BIRN) (http://www.nbirn.net/) and the Global Biodiversity Information Facility (GBIF) (http://www.gbif.org/) [7]. But, we do not yet have a good model for coordinating the large proportion of small-scale scientific projects producing research level data collections. 1 We theorize that if we were to plot data collections by size we would see a moderate number of very large collections and then a long tail of smaller size collections. We can view this as the long tail of science data. These long tail collections in aggregate are highly heterogeneous and tend to be isolated in scientists 1 For an explanation of the distinction among reference, resource, and research level collections, see the National Science Board s (2005) report, Long-lived digital data collections: Enabling research and education in the 21st century, (http://www.nsf.gov/pubs/2005/nsb0540/).

offices and laboratories, yet they account for a substantial portion of the data assets at any given research university. Anderson [1] coined the long tail to describe the power of collective small business markets, and later Dempsey [3] aptly applied the concept to libraries, noting that success of this model requires that consumers (or readers) have access to and be aware of long tail products (materials), and that services to match supply with demand must also be in place. We believe this model also applies to scientific data. Long tail sciences generate small but numerous data collections. The questions are how to unleash the market potential of these data collections and lower the barrier to access and reuse. This requires much more than putting the information on the web. Data curation and stewardship are needed to manage and add value to the data collections produced by long tail science, and to facilitate their integration with other coordinated data collections. Karasti, Baker, & Halkola make a valuable distinction between data curation and data stewardship, stating that they have different views about the nature of data, their life cycles and relations with their environments of science conduct [5, p. 352]. They argue that curation activities characterized in the e- Science literature focus on ingest, archive and delivery, whereas stewardship activities span data planning to sampling, from data archive to use and reuse including both data care and information infrastructure work [5, p. 352]. While we hold that curation includes a much broader spectrum of activities, both data curation and stewardship are necessary to maintain the long-term usefulness of data. Importantly, while large-volume and more homogeneous data collections are now relatively well curated (such as in astronomy and seismology), the more specialized and heterogeneous, but small and numerous, collections are not being well served. We believe there is a need for a mix of data management solutions to address the range of collections, from large, disciplinary collections with relatively homogeneous data to cross-institutional, centralized or distributed heterogeneous collections. Institutional repository (IR) efforts at a number of universities are beginning to explore ways to support local researchers and laboratories with data curation and management. Just as collaborative models will be necessary for collection development activities [2], we believe that coordinated or consortial IR initiatives can provide the economy of scale needed for data curation and stewardship services to improve access, preservation, and use of research-level, long-tail data collections. 2 Research Questions To better understand the potential of cross-institutional curation for research level data collections, the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) is undertaking a series of studies. The initiative includes a study of IR development models and a project to develop data curation profiles across scientific research domains. Over the long term our studies aim to answer the following research questions: How long is the tail? Which

disciplines or research areas are tail-dependent, requiring high levels of data integration? What is it about the data or the science that makes them less likely candidates for resource or reference federations? How can data collections best be represented, in terms of collection granularity and data description, to support integration and reuse? How can research level collections best share and exchange data with existing resource and reference level collections? And, ultimately, what is the impact of accessibility to the long tail of data for the conduct of science? In this paper we focus on one project the DCEP is conducting in cooperation with the Environmental Council (EC) at the University of Illinois at Urbana-Champaign (UIUC) to investigate the data curation needs among the Faculty of the Environment, a campus-level organization that works to build the University's capacity for leadership in environmental discovery, learning, and public engagement and promote interdisciplinary discovery and learning. The Faculty of the Environment consists of approximately 400 members from all colleges and most departments at the University; they have an impressive variety of research interests and are dedicated to environmental excellence. Many EC researchers participate in long tail of science projects, collecting and analyzing data that is essential to solving critical problems related to climate change, energy, and ecology. The DCEP and the EC are working together to understand how current data sets or collections are being used across UIUC, and the nature and extent of associated data management practices and problems. This study will serve as the foundation for further investigations across regional institutions to address a more specific set of research questions related to environmental science curation and stewardship: What facets of data are of particular value to environmental and ecoinformatics research questions and projects? How can curation of data sets be extended and refined to encourage integration and reuse for ongoing ecological and environmental research? What are the collecting and reuse relationships among local research collections and disciplinary and national collections? How can curation encourage better data exchange among research, reference, and resource collections? 3 Research Approach The study uses complementary techniques to gather data from research scientists affiliated with the EC. A campus-based survey is currently underway, which is being pre-tested with representative earth scientists and sociologists. This will be followed with a pilot test of the revised survey on a larger sample of 8-12 researchers to allow for fine-tuning of the survey questions for web delivery. The remaining (approximately) 390 Faculty of the Environment will be invited by e-mail to participate in the survey. One question on the survey will ask for volunteers to participate in follow-up interviews. Semi-structured interviews will be conducted with this subsample. The two techniques will support gathering first broad, and then more focused details on the specifics of respondents data management activities and needs.

Preliminary results will be reported on data types, current in-house data management activities and outsourcing, how and when data are used and shared, data archiving plans, as well as how data curation professionals can contribute to research operations and the management of valued data for long-term use. 4 Environmental Sciences as Exemplar Case Because environmental sciences are multidisciplinary and rely on data collected in various ways to answer a broad range of research questions, they can profit greatly from more prudent management of the data resulting from long-tail science. Long Term Ecological Research (LTER) sites and the National Ecological Observatory Network (NEON) are projects where long tail data management is a critical issue. Moreover, environmental data are often used by constituencies beyond academic researchers and data managers. Audiences or stakeholders in environmental research include citizen scientists, politicians and policy makers, businesses and the general public [6]. Problem-oriented research domains, such as biofuels and land use research, require a large amount of data integration and interoperability to address questions that span multiple disciplines. These areas of research stretch across scales from the molecular to the ecosystem and across landscape scales from greenhouse growing trays to large multiuse land tracts, from agricultural to urban settings. Researchers from many fields of science need to work with and understand each others data to design intelligent experimentation and inform land use planning. Some examples of research questions include, what happens to land, water and atmosphere if we replace one land use with another? What crops can produce maximum sustainable energy yield for a given environment? 5 Conclusion The results from this study will be used most immediately by the EC to develop centralized indexing, data storage, and support services to lower the barriers to retention and access. In addition, the DCEP project will use the findings as a basis for curriculum planning and course development for training data professionals. Most importantly, this preliminary study will provide insights into the data needs across the long tail of science and related data curation and stewardship requirements. Data integration and reuse in the environmental sciences will require effective data curation and stewardship [4, 5]. The EC case will serve as an exemplar of research level data collection practices, problems, and potentials, advancing the longer-term research agenda on inter-institutional data coordination discussed above. Acknowledgments. We acknowledge our co-author Bryan Heidorn for sharing his ongoing ideas about long tail science. This work was supported in part by a grant from the Institute of Museum and Library Services RE-05-06-0036-06 and a University of Illinois Environmental Council, Earth and Society grant.

References 1. Anderson, C.: The Long Tail. Wired, Issue 12.10, Oct. (2004). Available: http://www.wired.com/wired/archive/12.10/tail.html 2. Day, M., Pennock, M., Allinson, J.: Cooperation for Digital Preservation and Curation: Collaboration for Collection Development in Institutional Repository Networks. DigCCurr2007: An International Symposium on Digital Curation, April 18-20, 2007, Chapel Hill, NC. (2007), http://www.ils.unc.edu/digccurr2007/papers/daypennock_paper_9-3.pdf 3. Dempsey, L.: Libraries and the Long Tail. D-Lib Magazine, 12(4), April, (2006). Available: http://www.dlib.org/dlib/april06/dempsey/04dempsey.html 4. Heidorn, P.B., Palmer, C.L., Cragin, M.H., Smith, L.C.: Data Curation Education and Biological Information Specialists. DigCCurr2007: An International Symposium on Digital Curation, April 18-20, 2007, Chapel Hill, NC. (2007). http://www.ils.unc.edu/digccurr2007/papers/heidornetal_paper_8-2.pdf 5. Karasti, H., Baker, K.S., Halkola, E.: Enriching the notion of data curation in e-science: Data managing and information infrastructuring in the Long Term Ecological Research (LTER) network. CSCW, 15(4), 321--358 (2006). 6. Van House, N. A., Butler, M., Schiff, L.: Cooperative knowledge work and practices of trust: Sharing environmental planning data sets. CSCW 98: Proceedings of the 1998 ACM Conference on Computer Supported Cooperative Work, 335--343, (1998). 7. Wooley, J.C., Lin, H. (Eds.): Chapter 4. Catalyzing Inquiry at the Interface of Computing and Biology, (pp. 57--115). Washington, D.C.: National Academies Press (2005).