Data Curation for the Long Tail of Science: The Case of Environmental Sciences

Data Curation for the Long Tail of Science: The Case of Environmental Sciences Carole L. Palmer, Melissa H. Cragin, P. Bryan Heidorn, Linda C. Smith Graduate School of Library and Information Science University of Illinois at Urbana-Champaign {clpalmer, cragin, heidorn, lcsmith}@uiuc.edu Abstract. Universities and consortial groups need to rationalize how they invest in and manage their growing data assets. While some sciences have organized sharing and deposit activities around standardized disciplinary repositories, the data curation and stewardship needs of sciences that rely on smaller, researchlevel data collections are less well understood. This paper outlines a research agenda, and a foundational study of data practices and needs of a loosely organized group of environmental scientists, to advance our understanding of the potential for inter-institutional data coordination to promote and preserve the long tail of science. Preliminary findings of the survey component of the study will be reported. Keywords: data curation, data management, research collections, environmental sciences, ecoinformatics 1 Introduction The success and promise of national data centers, such as those for Arabidopsis and other model organisms, have demonstrated the critical role of reference level data initiatives. At the same time, resource collections are maturing in some fields, as evidenced by efforts such as the Biomedical Informatics Research Network (BIRN) (http://www.nbirn.net/) and the Global Biodiversity Information Facility (GBIF) (http://www.gbif.org/) [7]. But, we do not yet have a good model for coordinating the large proportion of small-scale scientific projects producing research level data collections. 1 We theorize that if we were to plot data collections by size we would see a moderate number of very large collections and then a long tail of smaller size collections. We can view this as the long tail of science data. These long tail collections in aggregate are highly heterogeneous and tend to be isolated in scientists 1 For an explanation of the distinction among reference, resource, and research level collections, see the National Science Board s (2005) report, Long-lived digital data collections: Enabling research and education in the 21st century, (http://www.nsf.gov/pubs/2005/nsb0540/).

offices and laboratories, yet they account for a substantial portion of the data assets at any given research university. Anderson [1] coined the long tail to describe the power of collective small business markets, and later Dempsey [3] aptly applied the concept to libraries, noting that success of this model requires that consumers (or readers) have access to and be aware of long tail products (materials), and that services to match supply with demand must also be in place. We believe this model also applies to scientific data. Long tail sciences generate small but numerous data collections. The questions are how to unleash the market potential of these data collections and lower the barrier to access and reuse. This requires much more than putting the information on the web. Data curation and stewardship are needed to manage and add value to the data collections produced by long tail science, and to facilitate their integration with other coordinated data collections. Karasti, Baker, & Halkola make a valuable distinction between data curation and data stewardship, stating that they have different views about the nature of data, their life cycles and relations with their environments of science conduct [5, p. 352]. They argue that curation activities characterized in the e- Science literature focus on ingest, archive and delivery, whereas stewardship activities span data planning to sampling, from data archive to use and reuse including both data care and information infrastructure work [5, p. 352]. While we hold that curation includes a much broader spectrum of activities, both data curation and stewardship are necessary to maintain the long-term usefulness of data. Importantly, while large-volume and more homogeneous data collections are now relatively well curated (such as in astronomy and seismology), the more specialized and heterogeneous, but small and numerous, collections are not being well served. We believe there is a need for a mix of data management solutions to address the range of collections, from large, disciplinary collections with relatively homogeneous data to cross-institutional, centralized or distributed heterogeneous collections. Institutional repository (IR) efforts at a number of universities are beginning to explore ways to support local researchers and laboratories with data curation and management. Just as collaborative models will be necessary for collection development activities [2], we believe that coordinated or consortial IR initiatives can provide the economy of scale needed for data curation and stewardship services to improve access, preservation, and use of research-level, long-tail data collections. 2 Research Questions To better understand the potential of cross-institutional curation for research level data collections, the Data Curation Education Program (DCEP) at the Graduate School of Library and Information Science (GSLIS) is undertaking a series of studies. The initiative includes a study of IR development models and a project to develop data curation profiles across scientific research domains. Over the long term our studies aim to answer the following research questions: How long is the tail? Which

disciplines or research areas are tail-dependent, requiring high levels of data integration? What is it about the data or the science that makes them less likely candidates for resource or reference federations? How can data collections best be represented, in terms of collection granularity and data description, to support integration and reuse? How can research level collections best share and exchange data with existing resource and reference level collections? And, ultimately, what is the impact of accessibility to the long tail of data for the conduct of science? In this paper we focus on one project the DCEP is conducting in cooperation with the Environmental Council (EC) at the University of Illinois at Urbana-Champaign (UIUC) to investigate the data curation needs among the Faculty of the Environment, a campus-level organization that works to build the University's capacity for leadership in environmental discovery, learning, and public engagement and promote interdisciplinary discovery and learning. The Faculty of the Environment consists of approximately 400 members from all colleges and most departments at the University; they have an impressive variety of research interests and are dedicated to environmental excellence. Many EC researchers participate in long tail of science projects, collecting and analyzing data that is essential to solving critical problems related to climate change, energy, and ecology. The DCEP and the EC are working together to understand how current data sets or collections are being used across UIUC, and the nature and extent of associated data management practices and problems. This study will serve as the foundation for further investigations across regional institutions to address a more specific set of research questions related to environmental science curation and stewardship: What facets of data are of particular value to environmental and ecoinformatics research questions and projects? How can curation of data sets be extended and refined to encourage integration and reuse for ongoing ecological and environmental research? What are the collecting and reuse relationships among local research collections and disciplinary and national collections? How can curation encourage better data exchange among research, reference, and resource collections? 3 Research Approach The study uses complementary techniques to gather data from research scientists affiliated with the EC. A campus-based survey is currently underway, which is being pre-tested with representative earth scientists and sociologists. This will be followed with a pilot test of the revised survey on a larger sample of 8-12 researchers to allow for fine-tuning of the survey questions for web delivery. The remaining (approximately) 390 Faculty of the Environment will be invited by e-mail to participate in the survey. One question on the survey will ask for volunteers to participate in follow-up interviews. Semi-structured interviews will be conducted with this subsample. The two techniques will support gathering first broad, and then more focused details on the specifics of respondents data management activities and needs.

Preliminary results will be reported on data types, current in-house data management activities and outsourcing, how and when data are used and shared, data archiving plans, as well as how data curation professionals can contribute to research operations and the management of valued data for long-term use. 4 Environmental Sciences as Exemplar Case Because environmental sciences are multidisciplinary and rely on data collected in various ways to answer a broad range of research questions, they can profit greatly from more prudent management of the data resulting from long-tail science. Long Term Ecological Research (LTER) sites and the National Ecological Observatory Network (NEON) are projects where long tail data management is a critical issue. Moreover, environmental data are often used by constituencies beyond academic researchers and data managers. Audiences or stakeholders in environmental research include citizen scientists, politicians and policy makers, businesses and the general public [6]. Problem-oriented research domains, such as biofuels and land use research, require a large amount of data integration and interoperability to address questions that span multiple disciplines. These areas of research stretch across scales from the molecular to the ecosystem and across landscape scales from greenhouse growing trays to large multiuse land tracts, from agricultural to urban settings. Researchers from many fields of science need to work with and understand each others data to design intelligent experimentation and inform land use planning. Some examples of research questions include, what happens to land, water and atmosphere if we replace one land use with another? What crops can produce maximum sustainable energy yield for a given environment? 5 Conclusion The results from this study will be used most immediately by the EC to develop centralized indexing, data storage, and support services to lower the barriers to retention and access. In addition, the DCEP project will use the findings as a basis for curriculum planning and course development for training data professionals. Most importantly, this preliminary study will provide insights into the data needs across the long tail of science and related data curation and stewardship requirements. Data integration and reuse in the environmental sciences will require effective data curation and stewardship [4, 5]. The EC case will serve as an exemplar of research level data collection practices, problems, and potentials, advancing the longer-term research agenda on inter-institutional data coordination discussed above. Acknowledgments. We acknowledge our co-author Bryan Heidorn for sharing his ongoing ideas about long tail science. This work was supported in part by a grant from the Institute of Museum and Library Services RE-05-06-0036-06 and a University of Illinois Environmental Council, Earth and Society grant.

References 1. Anderson, C.: The Long Tail. Wired, Issue 12.10, Oct. (2004). Available: http://www.wired.com/wired/archive/12.10/tail.html 2. Day, M., Pennock, M., Allinson, J.: Cooperation for Digital Preservation and Curation: Collaboration for Collection Development in Institutional Repository Networks. DigCCurr2007: An International Symposium on Digital Curation, April 18-20, 2007, Chapel Hill, NC. (2007), http://www.ils.unc.edu/digccurr2007/papers/daypennock_paper_9-3.pdf 3. Dempsey, L.: Libraries and the Long Tail. D-Lib Magazine, 12(4), April, (2006). Available: http://www.dlib.org/dlib/april06/dempsey/04dempsey.html 4. Heidorn, P.B., Palmer, C.L., Cragin, M.H., Smith, L.C.: Data Curation Education and Biological Information Specialists. DigCCurr2007: An International Symposium on Digital Curation, April 18-20, 2007, Chapel Hill, NC. (2007). http://www.ils.unc.edu/digccurr2007/papers/heidornetal_paper_8-2.pdf 5. Karasti, H., Baker, K.S., Halkola, E.: Enriching the notion of data curation in e-science: Data managing and information infrastructuring in the Long Term Ecological Research (LTER) network. CSCW, 15(4), 321--358 (2006). 6. Van House, N. A., Butler, M., Schiff, L.: Cooperative knowledge work and practices of trust: Sharing environmental planning data sets. CSCW 98: Proceedings of the 1998 ACM Conference on Computer Supported Cooperative Work, 335--343, (1998). 7. Wooley, J.C., Lin, H. (Eds.): Chapter 4. Catalyzing Inquiry at the Interface of Computing and Biology, (pp. 57--115). Washington, D.C.: National Academies Press (2005).