INTRODUCTION TO DATA MANAGEMENT By Michelle Lloyd, Kate Crosby, Peter Lawton Data Management Team Canadian Healthy Oceans Network November 2013 Approved and endorsed by Canadian Healthy Oceans Network Scientific Advisory Committee (SAC) and Board of Directors Copyright by Canadian Healthy Oceans Network, 2008-2014
TABLE OF CONTENTS Introduction... 1 Responsibilities and Benefits of a structured approach to data management... 2 So what is all the fuss about METADATA?... 3 What are we asking CHONe researchers to do?... 5 INTRODUCTION Management of marine ecological data is a complex and daunting task, especially due to the variety of data types that may be acquired in scientific studies (genetic, ecological, taxonomic, oceanographic, geological, geographical, etc.). In addition to managing data within the original time frame and scope of ecological studies, there is also the issue of ensuring the legacy value of newly-acquired data, and derived information products. Legacy in this context refers both to the protection of the new information from loss (e.g. through a comprehensive backup strategy), and also the extent to which the data can be reused by others in the future (e.g. by storing contextual information metadata in an accessible format that lets others know the what, where, when, why, and how the original data was collected). Given the national scope and diversity of its research projects, Canadian Healthy Oceans Network (CHONe) faces all the above challenges and more. Prior efforts on developing a comprehensive data management approach in the CHONe were not successful, due to a number of interacting factors. Nonetheless, as we enter the final stages of the first research network and begin to prepare proposals for a renewal it is critical that we undertake a data management and data rescue process. To achieve this goal CHONe has recently hired two personnel, Michelle Lloyd and Kate Crosby, as Data Management Coordinator and Data Manager, respectively. They, along with guidance from Peter Lawton ( co-theme leader for Marine Biodiversity theme) Data Management Team The data management process must be completed by March 2014 to meet CHONe funding gauge the range and magnitude of issues facing CHONe in meeting some of its end of program obligations to Natural Sciences and Engineering Research Council (NSERC) with research project and subproject outputs (theses, publications, theses chapters) accessible through CHONe was found to be incomplete, and research data documentation and metadata were either incomplete, out of 1
date and/or in some cases non-existent. These issues have complicated, in some cases halted our data management progress. Terms that appear in bold are typically words or phrases that have a specific meaning with respect to the data management approach. These terms, and various other abbreviations used in the text, are defined in a Glossary. -government research network undertaking a broad range of marine biodiversity- oceans, CHONe has to address the problem of archiving and reusing diverse ecological data. Recent surveys have estimated that only 1% of ecological data are accessibly archived for reuse (Reichman et al. 2011). CHONe aims to have a searchable and widely available Discovery Database and Ocean Biogeographic Information System Discovery Database will contain discovery metadata from all research projects and subprojects, while the OBIS ine species. Both databases will provide a link to and/or location of complete data packages that may be accessed through Public Digital Repositories (e.g. DataONE, Data Dryad, Figshare, Integrated Science Data Management (ISDM)). Researchers will be able to use CHONe data to infer larger regional and global patterns about marine biodiversity, ecosystem function and population connectivity, and begin to examine the effects of cumulative impacts and risks for ocean sustainability. RESPONSIBILITIES AND BENEFITS OF A STRUCTURED APPROACH TO DATA MANAGEMENT Data management refers to all aspects of creating, storing and delivering, maintaining, archiving and preserving data. It is one of the essential areas of responsible research conduct (Whitlock et al. 2010). From an individual project standpoint, data management approaches may be user specific, minimally structured, minimally documented (Wallis et al. 2013), yet still meet the needs of the researcher in completing and publishing a particular study. The benefits of a structured data management approach include: 1. Meeting NSERC Strategic Network Grant requirements; 2. Enabling reproducibility; 3. Increasing research efficiency and organization among projects within CHONe, as well as subprojects within projects; 4. Ensuring research metadata and data are accurate, complete, authentic and reliable; 5. Enabling the development of CHONe summary products from structured data and metadata that can be applied across projects; 6. Saving time and resources in the long run; 2
7. Enhancing validation and quality control of the data; 8. Enhancing data durability and minimising the risk of data loss; 9. Preventing duplication of effort by enabling others to re-use data; and, 10. Complying with practices conducted in industry and commerce. research discovery, private data repository, data submission, and data sharing and permanent archival, and timeline, to ensure data is accessible, understandable and reusable by CHONe partners or other users. So what is all the fuss about METADATA? Metadata is the data (i.e. documentation and information) about research data. One of the founding principles of science is reproducibility and replication; however, ecological studies are not easily reproduced or replicated (Reichman et al. 2011). While evolution and genetics have been archiving and re-using data in centralized data repositories for ~30 years, ecology because of its diverse nature is in a period of developing its own framework for data archival and retrieval. Meta-analyses make use of past genetic, ecological, taxonomic, oceanographic, geological, and geographical data and metadata to infer larger regional and global patterns. Without metadata to explain the data, the interpretation and merging of data records becomes impossible, or at the very least an arduous task. Here are three examples of different levels of metadata: Example 1 Consider a simple data record with no column headers. Without metadata, a data record like this one is useless. In addition, there is no information about the location the data was collected, the focal organism or system, or the identity of the data owner. Example 2 Now the simple data record contains column headers. We know the day, month, year and time the samples were collected, but what do d, T, S,, Fl, w, v, u, w, v, u, RI, SN, and A stand for? The lack of adequate metadata makes this data record useless to anyone other than the owner. 3
Example 3 Now the simple data record contains defined column headers, including the variable units, and the corresponding metadata record, that identifies the data record owner, where, when, why and how the data was collected, who founded the research, etc. Data Record Metadata Record 4
See Dryad data package or Appendix 1 and 3 for more examples. What are we asking CHONe researchers to do? researchers (i.e. investigators, students and postdoctoral fellows) for all CHONe related research projects and subprojects. The data management team has estimated that 36 project and >60 subproject discovery metadata records will need to be collected, and >100 data packages must be collected by March 2014 (or 2 years after completion of data collection whichever comes last) (Figure 1). 5
Within this context, project denotes the 36 project grants awarded to CHONe investigators. Subproject refers to individual student projects and other initiatives within those 36 projects. As of June 2013 there were ~ 40 completed and 58 ongoing research subprojects. The data management team will be targeting first those researchers who have completed their research projects and subprojects. Figure 1. Anticipated number discovery metadata records to accompany each data record. The data management process is multi-stepped, requiring researchers to complete 4 tasks: 1. Update researcher information and research project or subproject description. 2. Complete an online standardized discovery metadata survey for each project or subproject (<15 minutes to complete). 1 3. data record) for each subproject (or project if no subprojects exist) to CHONe Data Repository to back-up and secure the data against loss, unless your research data is already archived in a publicly accessible repository in which case provide us with permanent DOI linked to and/or location of your data. 2 4. Submit corresponding metadata record and documentation for each data record. Only researchers really know their data. The metadata record must be accurate, complete, authentic and reliable. See Data Management Plan for further details. 1 Students and postdoctoral fellows MUST have their investigator approve their surveys. 2 Should CHONe be refunded, raw data record and processed data record will be uploaded regularly using versioning software to prevent accidental loss. 6