Research Data Archival Guidelines

Research Data Archival Guidelines LEROY MWANZIA RESEARCH METHODS GROUP APRIL 2012

Table of Contents Table of Contents... i 1 World Agroforestry Centre s Mission and Research Data... 1 2 Definitions:... 1 3 Guidance:... 1 3.1 Responsibilities for Preparing Datasets for Archiving... 1 3.2 Type of Dataset Archived on the ICRAF Dataverse... 4 3.3 When to Archive a Dataset... 4 3.4 Documentation... 5 3.4.1 Cataloging Information (Metadata)... 5 3.4.2 Additional Documentation... 5 3.5 Data File Formats... 5 3.5.1 Size Limitation... 6 3.5.2 Tabular Data... 6 3.5.3 Preparation of CSV and TAB Metadata Files... 6 3.6 Confidentiality... 6 Appendix 1: Dataset Cataloging Information Template... 8 Appendix 2: Description of Recommended Cataloging Information... 9 Page i

1 World Agroforestry Centre s Mission and Research Data The World Agroforestry Centre s (ICRAF) mission is to generate science-based knowledge about the diverse roles that trees play in agricultural landscapes and to use its research to advance policies and practices that benefit the poor and the environment. The Centre sees the dissemination of both research data and knowledge derived from it as a central component for fulfilling this mission. Data generated from public funded research is essentially International Public Goods and should be made publicly available. The Centre has endorsed an 'Open Access' policy for research data that is in line with the CGIAR Principles on the Management of Intellectual Assets. To ensure that research data is available the Centre has setup a publicly accessible Dataverse repository where scientist deposit copies of final research outputs along with 'metadata' (information that describes the deposited items). The function of the repository is to make research data available (can be found) and accessible (can be used), by ensuring that these data objects are searchable, preserved, described by appropriate metadata, are assigned with appropriate access rights and provided with official citations. Metadata of all research data is made public, whereas the access right to the actual data will be decided jointly between the data author, project leader and the regional coordinator. 2 Definitions: Data author: The individual(s) who are responsible for the generation of the data, they are responsible for the work s substantive and intellectual content. For large research projects this could include the following individuals: o People that substantially participated in the design of the study o People that facilitated the data collection (field coordinators) o People that facilitated the data validation Primary data: This is raw, verified data that has been obtained directly from source. The data can be captured through experiments, surveys, interviews, focus groups, or other direct interactions with individuals in the field. Primary data does not include any analysis on the data. Secondary data: Pre-existing data not gathered or collected by the authors of the current research project. Usually it has been collected by another organization or source or data collected from government publications. Metadata: Set of data that describes and gives information about the dataset. In our data repository this is also referred to as cataloging information. Dataverse: An online based data repository. It is an application for publishing, sharing, referencing, extracting and analyzing research data. A dataverse is hosted on a website known as a Dataverse network. The World Agroforestry dataverse can be found in the following web location. http://dvn.iq.harvard.edu/dvn/dv/icraf. 3 Guidance: 3.1 Responsibilities for Preparing Datasets for Archiving Data authors are responsible for ensuring the following: That the dataset has been cleaned and verified for correctness. Page 1

Provide adequate dataset documentation: This will include metadata provided online through the dataset cataloging information form that will be available for submission on the RMG website (see appendix 1 for the required fields) and key study documents such as research methods, questionnaires, unpublished reports and manuals. Ensure elimination of personal identifiers through modification of data elements to ensure that individual participants cannot be identified. 3.1.1 Examples of datasets Do s Comma Separated, tab separated text files, excel, SPSS or STATA files (see Figure 2 and Figure 5 ) Figure 1 - Tab Separated dataset Figure 2 - Excel dataset Don ts Datasets submitted as pdf or word documents, figures or analysis (see Figure 3 and Figure 4) Page 2

Figure 3 - Excel Figures Figure 4 - Dataset in word document Page 3

3.2 Type of Dataset Archived on the ICRAF Dataverse The Centre s Dataverse repository is used to archive two major categories of data: 1. Primary data used in the production of a publication. 2. Unpublished datasets, that are described by: Material and methods Clear description of the variables presented (please see Figure 5) Supported by unpublished reports (e.g. country reports) Any other relevant material Figure 5 - Variable Description 3.3 When to Archive a Dataset Complete datasets can be archived in the data repository at any time, however the latest the dataset would be archived is: 1. When a publication, with primary data, is submitted to the ICRAF library for dissemination, the accompanying data should also be submitted to the Research Method Group (RMG) for archival. 2. When a major study is starting, e.g. a baseline survey or multi-country survey, a dataverse study will be setup by RMG to facilitate easy data management and archiving. Data and supporting documentation like methods and questionnaires can then be archived as soon as they are made available. Please see Page 4

the Initial Setup of Dataverse Study for Project Data Management document on the RMG website http://www.worldagroforestry.org/research-methods/data-management. 3.4 Documentation 3.4.1 Cataloging Information (Metadata) The data author will provide the following cataloging information to RMG when submitting data for archival. This will be supplied using the dataset cataloging information template document (see appendix 1). In dataverse this information will be entered in the Cataloging Information tab. This will include information such as: Title of Survey/Project/Study Original publication (where relevant) Authors Person responsible for the data Producers Organizations that contributed to the data production Funding Agencies (if available) Abstract: A brief description of the project and its intellectual goals Keywords that describe the data Type of data: Survey data, experimental data, geospatial data, or others Country of data collection Time period covered Dates of data collection 3.4.2 Additional Documentation Further documentation to the dataset should be provided to enable other scientists who may not be familiar with your dataset to use it. This is especially true if the dataset is not linked to any publication. If the study was conducted in more than one study it would be useful to provide the documents in all the applicable languages. In dataverse this documents should be uploaded to the Data & Analysis tab. The following are study documents that could be provided along with the dataset. 1. Method documentation 2. Dataset description document Describe all the variables in the dataset and the measurement units used. 3. Codebook: This should provide a list of variable names, variable labels, and label values. It should specify the data position of each variable, describe the contents of each variable, and identify the range of possible codes and the meanings of those codes 4. Questionnaires An unused copy of the questionnaire. 5. Data collection tools If data collection tools were used e.g. CS-Pro forms or Access databases this should also be included with documentation of how the work. 6. Handbooks, guides and manuals 7. Unpublished reports e.g. Country reports or workshop reports 3.5 Data File Formats Dataverse is format-agnostic; it does not require files archived to be in any format. Essentially dataverse supports every kind of data format (txt, csv, pdf, xls, doc, mdb, jpg, dat, mdf, avi ). Page 5

Data authors are therefore encouraged to archive research datasets regardless of the data format. 3.5.1 Size Limitation A dataverse study does not have any size limitation however each archived file is limited to 2 GB. 3.5.2 Tabular Data Some tabular data formats when uploaded can be automatically processed and reformatted by Dataverse. The data is subsequently stored in an open format that ensures accessibility in the future. The data can also be downloaded in several formats; text (tab separated), R data, Stata and S Plus. Reformatted datasets can also provide for variable subset and analysis online. The following files formats are available for automatic data reformatting. SPSS data files (.sav) STATA data files (.por,.dta) CSV data files with SPSS control card (.csv - comma separated ) TAB data files with DDI control card (.tab,.csv -tab separated) SPSS and STATA files: These are proprietary data formats and you must use SPSS or STATA programs to generate these datasets before uploading the data. For those who wish to use a sub-settable nonproprietary data format the can use the CSV and TAB data formats as described below. CSV data Files: Unlike SPSS and STATA formats, this data format actually requires 2 files: the CSV raw data file proper and an SPSS Setup file ("control card") with the data set metadata. TAB data files: Like the CSV format, this data format also requires 2 files. The TAB separated raw data file and a DDI (XML based) metadata file. 3.5.3 Preparation of CSV and TAB Metadata Files RMG will provide a software application which will assist data authors to generate metadata files in SPSS Metadata formats for CSV files and DDI Metadata files for TAB files. Users will be required to provide the following details to the software for each CSV or TAB data file in order to generate the metadata file. variable fields and associated data type variable labels value labels 3.6 Confidentiality Before a study is made public or when a project is archiving its final dataset the data needs to be modified to ensure that individual participants whether households or persons cannot be identified. Personal identifiers such as name, addresses, and social security numbers should be replaced with running numbers while group identifier such as village names should be replaced by categorical names e.g. Village A, Village B. 3.7 Data Access Permissions All of the metadata data associated with an archived dataset / study will be publicly accessible. Page 6

The actual data files are initially restricted for download. To be able to download restricted data users will be asked to send a request to our data management team who will direct that request to the project leader. The project leader has to authorize the request for subsequent downloading. The Research Data Management policy stipulates that all data will be made publicly available 2 years after the project has been finalized, unless the data author made a special request to the DDG Research. Exceptions will be made when there is a valid reason for the data to continue being restricted e.g. for long term monitoring studies. 3.7.1 Data Download Requests When a user requests to download data archived in Dataverse an email is automatically sent to the Data Management team. The requesting person may be asked to give more information on what they intend to do with the data. The data authors together with the project leader or regional coordinator will then be contacted and they will have to authorize whether the data can be shared or not. RMG will then go ahead and share the data once approved. Users of datasets from our repository will be required to acknowledge the source of data through proper citation. Dataverse citation standard can be found at http://thedata.org/citation. Page 7

Appendix 1: Recommended Cataloging Information The online dataset cataloging information form will contain the following metadata fields for you to fill when archiving data. For multiple items e.g. multiple entries e.g. multiple authors please separate the entries with a comma (,). Meta Data Field Title : Original Publication: Author Name and Affiliation: Producer and Abbreviation Production Date : Funding Agency : Grant Number : Abstract : Abstract Date : Keywords: Topic Classification : Related Publications : Related Material : Related Studies : Time Period Covered Start: Time Period Covered End: Date of Collection Start : Date of Collection End: Country/Nation: Kind of Data: Metadata Page 8

Appendix 2: Description of Recommended Cataloging Information Meta Data Field Description Comment Title Full authoritative title for the work. The study title will in most cases be identical to the title of publication if the study is replication data for the publication. This element is required in the cataloguing information. Original If the study is a replication, cite the original study (ies) in this field. Publication Author Name Person, corporate body, or agency responsible for the work's substantive and intellectual content. Repeat the element for each author, and use affiliation attribute if available. Use - FirstName LastName (Affiliation) Author Affiliation Producer Producer Abbreviation Production Date Funding Agency Grant Number Abstract Abstract Date Keyword Topic Classification Related Publications Related Material Related Studies Time Period Covered Start Time Period Covered End Date of Organization with which the author is affiliated. The producer is the organisation or person that prepared the data i.e. brought the data into existence. Abbreviation by which the producer affiliation is commonly known. Production or Published Date (if the distributor date is not filled-in, this date is used for the Dataverse study citation). Source of funds for production of the work. The grant or contract number of the project that sponsored the effort. A short summary describing the purpose, nature and scope of the data collection, special characteristics of its contents, majors subject areas covered and what questions the Principal Investigators attempted to answer when they are conducting the study. A listing of the major variables in the study will also be should be added here. The date attribute follows the ISO convention of YYYYMM-DD. Words or phrases that describe salient aspects of a data collection's content. ICRAF keywords should be used here. The classification field indicates the broad substantive topics that the data cover. If there are other publications that are relevant to this study, cite them in related publications. Any related material. Any dataverse studies that are relevant to this one, such as prior research on this subject. Starting date of the time period covered by the data, not the dates of coding or making documents machine-readable or the dates the data were collected. Also known as span. The ISO standard for dates (YYYY-MM-DD) is recommended, although this form accepts YYYY or YYY-MM as well. This is the time period covered by the data, not the dates of coding or making documents machine-readable or the dates the data were collected. Also known as span. The ISO standard for dates (YYYY-MM-DD) is recommended, although YYYY or YYY-MM is acceptable. Starting Date when the data were collected. Use Producer (Abbreviation) Page 9

Collection - Start Date of Collection - End Country/Nation Kind of Data The ISO standard for dates (YYYY-MM-DD) is recommended, although YYYY or YYY-MM is acceptable. Ending Date when the data were collected. The ISO standard for dates (YYYY-MM-DD) is recommended, although YYYY or YYY-MM is acceptable. Country where the data was collected. If more than one, they can be separated by commas. Type of data included in the file: survey data, census/enumeration data, aggregate data, clinical data, event/transaction data, program source code, machine-readable text, administrative records data, experimental data, psychological test, textual data, coded textual, coded documents, time budget diaries, observation data/ratings, process-produced data, or other. Page 10