Data Management for Multi-Environment Trials in Excel Cathy Garlick (c.a.garlick@reading.ac.uk), Statistical Services Centre, University of Reading, UK 27 May 2010
1. Introduction Excel is widely available and easy to use, and thus, is frequently used in agricultural research for entering and managing data. However, it is easy to use badly. The Statistical Services Centre has produced a guide on using Excel with discipline, available from the SSC web site www.reading.ac.uk/ssc - the booklet is entitled Disciplined Use of Spreadsheets for Data Entry. This is recommended reading for all researchers using Excel. In this supplement we emphasise the use of Excel for preparing data from multi-environment trials (METs) for analysis. 2. List Format Many users of Excel are tempted to make their spreadsheets look pretty ; effectively laying out the data in tabular form. The following is a simple example for an experiment with 3 treatments and 4 breeds of hen. The number of eggs laid and the cholesterol levels in the blood were measured. The researcher laid out the data in table format as shown below: This layout is fine for a summary report but is not suitable for analysis. Instead you should store your data in a List Format. List format is a rectangular block of data where the rows represent cases and the columns represent variables for experimental data these are generally factors or measurement variables. The list should include column headers. When transferring to a statistics package the column headers are generally used as variable names. There should be no blank rows in the data a statistics package will often interpret a blank row as the end of the data. Each column should have the same data type throughout a statistics package often determines the data type from the first few rows of data. So the data for our example experiment should be stored as follows:
The layout is less pretty but more practical. You can always use Pivot Tables to create the summary table from the list. Note we have included a longer description of the variable in the top row with a short variable name in row 2. It is useful to have the longer description, although some statistics packages restrict variable names to 8 characters and do not allow spaces. Thus, we have the description in row 1 and the shorter name in row 2. When the data are exported, we would export from row 2 onwards. 3. Multiple Site Experiments In multiple site experiments the same rules apply, for example, data from all plots in all sites should be together in one list. A column should be added to distinguish the site. This is illustrated below: The last case for site 1 is on row 14, and the first case for site 2 is on row 15. Do not attempt to make this prettier by adding blank rows between sites - there should be no blank rows in the data as this is often interpreted as the end of the data.
If the layout factors differ at some sites, then include extra columns and leave blanks as appropriate. For example let s assume site 1 has blocks as a layout factor, but site 2 has ROWs and COLUMNs 1 ; we need 3 columns for the layout factors and would leave blanks for site 1 for ROW and COLUMN and blanks for site 2 for block : 4. Dealing with Numeric Codes Many variables are set up as numeric codes. For example let s assume we have codes associated with districts as follows: District Code District Name 1 Nsitu 2 Kalulu 3 Mkanda 4 Dididi 5 Chitunda 6 Linga 7 Mwansamba 8 Nkhunga The best way to deal with these codes is to enter just the numeric code in the main data sheet. Then, on a separate sheet, enter the list of codes with the district names this is like a codebook. You can then use the Vlookup function in Excel to include the district names in a new column in the main sheet. So we would have a sheet which we ll call Codes, shown below: 1 Note when we say ROW and COLUMN using capital letters we are referring to rows and columns as used in agricultural experiments. When we use local case letters we are referring to rows and columns in the Excel spreadsheet.
This is our lookup table or code list. On the main data sheet shown below, we have entered the district code into column A. We insert a new column B and use the vlookup function as shown: Note the formula vlookup (A3, Codes!$A$2:$B$9, 2, FALSE) in cell B3. Vlookup takes 4 parameters: The first in this case A3 is the reference for the cell containing the value we are looking up. In this case A3 contains the value 1. The second parameter Codes!$A$2:$B$9 defines the range of cells, which defines the lookup table or code list. In this example it is the range of cells A2:B9 on the sheet named Codes. Note, here we use absolute cell references (i.e. with the $ signs), so that if we copy this formula the reference remains the same. The third parameter specifies the column in the lookup table from where we should take the value. In this case it is column 2, i.e. column B. Thus in our example, Excel will look for a 1 in the first column of the range A2:B9 on the sheet called Codes. It then takes the corresponding value in column B, i.e. the second column. There is a 1 in cell A2 so the function returns the value in B2 which is Nsitu. The fourth parameter is TRUE or FALSE and determines the action if the code is not found in the lookup table. If you choose TRUE, Excel will return the nearest value; if you choose
FALSE it will give the result as #N/A. For example, our coding table includes codes from 1 to 8 if we were to look up the value of 9, then Excel would consider this 4 th parameter. If TRUE, it would find the nearest value which is 8 and return Nkhunga; if FALSE it would return #N/A. For most cases it is best to use FALSE so you can easily see if you have used invalid codes. This method of using codes improves data quality, as you are only typing the text values once entering text values repeatedly can lead to errors. Note you should not attempt to change the district names in the main data sheet if the district is wrong you should change the code and the name will automatically update. If the district name has a mistake, then change it in the Codes sheet. 5. Importing Excel data into a Statistics Package Note that when Excel data are imported into statistics packages, it is the values that are imported and not the formulae. Therefore, if the data mentioned above with the district codes and names were imported into Genstat, changing the district code in Genstat would not automatically change the district name. You must be aware that once you start importing data into other packages you are producing copies of the data, and it is important to keep all copies up-to-date or to have some level of version control. A typical scenario is that a researcher has the data in Excel and imports it into Genstat. She then notices an error which she corrects in Genstat. Meanwhile, another researcher takes the Excel data into STATA where he notices a different error, which he corrects within STATA. There are now 3 copies of the data, all slightly different and none of them totally correct. The solution is to designate the Excel file as the MASTER dataset. If and when errors are found, the correction must be made immediately in the MASTER dataset and ideally a log should be kept of the change so that others using the data will be aware of the change. The corrected data can then be reimported into the statistics package and analysis can continue or be rerun. In most statistics packages it is possible to save the syntax for any analysis you do, so rerunning the analysis should not be a problem. 6. Raw data and Results It is advisable to keep raw data and results on separate sheets in the Excel file. For example Pivot tables and graphs should not be on the same sheet with the raw data. Some researchers like to include summaries (totals and averages) at the end of the data again we would recommend storing these summaries on a separate sheet otherwise you will have problems when exporting the data. 7. Calculations As far as possible, all calculations should be done using formulae in Excel. For example, in the dataset below we have the Fresh Pod Weight in column H and the Sample Grain weight in column I; to calculate the Total grain per plot we use the formula I*H as shown below.
We keep the formulae in the spreadsheet so that the values can easily be checked in this case we would not enter values into column K but only the formulae. Doing calculations by hand and entering the results into the spreadsheet can and will lead to errors both in the calculations themselves and in the transcription of the results let the computer do the boring work! 8. Data at different levels Often you will have data at different levels. For example you may have data at the plot level and at the plant level; or data at the site or district level, as well as plot level data. Data at different levels should be stored on separate sheets in the workbook to avoid unnecessary duplication, which can lead to errors. There must be a column or variable which links the 2 sets of data. The example below shows some data from a pigeon pea experiment. Note the farmers are identified by a code (Farmer ID). We also have data at the farmer level, which are shown below:
The data are linked by the FarmerID, which appears in both sheets. Data from the farmer level can be displayed in the pigeon pea sheet using Vlookup. The example below shows farm size a farmer level variable included with the pigeon pea data. This is an extension of using Vlookup to display labels for numeric codes. Here we just have more columns in our lookup table, the range for which is A3:K57, so we have 11 columns in the table and we are picking data from the 10 th column. In each sheet you should be able to determine what each row represents e.g. one row per plot, one row per farmer, etc. You should know the number of plots, farmers and other subjects in your experiment, and easily check that you have the correct number of rows. If you have 10 farmers, you must have 10 rows of data at the farmer level; if you have 50 plots in total, you must have 50 rows of data at the plot level. 9. One item per cell dealing with repetitions and missing data A single cell in an Excel spreadsheet must only contain a single item. Two or more numbers in the same cell will be treated as a text value, and will not be unusable. The examples below to the left show where the diameter was measured twice for each tree; this should be split into two columns as shown in the example below to the right.
Where data are missing, the cell should be left blank in Excel. You might want to include the reason for the missing value, and this can be done by giving the explanation as a comment or in a separate column. In this example, the tree in row 15 died and the text Dead was entered instead of the diameter measurements. This is text in an otherwise numeric column, which will cause problems and should be avoided. The cells should be left blank and a comment added explaining the missing data as shown below. 10. Sorting the data Data should be entered and stored in field collection order. There is no need to sort the data. The layout factors can be entered prior to data collection, and the spreadsheet printed and used as a data collection sheet.
Sorting the sheet can lead to problems and it is easy in Excel to accidentally sort a single column. It is important to check your data after every manipulation, and of course to keep backup copies of your data. 11. Summary The ideas introduced here can be summarised into a set of rules or recommendations as follows: 1. Store the data in List format ; 2. Data from a multiple site experiment should be combined into a single worksheet, not a separate sheet of file for each location; 3. Do not have blank rows in the data; 4. Keep data and results separate; 5. Let the computer do the calculations don t do hand calculations; 6. Store data at different levels on separate sheets; 7. Use vlookup for coded data and to link data at different levels; 8. Only enter one item per cell; 9. Do not include text in numeric columns use comments instead; 10. Keep regular backups of all your data; 11. Designate a MASTER dataset and ensure all corrections are made to this dataset.