DATA VALIDATION FOR THE 2006 CENSUS OF AGRICULTURE Charlie Arcaro Statistics Canada, R.H. Coats Building 17 th floor, Ottawa, Ontario, K1A 0T6 ABSTRACT The Census of Agriculture (CEAG) is held every five years and collects information on farming practices, commodities and finances. This census is usually conducted at the same time as the Census of Population and is used in redesigning the major Agricultural surveys and in updating the Farm Register. Data Validation is a major part of the data quality assurance evaluation and is the most expensive component of the CEAG process. The validation process is done at both the macro and micro levels to ensure quality at the major geographical and small area levels as well as ensuring good sampling frames for the major survey collections. As users increasingly scrutinize CEAG data at finer geographical levels, it is looked at even more closely in the Validation Process. There is also a need to maximize the efficiency for validation resources during this long process. This paper provides an outline for the 2006 Data Validation process and presents some new methods that addressed the issues of resource management and improved data quality for lower levels of geography. Keywords Data Validation, Referential Sources, Imputation; Census of Agriculture corporate, operator and geographical information for existing farms on the Farm Register (FR). 2. CEAG PROCESS As with most surveys, the 2006 CEAG process begins with the collection of data. Most of the forms were either dropped-off by the enumerator and mailed back by the respondent or completed via the Internet. Once the forms come back their data were scanned and checked for character recognition (ICR) errors. There was also extensive follow-up of non-responding farms as well micro-level data editing and assignment of headquarter geography to each farm. Data imputation proceeds these steps followed by data validation. Once validation is complete the results are sent for certification and finally to output and dissemination. Table 2.1 2006 CEAG Process Collection 1. INTRODUCTION The Census of Agriculture (CEAG) provides a fiveyearly snapshot of Canadian Agriculture and collects a wide variety of farm commodity, financial and operator information. The 2006 CEAG was collected in conjunction with as the 2006 Canadian Census of Population to lower survey under-coverage and to locate missing CEAG farms. Between 2001 and 2006, the number of census farms and farm operators in Canada continued their long-term decline. The 2006 CEAG counted 229,373 farms, which was down 17,550 (7.1%) from 2001. It also accounted for 327,060 farm - a decline of 19,140 (5.5%) from 2001. Data Scanning Editing, Matching, Follow up Imputation Data Validation Output & Dissemination The CEAG is also used in redesigning many of the major Agricultural surveys at Statistics Canada. Most of these surveys use CEAG data in constructing their sampling frames as well as use auxiliary information as part of their estimation process. CEAG data is the main source of new farms as well as updates to 709
3.1 Introduction 3. DATA VALIDATION Data Validation involves analyzing and correcting CEAG data by using a series of both aggregate and record level tables. These tables allow us to compare the number of farms, totals and structural changes from the last CEAG. We also compare to recent surveys (Reference Sources) and see if there are any major discrepancies between them. that contribute significantly to the provincial totals are also scrutinized as well as the impacts of imputation and validation on our final data. As shown in table 2.1, Data Validation is performed after the imputation stage of the CEAG process but just before the final output & dissemination. Data Validation is also the costliest part of the CEAG postcollection process and involves analyzing and possibly changing data that looks questionable. This analysis is not only done for the current CEAG but also compares record and aggregate level data with other collections such as previous CEAGs and other agricultural surveys. For the 2006 CEAG, the validation process was done for each province separately beginning in September 2006 with the Atlantic provinces (New Brunswick, Nova Scotia, Prince Edward Island and Newfoundland) and ending in February 2007 with the province of Alberta. A series of macro and micro-level tools are used to analyze the data. Aggregate data is analyzed at various geographic levels; from the province level down to the Census Consolidated Subdivision (CCS), of which there are approximately 2135 across Canada. This lower level of data evaluation allows us to determine the quality of small area data, which is an important part of the validation process. Data at the finer geographical levels are expected to come under greater scrutiny from data users than ever before Senior validators prepare extensive plans for all variables to be analyzed based upon their field knowledge and their expectations before the Validation Process begins. These validators are responsible for a number of variables on each questionnaire and are usually the subject matter experts in that field. Their tasks also include the supervision of junior validators and helping them with their tasks. For the 2006 CEAG, efforts were made to make more efficient use of financial and human resources. The results from the new strategies developed in determining farms to be validated will be given at the end of this paper. 3.2 Validation Tools A series of Data Validation tools is used to create a list of reports which are then analyzed by the respective validation teams. These tools are located on the CEAG Central Processing System (CPS) which also keeps track of all validation reports generated. These reports provide information at both the farm and aggregate record level including comparisons against the previous CEAG and other Referential Sources. It s important to note that any tables reproduced in this paper are either fictitious or from the 2004 CEAG Test. 3.2.1 Comparison Tables Comparison Tables provide numbers that measure agricultural activity within a specific province and breaks them down to finer levels of geography. The tables provide data for the 2006 CEAG and 2001 CEAG as well as percentage of change for each variable. Comparison tables are used to explain and justify any dramatic changes that make the data questionable. In general, we assume that the numbers won t change dramatically from one CEAG to the next, however can expect some commodities to alter significantly and the Comparison Tables allow us to track and evaluate these differences. 3.2.2 Impact of Processing Tables Impact of Processing (IP) Tables allow Data Validators to assess the impact of changes made to CEAG data by the imputation and validation processes. These tables are only available at the provincial level and allow validators to locate any potential problems with both the automated and manual imputation systems. IP Tables also allow validators to monitor the impact of their own work on the final data to be released. There are two tables produced in an IP report: 1. Certification/Summary Report. This report lists all the report variables and provides data similar to Comparison Tables as well as the effects of imputation and validation on the 2006 CEAG data. 2. Detail Report. This report provides a detailed breakdown of the changes in value for each variable in the summary report. This not only measures the effects of imputation and validation on the 2006 CEAG data but also which of these changes were positive and which were negative. 710
3.2.3 Top Contributor Tables Top Contributor (TC) Tables are used to display record level farms that have the largest values for the main variable being considered. The tables also contain the farms impacts upon the provincial estimate and their pre-imputed values. The values are usually displayed in descending order however the option is available to see records that have the smallest non-zero contributors in ascending order (bottom contributors). TC tables are only available at the provincial level. Similar to the IP tables, Top Contributor Tables are presented in two forms: 1. Summary Report, which contains one record per row in order by rank. 2. Detailed Report, which is the same as a Summary Report with the addition of administrative data. By default, the top 100 farms for a given variable are investigated in the validation process with the added functionality to investigate Bottom Contributors. Bottom Contributors generally affect greenhouse variables and mushrooms as these variables are required in square feet or square meters. Sometimes these variables will be reported in acres even if it is only one acre (1 acre = 43,563 square feet). Validators will generally look for low values of greenhouse or mushrooms as this may suggest the wrong unit of measure being used. 3.2.4 Match Tables Match Tables are a series of three reports that compare specific farms in a certain geographical area using 2006 CEAG data and a selected Referential Source. This source can be either the previous Census of Agriculture or any of a number of surveys from the Agricultural Statistics Program, such as the June Crops Survey, the Fruit and Vegetable survey, the Livestock Survey and the Annual Greenhouse, Sod and Nursery Survey. The first report lists those farms that appear on the Referential Source List but not on the current CEAG. The second report lists all operations that appear on both the Referential Source List and the CEAG but show a change of value for a selected variable greater than a specified difference. The third report lists all the farms that received a census questionnaire but did not appear on the Referential Source List. Each report has two parts a summary list of all operations, with a few variables, and an expanded section that provides more information. Match Reports allow access to farm records in order to identify those operations that account for significant differences for specific variables. 3.2.5 Distribution Tables Distribution Tables (DT) present data for specific variables sorted into meaningful classes (e.g., distributions by farm type). Depending on the table they can provide frequency counts, percent changes and other valuable information. Most DT tables compare current and past census data. These reports are the Data Validation tool of choice for the analysis of tick-box variables but nevertheless are useful for some numeric variables as well. There are three main DT variations: 1. Operator Tables. These tables cover all operators at detailed levels of geography. These tables include information such as number of operators resident on their operations, age and sex distributions, hours per week worked on and off the farm and workplace injuries. 2. Livestock Tables. Show counts & farms sizes of various livestock & poultry within a range of farm sizes, the number of farms reporting and the average numbers of units per reporting farms at the provincial level. The ranges represent the total number of animals on Census Day while the counts are for specific types of animals. The range categories are determined by the survey experts and vary by province. 3. Specialty Tables. These tables provide data on tick box variables such as land management, computer usage, certified production and organization type. They also provide a numerical variable table on the distribution of farming sales. 4. NEW METHODS FOR 2006 CEAG As mentioned, the major goal for the 2006 CEAG Validation Process was to make more efficient use of both human and financial resources. The main improvements were at the farm level analysis for both the Top Contributor Tables and Match Reports as these were the most time and cost-consuming parts of the Validation Process. 711
4.1 Top Contributors The Top Contributor Tables display the largest farm values for a specific variable. In the 2001 CEAG, the top 100 farms from each province for a designated variable were analyzed. Some of these variables have a handful of farms that contribute to its provincial total and thus it was considered inefficient to validate 100 farms in each province for all variables. For the 2006 CEAG it was decided to validate the top 100 contributors or the lowest number of farms that contributed at least 80% of the provincial estimate - which ever was smaller. We also searched Bottom Contributors for variables that were required in square feet or square meters. For example if we found for a specific variable that the top 6 farms contributed 75% of the provincial estimate and the top 7 contributed 82% then we would have validated only the top 7 farms for our table. If for another variable we found the top 100 contributors contributed 65% of the provincial estimate then we would have focused our efforts on the top 100 farms. The finest level of geography for data validation is the Census Consolidated Subdivision (CCS) level. Given the expected increased scrutiny of CEAG data at this finer level, it was also determined that any farm that contributed at least 30% of its CCS estimate for a particular variable was also classed as a Top Contributor and included in the validation process. 4.2 Match Tables As mentioned, the Match Tables produce a series of three reports comparing farms between CEAG and the Referential Sources. In 2001 each of these reports generated a maximum of 100 farms for each variable in each province. For the 2006 CEAG it was decided to use a Collection Management System algorithm used to detect missing CEAG farms. This algorithm would identify all farms that, combined, contributed to at least 2% of all farms and 50% of the provincial estimate and, individually, 1% of the provincial estimate for a specified variable. We would also identify all farms that were on the Referential Sources, missing on CEAG and individually contributed to more than 30% of the CCS estimate for a specified variable. This allowed us to analyze the missing farms that were significant contributors at the CCS level. 4.3 Analysis Results For our study we looked at 3 CEAG variables, Total Cattle (TCATTL), Total Pigs (TOPIGS) and Alfalfa (ALFALFA). The tables below estimate how many farms were checked for validation using the new procedures for Top Contributors and the Match reports. Table 4.3.1 Number of on Validation Reports (TOTAL CATTLE) PROV Validated NF 54 4 31 26 PEI 112 2 100 16 NS 106 2 100 5 NB 142 3 100 46 PQ 303 59 100 207 ON 185 92 100 36 MB 134 75 100 11 SK 154 104 100 18 AB 221 199 100 4 BC 140 44 100 326 1551 584 931 401 Table 4.3.2 Number of on Validation Reports (TOTAL PIGS) PROV Validated NF 7 3 2 5 PEI 56 8 36 30 NS 47 4 23 23 NB 50 5 18 37 PQ 348 26 100 269 ON 287 30 100 190 MB 149 10 100 69 SK 268 8 42 244 AB 121 10 100 29 BC 133 10 21 112 1466 114 542 1008 712
Table 4.3.3 Number of on Validation Reports (ALFALFA) PROV Validated PROV NF 31 0 19 21 PEI 142 0 100 80 NS 112 0 100 23 NB 143 0 100 82 PQ 613 33 100 550 ON 251 96 100 91 MB 124 40 100 8 SK 155 69 100 47 AB 171 95 100 5 BC 174 21 100 76 1916 354 919 983 Note: The Top Contributor, Match Report and Top CCS columns do not add up to the Total column as any one farm can belong to multiple categories. For example a farm that appears in a Match report can also be a Top Contributor for the same variable. When comparing the new methods for 2006 CEAG to the old methods with 2001 CEAG data, we get the following results. The number at the top of each table represents the number farms contributing to a variable so, for example, there were 109,920 farms that reported Total Cattle on the 2006 CEAG. TOTAL CATTLE 109,920 Methods # Validated % Total %Total 2006 CEAG 1,551 1.41 3,346,211 20.15 2001 CEAG 2,928 2.66 3,992,312 20.73 TOTAL PIGS 11,506 Methods # Validated % Total %Total 2006 CEAG 1,466 12.74 7,018,275 45.71 2001 CEAG 2,808 24.40 7,255,827 47.25 The new methods implemented for the 2006 CEAG significantly reduced the number of farms to be validated, with very little compromise on the variable totals. For example, we validated 1,551 farms accounting for 3,346,211 Total Cattle using the new methods as opposed to 2,928 farms for 3,992,312 Total Cattle under the previous ones. This significant reduction was also evident when validating the Total Pigs variable but not as apparent for Alfalfa. Both the Total Cattle and Total Pigs producers have a large variation between the large and smaller values. Conversely Alfalfa is grown fairly evenly across all farms which also explains the small changes in growers and production between the different methods 4.4 Conclusions The emphasis for the Validation Process in 2006 was to concentrate on those farms that made the most impact on the provincial and CCS estimates as well as those farms that had significant differences between CEAG and the Referential Sources. From the study, it is apparent that the 2006 CEAG Validation process has reduced the number of farms required for analysis with little compromise to data quality. The analysis has also shown that the process works well at the finer levels of geography and maintains the quality of small area estimates. Acknowledgements The author would like to thank Claude Julien, Erik Dorff and Joseph Duggan for their valuable comments on this paper. Special thanks also to Martin Lessard, Steven Danford, Elizabeth Abraham, Leon Laborde, Olga Olni and Mike Westland for all their help, assistance and encouragement during this project. References 2006 Census of Agriculture Data Validation Training Manual, Internal Document, Statistics Canada G. Davidson, J. Dornan, C. Robinson, A Lim (2005) 2006 Census of Agriculture Call Management System Specifications, Internal Document, Statistics Canada ALFALFA 88,064 Methods # Validated % Total %Total 2006 CEAG 1,916 2.18 1,220,928 9.62 2001 CEAG 2,562 2.91 1,187,804 9.36 713