Child Labour Survey Data Processing and Storage of Electronic Files

Similar documents

First published applications. purpose. 121 p. International Labour O

CROATIAN BUREAU OF STATISTICS REPUBLIC OF CROATIA MAIN (STATISTICAL) BUSINESS PROCESSES INSTRUCTIONS FOR FILLING OUT THE TEMPLATE

Measuring Intangible Investment

TANZANIA - Agricultural Sample Census Explanatory notes

Country Paper: Automation of data capture, data processing and dissemination of the 2009 National Population and Housing Census in Vanuatu.

Documenting the research life cycle: one data model, many products

Human resources development and training

For further information on ILO-OSH 2001, please contact:

Foundations for Systems Development

Research Data Archival Guidelines

Survey of Canadian and International Data Management Initiatives. By Diego Argáez and Kathleen Shearer

International Labour Office GLOBAL STRATEGY ON OCCUPATIONAL SAFETY AND HEALTH

Guide for Applicants COSME calls for proposals 2015

CSPro Getting Started

Census Data Capture with OCR Technology: Ghana s Experience.

Document Management/Scanning White Paper

Recommendation 195. Recommendation concerning Human Resources Development: Education, Training and Lifelong Learning

International Labour Office Geneva. Audit Matrix for the ILO Guidelines on Occupational Safety and Health Management Systems ( ILO-OSH 2001)

OECD SERIES ON PRINCIPLES OF GOOD LABORATORY PRACTICE AND COMPLIANCE MONITORING NUMBER 10 GLP CONSENSUS DOCUMENT

GUIDELINES FOR CLEANING AND HARMONIZATION OF GENERATIONS AND GENDER SURVEY DATA. Andrej Kveder Alexandra Galico

Position Classification Standard for Management and Program Clerical and Assistance Series, GS-0344

Guide for Documenting and Sharing Best Practices. in Health Programmes

A Computer Glossary. For the New York Farm Viability Institute Computer Training Courses

Price: 20 Swiss francs INTERNATIONAL LABOUR OFFICE. GENEVA

Management Challenge. Managing Hardware Assets. Central Processing Unit. What is a Computer System?

APPENDIX B: FEMA 452: Risk Assessment Database V5.0. User Guide

Updating the International Standard Classification of Occupations (ISCO) Draft ISCO-08 Group Definitions: Occupations in Secretarial and Reception

PISA 2003 MAIN STUDY DATA ENTRY MANUAL

Answers to Review Questions

Appendix B Data Quality Dimensions

THE SIPP UTILITIES USER'S MANUAL

ILLINOIS DEPARTMENT OF CENTRAL MANAGEMENT SERVICES CLASS SPECIFICATION DATA PROCESSING OPERATIONS SERIES CLASS TITLE POSITION CODE EFFECTIVE

Implementing an Automated Digital Video Archive Based on the Video Edition of XenData Software

Accounts Receivable System Administration Manual

SCADAPack E ISaGRAF 3 User Manual

Employer Survey Guide

Case Study No. 6. Good practice in data management

An Application of the Internet-based Automated Data Management System (IADMS) for a Multi-Site Public Health Project

Data Management Implementation Plan

Multi-Environment Trials: Data Quality Guide

File Magic 5 Series. The power to share information PRODUCT OVERVIEW. Revised November 2004

Physical Design. Meeting the needs of the users is the gold standard against which we measure our success in creating a database.

Competent Data Management - a key component

Preparing data for sharing

OCCUPATIONS & WAGES REPORT

SPSS: Getting Started. For Windows

Software: Systems and Application Software

Project Data Archiving Lessons from a Case Study

XenData Archive Series Software Technical Overview

DATA QUALITY DATA BASE QUALITY INFORMATION SYSTEM QUALITY

Exchange Mailbox Protection Whitepaper

RATIONALISING DATA COLLECTION: AUTOMATED DATA COLLECTION FROM ENTERPRISES

ABSTRACT. would end the use of the hefty 1.5-kg ticket racks carried by KSRTC conductors. It would also end the

Classification Appeal Decision Under section 5112 of title 5, United States Code

Notes. Business Management. Higher Still. Higher. HSN81200 Unit 1 Outcome 2. Contents. Information and Information Technology 1

Accounts Receivable User Manual

Space Project Management

EMC Documentum Repository Services for Microsoft SharePoint

Results-based Management in the ILO. A Guidebook. Version 2

Building an Integrated Clinical Trial Data Management System With SAS Using OLE Automation and ODBC Technology

Data Migration Service An Overview

Do you know? "7 Practices" for a Reliable Requirements Management. by Software Process Engineering Inc. translated by Sparx Systems Japan Co., Ltd.

Documentation for data centre migrations

NHA. User Guide, Version 1.0. Production Tool

Data Management Procedures

B.Com(Computers) II Year RELATIONAL DATABASE MANAGEMENT SYSTEM Unit- I

BC Geographic Warehouse. A Guide for Data Custodians & Data Managers

Records Management. Objectives. With the person sitting next to you, Presented by: Rachel Martin. After this workshop, you ll be able to:

Litigation Support. Learn How to Talk the Talk. solutions. Document management

Configuration Management: An Object-Based Method Barbara Dumas

Clinical Data Management (Process and practical guide) Dr Nguyen Thi My Huong WHO/RHR/RCP/SIS

How To Use A Court Record Electronically In Idaho

Data Availability Policies & Author Responsibility Policies Time of Evaluation: May 2014

How To Read Data Files With Spss For Free On Windows (Spss)

How To Manage Assets On A Microsoft Powerbook (For Microsoft)

B.Sc (Computer Science) Database Management Systems UNIT-V

How To Backup A Database In Navision

International Certificate in Financial English

Guide for Applicants. Call for Proposal:

OMCL Network of the Council of Europe QUALITY ASSURANCE DOCUMENT

Accounts Payable System Administration Manual

Quick Guide: Meeting ISO Requirements for Asset Management

Transcription:

Statistical Information and Monitoring Programme on Child labour (SIMPOC) International Programme on the Elimination of Child Labour (IPEC) Child Labour Survey Data Processing and Storage of Electronic Files A Practical Guide Revised December 2003 International Labour Office Geneva

Copyright International Labour Organization 2004 Publications of the International Labour Office enjoy copyright under Protocol 2 of the Universal Copyright Convention. Nevertheless, short excerpts from them may be reproduced without authorization, on condition that the source is indicated. For rights of reproduction or translation, application should be made to the ILO Publications Bureau (Rights and Permissions), International Labour Office, CH-1211 Geneva 22, Switzerland. The International Labour Office welcomes such applications. Libraries, institutions and other users registered in the United Kingdom with the Copyright Licensing Agency, 90 Tottenham Court Road, London WIT 4LP [Fax: (+44) (0)207631 5500; e-mail: cla@cla.co.uk], in the United States with the Copyright Clearance Center, 222 Rosewood Drive, Danvers, MA 01923 [Fax: (+ 1) (978) 7504470; e-mail: info@copyright.com] or in other countries with associated Reproduction Rights Organizations, may make photocopies in accordance with the licences issued to them for this purpose. ISBN 92-2-113629-9 First published 2004 The designations employed in ILO publications, which are in conformity with United Nations practice, and the presentation of material therein do not imply the expression of any opinion whatsoever on the part of the International Labour Office concerning the legal status of any country, area or territory or of its authorities, or concerning the delimitation of its frontiers. The responsibility for opinions expressed in signed articles, studies and other contributions rests solely with their authors, and publication does not constitute an endorsement by the International Labour Office of the opinions expressed in them. Reference to names of Firms and commercial products and processes does not imply their endorsement by the International Labour Office, and any failure to mention a particular firm, commercial product or process is not a sign of disapproval. ILO publications can be obtained through major booksellers or ILO local offices in many countries, or direct from ILO Publications, International Labour Office, CH-1211 Geneva 22, Switzerland. Catalogues or lists of new publications are available free of charge from the above address. Photocomposed in Switzerland Printed in Switzerland BRI VAU

Foreword and acknowledgements The production of survey results in a presentable format is often delayed, one of the main reasons being that data processing issues are addressed neither properly nor early enough. Emphasizing the importance of careful and informed data processing, this guide provides detailed guidelines for survey planners, data processors, and computer system administrators with respect to data processing planning, actual data processing activities, and the storage of generated files. The guide also outlines the requirements and procedures for transferring electronic files to the ILO at the completion of child labour surveys, a process that is contributing to a growing global child labour data repository. The main aim is to facilitate the generation of high-quality micro-data derived from child labour surveys. This guide has been prepared by Muhammad Q. Hasan of SIMPOC/IPEC, ILO. Many people involved with child labour surveys helped in the exercise. We would like to express sincere thanks to all concerned. We particularly wish to thank Mr. Sylvester Young, Director of the ILO Bureau of Statistics and Mr Farhad Mehran of the ILO s Department of Policy Integration for their valuable comments and suggestions. This guide is planned to be revised and reproduced on a regular basis. For this suggestions and comments are always welcome. Users should direct any feedback to simpoc@ilo.org iii

Contents 1. Introduction 1.1 Background... 1 1.2 Field data collection: A brief overview... 2 1.3 Importance of data processing... 2 2. Planning 2.1 Introduction... 5 2.2 Data processing policy planning... 6 2.3 Defining the relevant aspects of a dataset... 6 2.4 Selection of hardware and software... 14 2.5 Identification of personnel... 17 2.6 Scheduling the data processing... 18 2.7 Data preservation strategy and access procedure... 18 3. Data processing 3.1 Introduction... 21 3.2 Data entry and preliminary validations... 22 3.3 Appending/merging/splitting files... 23 3.4 Data validation... 28 3.5 Final decisions on errors... 30 3.6 Completion of data processing and generation of data file(s)... 31 3.7 Preparation of public use datasets... 32 3.8 Final documentation... 33 3.9 Final tabulation... 40 3.10 Conversion of data files to other formats... 41 3.11 Storage of all files... 42 4. Data preservation 4.1 Introduction... 45 4.2 Organization of files... 45 4.3 Transfer of files to a preservation machine... 47 4.4 Backups... 47 Transfer of files to the ILO... 49 Bibliography and further ressources... 51 v

Glossary... 53 Annexes Annex I Comparison of statistical packages... 57 Annex II English country names and code elements... 59 Annex III Zambia end of decade and child labour questionnaire (education module)... 66 Annex IV A sample codebook for ASCII data created in SAS... 67 Annex V Structure of dataset... 76 vi

1. Introduction 1.1 Background ILO/IPEC s Statistical Information and Monitoring Programme on Child Labour (SIMPOC) supports child labour surveys conducted in a large number of countries. One of the most important aspects of this programme is the collection, archiving, and dissemination of credible, well-documented, and easily accessible micro-data. This requires the extensive planning, organization, and execution of planned activities, especially at the country level, where, it is expected, collected data will be archived for an indefinite period. At the ILO, meanwhile, this information will provide the basis of a global child labour data repository for use by a variety of people in a variety of countries in a variety of computing environments. Thus, the data must be clean, with no inconsistencies, and well documented, readily accessible for use at any time in research and policy-making activities. The dataset received by the ILO also needs to be complete incorporating codebooks, questionnaires and so on and ready for straightforward use by any analyst in any computing environment. Child labour surveys include three phases. First of all, data are collected through interviews with children and other family members. Data collection is followed by data processing, where the collected information is checked for errors, and micro-data and relevant documentation files are created. Finally, data analysis is performed in the light of any additional requirements or policy. Data processing is a difficult and complex process, but in many cases this is the stage that receives least attention. Data processing activities such as planning for equipment, software, and training of personnel can be conducted concurrently with such activities as survey design and field-data collection. Since all child labour surveys are carried out under strict time constraints, it is recommended that all planning, training, and testing procedures are completed before the field data collection is undertaken. The data processing phase includes several distinct stages, each of these comprising multiple steps where errors can and do occur. Child labour surveys are smaller operations than censuses, but, since most are first-time surveys and collect a greater amount of information than many other general household surveys, they tend to be more complex. While overall data processing activities are in many respects similar to other general householdtype surveys, child labour surveys, given their larger sample sizes and questionnaires, sometimes make greater demands on time and other resources. Presentable survey results are commonly delayed because data processing issues are addressed neither appropriately nor early enough. This guide presents a brief overview of the data collection phase before going on, first, to highlight the importance of data processing and, second, to provide detailed guidelines for its conduct, with particular emphasis on issues pertinent to child labour surveys. Chapter 2 addresses planning issues involved in data processing. Chapter 3 looks at the conduct of data processing and, immediately upon completion of a child labour survey, the generation of files, including well-documented public use datasets. One main purpose of this guide is to help data processors at the country level to produce clean, reliable datasets together with all the necessary documentation for use by secondary analysts, at the conclusion of surveys, in producing reliable aggregate data. Chapter 4 provides information on how to preserve datasets, allowing continued ease of access over an indefinite period. Survey design issues, data analysis, and data dissemination lie beyond the scope of this guide. 1

The information presented in the following chapters should be viewed only as guidelines the procedures outlined here may of course be adapted in the light of available national resources and experience. This guide as a whole is intended for planners and technical experts supervising data processing activities. Chapter 3, however, is designed specifically for those who perform the actual data processing, while Chapter 4 is for computer system administrators responsible for storage of child labour survey data. The guide also provides an overview of data processing activities that can be carried out at the survey design stage. 1.2 Field data collection: A brief overview In general, data collection can involve a variety of methods, from face-to-face or telephone interviews to aerial photography. Child labour surveys, however, involve only faceto-face interviews, and only two such methods are feasible. PAPI. With paper-and-pencil interviews, enumerators apply questionnaires on paper, recording the data with pencils. Data entry operators then key data into computers or convert them into machine-readable form through some scanning technique coupled with character recognition technology. No matter which method of data entry is selected, the information needs to be rechecked. Various means are used to ensure that data are entered properly. Much of this will be explained in the following chapters. CAPI. With computer-aided personal interviews, enumerators are supplied with handheld electronic devices (e.g. palmtop or laptop computers), permitting the direct digital recording of data. This method offers advantages, compared to PAPI, since major errors occur only while keying in the data, which can be rechecked immediately after data collection. Data are then transferred to computers, with almost no time needed for additional data entry, and data cleaning can begin immediately. This guide primarily addresses PAPI, which is the data collection method applied in most child labour surveys. 1.3 Importance of data processing Child labour typically remains a hidden issue, in many respects; and surveys seek reliable quantitative data about the various related concerns. National surveys require important sums of money and enormous organizational efforts involving ministries, national statistical offices, and other agencies. The resultant data are then provided to policy-makers, researchers, global estimators, and campaigners responsible for publicizing the adverse effects of child labour. All of these mentioned above need easily accessible, reliable data on the various aspects of child labour. Both non-sampling and sampling errors appear in survey datasets. Sampling errors are handled during the sample design phase, and are not addressed in this guide. Non-sampling errors may originate with respondents, interviewers, data-entry clerks, or processing programmers. One main objective of data processing is to find these errors and fix them in the shortest possible time. Where irreparable errors are found, they should be flagged with explanations. Unidentified and unflagged errors can corrupt interpretations of data and, ultimately, may result in the adoption of inappropriate policies. Competent and thorough data processing activities including error correction, logical checks, and compilation of information as the basis of documentation are vital to reliable 2

survey information. Otherwise, output from a successful survey (the collected field data) may be limited only to a few tables. Secondary analysts will find it difficult, if not impossible, to use the data, while national and international policy-makers may be misguided by the survey results. One key to successful data processing is careful planning. The various related activities need to be detailed as early as possible, and should include fallback plans. Data processing is immensely important to the survey output, and cleaning and verifying the data is essential. 3

2. Planning 2.1 Introduction The preparation of high-quality datasets requires proper planning, and involves two essential elements: Statistical method. One should employ good data collection tools and a well-developed survey methodology. Processing and subsequent storage of datasets. A second essential element involves the informed use of established data processing tools, processing methodology, and up-to-date computer hardware and software where applicable. In most cases, child labour surveys are conducted either as stand-alone operations or are attached as a module to some form of national household survey. In stand-alone surveys, children and their parents are usually interviewed. On the basis of initial investigations, this manual assumes that all child labour surveys employ paper-and-pencil interviews (PAPI) for data collection. Survey planning and data cleaning are discussed in light of this assumption. Upon completion of the interviews, the collected data are entered on a computer. Data entry in the field may occur under the supervision of field office supervisors or at survey headquarters, which is normally the national statistics office. If data are entered in the field, there will be a minimum of one file at each field location. Since the same survey questionnaire is used, all files generated in field locations will be similar with respect to the number of variables. No matter how the data are entered, different files are either appended before data cleaning or cleaned and then appended. These activities are normally conducted at survey headquarters. Where a child labour survey is conducted as a module attached to some other household-based survey (e.g. a household member health and education module), child labour data may be collected together with other modules (as with stand-alone surveys) or as a complete module without the household information (which is collected as part of another module). The data may also be collected at different times (e.g. if attached to a quarterly labour-force survey, the total sample will be covered over a period of one year). In such cases, household information needs to be extracted from the other data file(s) and then combined with the child labour data. Such cases entail both appending and merging of data. (Merging and appending are described in greater detail in later sections.) Completion of a data file is followed by data cleaning (partial cleaning may also occur in each modular file). It should be noted that child labour is difficult to define unless all relevant information about children is thoroughly investigated understanding the causes and consequences of child labour requires analysis of information about the household and other family members. Another scenario 1 presents itself when the survey is conducted in phases, with a series of questionnaires referring to different entities or differing in their respective coverage. In this situation, data may have to be presented in separate files, with no merging or appending. All of the situations described above warn us that careful planning is needed before the collected information is processed and made available for analysis. All planning issues can 1 One example for this would be the SIMPOC assisted country report Survey of activities of young people in South Africa 1999, http://www.ilo.org/childlabour/simpoc/southafrica/ report/rep1999.pdf. 5

be addressed while the survey design is in progress. If financial and time constraints are not an issue, all data processing activities should be tested during the pilot survey. (If the CAPI data collection method is used, this is essential.) It should be noted that careful advance planning considerably reduces actual processing time. The following sections discuss planning issues that need consideration before the actual data processing. 2.2 Data processing policy planning Two areas of planning are important to data processing. On the one hand, we must decide how the actual data processing is to be accomplished, and this is treated in detail in Chapter 3. But first we must ask what resources and definitions are required for effective and efficient data processing. We may term this initial step policy planning. Policy planning comprises the following essential features: defining the relevant aspects of a dataset; selecting hardware and software; identifying personnel; scheduling the time needed for data processing; formulating a data preservation strategy; and designing an access procedure. 2.3 Defining the relevant aspects of a dataset If analysts are to use a dataset effectively, the micro-data must first be properly processed. This involves a number of stages. Preliminary planning is essential, and includes the identification and definition of such aspects of the dataset as the following. Record identification variable To identify a case or record, an identifying variable is usually created and encoded with a unique value. The encoding method and the elements that constitute this variable have to be determined, and the variable often referred to as the unique record identifier should be named in accordance with the procedure described later in this chapter. This identification variable will provide the only linkage between the original dataset containing all the variables and a public use dataset (where many identification variables may have been deleted for reasons of confidentiality) or when data are in different files, but a cross comparison of information is required. For example, a combination of state or provincial code, enumeration area code, and house number appended one after another may be enough to identify a house uniquely. A line number (position of a person in a house) can be used to identify a person in the house uniquely. Other approaches may achieve the same goal, but care should always be taken when appending these numbers, and each household, as well as each person living in a household, should always have a unique identifier. 6

File structure In child labour surveys, the unit of analysis is the child or the person, whereas the medium is the household, because information about the child or person is collected by first identifying a house. Thus, it is worth deciding what the final data files should look like. The structure of data files may vary considerably in format and organization when, upon completion of data entry, they become available to secondary analysts. Is a large data file with one long data record preferable (describing both a child and the house in which he/she is living, for example), or does one want several small data files with short data records (where child and house information, for instance, reside in different files with a linking variable)? This decision will depend on factors such as how the survey is conducted and what statistical software is being used for data entry and processing. The following considerations may serve as guidelines. A data file may contain one long record or several smaller records. A large number of records slows processing speed. Some statistical packages (e.g. Stata) limit records to a maximum number of variables. On the other hand, one advantage of long records in a single file is that secondary analysts do not have to merge files at a later date. Annex I describes limitations associated with statistical packages such as SPSS, SAS, and Stata. Data may be organized in a file such that household records are followed by person records (with different record types in an ASCII hierarchical file). Alternatively, there may be two separate files: one for a house and one for persons living in that house, with welldefined linking variables common to both files in a package-specific format. There may also be a single merged file with long records. The values for many variables will be repeated for members of the same house in such files, thus occupying more storage space. Each system has its pros and cons, and one planning decision must address the questions of how many data files are to be included in the dataset and what the structure of each should be. Because of the way specific software handles data files, processing large data files within a Windows environment may be a problem. A child labour survey data file may become large when associated with a labour force survey, so the data file may need to be split before analysis. The file structure should be chosen according to available computing resources and the experience of the data processors. Because of its simplicity, however, a single flat data file is recommended where possible for child labour surveys. Naming files As soon as a file is created, it must be named, and it is worth deciding beforehand how all files are to be named. This means, at a minimum, adopting a naming convention. It is always recommended, for one thing, that names reflect the file contents. The version number of the file can also be included. (In Chapter 3 we see how different versions may be generated.) For child labour surveys, specifically, it is recommended that the following information is included in file names: file content (data, documentation, questionnaire, etc.); to whom the file relates (child, parent, both); version number; relevant country; and whether the file is for general or restricted use. 7

Such a standard naming convention greatly assists users in choosing the correct file from the dataset. In general, it facilitates processing the contents, often at a much later date, of computer-based storage systems that may contain thousands of files. Other information, such as survey year and survey round, may also be included in the file name. However, there are generally restrictions on the number of characters used in naming a file, with most computer systems allowing 8.3 structures i.e., eight characters for the actual file name and three characters for the file extension (e.g. MY_FILE.DOC). The extension is usually allocated by the package that created the file. (MSWord, for example, will use DOC as the extension.) In other words, only eight characters can be manipulated to express as much information as possible about the nature of a file. In view of these limitations, the following naming convention is recommended. All filenames should start with a country code (Annex II lists the two-character codes) followed by the abbreviations C for child or P for parents or F for family (both parents and children) and H for house (dwelling). The version number follows and, since more than nine versions can easily evolve over time, two characters should be used. G indicates the file is available for general use, and R marks it as restricted. Finally, the eighth character D, Q, or C representing data, questionnaire, or codebook respectively indicates the file contents. If any field in the file name is not applicable, that should be replaced with an underscore (_), thereby simplifying manipulations during computer processing. In summary, when naming files according to an 8.3 structure, use the following convention. The first eight characters: first and second characters country code third and fourth characters child/parent (person), house (dwellings) or both C_ for child only F_ for child and parent (family) H_ for house P_ for parent only FH a single file containing information on child, parent and house (dwellings) combined Note: an underscore (_) is used to fill the blank space of the fourth character fifth and sixth characters version number 01 first or original version 02 second version, and therefore not the original version and so on seventh character file use G for general (public) use R for restricted (internal) use (in case of data only) eighth character file contents C for codebook (normally associated with an ASCII data file) D for data I for summary of classification of industries 8

L internal consistency check rules Q for questionnaire S for summary of classification of occupations V for variable list The last three characters, following the decimal point, denote the type of file (proprietary or otherwise). The following examples should clarify the convention: BDC_01RD.DOC/SAV/POR A file containing data about children in Bangladesh, and which is the original version, might be named BDC_01RD, where BD stands for Bangladesh; C stands for child; _ indicates there is no information regarding the house or dwelling; 01 marks this file as the first version; R shows that the file is restricted; and D stands for data. The corresponding public use data file derived from the above would carry the name BDC_01GD. The associated questionnaires would be called BDC_01GQ. (Since the questionnaires are for general public use, they would always include the G code.) The extension would say whether it is a package-specific data file or documentation. For example, a SPSS data file takes a SAV or POR extension, while documentation in MSWord takes a DOC extension. UAFH04RD.[xxx] Similarly, a file containing data about parents, children, and their house in the Ukraine, and which is the fourth version, can be named UAFH04RD. The public use version would then be named UAFH04GD. Associated questionnaires are named UAFH04GQ, while a variable description file is named UAFH04GV. A summary classification of occupations file would be named UAFH04GS. All the file names would include appropriate three-character extensions. PAFH02RD.txt An ASCII data file that contains data about parents, children, and households in Panama, and is the second version, can be named PAFH02RD.txt, and the public use version would be PAFH02GD.txt. The associated codebook file should be named PAFH02GC with a TXT or DOC extension, depending on file type. Creation and naming of variables Once a survey is completed, a set of variables is created from the questionnaire (primary variables). At a later stage, manipulating the primary variables may produce derived variables. Unless conventions are followed, naming these variables can prove awkward. Here are a few rules of thumb: Variable names should convey the meaning of the data content they represent. Any potential analyst should be confident that the same variable names apply to the same data. For example, if two questions are used to determine the work status of a respondent e.g. enquiring as to both current work and usual work variables representing these questions should never be named work1 and work2, since this leaves it unclear which variable refers to which question. Ideally, questionnaires should be prepared such that each question comes with a predesignated variable name. For example: How old are you? would be annotated 9

with the variable name AGE. This type of questionnaire is often referred to as an annotated questionnaire. As with files, naming variables often depends on statistical packages that restrict the number of code characters to eight or fewer (SPSS for example). 2 The prevailing computing environment in any particular country will also influence naming conventions. Each answer in a multiple-choice question should also be assigned a variable name. For example : If question number 9 has 2 multiple-choice answers then variables may be named as Q9A and Q9B. Several different methods may be applied for naming variables 3. One-up numbers. In this approach, variables are numbered sequentially. Thus if there are 100 variables in a data file, they can be numbered from 1 to 100. However, many statistical software packages do not allow a digit to be the first character in a variable name (e.g. in SPSS), as such a letter can be added as a first character (e.g. in SPSS, variable names will be automatically assigned either v1 to v100 or var0001 to var00100.) Variable names can be changed manually afterwards. However, the problem with this method is that it is often impossible to comprehend the meaning of the variable or to match some variable names with the respective questions without additional labels. Errors can easily happen if variables are named in this way. Question numbers. A possible alternative to the one-up number method is to name variables with the respective question number; for example, Q1 is the variable that corresponds to question 1. Since multiple answer questions would require more than one variable to be created for a single question, a letter can be appended after the question number, Q4a, Q4b etc. Since all child labour questionnaires consist of multiple sections, the first letter can be chosen to represent the section (A1, A2 B4a, B4b etc, where A, B are two different sections) Again, additional labels can also be used to explain the actual meaning of the variables Mnemonic names. In this method, variables are named with words representing the concept of the variable. However, the same word may offer different meanings to different users. Also, the maximum of eight permissible characters in the variable name may impose severe restrictions to conveying the actual meaning. It is also hard to assign manually the same word to different variables conveying the same type of meaning. Prefix, root, suffix systems. A possible alternative to the mnemonic method of constructing variable names is to use predefined abbreviated words and join them as prefix, root and suffix. For example, all variables related to children may use CH as a prefix; WW and WY, to denote last week s work and last year s work respectively, as a root; and GRP, to group cases, as a suffix. Derived variables. As mentioned earlier, derived variables are created from primary variables or by combining multiple primary variables. For example, age may be a primary variable, but analysts might need information about children in the 5- to 9-year age group. Information about individual children s ages can then be grouped to form the derived variable age group. It is always recommended that primary and derived 2 See Annex I for maximum number of characters allowed in naming a variable in some statistical packages. 3 This follows the approaches outlined in: Inter-university Consortium for Political and Social Research (ICPSR), Guide to Social Science Data Preparation and Archiving. Retrieved from http://www.icpsr.umich.edu/access/dpm.html 10

variables are distinguishable. For a variety of reasons, it is also advised that public use datasets should not contain large numbers of derived variables: they are costly in terms of data-processing time; if they are to be properly used, they need adequate explanations; and the datasets may become too large and unwieldy. Moreover, data analysts may not have occasion to use these derived variables at a later date, and prefer to tailor derived variables to their own requirements. Remember that the weight factor included in a dataset is not a variable from the questionnaire, and it should be treated separately. It should be named WEIGHT, using the naming convention applied to a primary variable. Individual countries are of course free to choose the naming convention appropriate to their variables. With the aim of establishing international consistency with regard to child labour data, however, the following rules are recommended: use the question-number method in naming variables, with the character representing the section appearing as the first character in the variable name; use the prefix method in naming derived variables; use capital letters for primary variables, when possible; use lower-case characters for derived variables; and the weight factor should be named according to the rules for primary variables, but at the same time be distinguished from a primary variable. Variable labels It is more difficult to understand a dataset if attributes associated with the variables for example the literal question asked are not properly described inside that dataset. People who want to perform secondary analyses of child labour surveys prefer that all information be contained in the dataset. One sign-posting method is to provide an adequate label for each variable. Since nowadays almost all data processing software (e.g. SPSS) provides the option to add labels, this option should be used to describe each variable. If no suitable labels can be found, the literal question together with the appropriate question number should be used as a label. If the variable is a derived variable, a label can be added to express which variable or variables are used to create this new variable and if possible indicate the reason for creating such a variable. Coding A statistical software package is used to analyse the information collected through collection of field data. Thus, the information needs to be transformed into data that the software can handle. To this end, each answer is coded, and the process that determines which symbol represents what item is known as coding. Coding should be undertaken during the survey design process, and it is important that the data processors themselves are involved. Child labour surveys should be pre-coded before data entry. All possible values including those such as not available, not applicable, refused to answer ought to be included in the questionnaire, and interviewers should receive proper training. These measures will greatly reduce the time that data entry or data processing personnel need to spend on coding. Following are a few guidelines drawing on the ICPSR Guide to Social Science Data Preparation and Archiving 4 and the Audience Dialogue Survey analysis 5. 4 ibid. 5 Audience Dialogue: Survey analysis. Retrieved from http://www.audiencedialogue.org/ kya5.html 11

Should the need for additional codes arise (for example, assigning a specific code for open-ended questions), this is to be carried out with proper consideration to the coding scheme defined during the questionnaire design. It is particularly important to ensure that there are no overlaps between code categories and that each code fits into only one category. For open-ended questions, major categories/classifications should be identified by examining the number of responses and should be used for additional coding. The meaning of each code should be clearly documented. During the additional coding procedure it is also good practice to preserve as much information as possible in the data as they are collected (i.e. no collapsing or bracketing etc.). With occupational coding, it is important to follow a standard format defined by an accepted standards institution e.g. the International Standard Classification of Occupations, ISCO-88 and to use as many digits and, therefore, include as many details as possible. Specify all possible missing values (such as no response or not applicable ). Assign the same value (99, for example) to each type (e.g. not applicable ) in the same dataset. One of the following factors is usually responsible for missing values in child labour survey data, and a different code should be assigned to each case. Refused to answer. A child or parent did not answer the question. Don t know. A child or parent was unable to answer the question. The respondent might not have had any concept of time or arithmetic, for example, and replied Don t know to the question: What was your total income last year. (Respondents should be discouraged from answering, Don t know.) Not applicable. For some valid reason, the question was not asked. Following the response Not working, for example, any questions related to income were not asked. It has been observed in many child labour surveys that missing values were left blank or coded with a zero that was not pre-defined. It is of paramount importance, therefore, that all cases are assigned different codes during the coding process; and these should then appear pre-coded in the questionnaire. If for any reason missing values are assigned codes, the documentation should include clear descriptions. It is often quite difficult to code such items as occupations and industries. Where codes are developed, some classifications (occupation, for example) may be missed, making the jobs of enumerators and data processors even more difficult. Consequently, countries are encouraged to consult the following resources for help: International Standard Classification of Occupations (ISCO) 6 International Classification of Status in Employment (ICSE) 7 International Standard Industrial Classification of all Economic Activities (ISIC) 8 Classifications of Occupational Injuries 9 6 retrieved from http://www.ilo.org/public/english/bureau/stat/class/isco.htm 7 retrieved from http://www.ilo.org/public/english/bureau/stat/class/icse.htm 8 retrieved from http:// www.ilo.org//public/english/bureau/stat/class/isic.htm 9 retrieved from http://www.ilo.org/public/english/bureau/stat/class/acc/index.htm 12

This list, which is not exhaustive, can be accessed through the Bureau of Statistics web page. 10 Child labour classifications, the relevant categories varying from country to country, are not yet in a finalized form, and additional coding schemes may need to be developed. Consistency and logic check rules It is important to develop as many logic check rules as possible by going through the questionnaire. This requires a detailed understanding of the questionnaire and its flow, and will greatly help computer programmers at later stages. First, consistency check rules have to be generated by studying the routing of each question (e.g., if the answer to question 20 is yes, enter skip pattern as answers to questions 21 and 22). Sample responses from questionnaires that suggest other consistency checking rules include these: A child aged younger than six years is reported as having completed secondary school. A child is reported as not working but as nevertheless bringing cash into the household. A child did not work, but reported a work-related injury. Another type of logic check rule needs to be developed where data contains a legal value but nevertheless does not look right. For example, a parent is reported as having 11 children. This may be true, but may not look right, and could well represent a typographical error. The correct value may more likely be 1 child. The corresponding rule could read: Flag cases where parents reported having more than 10 children. These flagged cases then need to be checked manually. Imputations Once consistency checks are performed, many missing values can be replaced following imputation rules. Imputations estimate what would otherwise be missing values, where survey respondents failed to provide responses to given items. One rule might indicate, for example, that a person s income can be imputed by generating a formula involving age, type of work, wage rate, and number of days worked in a particular geographical area. As many of these formulae as possible should be developed by going through the questionnaire. It must be decided how imputed variables are to be incorporated in the dataset, and, where needed, relevant computer programs may be developed and tested. For simplicity, a completely new variable can be created, one which includes imputed values for missing codes, or where missing codes are replaced with imputed values together with a flagged variable with a value of 1 for imputed, and a value of 0 if not. Weights Since all child labour surveys are sample surveys, weights need to be calculated in order to produce national estimates. In choosing a sampling procedure, we should ask whether standard errors based on simple random sampling are appropriate, or whether more complex methods are required. If weights are required, they should be described. A clear indication of the response rate should be provided in the documentation, indicating what proportion of those sampled actually participated in the survey. The retention rate, if applicable, should also be noted. Weights are usually developed by specialists, and it is essential that a weighting formula with descriptions of all its elements is obtained well before data processing begins. 10 Details may be obtained from http://www.ilo.org/public/english/bureau/stat 13

Documentation Documentation should be as much a part of overall planning as is analysis. It has to be decided who is responsible for keeping a log of what is happening during data processing, including such considerations as problems encountered, major decisions taken, and any imputation method adopted. A more detailed account of this process is presented in Final documentation, (Section 3.8). 2.4 Selection of hardware and software Marshalling resources for a child labour survey strongly depends on what hardware, software, and national statistics office personnel are available. Given those constraints, the following aspects must be considered when selecting hardware and software for data processing: computers and printers data entry and data cleaning statistical processing and tabulations documentations and other tabulations software utility tools automation tools (to perform repeated tasks) Computers and printers tools for transferring files among different computers. virus-checking software hardware accessories cables, disks, CD, UPS, etc. Since data will be entered in batches and probably in parallel, one PC is needed for each data entry operator. Different data entry operators, however, can often share the same computer at different times. Printers capable of printing landscape format are also necessary. If line printers/ dot matrix printers are used, they should have a capacity of 120 characters per line. A Pentium computer with a 1GB hard disk is more than enough for data processing and temporary storage of child labour survey data. A permanent computer is also needed where the final dataset will be archived. It is highly recommended that the computer used for permanent storage of data is not the same one used for day-to-day work, even where this computer may be a central one, shared by different sections in the national statistics offices to store their data on a permanent basis. Data entry and data cleaning A great number of staff-hours are sometimes devoted to developing custom software for checking data entry errors. A better solution can be to use automatic data entry software, most of which has some form of built-in checking facility. Over the years, a variety of organizations have developed data entry software, and many national statistics offices use one or all of the following programs for data entry and initial data validations (this list is not exhaustive): 14

Blaise. 11 A flexible, relatively powerful system developed by Statistics Netherlands for computer-assisted interviewing, data entry, and data editing, Blaise is a software system for survey processing on microcomputers. Blaise also simplifies subsequent processing of the collected data. This software is being used primarily by European Union countries. IMPS. 12 Developed by the US Census Bureau, the original DOS-based Integrated Microcomputer Processing System has been superseded by a Windows-based version. Many developing countries are using this software for data entry. ISSA. 13 Integrated Systems for Survey Analysis is produced jointly by SerPro Ltd of Chile and Macro International of the USA. A number of developing countries are using this software for data entry. Evidence suggests that ISSA does not have a wide user base in SIMPOC countries and offers limited support in the form of training courses and documentation. EpiInfo. 14 This word-processing, database, and statistics program for public health on IBM-compatible microcomputers is produced by the Centre for Disease Control and Prevention, in the USA. Many developing countries are using this software for data entry. CSPro. 15 The Census and Survey Processing System was also developed by the US Census Bureau. Incorporating many features of IMPS, ISSA, and EpiInfo, CSPro is designed to replace both IMPS and ISSA, eventually. Detailed evaluation of the above software lies beyond the scope of this manual. In general, however, availability of financial resources, trained personnel, and microcomputers are all-important considerations in choosing any child labour survey software. Where no other data entry software is available and trained national statistics office personnel are lacking, CSPro (see above), public domain software from the US Census Bureau, can be used for entering, tabulating, and mapping survey data. This software, together with its documentation, is free online, although online registration may be required. The US Census Bureau can arrange training programmes, but charges for them. According to the software documentation, it is possible to handle child labour survey data with this software. Nevertheless, although some national statistics offices reportedly use versions of this software, they have yet to be tried on child labour surveys specifically, and the training may be worth the cost. An alternative is Blaise (see above), a user-friendly, high-speed data entry and data manipulation software with an interactive editing facility and survey management capabilities. The software is not free, but is offered at a discounted price to developing countries. However, it has a number of characteristics that can make it harder for non-programmers to learn. One such characteristic is the use of advanced programming concepts such as data typing and procedure parameters. Another is the lack of structured forms to aid in defining questionnaire forms and variables. Blaise is not widely used outside Europe, moreover, so an established user base in developing countries does not yet exist. 11 Details may be obtained from Statistics Netherlands http://neon.vb.cbs.nl/blaise 12 Details may be obtained from U.S Census Bureau http://www.census.gov/ipc/www/imps/ index.html 13 More information is available at SERPRO http://www.serpro.com/about.asp 14 Details may be obtained from Centre for Disease Control and Prevention http://www.cdc. gov/epiinfo/ 15 Details may be obtained from U.S Census Bureau http://www.census.gov/ipc/www/ cspro/index.html 15

In any case, software should be tested beforehand, and data entry operators should be both trained with the software and familiar with child labour surveys before actual data entry. Processing and tabulations Evidence suggests that virtually all national statistics offices have access to either SAS or SPSS or both statistical packages. Where that may not be the case, national statistical offices should try to adopt one standard statistical software package (e.g. SPSS, SAS, or Stata). Where that is not possible, data entry software can also be used for child labour survey data processing purposes. (Data analysis can be performed using EpiInfo, for example.) See Annex I for a comparison of the SAS, SPSS, and Stata statistical packages. Documentation and other tabulation Microsoft Office Suite, comprising Word, Excel, and Access, is being used by many statistical offices, and is adequate for creating the appropriate documentation, including creation of the questionnaires. Both MSExcel, a spreadsheet program, and Access, a database program, are user-friendly means to preparing tables. TPL, table generation software from QQQ software, 16 can also be used. Again, availability of resources and trained personnel should be the main criteria for choosing a particular software. Software utility tools The following list of software utilities is not an exhaustive one, and many other utility tools may currently be in use in various countries. Databases. General users are often unfamiliar with statistical packages, and they might prefer to have a subset of the data (or even the entire dataset) in a database format. Many statistical packages allow data to be saved in a database format, and database programs such as Microsoft Access are sometimes quite helpful. File compression software (e.g. WinZip, PKZIP, gzip). This software is used for compressing files. It is sometime possible to reduce the file sizes as much as 80 per cent or more using these kinds of software. Compression is useful where a hard disk is short of storage space or when using floppy disks to transfer files between computers. Compiling software (e.g. Visual Basic, FoxPro, C++). This is programming software other than that incorporated in the statistical package. Compilers can be used to develop user-friendly front-end for data entry, for example, or to produce customized, in-house automation software for performing repetitive tasks. Conversion software. Utility software such as STAT Transfer and DBMScopy converts files from one specific statistical package to another. SAS proc convert statements can easily convert SPSS portable files into SAS datasets. File transfer software. This is software that allows files to be transferred between computers, whether networked or not. These utilities include Direct Cable Connection, which is included in the Windows operating system, or LL3 for non-windows-based transfers. FTP programs are also helpful in transferring files among networked computers. 16 More details are available at QQQ Software, Inc http://www.qqqsoft.com 16

Virus checking and recovery software. Programs such as Norton Utilities, McAfee Virus Shield and Scan Disk (which may or may not come with the operating system) not only provide protection against virus attacks, but can also sometimes be useful in recovering corrupted files. Hardware accessories Apart from computers, required resources include hardware accessories such as cables, floppy disks, CDs, uninterrupted power supplies, air-conditioning systems, and dehumidifiers. Associated problems will vary from country to country. 2.5 Identification of personnel Human resources are required in the following areas: Data entry personnel. These individuals are responsible for such tasks as data entry and initial validations. Although some countries are trying to use scanning technology coupled with optical character recognition systems for data entry, most child labour survey data is still entered manually. Data entry personnel should be familiar with data entry software, as well as with questionnaire design. Ideally, they should have previous data entry experience. At a minimum, data entry operators should be familiar with the computer keyboard and have typing skills. A rule of thumb: 10 data entry operators working in parallel for about 40 hours a week are needed to enter and make preliminary validation of data regarding 8,000 households over a period of about 2 months. Using the CAPI method of data collection eliminates the need for such data entry operators. Data processing personnel. These persons should be thoroughly familiar with the survey questionnaire, data processing activities, editing, and the necessary tabulations. They need to be familiar with statistical packages, and should be capable of finding errors and correcting certain types of errors in the dataset. They also have to be capable of performing repetitious tasks efficiently. Computer programmer. This person should be able to develop programs either in the software specific format or by using other computer programming languages based on consistency checking rules. Ideally, the person should also be capable of understanding the survey questionnaire and developing the consistency rules. If any programmers are used in the questionnaire design, it is strongly recommended that they are subsequently included in the programmer team. Computer system administration. This person should be a competent computer systems administrator familiar with managing stand-alone or networked systems, printers, file transfer methods, virus-checking systems, back-up operations, and corrupted-file recovery methods. Supervisor. This position requires a highly qualified data processing specialist with programming experience, capable of overseeing the entire data processing operation. He or she should have previous experience managing survey or census data processing, and be familiar with the software packages used to process the child labour survey data. 17

It is likely that one and the same person could perform a number of the activities described above. Where this is the case, the supervisor should decide which of the activities the same person can perform, and specify how that person s time should be allocated. 2.6 Scheduling the data processing Time is always a crucial factor in child labour surveys. Administrative procedures, nonsubmission of progress reports to funding agencies, non-availability of resources, and training of personnel are among the factors that can delay data capturing and data processing. Plans should stipulate that all data processing activities be completed within three months of the data entry starting date, if not sooner. Other major considerations at this time are identifying tasks that can be conducted in parallel and ascertaining the availability of human and machine resources. In what follows, we present guidelines for processing 8,000 household records with about 50 questions. A greater number of household records or questions will normally mean that data entry, cleaning, and error correction require extra resources, including more time. Fewer records or questions, conversely, will require less. The following time allocations should serve as rules of thumb for stand-alone child labour surveys: about one month for data entry, including additional coding; and about one month for data validations. 2.7 Data preservation strategy and access procedure Surveys often conclude with the preparation of tables. If the micro-data are not properly archived, they may eventually become obsolete where, for example, data are stored in a package-specific format, and the package used to create the data is superseded by a newer version. Planning must take into account data storage and strategies for how this information can later be accessed. When one is establishing data preservation and access procedures, certain considerations require careful attention: Hardware. Sometimes a shared machine, where other datasets are stored by a national statistics office, is used. This may be a workstation with offline storage capacity or any server where data are stored. The minimum requirement is a Pentium PC that is not used for dayto-day work. Automation software. This may be in-house, purpose-built software, and will vary depending on the hardware platforms available in the individual country. This software is used to perform repetitive tasks, for example checking that all files are transferred and labelled. Directory structure. Design a structure for storage of data, documentation, and programme files. Remember that files will be created using a variety of software packages. It is not good practice to store all files related to the same dataset in a single directory. (A model directory structure is presented in Chapter 4.) Access policy. Decide who is allowed to access the datasets. A data access request may come from someone within the department, another ministry or organization, or a complete outsider. Access policy, in general, should be to make data available to all users. However, certain data may be restricted to a particular group of users only. 18

Backup policy. Child survey data backup procedures will probably resemble those which exist for an organization s data in general. In any case, aspects to consider when backing up include these: which files are to be backed up; how often they are to be backed up (daily, weekly, monthly, etc.); what backup medium is to be used (CD, tape, etc.); and what procedure is to be used who is responsible, and how the backups are to be performed; backup procedures during and after data processing are different: During the processing, files are incomplete and backups are only short term (yet the latest versions can still be recovered if the system crashes). Typically, these temporary files are small; child labour data processing files can normally be accommodated on a couple of floppy disks. Following data processing, permanent backups are needed. Dissemination procedure. Access policy determines in part how the data will be disseminated. Dissemination procedures should be simple. Approaches to consider include online dissemination through the Internet or an intranet, and offline dissemination using diskettes or CD-ROMs. Detailed procedures need to be formulated. All policy planning activities can be performed in parallel with survey design and field data collection. Policy planning should be completed before field data collection so that data can be entered immediately afterwards. 19

3. Data processing 3.1 Introduction In many respects, survey data processing has remained unchanged over the past few decades. With the invention of more sophisticated technology, however, it has become faster and more reliable. Data are first collected using manually completed questionnaires. Next, a manual count of the completed questionnaires is cross-checked with the number of persons interviewed. Then the data are coded and sent for data entry. Data entry is accomplished as quickly as possible using operators who automatically enter what they see. Some countries may use scanning technology followed by optical character recognition (OCR) procedures, where the responses to questionnaires are scanned to enter the data and then OCRed to identify the codes in a manner that statistical software can handle. If CAPI is used for data collection, this kind of data entry is unnecessary. But most child labour surveys are conducted using the PAPI method, so the data must be captured as quickly as possible to produce a preliminary count for the survey. As mentioned above, data processing represents the second stage in the survey process. Data are first received, usually from multiple sources, and then converted to a format suitable for the following stage, which is analysis. Data may be collected on paper or as digital information. Similarly, data processing may be either electronic or non-digital. Initial investigations at the country level reveal that data processing activities for child labour surveys are performed electronically, in most cases using personal desktop computers. This guide, then, addresses only electronic data processing. Whichever way data are entered, the following phases need attention during data processing: data entry and preliminary validations; appending, merging, and splitting files; data validation (further checking, editing, and imputations); final decisions on errors; completion of data processing and generation of data file(s); preparation of public use datasets; final documentations; final tabulations; conversion of data files to other formats as required; and storage of all files. Child labour survey data should go though these stages at a minimum, and there should be no shortcuts. Shortcuts are rarely effective, since they increase the risk of producing unreliable and thus less creditable datasets, which then require more time for error correction. The following sections of this chapter elaborate on these stages. It is recommended that those involved in data processing read this chapter carefully before approaching any processing tasks. Remember that it is also important to include proper weighting factors in the data. 21

3.2 Data entry and preliminary validations Depending on the prevailing situation in a given country, data entry may occur either at the field level, under the supervision of a field supervisor, or at survey headquarters. Where data is entered in batches, each batch should appear in a separate file, rather than together in a single large file. Most importantly, at this stage, the data should be entered right after collection and checked to ensure all information has been entered correctly. Error-detection procedures should be in place, and errors should be corrected immediately. Data entry operators should not leave their computer while entering data related to a household record. However short the interval away from the task, this practice tends to generate errors. Once a batch of data is entered, the questionnaire should be bundled, labelled, and stored for future reference. A variety of common data entry errors with the appropriate precautionary or corrective measures follow: Data from an old questionnaire entered (e.g. from a pilot survey). This can be verified by referring to interview dates or to the colour of the questionnaire paper (where different colours should be used for pilot surveys and actual surveys). Data entry software should be programmed to recognize this problem. Wrong data but within range. The sex of a female child is entered as male. Both male and female are legal, so this type of error evades normal statistical checking. Custom-built programs involving different questions need to be developed for this and tested beforehand. Wrong data and out of range (wild code). If 1 stands for male and 2 for female, a value of 3 represents erroneous data. Frequency distribution procedures will flag these cases. Once the error is found, compare the erroneous record with other answers to correct the error. False logic (consistency). A six-year-old child is reported as having completed secondary school. The child may have responded appropriately, but the answer was typed incorrectly. This type of error may also be caught with custom-built programming involving different questions. Once found, compare the erroneous record with other answers to correct the error. Data not typed (missing data). Missing data codes for items such as not applicable and refused to answer may not have been pre-coded in the questionnaire, even though they should have been. Or, during data entry, all such values may have been left as blanks to be filled in later. In either case, all these instances should be found and replaced with the appropriate data code. Duplicate entry (same records or cases entered more than once). As data are entered in batches, the same cases or records may be entered twice. Checks can be performed to capture this type of error (e.g. refer to unique identification numbers). Once such cases are identified, appropriate actions include deletion (this may not be possible with some data entry software), flagging, or reporting to supervisor. Unmatched record (for hierarchical files). A household record may be followed by a persons living in the house record. In this case, two types of error may occur. One is where there are either fewer or more persons records than were actually 22

collected. A second is where entire households have been missed. Both cases can be caught and corrected with proper programming. Dropped cases (interviewed but not entered). Sometimes data are not entered in the computer. Where undesired cases/records are dropped, this should be verified. Appending error (data entered but not appropriately joined). Where data are entered in batches, programmes need to be developed to join/merge all files. The number of cases must correspond to the sample size or number of persons interviewed (or records collected), whichever is applicable. The data entry software described above is capable of catching many of these errors. Data entry using interactive software is often referred to as intelligent data entry. The double entry method where two different people enter the same data, and the two files are then compared to find any differences is also used to validate the data entry process. It is recommended that both double entry and intelligent data entry methods be applied in child labour surveys. Electronic files generated after data entry may be formatted as modules. In this case, check each module separately, revisiting the questionnaires as necessary. Once the data entry is complete, record identification variables should be checked to see that each is unique for each record or case as applicable. If not, to help processors avoid problems merging files, those cases with errors should be corrected by revisiting the questionnaires. 3.3 Appending/merging/splitting files Appending is the method of combining multiple files with different observations (consisting of variables) into a single file. The properties of each variable are usually the same in each file. Conceptually, it helps to understand appending as increasing the data size vertically. Merging is the method of combining variables from multiple files into a single file. The variables in each file describe the same observation, usually with different units, such as household and person. Conceptually, it helps to understand merging as increasing the data size horizontally. The files to be merged must have one or more unique identifying variable in common. Merging operations can be of different types depending how files are merged. On the other hand, splitting, also called sub-setting, refers to dividing files. This may occur in terms of numbers of either variables or observations. Extreme care should be taken in merging files. Merging often leads to missing values, even though the files to be merged may be perfectly clean and correct. Different types of merging, appending, and sub-setting of files are described below. They are based on the SPSS class notes of the UCLA Academic Technology Services 17 and The University of North Carolina s Carolina Population Center. 18 17 SPSS Learning Module Match merging data files http://www.ats.ucla.edu/stat/spss/ modules/merge.htm 18 Stata Programming: Data Management http://www.cpc.unc.edu/services/computer/presentations/statatutorial#combining 23

One-to-one merging One-to-one merging refers to the process of joining files where one record in each file constitutes a case, and each record in each file must have at least one unique identifying variable. There may or may not be more than one common variable. Merging is performed according to unique identifying variables. This procedure is usually applied when data are collected at two different times, or when data is entered as two different modules, thus generating more than one file. For example, File 1 (house file) may include three variables a1, a2, and a3 representing age of household head, energy sources in the house, and number of people living in the house. (See Table 1, below.) File 2 (person file), on the other hand, may include more detailed information about the household head, such as number of hours (x1) worked per week, educational level (x2), and income (x3). Numbers 1, 2, and 3 represent unique record/case identification numbers based, perhaps, on cluster, household, and line number nested in order. In this case, there will be a one-to-one merge, since one house has one household head. The merged file will present all six items of information (variables) about the household head in a single file. During one-to-one merging, some statistical packages place restrictions on the number of variables (Stata, for example, has a limit of 2,047 variables. Limitations imposed by some statistical packages are included in Annex I). Although it is unlikely, in child labour surveys, that the total number of variables will exceed the number allowed by a particular statistical package, data processors should remain alert to the possibility during a one-to-one merge. Table 1 Example of one-to-one merging Before merge After merge File 1 (house file) File 2 (person file) (Numbers are unique identifiers used for merging) 1 a1 a2 a3 1 x1 x2 x3 1 a1 a2 a3 x1 x2 x3 2 b1 b2 b3 2 y1 y2 y3 2 b1 b2 b3 y1 y2 y3 3 c1 c2 c3 3 z1 z2 z3 3 c1 c2 c3 z1 z2 z3 Some exceptions: One of the files has more cases then the other. Or two files have the same variables. Different statistical packages will handle such situations differently. The operation can be performed in the following way: SORT household file by unique variable ID and save as a separate file (File1), thereby preserving the original in case of mishaps. SORT persons file by unique variable ID and save as a separate file (File2), thereby preserving the original in case of mishaps. Make sure both files (File1) and (File2) are properly saved by closing both files and reopening them. Execute MERGE FILES command to merge the File1 and File2. SAVE merged file (New File). Sample SPSS syntax programming: GET FILE= Household.sav. SORT CASES BY ID. SAVE OUTFILE= File1.sav. GET FILE= Persons.sav. SORT CASES BY ID. 24

SAVE OUTFILE= File2.sav. MATCH FILES FILE= File2.sav /FILE= File1.sav /BY ID. SAVE OUTFILE= New.sav. One-to-many merging One-to-many merging is the process of merging files when multiple records constitute one observation and records from the same observation are located in different files. Each record in each file must have at least one unique identifying variable. There may or may not be more than one common variable. For example, there may be a house where three people live. Information about the house (e.g. building type, owner or tenant status, rent) resides in File 1, whereas information about each person (e.g. age, sex, socio-economic status) resides in File 2. The house information about each member of the house will be the same. An example of one-to-many merging is shown in Table 2, where one household record may be associated with more than one person s record, depending on the number of people living in that house. Each record in each file must have a common unique identifying variable for merging, and merging is performed on the basis of this unique variable. Table 2 Example of one-to-many merging Before merge After merge File 1 (house) File 2 (person) (Numbers are unique identifiers used for merging) 1 a1 a2 a3 1 x1 x2 x3 1 a1 a2 a3 x1 x2 x3 Same as in house 1 1 y1 y2 y3 1 a1 a2 a3 y1 y2 y3 Same as in house 1 1 z1 z2 z3 1 a1 a2 a3 z1 z2 z3 2 b1 b2 b3 2 m1 x1 z1 2 b1 b2 b3 m1 x1 z1 Same as in house 2 2 z1 m1 m2 2 b1 b2 b3 z1 m1 m2 3 c1 c2 c3 3 m1 y1 y2 3 c1 c2 c3 m1 y1 y2 Same as in house 3 3 x1 y1 y2 3 c1 c2 c3 x1 y1 y2 Exception: One of the files has records that do not match the other. Different statistical packages will handle such a situation differently. The operation can be performed in the following way: SORT household file by variable ID and save the file as a separate file (File1), thereby preserving the original in case of mishaps. SORT persons file by variable ID and save the file as a separate file (File2), thereby preserving the original in case of mishaps. Make sure both files (File1) and (File2) are properly saved by closing both files and reopening them. Execute MERGE FILES command to merge the File1 and File2 files. SAVE merged file (New) Sample SPSS syntax programming: GET FILE= Household.sav. SORT CASES BY ID. SAVE OUTFILE= File1.sav. 25

GET FILE= Persons.sav. SORT CASES BY ID. SAVE OUTFILE= File2.sav. MATCH FILES FILE= File2.sav /TABLE= File1.sav /BY ID. SAVE OUTFILE= New.sav. Note that if /FILE option is used instead of /TABLE with MATCH FILES syntax command, system-missing values will be generated. Many-to-many merge Many-to-many merging is used when many observations in one file match but the rest do not. This situation may arise when field data are collected at different times (e.g. in a modular survey or when there is a follow-up survey) with some common questions, some common households, some common people living in the same house, but where some are not. It will not be known how much overlap there is between the merging files before the merge is performed. Merging is again performed according to unique identifying variables that are the same in all the files to be merged. In a sense, many-to-many merging is a combination of the one-to-one and one-to-many processes. In Table 3, two files are merged. Both File 1 and File 2 contain variables describing the house and the people living in the house, but the information was collected at different times. In the first survey, two people lived in the same house (File 1), and there are two records. In Case 1, in the second survey, however, one of the persons was no longer living there. After merging, therefore, there are empty variables for the second person. In Case 2, no members of this house were interviewed or the house was not an enumerated house in the first survey. But in the second survey it was included. Similarly, in Case 3, members of a house were interviewed, while in the second survey the house was not included as an enumerated house because that house may not have existed any more. In many-to-many merged files, there are usually missing values of the not applicable type. Table 3 Example of many-to-many merging Before merge After merge File 1 File 2 (Numbers are unique identifiers (house & person) (house & person) used for merging) 1 a1 a2 a3 (person 1) 1 x1 x2 x3 1 a1 a2 a3 x1 x2 x3 1 b1 b2 b3 (person 2) 1 person not interviewed 1 b1 b2 b3 2 person not interviewed 2 z1 z2 z3 2 z1 z2 z3 3 d1 d2 d3 3 person not interviewed 3 d1 d2 d3 Appending data files Appending is used when all data files contain the same variables, but each file records different observations. This is usually essential during data entry, and the process continues until all batch files are appended. Appending is necessary, for example, when data are entered in different locations and sent to survey headquarters at different times, or where data are entered in batches (modules) and each batch (module) constitutes a file. In Table 4, data were entered in batches. Each file has the same number of variables, but each case/record refers to a different house/person. 26

So the case/records are simply appended at the end. After appending, the number of cases/records in the final file should be equal to the total number of cases/records from each file. There may be exceptions when records have different variables. Table 4 Example of appending data files Before append File 1 File 2 1 a1 a2 a3 1 x1 x2 x3 2 b1 b2 b3 2 y1 y2 y3 3 c1 c2 c3 3 z1 z2 z3 After append 1 a1 a2 a3 2 b1 b2 b3 3 c1 c2 c3 4 x1 x2 x3 5 y1 y2 y3 6 z1 z2 z3 Potential problems in appending and merging files. Merging can be very complex and, in general, one should be familiar with the data before attempting a merge. Potential problems (some of them really only inconveniences), to watch for before appending or merging files include these: Different variable names may have been used to represent the same thing in two data files (e.g. age in one file, c_age in another file). The same variable name may have been used to represent different things in two data files (e.g. the variable wage represents income per week in one file and income per month in another). This is not a problem with to-be-merged files, but can be a problem with merged files. Variable names in two files may be the same but of different types. (e.g. numeric vs string). String variables in two different files may be of different lengths (e.g. 8 and 16 characters). The two data files have variables with the same name but different codes (e.g. the values for yes and no are reversed). Records have been mismatched during merging. Be sure to look at package-specific warning messages, and apply the prescribed precautions or remedies where appropriate. It is essential that some of the above problems, if present in the data, be resolved before attempting to merge or append. Different statistical packages handle appending and merging differently. One should be careful about how particular empty cells will be filled, and consult manuals where appropriate. 27

Splitting data files The process of splitting a file, also referred to as sub-setting, is often necessary during data cleaning and data analysis. Data files may be split according to the number of cases or records or according to the number of variables. During split operations, most statistical packages offer options to delete or filter unwanted cases/variables. One should be careful in saving split files, since carelessness may result in overwriting the original file with a subset file. 3.4 Data validation With effective entry procedures, data are already basically clean as soon as they have been entered. Secondary editing involves complex internal consistency and structure checks that require the review of several sections of the questionnaire, and which, if corrections are needed, must follow detailed recommendations. More advanced data processors can do this interactively. Some people prefer to carry out all data validations before merging of files. Individual countries will decide which approach is appropriate for a given situation. Even when the first statistical operations are performed with due care on new data, it is not uncommon to find cases such as where six-year-olds have completed secondary schooling. Validation checks are needed to find these errors and fix them. Although no system is perfect, the number of errors may be more surely reduced if, at a minimum, the following steps are taken. 19 Number of variables check. It sometimes happens that the number of variables that should be generated from a questionnaire do not match the number of variables in the data. Various factors may be responsible. For example, the variable may not have been created in the first place, or the questionnaires may have been imperfectly translated from one language to another. Although this type of error should be recognized in earlier stages, it is best to recheck after all files have been merged or appended. Number of record/cases check. If the household is considered as a case, check that the number of cases entered equals the expected number of households (may be equal to sample size). Also check that the number of person records is equal to the number of persons interviewed (or data collected). Record matches and counts. If household records and records about persons living in the household are in two different files, check to make sure that identifying variables required for merging are clearly defined. Also make sure that all members belonging to a household are properly entered by comparing the number of persons in the household file with the number of persons in the same household in the person file. Wild codes and out-of-range values. Wild codes are those that are not defined as acceptable legal codes in the data, whereas out-of-range values are those values that are assigned to acceptable legal codes but may not be right. For example, if 1 stands for male and 2 for female, 3 will be a wild code, whereas giving the per week income of a child at 1000 is an out-of-range value when actually it should be 100. Frequency distributions as well as graphs expose these types of errors, so frequency distributions of all variables should be examined for possible anomalies. Revisit the questionnaire as necessary to correct these problems. 19 Further development of the procedures outlined in Inter-university Consortium for Political and Social Research (ICPSR), Guide to Social Science Data Preparation and Archiving, op. cit., and Audience Dialogue: Survey analysis, op. cit. 28

Missing values. Flag all values that are missing, in each case indicating the reason why the value is missing. Responses such as do not know, not applicable, not available, and refused to answer should be clearly marked. Their values, to the extent possible, should also be uniform through out the dataset. No cell in the data set should have a blank space. Consistency checks. There are always possibilities of inconsistencies among responses to related questions. For example, 100 children say they did work, but 105 children report earnings. The presence of some inconsistencies may also arise from more than two variables. For example, five extra children who reported earnings may be wrong because they were in fact attending school. One of the easiest ways to perform consistency checks is to check the question route. For instance, where the questionnaire says When answer to question number 10 is 2 ( NO ), skip to question number 14, a logical rule can be developed: If Q10 = 2, then Q11 = Q12 = Q13 = 99 (means not applicable ). If the data indicates that this is not true, then it is possible that Q10 is really 1, but during data entry it was entered as 2. This is probably the case if Q11, Q12, and Q13 all have valid codes. So the answer to Q10 can easily be changed to 1 ( YES ). If they have been coded but are not all valid, one may need to refer back to the original questionnaire to find out which need to be changed. Comparing frequency counts or cross tabulations among all possible related variables would reveal many inconsistencies. Revisit the questionnaire to correct these problems. A useful example (see also Annex 3) of how logic checking rules can be developed is presented in the following consistency check rules derived from the 1999 Zambia End of Decade and Child Labour Survey (education module) 20, in which information about children aged 5-17 years was collected: If a child answers YES to ever attended school, then skip the (not applicable) question why not attended school. Where a child answers NO to ever attended school, yet reports a grade to the question highest grade attained, his/her answer should be changed to YES to ever attended school Those who answer neither the ever attended school nor the highest grade attained questions should be taken as responding NO to attended school and 0 GRADE to highest grade attained. Rationale: Neither of two related questions are answered, so the likelihood is that the correct answers are actually NO. If the child answers NO to ever attended school and NO to attending school, then the rest of the questions in the education module should be considered not applicable. If a child answers NO to attending school, then skip the (not applicable) question grade currently attending. Where a child answers NO to question attending school last year, but YES to attending type of school last year and YES to grade attending last year, the first response should be changed to YES. Rationale: The two related YES responses, in this case, suggest YES is a more likely response than NO in the third instance. 20 Central Statistical Office, Zambia End of Decade and Child Labour Survey (1999) Household Questionnaire http://www.ilo.org/public/english/standards/ipec/simpoc/zambia/document/ zafh01gq.pdf 29

During consistency checks, extreme care should be taken to avoid executing scripts based on false logic (consistency rule). If several consistency check operations are to be carried out involving the same variable, moreover, take great care to choose the appropriate sequence in executing the check operations and change of values. Some programmers find it more effective to split files before running consistency checks, and merge the files again once the checks are complete. Errors are inevitable during complex consistency checks. It is good practice to keep old files so that it is always possible to refer to a copy of the original data. At a minimum, consistency checks should be run to ensure that no fields are blank and that all fields contain valid values. Finally, the first three to five per cent of the records should be carefully checked to ascertain that those records are error free. Afterwards, random checks should be conducted to test the overall integrity of the dataset. At the end of the data validation stage, there should be no missing values of any type (e.g. not applicable codes are properly included); consistency errors and all records should be matched with all unique record/case identifiers uniquely defined. In other words, all missing values must be properly defined. However, if any values remaining in the dataset cannot be rectified, then a file should be generated that contains the following information: case/record identification; type of error (missing value, non-response error, etc.); detailed breakdown in terms of number of cases, records, etc.; reasons why such errors could not be corrected; some tabulations to show their impact on the overall dataset; number of mismatches between cases and applicable records in a case; and mismatches between number of cases and data collection, and possible reasons for the errors. In addition, a list of all variables with labels needs to be generated. These tables, together with the error report file and variable list, then should be forwarded to the supervisor for consideration. 3.5 Final decisions on errors Once a checklist of errors is produced, together with tables showing the overall impact of these errors on the data, the supervisor, in consultation with survey associates, should decide what to do with those items. Depending on error type, necessary decisions may include the following: how errors in the data should be flagged; which cases/records/variables can be imputed under what conditions, and how that information will be incorporated in the dataset once a decision is made; which records/cases need to be referred back to the survey questionnaires for further investigation, and how that information will be incorporated in the dataset; 30

which cases/records/variables can be dropped, what the reasons are for dropping them, and how that will affect the data overall; and the wording of the documentation for all of the above cases. As soon as the necessary decisions are taken, they should be communicated to the data processors, who should then incorporate them into the datasets as quickly as possible. 3.6 Completion of data processing and generation of data files Most data processors find their job is a never-ending process. Even where data are already cleaned, data processors often go on to sub-set the data, to create additional variables, and so on. This frequently leads to a major problem various data processors find themselves working on different versions of the dataset. When all the information needed for preparation of the final documentation is available, the supervisor should declare all data processing activities ended. The data file or files at this point can be marked as Version 1 of the dataset. Supervisors should then assign someone to compile all information that was collected by the data processing people in a single document (file). Processing management One supervisor should oversee all data entry and data processing operations. Supervisory roles include the following: making sure that all data processing activities are progressing according to schedule; providing administrative assistance to data processing personnel (e.g. offering an alternative computer where another has crashed); ensuring that data files are structured (e.g. flat file or hierarchical, record types) before data entry and all variables are coded, labelled, and values assigned, including types and codes for missing values; trying to ensure that data processing personnel need not worry about coding; and, where they do have to, minimizing the task; ensuring that all files are merged and/or appended; controlling the master file, so that personnel responsible for cleaning and analysing data are always working with up-to-date files; in consultation with other concerned parties (e.g. data analysts and survey designers), making the necessary decisions regarding errors (see 3.5, above); overseeing random household record checks as an overall quality-control measure throughout the data processing procedures; ensuring that all decisions are recorded during the data processing, and that final documentation contains all relevant information; ensuring that the necessary steps have been followed in creating an effective public use dataset; deciding when to declare an end to data processing activities, and then taking control of all data and documentation files; 31

making sure that the necessary files are appropriately located in the main preservation system for future reference; and serving, once processing is complete, as a contact point for the datasets. The supervisor s main concern should be to reduce as much as possible the time between capturing the data following field collection and preparing it for analysis without jeopardizing data quality. 3.7 Preparation of public use datasets Confidentiality issues Most child labour surveys are expensive and are of great national importance. They provide statistics that will help to improve schooling, eliminate poverty, and increase healthcare resources and other public or private services. Each individual survey response is significant, and it is important to share the survey information with as many people as possible. 21, 22 Following are some concepts that are adopted from the two-refrenced documents Anonymity of data. At the same time, survey respondents should be able to remain confident that their personal identities will not be compromised. On the one hand, then, child labour survey data should be made available to the public for in-depth secondary analysis; on the other hand, it is essential that proper procedures be followed to ensure the anonymity of the data, so that no children or their families or the people/organizations they work for can be identified from the raw data. Unless a dataset is anonymous, it cannot be freely distributed. Alteration of files. Information that could imperil the confidentiality of children, parents, households or organizations, especially those involved in hazardous work, must not be compromised. Thus, public use child labour datasets may require alteration of the files. Two kinds of variable can compromise the identity of an individual or organization: Direct identifiers. These are identifying variables describing the identification of a person, entity etc. (e.g. post code). These variables, coupled with other linkable identification numbers such as date of birth, can be used to identify an individual. These identifiers should be removed or properly coded. Indirect identifiers. These are variables that are not treated as direct identifiers but can be used together with other variables or other publicly available information and can be used to identify a person, entity etc. For instance, when a very low number of people are involved in a particular occupation in a particular region, the information can be coupled with publicly available recent census data to identify a person. The data analysts, together with other members of national statistics offices and ILO technical experts, if necessary, should review and deal with potential identifiers. So far, though, with all SIMPOC national surveys, it was found that the indirect identifiers were not much of a problem. 21 Inter-university Consortium for Political and Social Research (ICPSR), Guide to Social Science Data Preparation and Archiving, op. cit. 22 American Statistical Association, Rasinski, K. et al ; Producing a public use file A case study, retrieved from http://www.amstat.org/sections/srms/proceedings/papers/1997_074.pdf (1997) 32

Without being exhaustive, methods of ensuring anonymity in child labour survey data include the following. Suppressing. A variable or a number of variables may be suppressed from the dataset. For example, building numbers and addresses should be removed from public use datasets. Bracketing. Range of values of a variable may be included in a single variable. For example, a 7-year-old child can be in the 5- to 9-year age bracket. Top/bottom coding. One can restrict the upper and/or lower range of a variable. For example, children who earn excessive amounts might be identified and grouped with other children simply described as earning more than 100 units of local currency per day. Recoding. One can combine two or more similar variables and recode them into one. For example, province, district, town, and building number can be combined to form a unique variable. Data swapping. One can swap personal records of one household with personal records of another household in the data in such a way that the overall result is not altered. Household identifiers cannot be used to identify the people living in that house. Data perturbation. One can modify individual records (for example, by adding/ subtracting a fixed number) in such a way that individual records are changed but the overall counts remain the same. Public use datasets should always be copied from the original dataset and named according to an appropriate convention (see section 2.3). The original copy can retain its restricted access status, and should always be kept for reference without further modification. 3.8 Final documentation Preparing high-quality documentation can be a time-consuming task, but clear and complete documentation greatly enhances the survey process. It is always best to engage those personnel who have been involved with the dataset from the beginning. They will know better than anyone else how the datasets were created, why the derived variables were created, and which major decisions were taken or editing rules applied during the data processing and why. On the other hand, people involved with the survey and initial data processing are often so close to the project that they tend to think some of the information does not need to be documented. However, the data is intended for use by a variety of people, and thorough, clear, and concise documentation greatly enhances its usability. Preparation of documentation or metadata should begin well before the start of actual data processing. As soon as data processing is complete, all the relevant information should be compiled. Two items can serve as final documentation. The first is a short file describing the structure of the dataset together with information concerning variables and values, coding and classification schemes, and weighting. A brief description of the survey should also be included. The second document is a more detailed one, and is described in what follows. 33

The following are extracts relevant to child labour surveys from the Data Document Initiative (DDI) codebook DTD Version 1.0 (FINAL). 23 Note that the information organized under the following headings is not a replacement for the codebook or data dictionary for ASCII datasets that defines micro-data layout. Summary survey description Title. The full, authoritative title of the survey will be used for all data and documentation, and it should indicate the geographic scope of the data collection as well as the time period covered. For example: Child labour in Portugal: Social characterization of school-age children and their families, 1998. Subtitle. A secondary title may be used to amplify or state limitations on the main title. For example: Child labour in Portugal, 1998. Alternative title. The alternative title may be the title by which a data collection is commonly referred to, or it may be an abbreviation of the title. For example: SIMPOC Portugal survey, 1998. Parallel title. A title may be translated into another language. For example: Trabalho Infantil em Portugal: Caracterização social dos menores emidade escolar e suas famílias, 1998. Keywords. Words or phrases should be specified that describe salient aspects of the survey and which may be used in building keyword indexes for classification and retrieval purposes. Abstract. This is a summary describing the purpose, nature, and scope of the child labour data collection. Special characteristics of the contents and a listing of major variables in the data can be added here. Summary data description This should briefly describe the child labour survey in terms of its duration and data collection dates, geographic coverage, and unit of analysis. Time period covered. This is the time period to which the data refers the period covered by the data, not the dates of coding or of making documents machine-readable or the dates when the data were collected. For example, if the data was collected in 1999, and one question was Did you work last year?, the time period should be 1998-99. Date of collection. Contains the date(s) when the data were collected. Country. Name of the country where the survey was conducted. 23 The Data Documentation Initiative Codebook DTD, http://www.icpsr.umich.edu/ddi/ users/dtd/codebook.html (The excerpts have been modified, since the document was prepared as a guide to different types of surveys conducted in a variety of situations, and some fields might not be applicable to a particular country. In addition, the same information is sometimes presented in different sections of the codebook, since, during on-line dissemination, software might seek the same information under a variety of headings. A full version of the codebook is available from the Codebook DTD website.) 34

Geographic coverage. Includes the total geographic scope of the data, and any additional levels of geographic coding provided in the variables. Most child labour surveys are national in scope. Geographic unit. This item refers to the lowest level of geographic aggregation covered by the data for example province, state, or district. Unit of analysis. For most child labour surveys, the basic unit of analysis or observation is the individual person. Universe. The summary should also include a description of the population covered by the data in the file the group of persons or other elements who are the objects of the survey and to which the survey results refer. Age, nationality, and residence commonly help to delineate a given universe also known as a universe of interest, population of interest, or target population but a number of other factors may be involved, among them age limits, sex, marital status, race, ethnic group, nationality, income, veteran status, and history of criminal conviction. The universe may consist of elements other than persons, including housing units and countries. In general, it should be possible to tell from the description of the universe whether a given individual or element (hypothetical or real) is a member of the population under survey (for example, where a child labour survey interviewed only children from the 5- to 15-year age group). Kind of data. This item refers to the type of data included in the file, for example survey, aggregate, clinical, or event/transaction data; program source code; machinereadable text; administrative records data; textual data; coded textual data; coded documents; time budget diaries; observation data/ratings; or process-produced data. All applicable data types should be included. Notes. Notes should be used to provide additional information, clarifying and annotating codebook information on the scope of the data collection. Survey methodology and processing Time method. Panel, cross-sectional, trend, and time-series are some ways of approaching the time dimension of data collection. Data collector. This refers to the entity (e.g. a national statistics office) responsible for administering the questionnaire or interview or for compiling the data. Frequency of data collection. If the data were collected at different times, indicate the frequency with which this happened. For example, in first-time child labour surveys, first time would suffice. Sampling procedure. This is the type of sample and sample design used to select survey respondents representative of the target population. It may include reference to the target sample size and the sampling fraction. Major deviations from the sample design. Show correspondences as well as discrepancies between the sampled units (obtained) and available statistics for the population as a whole (age, sex-ratio, marital status, etc.). Mode of data collection. This is the method used to collect the data (e.g. face-to-face interviews). Type of research instrument. Structured indicates a questionnaire that presents all respondents with the same questions, and that may include pre-coded answers. If a 35

small portion of such a questionnaire includes open-ended questions, provide appropriate comments. Semi-structured indicates that the questionnaire contains mainly open-ended questions. Unstructured indicates that in-depth interviews were conducted. Most child labour surveys are structured in nature. Actions to minimize losses. The summary should include such actions taken to minimize data loss as follow-up visits, supervisory checks, historical matching, and estimation. Control operations. Describe the methods used to facilitate data control during the survey and subsequent data processing. Weighting. The use of sampling procedures may make it necessary to apply weights to produce accurate statistical results. Describe here the criteria for using weights in the analysis of a data collection. If a weighting formula or coefficient was developed, provide the formula, define its elements, and indicate how the formula was applied to the data. Cleaning operation. Methods used to clean the data collected may include consistency checking and wild code checking, for example. Study-level error note. Include any information annotating or clarifying the methodology and data processing procedures. Data appraisal information Response rate. This refers to the percentage of sample members who provided information. Estimates of sampling error. Include a measure of how precisely one can estimate a population value from a given sample. Other forms of data appraisal. Include such issues as response variance, non-response rate and testing for question bias, interviewer and response bias, and confidence levels. Data access This section describes access conditions and terms of use as well as other information regarding availability and storage of the data collection. Location. Say where the data is currently stored (e.g. a national statistics office). Archive where study originally stored. Give the place, if any, where the data was stored earlier (e.g. another ministry or department). Availability status. Provide a statement of data availability. For example, data may be unavailable because it was embargoed before formal dissemination of the final report. Extent of data. Summarize the number of physical files that exist in a dataset, recording the number of files that contain data and noting whether the collection contains machine-readable documentation or other supplementary files and information such as data dictionaries, data definition statements, and data collection instruments. Completeness of study stored. Describe the relationship of the data collected to the amount of data coded and stored in the data collection. Where appropriate, explain why certain items of collected information were not included in the data file. 36

Number of files. Give the total number of physical files associated with a collection. Collection notes. Provide any additional information regarding data availability. Access authority. Identify the contact person or organization that controls access to the data collection at the country level (with full address and telephone number, if available). Date use statement. Explain the terms of use for the data collection, if any. Conditions. Where appropriate, describe use and access conditions not covered elsewhere. Citation requirement. Specify any text that should be cited in publications based on analysis of the data. Deposit requirement. Information regarding the responsibility of external users for informing countries or the ILO of their use of data when citing or providing copies of the published work. Notes. Include a generic notes sub-section in the data access section to facilitate annotation/clarification of information regarding data access. File-by-file descriptions All files, including data and documentation files, should be individually described. File name. Use a short title to distinguish a particular file/part from other files/parts in the data collection. File contents. Provide an abstract or short description of the file describing its purpose, nature, and scope, special characteristics of its contents, major subject areas covered, and the reason the file was first created. It is also important to list the major variables contained in the file. In the case of multi-file collections, describe the contents of each file individually. File structure. Describe the type of file structure, for example indicating whether a given file is hierarchical, rectangular, or relational. Record or record group. If the file is hierarchical or relational, then describe the record groupings. Label (of record). Provide more detailed information for each record group. Dimensions (of record). Describe the physical characteristics of the record, including such items as number of variables per record, number of cases, and record length if applicable. Notes (on record or record group). Indicate any additional information regarding this record type. Dimensions of the overall file Overall case count. With rectangular files, specify the number of cases or observations in the entire file. 37

Overall variable count. With rectangular files, specify the number of variables in the entire file. Logical record length. The logical record length of a file is the number of characters contained therein. Provide this for rectangular files or where all records in a hierarchical file are the same length. Type of file. If the data files are of mixed types (e.g. both ASCII and software dependent) mention their types. Data format. Specify the physical format of the data file, i.e. delimited format, free format, software dependent, etc. Place of file production. Indicate which department produced the file. Extent of processing checks. Indicate here, at the file level, the types of checks and operations performed on the data file. Processing status. Indicate the processing status of the file, if part of a multi-file collection. Missing data. Provide information that can be used to account for missing data show that missing data have been standardized across the collection, that missing data are the result of merging, etc. Software. Identify the software used to create the file, including the software version number. Version statement. Provide a version statement for the data file. Notes. Provide additional information about the data file not covered in the other elements of this summary. Variable group This refers to a group of variables that may share a common subject, arise from the interpretation of a single question, or are linked by some other factor. Specify whichever of the following apply: Type. Show the general type of variable grouping (topic, multiple responses, etc.). Var. This indicates the entire constituent variable IDs in the group. Variable group. This indicates all the subsidiary variable groups nested under the current variable group, allowing the encoding of a hierarchical structure of variable groups. Name. This is the unique ID for the group. Summary data description references. These record the ID values of all elements within the summary data description referred to previously that apply to this variable group. These elements include time period covered, date of collection, nation or country, geographic coverage, geographic unit, unit of analysis, universe, and type of data. Methodology and processing references. These record the ID values of all elements within the study methodology and processing section described previously that apply 38

to this variable group. These elements include information on data collection and data appraisal (e.g. sampling, sources, weighting, data cleaning, response rates, and sampling error estimates). Variable group label Create a short description of the variable group. Variable group text. This is a lengthier description of variable group. Variable group definition. Provide a rationale for why the variables are grouped in this way. Notes. Add any clarifying information/annotation regarding the variable groups. Variable Each variable needs a name to serve as its unique ID. For each variable provide the following information: whether the variable is a weight; reference to the weight variable for this variable; a question ID for the variable; reference of the file to which the variable belongs; which format has been used (e.g. SAS, SPSS); the number of decimal points in the variable; whether the options are discrete or continuous; which record type this variable belongs to; references to the summary data description that records the ID values for all elements that apply to this variable; and references to the methodology and processing that records the ID value of all elements that apply to this variable. Variable label. This is a descriptive phrase that defines the variable. The length of the phrase may depend on the statistical analysis system used. Imputation. Imputation is the process by which missing values for items that survey respondents failed to provide are estimated. If applicable in this context, mention the procedure used. Embargo. This provides information on variables that may not be currently available because of policies established by national statistics offices or ministries. Response unit. Describes who provided the information contained within the variable (e.g. respondent, proxy, interviewer) Analysis unit. This provides details of whom or what the variable describes. Literal question. This is the literal text of the actual question asked. Post-question text. This text describes what occurred, if anything, after the literal question was asked. Interviewer instructions. These are the specific instructions to the individual conducting an interview. 39

Range of valid data values. This refers to the values for a particular variable that represent legitimate responses. Range of invalid data values. This refers to the values for a particular variable that represent missing data, not applicable responses, etc. List of undocumented codes. These are values the meanings of which are unknown. Summary statistics. This refers to one or more statistical measures that describe the responses to a particular variable, and which may include one or more standard summaries, e.g. minimum and maximum values. Variable text. This refers to an extended description of the variable, something beyond that provided by variable name and variable label. Coder instructions. These are any special instructions to those who converted the information from one form to another for a particular variable. These might include the reordering of numeric information into another form, or the conversion of textual information into numeric information. Version statement. If a variable has undergone changes, a version statement is required. Derivation. Used only in the case of derived variables, this element provides both a description of how the derivation was performed and the command used to generate the derived variable, as well as indicating the other variables in the study used to generate the derivation. Derivation description. This is a textual description of the way in which this variable was derived for display to users. Derivation command. This is the actual command used to generate the derived variable. The syntax attribute is used to indicate the command language employed (e.g. SPSS, SAS, Fortran). Variable format. This refers to the format for the variable in question, and includes type (character or numeric), name for the particular format (if applicable schema: vendor or standards body which defines the format, one of SAS, SPSS, IBM, ANSI, ISO, or XML-DATA), category (date, time, currency, other), and network identifier for format definition. 3.9 Final tabulation This section concentrates on aspects of tabulation especially relevant to data processors. Final tabulation plans are usually developed at the very outset of the survey process, and are not necessarily seen as part of the data processing. Usually, however, it will be the data processors who prepare the tables for the analysts. All surveys involve some kind of tabulation plan. With child labour surveys, the tabulation plan is normally formulated in discussion among key stakeholders. Once data processing is complete, the data processors should develop a complete set of tables based on variables specified by the data analysts. During tabulation the dataset will undergo changes, including additions and deletions, and, even where the most rigorous data cleaning was preformed in earlier stages, additional errors may be found. In addition, unanticipated variables will need to be constructed; and various analysts will want to subset the data by cases and/or by variables. In consequence, multiple versions of the dataset will probably find their way into use. Copies of these various versions may be copied to different desktop computers, further adding to the potential confusion. 40

Please note that any additional derived variables created and included in the dataset for preservation must also follow a proper naming convention and be labelled in the manner described earlier. These derived variables should also be subjected to standard error checking procedures such as wild coding and missing values. Explanations as to why and how they were created should be included in the codebook or metadata. Finally, all the files should be given different names indicating their proper version number. Extreme care should be taken that only up-to-date and complete files (not subsets) are used for tabulations. 3.10 Conversion of data files to other formats Data files are usually generated in a package-specific format such as SPSS or SAS, and can only be read efficiently in that package. A secondary analyst might not have access to the package in which the data was originally created. To read them in any other (e.g. where a file is created in SAS for Windows and then read in SPSS for Windows) or in a different computing environment (e.g. created on a PC with Windows and read on a workstation with Unix) is not usually a straightforward process. Sometimes it is not possible at all. Thus, it is always advisable to record data in alternative formats. The best option is to create the dataset in ASCII format (text file). ASCII has the advantage that it can be imported by any software, provided the necessary documentation is available. It is also better suited to long-term preservation. A dataset created and preserved today may not be used for years. By the time someone does want to use the data, that specific software version may no longer be available, and currently available versions may be incompatible with the older one. In such cases, whole datasets become obsolete. From the archival point of view, therefore, it is desirable that data be recorded in ASCII format as well as in the vendor-specific format. SPSS data can be converted to ASCII format in the following manner: 1. Open data in variable view. 2. Label all variables with proper wordings (refer to Final documentation, Section 3.8, above). If no appropriate words are found, exact question text can be used as a last resort. 3. Add all values in the value column. 4. Make sure the weight option is turned off, otherwise there will be more cases in the data file 5. Save the data file as a fixed-width ASCII file with the save as command and selecting the fixed ASCII option when a date file with DAT extension is to be created. 6. Variable names and required column numbers will be shown in the output window. 7. In the output window, select File and then Display Data Info, and all variables with their labels and values will be displayed. 8. Take a frequency response of all variables ( Without Table option). 41

9. Export all objects from the out put window into a text file by selecting File then Export then Save when a file with a TXT extension is to be created. 10. Edit the text file, adding or deleting extra information as necessary. 11. This text file now is the data dictionary and codebook for an ASCII data file (with DAT extension) that was created in Step 5, above. 12. Take the frequency tables for all variables and save them as an output file. Since some tables, such as those for unique identifiers, will be very large and may take a lot of memory space, they can be discarded, although this is not advisable. These tables will prove helpful to those who read ASCII data into a package as a crosscheck to ensure that the data have been properly read. A sample codebook and data dictionary produced in SAS is included in Annex IV. Please bear in mind that the above steps would not produce the codebook in Annex IV. Whatever statistical package is used for the data processing, it is strongly recommended that one create an ASCII dataset with necessary data dictionary and documentation. 3.11 Storage of all files Different people will access data generated from child labour surveys. Initially, data and associated documentations are usually prepared in a software-specific format. The choice of software for the processing of survey data, again, depends both on the availability of such software and on human and financial resources at the country level. Nevertheless, different people in different countries using different software will want to access these data. Thus, accessibility of the required data and documentations files is of paramount importance. This means that different types of files need to be generated and stored efficiently. Once all the files are ready, they should be transferred into a new directory. The following list of the file types involved is typical; the actual number of files listed will vary from country to country. 1. Data in a package-specific format (e.g. SAS for Windows, the software used to clean and analyse data). 2. Data in delimited ASCII format with necessary data dictionary (all files in text, and more than one file). 3. Public use dataset stripped of any variables that could be used to identify a person/institution in a package-specific format (file modified from Item 1, above). 4. Public use dataset without variables that could be used to identify a person/institution in delimited ASCII format with necessary data dictionary (modified from Item 1 or converted from Item 3, above). 5. Final documentation, preferably both in The Data Documentation Initiative Codebook DTD format and in ASCII text in the original language. 6. Questionnaire with answers, text in the original language (preferably annotated with variable names, including derived variables, and created using any MS Office Suite software package). 42

7. Any logical rules that are developed as a part of data processing and that are not included elsewhere in the original language. 8. Programs that were developed as a part of the data processing and tabulation activities in the original language (preferably in ASCII text). 9. Interviewer and/or supervisor s instruction manual, preferably in MSWord and in the original language; 10. Documentation, preferably in MSWord and in the original language, describing the structure of the dataset and providing information concerning variables and values, coding and classification schemes, details about derived variables, weighting, and grossing. Details of any process undertaken to make the dataset anonymous, derived from Item 5, above, for easy reference. 11. Any reports based on the dataset, preferably in MSWord and in the original language. 12. Generated codes such as those for occupation, industry, and injury. 13. Any classification (e.g. occupation, injury) files that were specifically created for the child labour survey. 14. All items mentioned in Items 5-13 in any other languages, if they have been translated. A completeness check should also be performed to ensure that: the correct dataset has been moved to temporary storage and is ready for transferral to permanent storage; data processing personnel are often involved in a number of different activities at the same time, and the wrong files may have been moved to temporary storage; the questionnaire is included in its exact original form; and all documentary materials (e.g. codebooks, programs) are accounted for and ready for use by secondary analysts. Finally, an index file should be generated that contains, at a minimum, the three following types of information about each file: file name; creation or last modification date; and an one-line description of the file contents. Information such as file size and who created it and why can also be included in the index file, which should be saved as a text file. 43

4. Data preservation 4.1 Introduction Data preservation one of the most important stages in a survey is the phase most often ignored. Too often, the output of large surveys is limited to a few reports based on selected tables, while the raw data used in creating these reports simply get lost. These data, however, should remain available for wider use by secondary analysts, and this demands a clearly defined strategy for preservation and effective dissemination. Effective data preservation requires the following measures: transfer of files to the preservation machine; indexing of files; development of a storage structure in the main archival machine; efficient backup procedures; physical and technical security procedures; and continuous monitoring of all the above procedures. The final preservation system should reside on a machine other than those used for dayto-day operations. Neither the data nor the documentation should be stored on a desktop or on any other computer used for routine business. Once the processing is complete, this information should be transferred to an independent machine, preferably one that will not be used for future processing of data. Where insufficient resources make these measures impracticable, all files should be copied onto an off-line medium such as CD-ROMs. These should be properly labelled, dated, and stored in a secure place. As insurance against destruction of the original versions by fire, for example, multiple copies should be stored at multiple locations. Access to the preservation machine or to the off-line medium (e.g. CDs, tapes) should be controlled, with only authorized personnel having read/write privileges. If someone wants to use the dataset, they should be given copies of the required files. Changes to any file should be made only according to established data management procedures. All previous versions should be kept, and the index should be updated. 4.2 Organization of files Once data processing is complete and all files have been generated, the personnel responsible for long-term storage of files (usually the system administrator) will create a directory structure for permanent storage on a computer. The files can be grouped together in various ways, taking into account such information as file types in terms of content, how they were created, and so on. In terms of content, files may be grouped in the following ways: Data. These are the actual data files, and may be prepared in various formats (e.g. SPSS, ASCII). Documentation. These are files that describe the data. They may be created using a word-processing package or they may be plain text files. 45

Programs. These are programme files developed during the data processing (programmes developed to design the data entry screen, for example). Again, these files may be in either package-specific or plain text formats. Questionnaires. Questionnaires used for the survey are usually in package-specific formats. Survey manual. Instruction manuals provide specific instructions to enumerators on how to conduct the field data collection. Reports. Reports (including tables) produced from the data are usually in a packagespecific format. Codes (e.g. occupation, industry, injury). Standard or country specific codes are used for the survey. There may also be public use data that is different from those datasets used for internal purposes. In addition, there may be files (such as detailed computer programs developed for consistency checks) that may not be available for public use. Model organizational structure One model organizational structure for files is provided in what follows. CLS. For ease of administration, all child labour survey-related information should be stored in one directory, which may be named the CLS, or child labour directory. The CLS is the root directory for all related information. INTERNAL and EXTERNAL. Two sub-directories, one named INTERNAL and the other EXTERNAL, fall immediately under the CLS directory. The INTERNAL subdirectory contains all files generated during data processing activities. All files in this directory are restricted to internal use. The EXTERNAL sub-directory contains files that are available for public use. The file structure in both directories is similar, except that the INTERNAL sub-directory may contain a greater number of files. VER_1. Since different versions of the data inevitably evolve, one option is to create a sub-directory called VER_1. DATA, DOCUMENT, and REPORT. Three sub-sub-directories named DATA, DOCUMENT, and REPORT may then be subsumed under VER_1. An index file describes the contents of each directory. Data directory. The data directory contains only data files, which may be packagespecific or text. Since text data files are always associated with codebooks, codebooks also appear in the same directory. An index file will explain the contents of each file. Document directory. This directory contains all necessary documentation regarding the data. All programs developed for consistency, tabulations, etc. also reside in this directory, as do questionnaires. An index file explains the contents of each file. Report directory. This directory contains all reports associated with the data, including country report, country profile, etc. An index file explains the contents of each file. VER_2. When a change in the data file creates a different version, the related documentation file must also change. At the least, it should specify what has been changed 46

in the data file, why, and when. This will require a complete replication of the directory structure. If storage capacity is a problem, however, only those files that are changed should be stored in VER_2 directory; all unchanged files should be moved to VER_2. The index file in VER_1 should also be updated to include the current location of missing files (VER_2 directory, in this example) so that all files remain easily locatable. The rationale behind this is that people will first explore the latest version in searching for files. A similar pattern should be followed for the EXTERNAL directory structure. However, the number of EXTERNAL files may be fewer than the corresponding number of files in the INTERNAL directory, since some files may not be available for external users. No VER_1 directory should be available to external users, who should always be offered the latest available version. If files are translated into a different language, each language version will need a complete separate directory structure similar to that illustrated in Figure 1, below. Names within squares represent directories, while names alone represent files. For ease of exposition, file names have not been prepared according to the 8.3 method (see Section 2.3, Naming files). 4.3 Transfer of files to a preservation machine Once files are transferred for permanent storage, the administrators have to ensure that no files were corrupted during transfer. To ensure error-free transfer of all files, the following checks are required: file numbers and names must match the related index file information; files must be certified as virus free; source and destination files must be of equal size (in bits); files should be randomly opened to see that they have been transferred properly; and system administrators should perform any other checks deemed necessary. If the system allows, the file creation dates should remain unchanged. 4.4 Backups Once data processing is complete, backup procedures should be implemented. A complete child labour dataset is usually smaller than 640MB, and is best stored on a single CD. The CD should be clearly marked to identify its contents and date of creation. Any change in the data or documentation requires either a separate storage CD or a whole new dataset, which can be marked with its creation date and transferred to a separate directory on the same CD. 47

Data Index.txt Data.sav Data.por Data.txt Codebook.txt INTERNAL Ver_1 Document Report Index.txt Index.txt Index.txt Metadata Country profile Programs Country report Questionnaire Other reports Etc. Ver_2 Structure will be same as Version 1 CLS EXTERNAL Data Document Report Index.txt Index.txt Index.txt Index.txt Data.por Metadata Country profile ASCII data & assoc. files Questionnaire Country report Other reports 48

Transfer of files to the ILO All child labour micro-data, together with the necessary documentation and reports, will be stored in the ILO s central child labour data repository. Always taking confidentiality issues into account, these data will be made available to secondary users. Files can be transferred to the ILO using FTP procedures. Details of how to transfer files using FTP will be provided before the actual transfer. Anyone with queries is welcome to contact the ILO by e-mail: simpoc@ilo.org Files to be sent to the ILO: 1. public use dataset, stripped of variables that might be used to identify a person/institution in delimited ASCII format with necessary data dictionary. Also in package-specific format, that was used for data cleaning and or tabulations (e.g. SPSS, SAS etc.) 2. final documentation, preferably in The Data Documentation Initiative codebook DTD format and in ASCII text in the original language; 3. questionnaire with response categories in the original language (preferably annotated with variable names, including derived variables, and created using any MS Office Suite package); 4. any logical rules developed in the course of data processing and not included elsewhere in the original language; 5. programs that were developed during the course of data processing and tabulation, in the original language (preferably in ASCII text); 6. Interviewer and/or supervisor s instruction manual, preferably in MSWord and in the original language; 7. documentation, preferably in MSWord and in the original language, describing the structure of the dataset and providing information on variables and values, coding and classification schemes, details about derived variables, weighting, grossing, and details of any process undertaken to make the dataset anonymous; also with mean, standard deviation, maximum and minimum values of each variables 8. any coding scheme or reference to coding information (e.g. occupation, industry, and injury) in MSWord; 9. any reports based on the dataset, preferably in MSWord and in the original language; and 10. all items mentioned in Items 7-15 in any other language, if translated. 49

Further resources Active Server Corner. What s in a name? Part I: Variables and methods. http://www.kamath. com/columns/squareone/so001_whatname1.asp Audience Dialogue. Survey analysis. http://www.audiencedialogue.org/kya5.html Carolina Population Center. Stata Programming: Data Management. University of North Carolina. http://www.cpc.unc.edu/services/computer/presentations/statatutorial#combining Center for Statistical Information and Research (CSCAR). Guide to data entry. University of Michigan. http://www.umich.edu/~cscar/software/dataentry.html Centers for Disease Control and Prevention. EpiInfo, Version 6. http://www.cdc.gov/epiinfo/ epi6man/epi6titl.htm Data Documentation Initiative A project of the social science community http://www.icpsr. umich.edu/ddi/ Data Documentation Initiative. Codebook DTD Version 1.0 (FINAL) March 17 2000 http://www.icpsr.umich.edu/ddi/index.html Data, Government and Geographic Information Services. SSDC Data File Structure in a Nutshell. University of California, San Diego. http://ssdc.ucsd.edu/ssdc/browse/dataformat.html#structure Deakin University, School of Information Technology. Introduction to Data Collection and Analysis: Processing survey data. http://www.deakin.edu.au/~agoodman/sci101/chap9.php History Data Service. Creating Data. http://hds.essex.ac.uk/create.asp ILO. Classifications of Occupational Injuries. http://www.ilo.org/public/english/bureau/stat/ class/acc/index.htm ILO. International Classification of Status in Employment (ICSE). http://www.ilo.org// public/english/bureau/stat/class/icse.htm. ILO. International Programme on the Elimination of Child Labour. http://www.ilo.org/childlabour ILO. International Standard Classification of Occupations. http://www.ilo.org/public/english/ bureau/stat/class/isco.htm ILO. International Standard Industrial Classification of all Economic Activities (ISIC). http:// www.ilo.org//public/english/bureau/stat/class/isic.htm. ILO. Survey of activities of young people 1999. http://www.ilo.org/public/english/standards/ipec/simpoc/southafrica/document/quest_2.pdf Inter-University Consortium for Political and Social Research (ICPSR). Guide to Social Science Data Preparation and Archiving. http://www.icpsr.umich.edu/access/dpm.html 51

North Carolina State University Department of Statistics. How to Collect Survey Data. http://www.stat.ncsu.edu/info/srms/survcoll.html North Carolina State University Department of Statistics. How to Plan a Survey. http://www.stat.ncsu.edu/info/srms/survplanl.html Office of Information Technology Services. An introduction to SPSS. Murdoch University. http://www.its.murdoch.edu.au/services/software/sitelic/spss/spss-intro.html QQQ software Inc. http://www.qqqsoft.com/ Rasinksi, K., Timberlake, J., Lee, L., Porras, J. and Mulrow, J : Producing a Public Use File: A Case Study. American Statistical Association. http://www.amstat.org/sections/srms/proceedings/papers/1997_074.pdf SRS Data Library. Introduction to data handling. University of Chicago. http://www.spc.uchicago.edu/datalib/dlguides/gdathand.html The Blaise System Homepage. http://neon.vb.cbs.nl/blaise U.S. Census Bureau, CS Pro. http://www.census.gov/ipc/www/cspro/index.html U.S. Census Bureau. The Integrated Microcomputer Processing System. http://www.census.gov/ipc/www/imps/index.html U.S. Department of Health and Human Services. Documenting Survey Data Files. http://aspe.hhs.gov/hsp/leavers99/datafiles/ch_4.pdf U.S. Department of Health and Human Services. Producing welfare outcomes data files. http://aspe.hhs.gov/hsp/leavers99/datafiles/ch_1.pdf UCLA Academic Technology Services, SPSS Learning Module Match: Merging Data Files. University of California, Los Angeles. http://www.ats.ucla.edu/stat/spss/modules/merge.htm UCLA Academic Technology Services. SPSS Class notes: Splitting and merging files. University of California, Los Angeles. http://www.ats.ucla.edu/stat/spss/notes/merge.htm UK Data Archive. http://www.data-archive.ac.uk UNICEF. MICS data processing. http://childinfo.org/mics2/dproc/ver2/m2dprocb.htm 52

Glossary 24 Aggregate data. Data calculated from micro-data. ASCII. Abbreviation of American Standard Code for Information Interchange. One way in which many computers encode characters, digits, and special characters. In simple terms, all characters are in text format and no conversion is needed to read/write ASCII characters (as it is with package-specific formats). Case. Complete data about a person or household or an entity. This is sometimes also referred to as an observation or unit of observation. A single record or multiple records, depending on the survey and data structure, would constitute a case. CAPI. Computer-aided personal interviewing, where data are collected in face-to-face interviews and small computers (palmtop, laptop, or handheld electronic devices that can be connected to computers) are used for data collection. CATI. Computer-aided telephone interviewing, where data are collected by telephone interviews and small computers (palmtop, laptop, or handheld electronic devices that can be connected to computers) are used for data collection. Code. In most numeric data files, answers to questions are recorded with numbers rather than text, and often even numeric answers are recorded with numbers instead of the actual response. The numbers used in the data file are called codes. Thus, when a respondent identifies herself as a working child, a code of 1 might be used for fetching water, 2 for begging, etc. Similarly, an age of 18 might be coded as a 2, indicating 18 or older. The codes used and their correspondence to the actual responses are listed in a codebook or, when precoded, in the questionnaire. Codebook. Generically, any information on the structure, contents, and layout of a data file. Typically, a codebook includes column locations and widths for each variable; definitions of different record types; response codes for each variable; codes used to indicate non-response and missing data; exact questions and skip patterns used in a survey; and other elements of the content of each variable. Many codebooks also include frequencies of response. Codebooks, which may be machine-readable, paper copy, or microfiche, vary widely in quality and amount of information included. Direct cable connection. A Windows-based procedure through which files are transferred between computers that are not networked. The program if installed during Windows set-up, and is usually found under the Accessories ( Communication program group. The computers first need to be connected through their serial/parallel ports. Both computers also need to be appropriately configured before the file transfer. DDI. The Data Documentation Initiative (DDI) is an effort to establish an international criterion and methodology for the content, presentation, transport, and preservation of metadata about datasets in the social and behavioural sciences. Double entry. Where, during data entry, the same data are entered by two different people and then compared to reveal errors. 24 Many definitions in the glossary are based on the Glossary of Selected Social Science Computing Terms and Social Science Data Terms http://odwin.ucsd.edu/glossary/glossary.html 53

DTD. A document type definition (DTD) is a set of rules, determined by an application, that applies SGML (Standard Generalized Mark-up Language) to the mark-up of documents of a particular type. Flat file. This refers to the structure of a file. A flat file is one in which each respondent or unit of analysis contains the same number of variables. Often referred to as a rectangular file. Contrasts with hierarchical file. FTP. A file transfer protocol (FTP) is a reliable method of transferring files electronically between networked computers. Hierarchical file. An ASCII data file in which more than one type of record is organized. The type and (usually) the number of variables associated with each respondent or unit of analysis is different for each type of records. For example, a household may be record type 1, which consists of 10 variables describing the building, whereas a person may be record type 2, which consists of 20 separate variables describing each member living in that house. HTML. HyperText Markup Language. HTML is the lingua franca for publishing hypertext on the World Wide Web. Imputation. The process by which one estimates missing values for items that a survey respondent failed to provide. Intelligent data entry. The use of computer software that helps capture errors during data entry, as well as performing preventive measures to avoid errors in the first place. LapLink. Software that allows the transfer of files between computers that are not connected through a network. Two computers are first connected by a LapLink cable by way of their parallel or serial ports. When the computers are booted in DOS mode and LL3 (LapLink 3.0) software is executed, files can be transferred between two computers in a user-friendly way. Metadata. Data about data, and represents the information that enables the effective, efficient, and accurate use of the datasets to which it refers. Micro-data. Information regarding individuals collected through some form of data collection procedure (face-to-face interviews, in most child labour surveys). Files containing micro-data are referred to as micro-data files. Missing values. Values (codes) missing from a dataset. Sometimes data for a particular variable is left blank for a particular record. This may happen, for example, where the question is not applicable for that case. PAPI. Paper-and-pencil interview where data are collected in face-to-face interviews and answers are recorded on paper (on the questionnaires). All data are then entered on a computer for further processing. Precoded questionnaire: An interview questionnaire where codes for each answer are already included in the questionnaire. Questionnaire: Sometimes referred as the survey instrument, this is a set of questions asked during interviews. Raw data. Same as micro-data. 54

RDF. Resource description framework. A standard way of describing an entity, for example conditions under which certain data may not be available to certain users. Record. Complete data about a person or household or entity. A number of variables constitute a record. When one record completes a case, the number of records is the same as number of cases (observation or unit of observation) in a data file. Record type. Sometimes, in the same ASCII data file, the same column refers to a different variable. The codebook associated with an ASCII data file explains how a statistical package will interpret each column of the data file depending on the record type. Rectangular file. Same as flat file. SGML: Standard Generalized Markup Language. Telnet: A process by which one can remotely access another networked computer and use resources available to the remote computer. Unit of analysis. The basic observable entity analysed by a survey, and for which data are collected in the form of variables. Although a unit of analysis is sometimes referred to as the case or observation, these terms are not always synonymous. With child labour surveys, the unit of analysis is a person while a case is a household, because the household may contain different variables for the different units of analysis: i. e., a physical shelter, a family within the structure, and a person within the family. Variable. In social science research, for each unit of analysis, each item of data (e.g. age of person, income of family) is called a variable. Weight. In survey research, this refers to a number associated with a case or unit of analysis. The weight is used as a measure of the relative contribution of the variables of that case when making estimates for the entire population. When a probability sample is used, there is often a chance that some elements of the population are under- or overrepresented in the sample. In order to allow more accurate estimates of a complete population, therefore, weights are assigned to each case and used to adjust the overall results so that they conform more closely to the total population. Wild code. In survey research, wild codes are codes that are not authorized for a particular question. For instance, if a question that records the sex of the respondent has documented codes of 1 for female and 2 for male and 9 for missing data, a code of 3 would be a wild code, sometimes called an undocumented code. XML. The Extensible Markup Language is the universal format for structured documents and data on the World Wide Web. 55

Annex I Comparison of statistical packages 25 STATA STATA SAS SAS SPSS Version 6 Version 7 Version 6.12 Version 8.x Version 10 1 024 (using Max. width of 1 mega-byte data list) Longer 8 192 8 192 32 767 input raw data file (in Windows) if using File Handle) Windows NTFS: Limited by free Limited by free Windows NTFS: 4 trillion GB No limit Max. size memory of memory of 17GB. Other Other (only limited of data file machine machine Windows: 2GB Windows: by disk space) 2GB Max. number of observations Max. Number of variables Max. length of variable name Max. length of a variable label Max. length of a value label Max. length of a data set label Max. length of a string variable No limit No limit No limit 2 147 483 647 2 147 483 647 (only limited by (only limited by (only limited by disk space) disk space) disk space) No limit 2 047 2 047 32 767 32 767 (only limited by disk space) 8 32 8 32 8 80 80 40 256 255 80 80 40 256 60 80 80? 32 60 8 for short 80 80 200 32767 string; 255 for long string Max. number of missing value codes 1 1 27 27 No limit Max. number of notes that can be attached to a 9 999 9 999 N/A N/A No limit data file Number of data sets that can be 1 1 No limit No limit 1 opened at once Number seconds Dates calculated Number days Number days Number days Number days from 14 Oct. as from 1/1/1960 from 1/1/1960 from 1/1/1960 from 1/1/1960 1582 25 Based on UCLA Academic Technology services: SPSS FAQ What are the limits of SPSS version 10and other statistical packages, http://www.ats.ucla.edu/stat/spss/faq/spsslimits.htm 57

STATA STATA SAS SAS SPSS Version 6 Version 7 Version 6.12 Version 8.x Version 10 Max. number of key variables in 10 10 No limit No limit No limit a merge Max. number of levels in encode/ 80 80 N/A N/A No limit auto recode Max. number of conditions in an 30 100 No limit No limit No limit if statement Max. number of rows in a one 3 000 3 000 32 760 cells 32 760 cells No limit way table Max. number of rows in a 300 300 32 760 cells 32 760 cells No limit two-way table Max. number of columns in a 20 20 32 760 cells 32 760 cells No limit two-way table 58

Annex II English country names and code elements 26 This list states the country names (official short names in English) in alphabetical order as given in ISO 3166-1 and the corresponding ISO 3166-1-Alpha-2 code elements. This list is updated whenever a change to the official code list in ISO 3166-1 is effected by the ISO 3166 Maintenace Agency. This list is complete and up-to-date as of 26 February 2001. It lists 239 official short names and code elements. AFGHANISTAN ALBANIA ALGERIA AMERICAN SAMOA ANDORRA ANGOLA ANGUILLA ANTARCTICA ANTIGUA AND BARBUDA ARGENTINA ARMENIA ARUBA AUSTRALIA AUSTRIA AZERBAIJAN BAHAMAS BAHRAIN BANGLADESH BARBADOS BELARUS BELGIUM BELIZE BENIN BERMUDA BHUTAN AF AL DZ AS AD AO AI AQ AG AR AM AW AU AT AZ BS BH BD BB BY BE BZ BJ BM BT 26 Based on International Organization for Standardization: http://www.iso.ch/iso/en/prodsservices/iso3166ma/02iso-3166-code-lists/list-en1.html 59

BOLIVIA BOSNIA AND HERZEGOVINA BOTSWANA BOUVET ISLAND BRAZIL BRITISH INDIAN OCEAN TERRITORY BRUNEI DARUSSALAM BULGARIA BURKINA FASO BURUNDI CAMBODIA CAMEROON CANADA CAPE VERDE CAYMAN ISLANDS CENTRAL AFRICAN REPUBLIC CHAD CHILE CHINA CHRISTMAS ISLAND COCOS (KEELING) ISLANDS COLOMBIA COMOROS CONGO CONGO, THE DEMOCRATIC REPUBLIC OF THE COOK ISLANDS COSTA RICA CÔTE D IVOIRE CROATIA CUBA CYPRUS CZECH REPUBLIC DENMARK DJIBOUTI DOMINICA DOMINICAN REPUBLIC EAST TIMOR ECUADOR EGYPT BO BA BW BV BR IO BN BG BF BI KH CM CA CV KY CF TD CL CN CX CC CO KM CG CD CK CR CI HR CU CY CZ DK DJ DM DO TP EC EG 60

EL SALVADOR EQUATORIAL GUINEA ERITREA ESTONIA ETHIOPIA FALKLAND ISLANDS (MALVINAS) FAROE ISLANDS FIJI FINLAND FRANCE FRENCH GUIANA FRENCH POLYNESIA FRENCH SOUTHERN TERRITORIES GABON GAMBIA GEORGIA GERMANY GHANA GIBRALTAR GREECE GREENLAND GRENADA GUADELOUPE GUAM GUATEMALA GUINEA GUINEA-BISSAU GUYANA HAITI HEARD ISLAND AND MCDONALD ISLANDS HOLY SEE (VATICAN CITY STATE) HONDURAS HONG KONG HUNGARY ICELAND INDIA INDONESIA IRAN, ISLAMIC REPUBLIC OF IRAQ SV GQ ER EE ET FK FO FJ FI FR GF PF TF GA GM GE DE GH GI GR GL GD GP GU GT GN GW GY HT HM VA HN HK HU IS IN ID IR IQ 61

IRELAND ISRAEL ITALY JAMAICA JAPAN JORDAN KAZAKSTAN KENYA KIRIBATI KOREA, DEMOCRATIC PEOPLE S REPUBLIC OF KOREA, REPUBLIC OF KUWAIT KYRGYZSTAN LAO PEOPLE S DEMOCRATIC REPUBLIC LATVIA LEBANON LESOTHO LIBERIA LIBYAN ARAB JAMAHIRIYA LIECHTENSTEIN LITHUANIA LUXEMBOURG MACAU MACEDONIA, THE FORMER YUGOSLAV REPUBLIC OF MADAGASCAR MALAWI MALAYSIA MALDIVES MALI MALTA MARSHALL ISLANDS MARTINIQUE MAURITANIA MAURITIUS MAYOTTE MEXICO MICRONESIA, FEDERATED STATES OF MOLDOVA, REPUBLIC OF MONACO IE IL IT JM JP JO KZ KE KI KP KR KW KG LA LV LB LS LR LY LI LT LU MO MK MG MW MY MV ML MT MH MQ MR MU YT MX FM MD MC 62

MONGOLIA MONTSERRAT MOROCCO MOZAMBIQUE MYANMAR NAMIBIA NAURU NEPAL NETHERLANDS NETHERLANDS ANTILLES NEW CALEDONIA NEW ZEALAND NICARAGUA NIGER NIGERIA NIUE NORFOLK ISLAND NORTHERN MARIANA ISLANDS NORWAY OMAN PAKISTAN PALAU PALESTINIAN TERRITORY, OCCUPIED PANAMA PAPUA NEW GUINEA PARAGUAY PERU PHILIPPINES PITCAIRN POLAND PORTUGAL PUERTO RICO QATAR RÉUNION ROMANIA RUSSIAN FEDERATION RWANDA SAINT HELENA SAINT KITTS AND NEVIS MN MS MA MZ MM NA NR NP NL AN NC NZ NI NE NG NU NF MP NO OM PK PW PS PA PG PY PE PH PN PL PT PR QA RE RO RU RW SH KN 63

SAINT LUCIA SAINT PIERRE AND MIQUELON SAINT VINCENT AND THE GRENADINES SAMOA SAN MARINO SAO TOME AND PRINCIPE SAUDI ARABIA SENEGAL SEYCHELLES SIERRA LEONE SINGAPORE SLOVAKIA SLOVENIA SOLOMON ISLANDS SOMALIA SOUTH AFRICA SOUTH GEORGIA AND THE SOUTH SANDWICH ISLANDS SPAIN SRI LANKA SUDAN SURINAME SVALBARD AND JAN MAYEN SWAZILAND SWEDEN SWITZERLAND SYRIAN ARAB REPUBLIC TAIWAN, PROVINCE OF CHINA TAJIKISTAN TANZANIA, UNITED REPUBLIC OF THAILAND TOGO TOKELAU TONGA TRINIDAD AND TOBAGO TUNISIA TURKEY TURKMENISTAN TURKS AND CAICOS ISLANDS TUVALU LC PM VC WS SM ST SA SN SC SL SG SK SI SB SO ZA GS ES LK SD SR SJ SZ SE CH SY TW TJ TZ TH TG TK TO TT TN TR TM TC TV 64

UGANDA UKRAINE UNITED ARAB EMIRATES UNITED KINGDOM UNITED STATES UNITED STATES MINOR OUTLYING ISLANDS URUGUAY UZBEKISTAN VANUATU Vatican City State VENEZUELA VIET NAM VIRGIN ISLANDS, BRITISH VIRGIN ISLANDS, U.S. WALLIS AND FUTUNA WESTERN SAHARA YEMEN YUGOSLAVIA Zaire ZAMBIA ZIMBABWE UG UA AE GB US UM UY UZ VU See HOLY SEE VE VN VG VI WF EH YE YU See CONGO, THE DEMOCRATIC REPUBLIC OF THE ZM ZW 65

Annex III Zambia end of decade and child labour questionnaire (education module) 27 QUESTIONS PERSON 1 2 3 4 NUMBER CHECK AGE IF 15 AND ABOVE ASK Can read a letter or newspaper? EASILY............ 1 WITH DIFFICULTY.... 2 NOT AT ALL........ 3 N/A OR BLIND....... 8 DK................ 9 Has ever attended primary/secondary school? YES.......... 1 >> Q4 NO.......... 2 22222 Why has never attended school? WORKING.......... 1 EXPENSIVE......... 2 TOO FAR........... 3 ENROLLMENT REFUSED......... 4 OTHER............ 5 (SPECIFY) >> NEXT PERSON What is/was the highest grade attained? ENTER GRADE EDUCATION MODULE CONTINUED ASK WITH RESPECT TO THOSE AGED 5-30 YEARS ONLY ALL OTHERS >> NEXT MODULE PERSON 5 6 7 8 NUMBER Is attending primary/secondary school this year regardless if on holiday at the moment? YES.......... 111111 NO.......... 2 >> Q8 What type of school is attending? GOVERNMENT...... 1 PRIVATE........... 2 MISSION........... 3 COMMUNITY....... 4 OTHER............ 5 What grade is attending? ENTER GRADE >> Q9 What is the main reason is not attending school? WORKING.......... 1 EXPENSIVE......... 2 TOO FAR........... 3 NOT SELECTED / FAILED........... 4 PREGNANCY........ 5 COMPLETED SCHOOL.......... 6 GOT MARRIED...... 7 OTHER... 8 (SPECIFY) EDUCATION MODULE CONTINUED ASK WITH RESPECT TO THOSE AGED 5-30 YEARS ONLY ALL OTHERS >> NEXT MODULE PERSON NUMBER 9 10 11 Was attending primary/ secondary school last year? YES................ 1 NO................ 2 >> NEXT MODULE What type of school was attending last year? GOVERNMENT....... 1 PRIVATE............ 2 MISSION............ 3 COMMUNITY........ 4 OTHER............. 5 What grade was attending last year? ENTER GRADE 27 Based on ILO/IPEC/SIMPOC/Zambia http://www.ilo.org/public/english/standards/ipec/simpoc/zambia/ document/zafh01gq.pdf 66

Annex IV A sample codebook for ASCII data created in SAS 28 Index Question File Name: HHOLD Variable Name Unique Number....................... UQNR Q4.1 Type of dwelling household occupies............ Q41DWELL Q4.2 Number of rooms...................... Q42NOROO Q4.3 Source of energy (cooking)................. Q43COOKI Q4.3 Source of energy (heating)................. Q43HEATI Q4.3 Source of energy (lighting)................. Q43LIGHT Q4.4a Who collects wood/dung.................. Q44ACOLL Q4.4b Gender of persons who collect wood/dung......... Q44BCOLL Q4.5 Household s main source of water............. Q45WATER Q4.6a Who collects the water................... Q46ACOLL Q4.6b Gender of persons who collect water............ Q46BCOLL Q4.7a Cultivate land or keep any stock.............. Q47AHHCU Q4.7b Household member who owns the land........... Q47BRELA Q4.7b Land allocated by tribe................... Q47BRELB Q4.7b Household allowed to use land by owner.......... Q47BRELC Q4.7b Household pays cash to rent land.............. Q47BRELD Q4.7b Household provides worker................. Q47BRELE Q4.7b Pay rent through portion of produce............ Q47BRELF Q4.7b Right to use land because working............. Q47BRELG Q4.7b Household has access to land for free............ Q47BRELH Q4.8a Total annual gross household income............ Q48AGROS Q4.8b Regular wages/salaries................... Q48BREGU Q4.8b Casual wages........................ Q48BCASU Q4.8b Income from self-employment/business.......... Q48BSELF Q4.8b Remittances from outside the household.......... Q48BREMI Q4.8b Income from agriculture................... Q48BAGRI Q4.8b Old age pension....................... Q48BPENS Q4.8b Child support grant, foster care grant............ Q48BCHIL Q4.8c Free subsidised accommodation.............. Q48CSUBA Q4.8c Free subsidised food or meals................ Q48CSUBB Province........................... PROV Area type.......................... STRATUM Qualify for second phase.................. QUALIFY Selected for second phase.................. SELECTED Person number of main respondent............. Q49MAINR Language of interview.................... Q410LANG Household weight for phase one.............. HHWGT 28 See also http://www.ilo.org/public/english/standards/simpoc/southafrica/index.htm 67

File Name: HHOLD Section: Section 4 Unique Number Var.Name: UQNR Position: 1 Type/Length: Numerical 10 Valid Code: 1011010101-9313021471 Q4.1 Which type of dwelling does this household occupy (If this household lives in more than one dwelling, circle the main type of dwelling) Var.Name: Q41DWELL Position: 11 Type/Length: Numerical 3 Valid Code: 1 House or brick structure on a separate stand or yard* 2 Traditional dwelling/hut/structure made of traditional materials* 3 Flat in a block of flats* 4 Town/cluster/semi-detached house (simplex, duplex or triplex)* 5 House/flat/room in backyard 6 Informal dwelling/shack in backyard 7 Informal dwelling/shack not in backyard, e.g. in an informal/squatter settlement/traditional area* 8 Room (-s)/garage not in backyard but on a shared property* 9 Caravan/tent* 10 Other, specify -99 Unspecified * Include in categories 1 to 4 and 7 to 9 similar structures on commercial farms Q4.2 How many rooms, including kitchens, are there for this household? Excluding toilets and bathrooms Var.Name: Q42NOROO Position: 14 Type/Length: Numerical 3 Valid Code: 1-20 -99 Unspecified Q4.3 What is the main source of energy/fuel for this household? (Cooking) Var.Name Q43COOKI Position: 17 Type/Length: Numerical 3 Valid Code: 1 Electricity 2 G as 3 Paraffin 4 Wood 5 Coal 6 Candles 7 Animal dung 8 Solar energy 9 Other, specify -99 Unspecified Q4.3 What is the main source of energy/fuel for this household? (Heating) Var.Name: Q43HEATI Position: 20 Type/Length: Numerical 3 Valid Code: 1 Electricity 2 G as 3 Paraffin 4 Wood 5 Coal 68

6 Candles 7 Animal dung 8 Solar energy 9 Other, specify -99 Unspecified Q4.3 What is the main source of energy/fuel for this household? (Lighting) Var.Name: Q43LIGHT Position: 23 Type/Length: Numerical 3 Valid Code: 1 Electricity 2 G as 3 Paraffin 4 Wood 5 Coal 6 Candles 7 Animal dung 8 Solar energy 9 Other, specify -99 Unspecified Q4.4a (Ask Q4.4 if any answer to Q4.3 is 4 or 7) Who usually collects this wood/dung? Var.Name: Q44ACOLL Position: 26 Type/Length: Numerical 4 Valid Code: 1 Someone outside household (Wood/dung bought or person hired) (Go to Q4.5) 2 Someone outside household (Free to household) (Go to Q4.5) 3 Only an adult/adults in the household 4 Only a child/children (under 18) in the household 5 An adult/adults and a child/children (under 18) in the household 6 Other, specify -99 Unspecified -999 Not Applicable Q4.4b Are the persons usually collecting wood/dung? Var.Name: Q44BCOLL Position: 30 Type/Length: Numerical 4 Valid Code: 1 Mostly males 2 Mostly females 3 Equally males and females -99 Unspecified -999 Not Applicable Q4.5 What is the household~s main source of water? Var.Name: Q45WATER Position: 34 Type/Length: Numerical 3 Valid Code: 1 Piped (tap) water in dwelling (Go to Q4.7) 2 Piped (tap) water on site or in yard (Go to Q4.7) 3 Public tap 4 Water-carrier/tanker 5 Borehole on site 6 Borehole off site/communal 7 Rain-water tank on site 8 Flowing water/stream 9 Dam/pool/stagnant water 10 Well 69

Q4.6a 11 Spring 12 Other, specify -99 Unspecified Who usually collects the water? Var.Name: Q46ACOLL Position: 37 Type/Length: Numerical 4 Valid Code: 1 Someone outside household (water bought or person hired)(go to Q4.7) 2 Someone outside household (provided for free to household)(go to Q4.7) 3 Only an adult/adults in the household 4 Only a child/children (under 18) in the household 5 An adult/adults an a child/children (under 18) in the household 6 Other, specify -99 Unspecified -999 Not Applicable Q4.6b Are the persons usually collecting water.? Var.Name: Q46BCOLL Position: 41 Type/Length: Numerical 4 Valid Code: 1 Mostly males 2 Mostly females 3 Equally males and females -99 Unspecified -999 Not Applicable Q4.7a Does your household cultivate any land or keep any stock, even chickens, for sale or for own use? Var.Name: Q47AHHCU Position: 45 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No (Go to Q4.8) -99 Unspecified -999 Not Applicable Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (May be more than one plot or piece of land) a) A household member is owner of the land or a member of the legal entity that owns the land Var.Name: Q47BRELA Position: 49 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (May be more than one plot or piece of land) b) Land has been allocated by tribal or traditional authority to a household member Var.Name: Q47BRELB Position: 53 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable 70

Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (May be more than one plot or piece of land) c) Person in charge of the land allows a household member to use the land Var.Name: Q47BRELC Position: 57 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (May be more than one plot or piece of land) d) A household member pays cash to rent the land Var.Name: Q47BRELD Position: 61 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (May be more than one plot or piece of land) e) Household has to provide a worker to work for the person in charge of the land Var.Name: Q47BRELE Position: 65 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (May be more than one plot or piece of land) f) Pay rent through portion of produce (share cropping) Var.Name: Q47BRELF Position: 69 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (Nay be more than one plot or piece of land) g) Right to use land because working for land owner Var.Name: Q47BRELGPosition:73 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable Q4.7b What is the relationship between your household and the land which you cultivate or keep stock on? (May be more than one plot or piece of land) h) Household has access to the land for free Var.Name: Q47BRELH Position: 77 Type/Length: Numerical 4 Valid Code: 1 Yes 2 No -99 Unspecified -999 Not Applicable 71

Q4.8a For the past 12 months, could you please tell me in which of the following ranges your total annual gross household income falls? Include remittances and all sources of income. (show prompt card) Var.Name: Q48AGROS Position: 81 Type/Length: Numerical 3 Valid Code: 1 No income 2 R 1 - R 1 200 3 R 1 202 - R 2 400 4 R 2 401 - R 4 200 5 R 4 201 - R 6 000 6 R 6 001 - R 9 000 7 R 9 001 - R 12 000 8 R 12 001 - R 18 000 9 R 18 001 - R 30 000 10 R 30 001 - R 42 000 11 R 42 001 - R 54 000 12 R 54 001 or more 13 Don t know 14 Refuse -99 Unspecified Q4.8b Does the household income include the following? a) Regular wages/salaries Var.Name: Q48BREGU Position: 84 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Q4.8b Does the household income include the following? b) Casual wages Var.Name: Q48BCASU Position: 87 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Q4.8b Does the household income include the following? c) Income from self employment/business Var.Name: Q48BSELF Position: 90 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Q4.8b Does the household income include the following? d) Remittances from outside the household Var.Name: Q48BREMI Position: 93 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Q4.8b Does the household income include the following? e) Income from agriculture Var.Name: Q48BAGRI Position: 96 Type/Length: Numerical 3 Valid Code: 1 Yes 72

Q4.8b 2 No -99 Unspecified Does the household income include the following? f) Old age pension Var.Name: Q48BPENS Position: 99 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Q4.8b Does the household income include the following? g) Child support grant, state maintenance grant or foster care grant, i.e. state grants directly related to children Var.Name: Q48BCHIL Position: 102 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Q4.8c Does anyone in the household receive any of the following free or subsidized because he/she is working? a) Accommodation Var.Name: Q48CSUBA Position: 105 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Q4.8c Does anyone in the household receive any of the following free or subsidized because he/she is working? b) Food or meals Var.Name: Q48CSUBB Position: 108 Type/Length: Numerical 3 Valid Code: 1 Yes 2 No -99 Unspecified Province (Derived variable: First digit of a PSU number) Var.Name PROV Position: 111 Type/Length: Numerical 1 Valid Code: 1 Western Cape 2 Eastern Cape 3 Northern Cape 4 Free State 5 KwaZulu-Natal 6 North West 7 Gauteng 8 Mpumalanga 9 Northern Province Area Type (Derived variable: from enumeration area types) Var.Name: STRATUM Position: 112 Type/Length: Numerical 1 Valid Code: 1 Formal Urban 2 Informal Urban 3 Other Rural Areas 4 Commercial Farms 73

Qualify (Derived variable: from question 3.4J Var.Name: QUALIFY Position: 113 Type/Length: Numerical 1 Valid Code: 1 If the household qualified for selection for second phase 2 Otherwise Selected (Derived variable) Var.Name: SELECTED Position: 114 Type/Length: Numerical 1 Valid Code: 1 If the household qualified for selection and was selected 2 Otherwise Q4.9 Person number of main respondent Var.Name: Q49MAINR Position: 115 Type/Length: Numerical 3 Valid Code: 0-22 -99 Unspecified Q4.10 Language in which the interview was conducted Var.Name: Q410LANGPosition: 118 Type/Length: Numerical 3 Valid Code: 10 Afrikaans 26 Arabic 23 Chinese 3 English 17 French 13 German 14 Greek 1 Gujarati 19 Hindi 1 Isandebele/Ndebele/South Ndebele/North Ndebele 2 Isixhosa/Xhosa 3 Isizulu/Sizulu/Zulu 15 Italian 4 Netherlands 16 Portuguese 4 Sepedi/Northern Sotho 5 Sesotho/Southern Sotho/Sotho 6 Setswana/Tswana 25 Shona 7 Siswati/Swazi 24 Swahili 18 Tamil 20 Telegu 8 Tshivenda/Venda 22 Urdu 9 Xitsonga/Tsonga/Shangaan O Other -99 Not reported Household weight for phase I (Derived variable: weighted to 1996 population census on the basis of province and area type) Var.Name: HHWGT Position: 121 Type/Length: Numerical 4 Valid Code: 7-3 686 74

Descriptive Statistics: HHOLD.dat Variable N Mean Std DevMinimum Maximum UQNR 26105 5335171383 2527122034 1011010101 9313021471 Q41DWELL 26105 2.3880866 7.8654339-99.0000000 10.0000000 Q42NOROO 26105 2.0488795 13.2371657-99.0000000 20.0000000 Q43COOKI 26105 2.1672477 5.8887941-99.0000000 9.0000000 Q43HEATI 26105 1.6592990 11.1480539-99.0000000 9.0000000 Q43LIGHT 26105 2.0647002 7.7083625-99.0000000 9.0000000 Q44ACOLL 26105-701.9144608 457.2320285-999.0000000 6.0000000 Q44BCOLL 26105-734.6049033 440.5416210-999.0000000 3.0000000 Q45WATER 26105 2.1295920 9.5600602-99.0000000 12.0000000 Q46ACOLL 26105-582.6624401 492.7074229-999.0000000 6.0000000 Q46BCOLL 26105-606.4159356 487.2248795-999.0000000 3.0000000 Q47AHHCU 26105 0.5947137 10.7473066-99.0000000 2.0000000 Q47BRELA 26105-744.4700249 434.0146826-999.0000000 2.0000000 Q47BRELB 26105-744.7195173 433.6185791-999.0000000 2.0000000 Q47BRELC 26105-744.7502011 433.5729078-999.0000000 2.0000000 Q47BRELD 26105-744.7125455 433.6372692-999.0000000 2.0000000 Q47BRELE 26105-744.7552959 433.5690465-999.0000000 2.0000000 Q47BRELF 26105-744.7124689 433.6378471-999.0000000 2.0000000 Q47BRELG 26105-744.7303582 433.6061679-999.0000000 2.0000000 Q47BRELH 26105-744.7078721 433.6408503-999.0000000 2.0000000 Q48AGROS 26105 6.3336526 7.8815061-99.0000000 14.0000000 Q48BREGU 26105 0.6804444 8.5040063-99.0000000 2.0000000 Q48BCASU 26105 1.1208964 8.6699237-99.0000000 2.0000000 Q48BSELF 26105 1.1241525 8.8029268-99.0000000 2.0000000 Q48BREMI 26105 1.0083509 9.4543846-99.0000000 2.0000000 Q48BAGRI 26105 1.0907106 9.4173120-99.0000000 2.0000000 Q48BPENS 26105 1.0495307 8.7774109-99.0000000 2.0000000 Q48BCHIL 26105 1.0351657 9.6180221-99.0000000 2.0000000 Q48CSUBA 26105 1.1292856 8.3520700-99.0000000 2.0000000 Q48CSUBB 26105 1.2311435 8.2393964-99.0000000 2.0000000 PROV 26105 5.1251867 2.5748888 1.0000000 9.0000000 STRATUM 26105 2.4666539 1.1889267 1.0000000 4.0000000 QUALIFY 26105 1.6494924 0.4771381 1.0000000 2.0000000 SELECTED 26105 1.8278491 0.3775188 1.0000000 2.0000000 Q49MAINR 26105 1.2481900 6.5899414-99.0000000 22.0000000 Q410LANG 26105 5.5339207 7.1309987-99.0000000 22.0000000 HHWGT 26105 356.7791228 294.7206583 7.0000000 3868.00 75

Annex V Structure of dataset 29 Hierarchical data and codebooks Data in ASCII: 01234 1 1 32161 232 0 19082 230 1 02234 1 0 11231 240 1 03234 1 0 43711 227 0 04234 1 0 40221 213 1 41162 222 0 16173 224 1 10234 1 1 30111 220 0 36222 211 1 21234 1 0 21751 217 0 33962 210 1 32143 226 1 Codebook for house record (Record Type 1): column 1-5 HOUSE column 7 record type column 9 GROUP Codebook for person record (Record Type 2): column 1-4 PERSON column 5 P_NUM(PERSON NUMBER) column 7 record type column 8-9 AGE column 11 SEX Flat data file (After above data is loaded in SPSS) HOUSE GROUP PERSON P_NUM AGE SEX 1234 1 3216 1 32 0 1234 1 1908 2 30 1 2234 0 1123 1 40 1 3234 0 4371 1 27 0 4234 0 4022 1 13 1 4234 0 4116 2 22 0 4234 0 1617 3 24 1 10234 1 3011 1 20 0 10234 1 3622 2 11 1 21234 0 2175 1 17 0 21234 0 3396 2 10 1 21234 0 3214 3 26 1 29 Based on UCLA Academic Technology Services SPSS FAQ: Reading hierarchical data http://www.ats.ucla.edu/stat/spss/faq/hierspss.htm 76

Flat fixed-width ASCII file Fixed width ASCII file Created from SPSS file: 1234 1 3216 1 32 0 1234 1 1908 2 30 1 2234 0 1123 1 40 1 3234 0 4371 1 27 0 4234 0 4022 1 13 1 4234 0 4116 2 22 0 4234 0 1617 3 24 1 10234 1 3011 1 20 0 10234 1 3622 2 11 1 21234 0 2175 1 17 0 21234 0 3396 2 10 1 21234 0 3214 3 26 1 Codebook for ASCII file: Variable Start End HOUSE 1 8 GROUP 9 16 PERSON 17 24 P_NUM 25 32 AGE 33 40 SEX 41 48 77